Bacardit - Learning Classifier Systems - 2009

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 211

Lecture Notes in Artificial Intelligence 6471

Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science


Jaume Bacardit Will Browne
Jan Drugowitsch Ester Bernad-Mansilla
Martin V. Butz (Eds.)

Learning
Classifier Systems
11th International Workshop, IWLCS 2008
Atlanta, GA, USA, July 13, 2008
and 12th International Workshop, IWLCS 2009
Montreal, QC, Canada, July 9, 2009
Revised Selected Papers

13
Series Editors
Randy Goebel, University of Alberta, Edmonton, Canada
Jrg Siekmann, University of Saarland, Saarbrcken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrcken, Germany

Volume Editors
Jaume Bacardit
University of Nottingham, Nottingham, NG8 1BB, UK
E-mail: jaume.bacardit@nottingham.ac.uk

Will Browne
Victoria University of Wellington, Wellington 6140, New Zealand
E-mail: will.browne@vuw.ac.nz

Jan Drugowitsch
University of Rochester, Rochester, NY 14627, USA
E-mail: JDrugowitsch@bcs.rochester.edu

Ester Bernad-Mansilla
Universitat Ramon Llull, 08022 Barcelona, Spain
E-mail: esterb@salle.url.edu

Martin V. Butz
University of Wrzburg, 97070 Wrzburg, Germany
E-mail: mbutz@psychologie.uni-wuerzburg.de

Library of Congress Control Number: 2010940267

CR Subject Classification (1998): I.2.6, I.2, H.3, D.2.4, D.2.8, F.1, H.4, H.2.8

LNCS Sublibrary: SL 7 Artificial Intelligence

ISSN 0302-9743
ISBN-10 3-642-17507-4 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-17507-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
Springer-Verlag Berlin Heidelberg 2010
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper 06/3180
Preface

Learning Classier Systems (LCS) constitute a fascinating concept at the inter-


section of machine learning and evolutionary computation. LCSs genetic search,
generally in combination with reinforcement learning techniques, can be applied
to both temporal and spatial problem-solving and promotes powerful search in
a wide variety of domains. The LCS concept allows many representations of the
learned knowledge from simple production rules to articial neural networks to
linear approximations often in a human readable form.
The concepts underlying LCS have been developed for over 30 years, with
the annual International Workshop on Learning Classier Systems supporting
the eld since 1992. From 1999 onwards the workshop has been held yearly,
in conjunction with PPSN in 2000 and 2002 and with GECCO in 1999, 2001,
and from 2003 onwards. This book is the continuation of the six volumes con-
taining selected and revised papers from the previous workshops, published by
Springer as LNAI 1813, LNAI 1996, LNAI 2321, LNAI 2661, LNCS 4399, and
LNAI 4998.
The articles in this book have been loosely organized into four overlapping
themes. Firstly, the breadth of research into LCS and related areas is demon-
strated. Then the ability to approximate complex multidimensional function
surfaces is shown by the latest research on computed predictions and piece-
wise approximations. This work leads on to LCS for complex domains, such as
temporal decision-making and continuous domains, whereas traditional learning
approaches often require problem-dependent manual tuning of the algorithms
and discretization of problem spaces, resulting in a loss of information. Finally,
diverse application examples are presented to demonstrate the versatility and
broad applicability of the LCS approach.
Pier Luca Lanzi and Daniele Loiacono investigate the use of general-purpose
Graphical Processing Units (GPUs), which are becoming increasingly common in
evolutionary computation, for speeding up matching of environmental states to
rules in LCS. Depending on the problem investigated and representation scheme
used, they nd that the use of GPUs improves the matching speed by 3 to
50 times when compared with matching with standard CPUs. Association rule
mining, where interesting associations in the occurrence of items in streams of
unlabelled examples are to be extracted, is addressed by Albert Orriols-Puig
and Jorge Casillas. Their novel CSar Michigan-style learning classier system
shows promising results when compared with the benchmark approach to this
problem. Stewart Wilson shows that there is still much scope in generating novel
approaches with the LCS concept. He proposes an automatic system for creat-
ing pattern generators and recognizers based on a three-cornered competitive
co-evolutionary algorithm approach.
VI Preface

Patrick O. Stalph and Martin V. Butz investigate current capabilities and


challenges facing XCSF, an LCS in which each rule builds a locally linear ap-
proximation to the payo surface within its matching region. It is noted that
the XCSF approach was the most popular branch of LCS research within the
latest editions of this workshop. In a second paper the same authors investigate
the impact of variable set sizes, which show promise beyond the standard two
ospring used in many genetics-based machine learning techniques. The model
used in XCSF by Gerard David Howard, Larry Bull, and Pier Luca Lanzi uses
an articial neural network, instead of standard rules, for matching and action
selection, thus illustrating the exible nature of LCS techniques. Their method
is compared with principles from the NEAT (Neuro Evolution of Augmenting
Topologies) approach and augmented with previous LCS neural constructivism
work to improve their performance in continuous environments.
ee and Mathias Peroumalnak also examine how LCS copes with
Gilles En
complex environments by introducing the Adapted Pittsburgh Classier System
and applying it to maze type environments containing aliasing squares. This work
shows that the LCS is capable of building accurate strategies in non-Markovian
environments without the use of rules with memory.
Ajay Kumar Tanwani and Muddassar Farooq compare three LCS-based data
mining techniques to three benchmark algorithms for biomedical data sets, show-
ing that, although not completely dominant, the GAssist LCS approach in gen-
eral is able to provide the best classication results on the majority of datasets
tested. Illustrating the diversity of application domains for LCS, supply chain
management sales is investigated by Mara Franco, Ivette Martnez, and Celso
Gorrin, showing that the set of generated rules solves the sales problem in a sat-
isfactory manner. Richard Preen uses the well established XCS LCS to identify
trade entry and exit timings for nancial timeseries forecasting. These results
show the promise of LCS in this dicult domain due to its noisy, dynamic, and
temporal nature. In the nal application paper, Jose G. Moreno-Torres, Xavier
Llor`a, David E. Goldberg, and Rohit Bhargava provide an approach to the ho-
mogenization of laboratory data through the use of a genetic programming based
algorithm.
As in the previous volumes, we hope that this book will be a useful support
for researchers interested in learning classier systems and will provide insights
into the most relevant topics. Finally we hope it will encourage new researchers,
business, and industry to investigate the LCS concept as a method to discover
solutions to their varied problems.

September 2010 Will Browne


Jaume Bacardit
Jan Drugowitsch
Organization

The postproceedings of the International Workshops on Learning Classier Sys-


tems 2008 and 2009 were assembled by the organizing committee of IWLCS
2009.

IWLCS 2008
Organizing Committee Jaume Bacardit (University of Nottingham, UK)
Ester Bernad
o-Mansilla (Universitat Ramon Llull,
Spain)
Martin V. Butz (Universitat W
urzburg, Germany)

Advisory Committee Tim Kovacs (University of Bristol, UK)


Xavier Llor`a (Univ. of Illinois at Urbana-Champaign,
USA)
Pier Luca Lanzi (Politecnico de Milano, Italy)
Wolfgang Stolzmann (Daimler Chrysler AG,
Germany)
Keiki Takadama (Tokyo Institute of Technology,
Japan)
Stewart Wilson (Prediction Dynamics, USA)

IWLCS 2009
Organizing Committee Jaume Bacardit (University of Nottingham, UK)
Will Browne (Victoria University of Wellington,
New Zealand)
Jan Drugowitsch (University of Rochester, USA)

Advisory Committee Ester Bernad o-Mansilla (Universitat Ramon Llull,


Spain)
Martin V. Butz (Universitat W urzburg, Germany)
Tim Kovacs (University of Bristol, UK)
Xavier Llor`a (Univ. of Illinois at Urbana-Champaign,
USA)
Pier Luca Lanzi (Politecnico de Milano, Italy)
Wolfgang Stolzmann (Daimler Chrysler AG,
Germany)
Keiki Takadama (Tokyo Institute of Technology,
Japan)
Stewart Wilson (Prediction Dynamics, USA)
VIII Organization

Referees

Ester Bernad
o-Mansilla Francisco Herrera Luis Miramontes Hercog
Lashon Booker John Holmes Albert Orriols-Puig
Will Browne Tim Kovacs Wolfgang Stolzmann
Larry Bull Pier Luca Lanzi Keiki Takadama
Martin V. Butz Xavier Llor`a Stewart W. Wilson
Jan Drugowitsch Daniele Loiacono
Ali Hamzeh Drew Mellor

Past Workshops
1st IWLCS October 1992
NASA Johnson Space Center, Houston, TX,
USA
2nd IWLCS July 1999 GECCO 1999, Orlando, FL, USA
3rd IWLCS September 2000 PPSN 2000, Paris, France
4th IWLCS July 2001 GECCO 2001, San Francisco, CA, USA
5th IWLCS September 2002 PPSN 2002, Granada, Spain
6th IWLCS July 2003 GECCO 2003, Chicago, IL, USA
7th IWLCS June 2004 GECCO 2004, Seattle, WA, USA
8th IWLCS June 2005 GECCO 2005, Washington, DC, USA
9th IWLCS July 2006 GECCO 2006, Seattle, WA, USA
10th IWLCS July 2007 GECCO 2007, London, UK
11th IWLCS July 2008 GECCO 2008, Atlanta, GA, USA
12th IWLCS July 2009 GECCO 2009, Montreal, Canada
13th IWLCS July 2010 GECCO 2010, Portland, OR, USA
Table of Contents

LCS and Related Methods


Speeding Up Matching in Learning Classier Systems Using CUDA . . . . 1
Pier-Luca Lanzi and Daniele Loiacono

Evolution of Interesting Association Rules Online with Learning


Classier Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Albert Orriols-Puig and Jorge Casillas

Coevolution of Pattern Generators and Recognizers . . . . . . . . . . . . . . . . . . 38


Stewart W. Wilson

Function Approximation
How Fitness Estimates Interact with Reproduction Rates:
Towards Variable Ospring Set Sizes in XCSF . . . . . . . . . . . . . . . . . . . . . . . 47
Patrick O. Stalph and Martin V. Butz

Current XCSF Capabilities and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 57


Patrick O. Stalph and Martin V. Butz

LCS in Complex Domains


Recursive Least Squares and Quadratic Prediction in Continuous
Multistep Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Daniele Loiacono and Pier-Luca Lanzi

Use of a Connection-Selection Scheme in Neural XCSF . . . . . . . . . . . . . . . 87


Gerard David Howard, Larry Bull, and Pier-Luca Lanzi

Building Accurate Strategies in Non Markovian Environments without


Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ee and Mathias Peroumalnak
Gilles En

Classication Potential vs. Classication Accuracy: A Comprehensive


Study of Evolutionary Algorithms with Biomedical Datasets . . . . . . . . . . . 127
Ajay Kumar Tanwani and Muddassar Farooq

Applications
Supply Chain Management Sales Using XCSR . . . . . . . . . . . . . . . . . . . . . . . 145
Mara Franco, Ivette Martnez, and Celso Gorrin
X Table of Contents

Identifying Trade Entry and Exit Timing Using Mathematical Technical


Indicators in XCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Richard Preen

On the Homogenization of Data from Two Laboratories Using Genetic


Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Jose G. Moreno-Torres, Xavier Llor` a, David E. Goldberg, and
Rohit Bhargava

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199


Speeding Up Matching in Learning Classifier
Systems Using CUDA

Pier Luca Lanzi and Daniele Loiacono

Dipartimento di Elettronica e Informazione


Politecnico di Milano
Milano, Italy
{lanzi,loiacono}@elet.polimi.it

Abstract. We investigate the use of NVIDIAs Compute Unified De-


vice Architecture (CUDA) to speed up matching in classifier systems.
We compare CUDA-based matching and CPU-based matching on (i)
real inputs using interval-based conditions and on (ii) binary inputs us-
ing ternary conditions. Our results show that on small problems, due
to the memory transfer overhead introduced by CUDA, matching is
faster when performed using the CPU. As the problem size increases,
CUDA-based matching can outperform CPU-based matching resulting
in a 3-12 speedup when the interval-based representation is applied
to match real-valued inputs and a 20-50 speedup for ternary-based
representation.

1 Introduction
Learning classier systems [10,8,17] combine evolutionary computation with
methods of temporal dierence learning to solve classication and reinforce-
ment learning problems. A classier system maintains a population of condition-
action-prediction rules, called classiers, which identies its current knowledge
about the problem to be solved. At each time step, the system receives the
current state of the problem and matches it against all the classiers in the pop-
ulation. The results is a match set containing the classiers that can be applied
to the problem in its current state. Based on the value of the actions in the
match set, the classier system selects an action to perform on the problem to
progress toward its solution. As a consequence of the executed action, the system
receives a numerical reward that is distributed to the classiers accountable for
it. While the classier system is interacting with the problem, a genetic algo-
rithm is applied to the population to discover better classiers through selection,
recombination and mutation.
Matching is the main and most computationally demanding process of a clas-
sier system [14,3] that can occupy up to the 65%-85% of the overall com-
putation time [14]. Accordingly, several methods have been proposed in the
literature to speed up matching in learning classier systems. Llor`a and Sas-
try [14] compared the typical encoding of classier conditions for binary inputs,
an encoding based on the underlying binary arithmetic, and a version of the

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 120, 2010.
c Springer-Verlag Berlin Heidelberg 2010
2 P.L. Lanzi and D. Loiacono

same encoding optimized via vector instructions. Their results show that bi-
nary encodings combined with optimizations based on the underlying integer
arithmetic can speedup the matching process up to 80 times. The analysis of
Llor`a and Sastry [14] did not consider the inuence of classier generality on
the complexity of matching. As noted in [3], the matching usually stops as soon
as it is determined that the classier cannot be applied to the current problem
instance (e.g., [1,12]). Accordingly, matching a population of highly specic clas-
siers takes much less than matching a population of highly general classiers.
Butz et al. [3] extended the analysis in [14] (i) by considering more encodings
(the specicity-based encoding used in Butzs implementation [1] and the en-
coding used in some implementations of Alecsys [7]); and (ii) by taking into
account classiers generality. Their results show that, overall, specicity-based
matching can be 50% faster than character-based encoding when general popu-
lations are involved, but it can be slower than character-based encoding if more
specic populations are considered. Binary encoding was conrmed to be the
fastest option with a reported improvement up to 90% compared to the usual
character-based encoding. Butz et al. [3] also proposed a specicity-based en-
coding for real-coded inputs which could halve the time required to match a
population.
In this work, we took a dierent approach to speed up matching in classier
systems based on the use of Graphical Processing Units (GPUs). More precisely,
we used NVIDIAs Compute Unied Device Architecture (CUDA) to implement
matching for (i) real inputs using interval-based conditions and for (ii) binary
inputs using ternary conditions. We tested our GPU-based matching by apply-
ing the same experimental design used in [14,3]. Our results show that on small
problems, due to the memory transfer overhead introduced by GPUs, matching
is faster when performed using the usual CPU. On larger problems, involving
either more variables or more classiers, GPU-based matching can outperform
CPU-based implementation with a 3-12 speedup when the interval-based rep-
resentation is applied to match real-valued inputs and a 20-50 speedup for
ternary-based representation.

2 General-Purpose Computation on GPUs


Graphics Processing Units (GPUs) currently provide the best oating-point per-
formance with a throughput that is at least ten times higher than the one
provided by multi-core CPUs. Such a large performance gap has pushed de-
velopers to move several computationally intensive parts of their software on
GPUs. Many-core GPUs perform better than general-purpose multi-core CPUs
on oating-point computation because they have a dierent underlying design
philosophy (see Figure 1). The design of a CPU is optimized for sequential code
performance. It exploits sophisticated control logic to execute in parallel in-
structions from a single thread while maintaining the appearance of sequential
Speeding Up Matching in Learning Classifier Systems Using CUDA 3

execution. In addition, large cache memories are provided to reduce the instruc-
tion and data access latencies required in large complex applications.
On the other hand, the GPU design is optimized for the execution of mas-
sive number of threads. It exploits the large number of executed threads to nd
work to do during long-latency memory accesses, minimizing the control logic
required for each thread. Small cache memories are provided so that when mul-
tiple threads access to the same memory data, they do not need to all access to
the DRAM. As a result, much more chip area is dedicated to the oating-point
calculations.

Fig. 1. An overview of the CPUs and GPUs design philosophies

2.1 The CUDA Programming Model

NVIDIAs Compute Unied Device Architecture (CUDA)1 allows developers to


write computationally intensive applications on a GPU by using an extension of
C which provides abstractions for parallel programming. In CUDA, GPUs are
represented as devices that can run a large number of threads. Parallel tasks are
represented as kernels mapped over a domain. Each kernel represents a sequen-
tial task to be executed as a thread on each point of the domain. The data to
be processed by the GPU must be loaded into the board memory and, unless
deallocated or overwritten, they remain available for subsequent kernels. Kernels
have built-in variables to identify themselves in the domain and to access the
data in the board memory. The domain is dened as a 5-dimensional structure
consisting of a two-dimensional grid of three-dimensional thread blocks. Thread
blocks are limited to 512 total threads; each block is assigned to a single pro-
cessing element and runs as a unit until completion without preemption. Note
that the resources used by a block are released only after the execution of all the
threads in the same block are completed. Once a block is assigned to a streaming
multiprocessor, it is further divided into groups of 32 threads, called warps. All
threads within the same block are simultaneously live and they are temporally
multiplexed but, at any time, the processing element executes only one of its res-
ident warps. When the number of thread blocks in a grid exceeds the hardware
1
http://www.nvidia.com/object/cuda_home_new.html
4 P.L. Lanzi and D. Loiacono

resources, new blocks are assigned to processing element as soon as previous


ones completed their execution. In addition to the global shared memory of the
device, GPUs also have a private memory visible only to threads within the same
block called per-block shared memory (PBSM).

2.2 Performance Issues


Although CUDA is very intuitive, it requires a deep knowledge of the underlying
hardware architecture. CUDA developers need to take into account the specic
features of the GPU architecture, such as memory transfer overhead, shared
memory bank conicts, and the impact of control ow.
In fact, in CUDA, it is necessary to manage the communication between
main memory and GPU shared memory explicitly. Developers have to reduce
the transfer overhead by avoiding frequent data transfers between the GPU
and CPU. Accordingly, rather than to increase the amount of communication
with the CPU, computation on the GPU is usually duplicated and computa-
tion is typically overlapped to data communication. Once the memory trans-
fer overhead has been optimized, developers must optimize the access to the
global memory of the device, which represents one of the most important per-
formance issue in CUDA. In general, CUDA applications exploit massive data
parallelism in that they process a massive amount of data within a short period
of time. Therefore, a CUDA kernel must be able to access a massive amount
of data from the global memory within a very short period of time. As the
memory access is a very slow process, modern DRAMs use a parallel pro-
cess to increase their data access rate. When a memory location is accessed,
many consecutive locations are also accessed. If an application exploits data
from multiple, consecutive locations before moving on to other locations, the
DRAMs can supply the data at much higher rate with respect to the access
to a random sequence of locations. In CUDA, it is possible to take advan-
tage of the fact that threads in a warp are executing the same instruction at
any given point in time. When all threads in a warp execute a load instruc-
tion, the hardware detects whether the threads access consecutive global mem-
ory locations. The most favorable access pattern is achieved when the same
instruction for all threads in a warp accesses consecutive global memory lo-
cations. In this case, the hardware combines, or coalesces, all these accesses
into a consolidated access to the DRAMs that requests all consecutive loca-
tions involved. Such coalesced access allows the DRAMs to deliver data at a
rate close to the maximal global memory bandwidth. Finally, control ow in-
structions (e.g., the if or switch statements) can signicantly aect the instruc-
tion throughput when threads within the same warp follow dierent branches.
When executing dierent branches, either the execution of each path must be
serialized or all threads within the warp must execute each instruction, with
predication used to mask out the eects of instructions that should not be ex-
ecuted [19]. Thus, kernels should be optimized avoid excessive use of control ow
Speeding Up Matching in Learning Classifier Systems Using CUDA 5

statements or to ensure that the branches executed will be the same across the
whole warp.

3 The XCS Classifier System


XCS [17] maintains a population of condition-action-prediction rules (or classi-
ers), which represents the current systems knowledge about a problem solution.
Each classier represents a portion of the overall solution. The classiers con-
dition identies a part of the problem domain; the classiers action represents
a decision on the part of the domain identied by its condition; the classiers
prediction p estimates the value of the action in terms of problem solution. Clas-
sier conditions are usually strings dened over the ternary alphabet {0,1,#}
in which the dont care symbol # indicates that the corresponding position can
either match a 0 or a 1. Actions are usually binary strings.
XCS applies supervised or reinforcement learning to evaluate the classiers
prediction and a genetic algorithm to discover better classiers by selecting,
recombining, and mutating existing ones. To guide the evolutionary process,
the classiers keep three additional parameters: the prediction error , which
estimates the average absolute error of the classier prediction p; the tness F ,
which estimates the average relative accuracy of the payo prediction given by
p and is a function of the prediction error ; and the numerosity num, which
indicates how many copies of classiers with the same condition and the same
action are present in the population.
At time t, XCS builds a match set [M] containing the classiers in the popu-
lation [P] whose condition matches the current input st ; for each classier, the
match procedure scans all the input bits to check whether the classier condition
contains a dont care symbol (#) or an input bit is equal to the corresponding
character in the condition. If [M] contains less than mna actions, covering takes
place and creates a new classier with a random action and a condition, with a
proportion P# of dont care symbols, that matches st . For each possible action a
in [M], XCS computes the system prediction P (st , a), which estimates the payo
that XCS expects if action a is performed in st . The system prediction P (st , a)
is computed as the tness weighted average of the predictions of classiers in
[M] that advocate action a:
 Fk
P (st , a) = pk  , (1)
cli [M ](a) Fi
clk [M ](a)

where [M](a) represents the subset of classiers of [M ] with action a, pk iden-


ties the prediction of classier cl k, and Fk identies the tness of classier
cl k. Next, XCS selects an action to perform; the classiers in [M] that advo-
cate the selected action form the current action set [A]. The selected action at
is performed, and a scalar reward rt+1 is returned to XCS together with a new
6 P.L. Lanzi and D. Loiacono

input st+1 . The incoming reward rt+1 is used to compute the estimated payo
P (t) as,

P (t) = rt+1 + max P (st+1 , a) (2)


a[M]

Next, the parameters of the classiers in [A] are updated [5]. At rst, the pre-
diction p is updated with learning rate (0 1) as,

p p + (P (t) p) (3)

Then, the prediction error and the tness are updated [17,5].
On a regular basis (dependent on the parameter ga ), the genetic algorithm
is applied to the classiers in [A]. It selects two classiers, copies them, and
with probability performs crossover on the copies; then, with probability
it mutates each allele. The resulting ospring classiers are inserted into the
population and two other classiers are deleted from the population to keep the
population size N constant.

4 Matching Interval-Based Conditions Using GPUs


Learning classier systems typically assume that inputs are encoded as binary
strings and that classier conditions are strings dened over the ternary alphabet
{0,1,#} [9,8,16,17]. There are however several representations that can deal with
real-valued inputs: center-based intervals [18], simple intervals [19,15], convex
hulls [13], ellipsoids [2], and hyper-ellipsoids [4].

4.1 Interval Based Conditions and Matching


In the interval-based case [19], a condition is represented by a concatenation
of n real interval predicates, int i = (li , ui ); given an input x consisting of n
real numbers, a condition matches s if, for every i {1, . . . n}, the predicate
li si si ui is veried. The matching is straightforward and its pseudocode
is reported as Algorithm 1: the condition (identied by the variable condition)
is represented as a vector of intervals; the inputs are a vector of real values (in
double precision); the n inputs (i.e., inputs.size()) are scanned and each input
is tested against the corresponding interval; the process stops either when all the
inputs matched or as soon as one of the intervals does not match (when result
in Algorithm 1 becomes false).
Butz et al. [3] showed that this matching procedure can be sped-up by chang-
ing the order in which the inputs are tested: if smaller (more specic) intervals
are tested rst, the match is more likely to fail early so as to speed up the
matching process. Their results on matching alone showed that this specicity-
based matching could produce a 60% speed increase when applied to populations
containing classiers with highly specic conditions. However, they reported no
signicant improvement when their specicity-based matching was applied to
typical testbeds.
Speeding Up Matching in Learning Classifier Systems Using CUDA 7

Algorithm 1. Matching for interval-based conditions in XCSLib.


// representation of classifier condition
vector<interval> condition;

// representation of classifier inputs


vector<double> inputs;

// matching procedure
int pos = 0;
bool result = true;

while ( (result) && (pos<inputs.size()) )


{
result =
((inputs[pos]>=condition[pos].lower) &&
(condition[pos].upper>=inputs[pos]));
pos++;
}
return result;

4.2 Interval-Based Matching Using CUDA


Implementing interval-based matching using CUDA is straightforward and in-
volves three simple design steps. First, we need to decide how to represent clas-
sier conditions in the graphic board memory; then, we have to decide how
parallelization is organized; nally, we need to implement the require kernel
functions. Once these steps are performed, the matching of interval-based con-
ditions on the GPU consists of (i) transferring the data to the board memory of
the GPU, (ii) invoking the kernels that perform the matching, and nally (iii)
retrieving the result from the board memory.

Condition Representation. An interval-based condition can be easily encoded


using two arrays of float variables, one to store all the conditions lower bounds
and one to store all the conditions upper bounds. Algorithm 2 reports the match-
ing algorithm using the lower and upper bound vectors.
We can apply the same principle to encode a population of N classiers using
two matrices of oat variables lb and ub which contain all the lower bounds
and all the upper bounds of the conditions in the population. Given a prob-
lem with n real inputs, the matrices lb and ub can be either organized (i) by
rows, putting in each row of the matrices the n lower/upper bounds of the same
classier (Figure 2a) or (ii) by columns, putting in each column of the matrices
the n lower/upper bounds of the same classier (Figure 2b). In both the repre-
sentations, the matrices lb and ub are then linearized into arrays to be stored
into the GPU memory. In particular, when the representation by rows is used, the
8 P.L. Lanzi and D. Loiacono

Algorithm 2. Matching for interval-based conditions using arrays.


// representation of classifier condition
float lb[n];
float ub[n];

// representation of classifier inputs


float inputs[n];

// matching procedure
int pos = 0;
bool result = true;

while ( (result) && (pos<n) )


{
result = ((inputs[pos]>=lb[pos]) && (ub[pos]>=inputs[pos]));
pos++;
}
return result;

rst n values of lb contain the lower bounds of the rst classier condition in
the population; while the rst n values of ub contain the upper bounds of the
same condition. The next n values in lb and ub contain the lower and upper
bounds of the second classier condition, and so on for all the N classiers in
the population. In contrast, when the representation by columns is used, the
rst N values of lb contain the lower bounds associated to the rst input of
the N classiers in the population; similarly the rst N values of ub contain the
corresponding upper bounds. The next N values in lb and ub contain the lower
and upper bounds associated to the second input, and so on for all the n inputs
of the problem.

(a) (b)

Fig. 2. Classifier conditions in the GPU global memory are represented as two matrices
lb and ub which can be stored (a) by row or (b) by columns; cli represents the variables
in the classifier condition; si shows what variables should be matched in parallel by the
kernel
Speeding Up Matching in Learning Classifier Systems Using CUDA 9

Matching. To perform matching, the classier conditions in the population are


stored (either by rows or by columns) in the GPU main memory as the two
vectors lb and ub of n N elements each; the current input is stored in the GPU
memory as a vector s of n oats. A result vector matched of N integers in the
GPU memory is used to store the result of a matching procedure: a 1 in position
i means that condition of classier cli matched the current input; a 0 in the
same position means that the condition of cli did not match. Then, matching
is performed by running the matching kernel on the data structures that have
been loaded into the device memory.
Memory Organization. As we previously noted, the vector lb and ub can be
stored into the device memory by rows (Figure 2a) or by columns (Figure 2b).
To maximize the performance of a GPU implementation, at each clock cycle,
GPU must access very close memory positions since the GPU accesses blocks
of contiguous memory locations. Note that, while the representation of lb and
ub by row (Figure 2a) appears to be straightforward, it also provides the lesser
parallelization possible. As an example consider the rst two classiers in the
population (cl0 and cl1 ) whose lower bounds are respectively stored in positions
from 0 to n-1 for cl0 and from n to 2n-1 from cl1 . At the rst clock cycle, one
kernel will start the matching of the rst condition and will access the value in
lb[0] while the second kernel will access the value in lb[n] (i.e., the rst value
of lower bound for cl0 and cl1 ). When n is large these two memory positions will

Algorithm 3. Kernel for interval-based matching in CUDA using a row-based


representation.

// LB and UB represent the classifier condition


// n is the size of the input
// N is the population size
__global__ void
match( float* LB, float* UB, float *input, int *matched, int n, int N)
{
// computes position of the classifier condition in the arrays LB and UB
const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x;
const unsigned int pos = tidx*n;

if (tidx<N)
{
int has_matched=1,i=0;
while ( (has_matched) && i<n )
{
has_matched = (input[i] >= LB[pos+i]) && (input[i] <= UB[pos+i]);
i++;
}
matched[tidx]=has_matched;
}
}
10 P.L. Lanzi and D. Loiacono

be too distant and require the GPU to perform two separate memory accesses.
Accordingly, the GPU will remain idle for a signicant amount of time to ac-
cess memory. In contrast, if lb and ub are represented by column (Figure 2b),
the same operations will access contiguous memory locations. In fact, at the
rst clock cycle, one kernel will now access the value in lb[0] (the rst lower
bound of cl0 ). while the second kernel will access the nearby memory position
lb[1] where the rst lower bound of cl1 is stored. As a result, the GPU can per-
form several operations using just one memory access resulting in the maximum
parallelization possible.
Kernels are the basic computation units in CUDA and they are the source
of the parallelization. Kernels are executed in parallels on separate GPU cores
grouped into blocks whose size depends on the model of GPU used and must be
properly set to achieve the best parallelization. As soon as a core completed the
execution of a block of kernels, a new block is assigned to it.
In our case, a kernel is in charge of performing matching one classier. Accord-
ingly, the GPU will execute N kernels one for each classier in the population.
We used blocks of 64 kernels which we empirically found to be the best block
size on the card models we tested.
Algorithm 3 shows the kernel for interval-based matching using CUDA when
using a representation of lb and ub by row is used. Each kernel reads the condition
of a classier from the device shared memory and checks whether it matches the
current input. If a match is found, the position of the matched array in the
device memory corresponding to the classier is set to one otherwise it is set to
zero.

Algorithm 4. Kernel for interval-based matching in CUDA using a column-based


representation.
// LB and UB represent the classifier condition
// n is the size of the input
// N is the population size
__global__ void
matchReal( float* LB, float* UB, float *input, int *matched, int n, int N)
{
// access thread id
const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x;

if (tidx<N)
{
int has_matched=1,i=0;
while ( (matched) && i<n )
{
has_matched = (input[i] >= LB[i*N+tidx]) && (input[i] <= UB[i*N+tidx]);
i++;
}
matched[tidx]=has_matched;
}
}
Speeding Up Matching in Learning Classifier Systems Using CUDA 11

Algorithm 4 shows the kernel for column-based interval-based matching using


CUDA. The only dierence with respect to the row-based implementation is the
computation of the index of the interval to be tested.

5 Matching Ternary Conditions Using GPUs


Ternary conditions are usually implemented using a character-based encoding
that represents conditions as string of characters and encodes each one of the
three symbols {0, 1, #} as a character variable (e.g., as a char in C/C++).
Character-based encoding is very simple and for this reason widely used but it
is also highly inecient in that (1) it wastes 75% of the memory by using 8 bits
characters to encode three symbols and (2) it processes input information that
is in principle useless [3]. More compact encodings were used in the early days

Algorithm 5. Binary representation and matching.


// representation of classifier condition
bitset fp;
bitset sp;

// representation of classifier inputs


bitset inputs;

// matching procedure

bitset result = ((input^fp) & (input^sp));


return result.none();

Algorithm 6. Improved binary representation and matching.


// representation of classifier condition (m is the number of unsigned integer
// necessary to represents n bits)
unsigned int fp[m];
unsigned int sp[m];

// representation of classifier inputs


unsigned int inputs[m];

// matching procedure
bool matched = true;
while ( (matched) && i<m )
{
matched = ( ( (fp[i]^inputs[i]) & (sp[i]) ) == 0)
i++;
}

return matched;
12 P.L. Lanzi and D. Loiacono

Algorithm 7. Kernel for the row-based matching on ternary conditions in


CUDA.
// fp and sp represent the classifier condition
// m is the number of integers required to represent the classifier condition (and the input)
// N is the population size
__global__ void
matchBinary( int* fp, int* sp, int *input, int *matched, int m, int N)
{
// access thread id
const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x;
// base position in fp and sp arrays
const unsigned int pos = tidx*m;

if (tidx<N)
{
int matched=1,i=0;
while ( (matched) && i<m )
{
unsigned int sp_i = sp[pos+i];
unsigned int input_i = input[i];
unsigned int fp_i = fp[pos+i];
has_matched= ( ( (fp^input_i) & (sp_i) ) == 0);
i++;
}
matched[tidx]=has_matched;
}
}

of classier systems research [7] and similar ones have been recently proposed to
speed up the matching using standard CPUs [14].
In their famous classier system Alecsys [7], Dorigo, Colombetti and col-
leagues implemented classier conditions as arrays of bits packed up inside un-
signed integers. In Alecsys, a condition was represented by two arrays, fp and
sp, of unsigned integers; a one in the condition was represented by a bit set to
one in the same position of fp and sp; a zero was represented by a bit set to zero
in the same position of fp and sp; a dont care (#) could be either represented
by a 0 in fp and a 1 in sp or by a 1 in fp and a 0 in sp. Given the bit encoded
inputs i, a condition matches if fp^i & sp^i returns a set of zero bits, where ^
is the bitwise exclusive or and & is the bitwise logical and. Algorithm 5 shows the
C++ implementation of the encoding used in Alecsys and the corresponding
matching taken from [3]. The condition is represented as two variables, fp and
sp, using the Standard Template Library (STL) bitset class [11], which encodes
a set of bits; the condition matches if the resulting bitset has all the bits set to
zero, i.e., if result.none() returns true.
We can apply the same approach we used for interval-based conditions to
speed up the matching of ternary conditions using CUDA. For this purpose,
we need to modify Alecsyss encoding as follows. A classier condition is still
represented using two arrays, fp and sp, each one representing part of the con-
dition. In the case of the GPU representation however, the rst array fp encodes
only the specic positions while the second array sp encodes only the general
positions. As a results, this encoding reduces the number of bitwise operations
Speeding Up Matching in Learning Classifier Systems Using CUDA 13

Algorithm 8. Kernel for the column-based matching on ternary conditions in


CUDA.
// fp and sp represent the classifier condition
// m is the number of integers required to represent the classifier condition (and the input)
// N is the population size
__global__ void
matchBinary( int* fp, int* sp, int *input, int *matched, int m, int N)
{
// access thread id
const unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x;

if (tidx<N)
{
int has_matched=1,i=0;
while ( (has_matched) && i<m )
{
unsigned int fp_i= fp[i*N+tidx];
unsigned int sp_i = sp[i*N+tidx];
unsigned int input_i = input[i];
has_matched= ( ( (fp_i^input_i) & (sp_i) ) == 0);
i++;
}
matched[tidx]=has_matched;
}
}

needed to match an input bitstring. In fact, given an input bistring i, a condition


matches if the expression fp^i & sp returns all zero bits (^ is the bitwise exclu-
sive or and & is the bitwise logical and ). This new matching procedure requires
only one bitwise exclusive or and one bitwise and, while, in Alecsys, matching
required two bitwise exclusive or and one bitwise and. This small modication
dramatically reduces the number of registers and memory accesses required to
perform fast bitwise operations on the GPU. Finally, since the Standard Tem-
plate Library (STL) bitset is unavailable on GPUs, the the arrays fp and sp
must be represented as two arrays of unsigned integers. Each unsigned inte-
ger is used to encode 32 bits (i.e., the size of integers and unsigned integers in
the CUDA specication). The matching procedure for ternary conditions using
CUDA is reported as Algorithm 6. As the algorithm shows, using two arrays of
unsigned integers instead of two bitsets allows to stop the matching procedure as
soon as a non-matching position is found, as it happens in the character-based
encoding (see [3]).
As in the case of interval-based conditions, also with ternary conditions we
can have (i) a row-based matching, in which the unsigned integers representing a
condition are stored in subsequent positions, and (ii) a column-based matching,
in which the same unsigned integers are stored in positions that are N positions
away. The kernels implementing the two approaches are reported respectively as
Algorithm 7 and Algorithm 8.
14 P.L. Lanzi and D. Loiacono

6 Experimental Results

We performed two sets of experiments to evaluate the speed-up introduced by


the use of GPU-based implementation of matching for interval-based conditions
and binary conditions.

6.1 Design of Experiments

In this work, we used an experimental design similar to the one applied in [3]
which was inspired to the previous work of Llor`a and Sastry [14]. We generated
a population of N interval-based or ternary conditions of length n with dierent
generality and 1000 random input congurations. For interval-based conditions,
the generality of a random population was determined by setting an adequate
value of the parameter r0 (see [19] for details); for ternary conditions, generality
was set using the dont care probability P# . We matched each random input
against the N conditions using one of the kernels previously discussed and mea-
sured the average time required to perform all the match operations using the
functions provided by the CUDA distribution. We repeated this procedure 10
times. Overall, we tested two matching kernels (one using the representation
by row and one using the representation by columns) on the CPU2 , on a Tesla
C1060 and on a GeForce 9600 GT (see Appendix A). The performance was
measured as the average CPU time to perform the 1000 matches over the N
conditions. The reported average performance takes into account (i) the time to
load each one of the 1000 inputs to be matched into the GPU; and (ii) the time
to move the result vector from the GPU to main CPU memory.

6.2 Interval-Based Conditions

Table 1 reports the average matching time for one condition using either the
CPU, a Tesla C1060 GPU or a GeForce 9600 GT GPU, when (i) the number
of inputs n in 10, 100, or 1000; the generality is chosen in {0.25, 0.50, 0.75,
0.90, 0.95}; the population size N is either 1000 (Table 1a), 10000 (Table 1b),
or 100000 (Table 1c); the representation is either row-based or column-based.
As expected, Tesla C1060 is always faster than GeForce 9600 GT which, on
the other hand, has only 512Mb of memory and cannot manage a large popula-
tion of 100000 classiers (Table 1c). As anticipated, column-based representation
results on a superior performance on GPUs. However, on the CPU, column-based
representation can be 10 times slower than its row-based counterpart. In fact, on
the CPU, row-based matching allows for the caching of contiguous data positions
(both condition bounds or inputs) which signicantly speed up the matching pro-
cess. In contrast, column-based matching accesses data positions in a scattered
way with respect to the storage resulting in a slower matching on CPUs.
2
Experiments have been performed on a 2 quad-core Xeon (2.66 GHz) with 8GB of
RAM running Linux Fedora Core 6.
Speeding Up Matching in Learning Classifier Systems Using CUDA 15

Table 1. Time (ms) required to match 1000 instances when the problem consists of 10,
100 or 1000 real inputs, the population size N is (a) 1000, (b) 10000, and (c) 100000;
the population generality gen varies between 0.25 and 0.95; statistics are averages over
10 runs
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 0.029 0.002 0.032 0.001 0.047 0.000 0.045 0.000 0.046 0.002 0.044 0.002
10 0.50 0.032 0.002 0.035 0.002 0.047 0.000 0.045 0.000 0.052 0.002 0.052 0.006
10 0.75 0.032 0.003 0.037 0.003 0.047 0.000 0.045 0.000 0.056 0.001 0.054 0.004
10 0.90 0.030 0.003 0.034 0.002 0.047 0.000 0.045 0.000 0.059 0.002 0.055 0.002
10 0.95 0.029 0.002 0.033 0.002 0.047 0.000 0.045 0.000 0.059 0.001 0.055 0.001
100 0.25 0.159 0.010 0.170 0.008 0.162 0.001 0.147 0.000 0.246 0.005 0.206 0.006
100 0.50 0.207 0.011 0.216 0.014 0.167 0.000 0.147 0.000 0.297 0.001 0.252 0.010
100 0.75 0.237 0.004 0.250 0.006 0.171 0.000 0.147 0.000 0.350 0.010 0.295 0.011
100 0.90 0.257 0.005 0.268 0.003 0.174 0.001 0.147 0.000 0.381 0.017 0.315 0.002
100 0.95 0.265 0.009 0.277 0.009 0.175 0.000 0.147 0.000 0.384 0.005 0.325 0.007
1000 0.25 1.654 0.029 7.487 0.121 1.463 0.006 1.148 0.001 3.009 0.056 1.678 0.026
1000 0.50 2.154 0.026 9.423 0.094 1.588 0.005 1.148 0.000 3.678 0.051 1.982 0.037
1000 0.75 2.537 0.013 11.002 0.073 1.659 0.005 1.148 0.000 4.133 0.038 2.222 0.039
1000 0.90 2.719 0.022 11.815 0.065 1.694 0.004 1.148 0.000 4.374 0.028 2.362 0.025
1000 0.95 2.779 0.017 12.101 0.049 1.703 0.003 1.148 0.001 4.444 0.033 2.392 0.012

(a)
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 0.282 0.009 0.321 0.009 0.136 0.001 0.091 0.001 0.258 0.006 0.204 0.005
10 0.50 0.312 0.017 0.353 0.015 0.153 0.001 0.091 0.001 0.316 0.002 0.241 0.001
10 0.75 0.300 0.003 0.345 0.018 0.164 0.001 0.091 0.001 0.365 0.004 0.272 0.005
10 0.90 0.284 0.008 0.324 0.011 0.168 0.001 0.091 0.001 0.390 0.007 0.291 0.006
10 0.95 0.277 0.009 0.318 0.013 0.169 0.001 0.091 0.001 0.399 0.005 0.294 0.004
100 0.25 2.442 0.041 3.651 0.094 0.935 0.004 0.367 0.001 2.244 0.009 1.468 0.008
100 0.50 2.663 0.046 4.106 0.077 1.067 0.004 0.368 0.000 2.829 0.007 1.878 0.005
100 0.75 2.799 0.039 4.506 0.080 1.155 0.002 0.369 0.001 3.307 0.008 2.207 0.005
100 0.90 2.869 0.026 4.643 0.068 1.202 0.002 0.368 0.001 3.576 0.006 2.395 0.009
100 0.95 2.884 0.013 4.701 0.067 1.218 0.001 0.369 0.001 3.657 0.005 2.448 0.003
1000 0.25 19.232 0.106 121.913 0.574 29.569 0.208 2.505 0.002 117.867 0.535 10.975 0.112
1000 0.50 23.703 0.083 158.491 0.753 41.444 0.132 2.512 0.001 164.823 0.680 12.382 0.211
1000 0.75 27.025 0.103 189.205 0.427 44.579 0.109 2.510 0.001 199.877 0.362 14.107 0.243
1000 0.90 28.793 0.094 205.408 0.642 44.872 0.076 2.511 0.001 216.719 0.288 14.912 0.324
1000 0.95 29.354 0.098 210.484 0.627 44.839 0.076 2.511 0.001 222.016 0.283 15.206 0.482
(b)
n gen CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
10 0.25 3.042 0.016 5.598 0.092 1.079 0.002 0.628 0.001 2.355 0.009 1.795 0.008
10 0.50 3.290 0.022 5.891 0.119 1.242 0.003 0.630 0.002 2.951 0.007 2.180 0.011
10 0.75 3.265 0.017 5.733 0.086 1.353 0.002 0.632 0.002 3.433 0.007 2.487 0.015
10 0.90 3.073 0.025 5.592 0.077 1.396 0.001 0.630 0.002 3.694 0.013 2.659 0.031
10 0.95 2.983 0.028 5.536 0.062 1.407 0.001 0.630 0.002 3.772 0.003 2.699 0.007
100 0.25 29.947 0.086 42.500 0.124 9.171 0.010 3.329 0.001 21.414 0.091 13.847 0.053
100 0.50 31.040 0.076 47.296 0.149 10.499 0.007 3.340 0.001 27.100 0.084 17.631 0.104
100 0.75 31.086 0.093 51.373 0.187 11.329 0.010 3.341 0.001 32.071 0.036 20.926 0.074
100 0.90 30.946 0.061 53.472 0.231 11.757 0.007 3.341 0.001 34.769 0.043 22.711 0.035
100 0.95 30.849 0.060 54.254 0.205 11.900 0.005 3.341 0.001 35.654 0.026 23.304 0.056
1000 0.25 192.730 0.724 1253.090 4.225 329.946 0.878 23.186 0.075 - -
1000 0.50 236.868 0.827 1640.120 3.299 423.478 0.421 23.286 0.047 - -
1000 0.75 270.308 0.831 1950.790 3.422 445.080 0.353 23.306 0.054 - -
1000 0.90 287.616 0.792 2120.160 2.686 449.720 0.333 23.255 0.052 - -
1000 0.95 292.706 0.729 2177.270 2.225 450.194 0.289 23.299 0.067 - -
(c)

In small populations (i.e., when N = 1000), GPUs provide no speed up on


very small problems, when n is 10 or 100. When more variables are involved
(n is 1000), GPUs achieve a rather limited speedup (Table 1a). In fact, when
we compare the fastest CPU implementation (CPUrow ) against the fastest GPUs
implementation (TESLAcol), we note a speedup up to 2.42.
However, as the number of classiers increases, GPUs scale up very well:
when N = 10000 the speedup ranges from 3 up to 11 in the largest problems
involving 1000 inputs; speedup is even higher in huge populations containing
100000 classiers where the speedup is between 4.8 and 12.56.
Note that the performance of Tesla C1060 is not inuenced by the classiers
generality. In contrast, the average match time for the CPU increases with the
classiers generality. This is not surprising and can be easily explained. In the
16 P.L. Lanzi and D. Loiacono

experiments performed, the classiers generality ranges between 0.25 and 0.95,
thus, at least one out of four classiers will match. In the GPUs many matches
are run in parallel and the overall matching time depends on the slower match.
Accordingly, even when classier generality is 0.25 the overall matching time is
almost the same. However, this does not happen with the GeForce 9600 GT,
where the results are very similar to the ones of the CPU, i.e., matching time
increases with the classiers generality (as in [3]). This is due to the more strict
requirements that the GeForce 9600 GT poses on the memory access pattern.
To maximize parallelization with the GeForce 9600 GT, cores need to access
memory positions that are both contiguous and adequately aligned, whereas
Tesla C1060 only poses constraints on the former. As more and more matching
are performed the pattern memory access of the GeForce 9600 GT tends to
diverge (accessed memory positions become more and more misaligned) resulting
in a worsening of the overall performance.

6.3 Ternary Conditions

We repeated a similar set of experiments using ternary conditions. As previously


done, we used a CPU, a Tesla C1060 GPU and a GeForce 9600 GT GPU; the
number of binary inputs n was chosen in {32, 512, 1024,4096,10240}; the gener-
ality, tuned by the parameter P# , was selected from {0.0, 0.25, 0.50, 0.75, 0.99,
1.0}; the population size N was either 1000, 10000, and 100000; we considered
both row-based and column-based representations. Table 2, Table 3, and Table 4

Table 2. Time (ms) required to match 1000 instances when the problem size is 32, 512,
1024, 4096 or 10240 bits, the population size N is 1000, and the population generality
gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs
n P# CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col
32 0.0 0.003 0.000 0.004 0.000 0.034 0.000 0.034 0.001 0.030 0.003 0.028 0.001
32 0.25 0.003 0.000 0.004 0.000 0.035 0.000 0.034 0.000 0.029 0.002 0.029 0.002
32 0.5 0.004 0.000 0.004 0.000 0.035 0.000 0.035 0.000 0.029 0.001 0.029 0.001
32 0.75 0.004 0.000 0.004 0.000 0.035 0.001 0.035 0.000 0.029 0.001 0.029 0.002
32 0.99 0.006 0.000 0.004 0.000 0.035 0.001 0.035 0.001 0.029 0.001 0.029 0.002
32 1.0 0.004 0.000 0.004 0.000 0.035 0.001 0.035 0.000 0.029 0.001 0.029 0.002
512 0.0 0.006 0.000 0.005 0.000 0.037 0.000 0.035 0.000 0.032 0.001 0.031 0.001
512 0.25 0.006 0.000 0.005 0.000 0.037 0.001 0.035 0.000 0.033 0.001 0.033 0.002
512 0.5 0.006 0.000 0.005 0.000 0.037 0.000 0.035 0.000 0.033 0.002 0.032 0.002
512 0.75 0.006 0.001 0.005 0.000 0.038 0.000 0.037 0.000 0.033 0.002 0.032 0.001
512 0.99 0.034 0.002 0.038 0.003 0.051 0.000 0.050 0.000 0.052 0.001 0.050 0.001
512 1.0 0.041 0.003 0.047 0.004 0.061 0.000 0.050 0.001 0.085 0.002 0.077 0.001
1024 0.0 0.007 0.001 0.006 0.000 0.038 0.001 0.037 0.001 0.035 0.001 0.035 0.002
1024 0.25 0.007 0.001 0.006 0.000 0.037 0.000 0.037 0.001 0.037 0.003 0.034 0.001
1024 0.5 0.007 0.001 0.006 0.000 0.038 0.001 0.037 0.000 0.036 0.002 0.036 0.003
1024 0.75 0.008 0.001 0.006 0.000 0.039 0.000 0.038 0.001 0.038 0.002 0.037 0.002
1024 0.99 0.037 0.002 0.037 0.002 0.064 0.000 0.066 0.000 0.066 0.004 0.063 0.001
1024 1.0 0.080 0.005 0.091 0.007 0.069 0.001 0.067 0.001 0.152 0.011 0.133 0.004
4096 0.0 0.016 0.001 0.014 0.001 0.047 0.001 0.046 0.001 0.058 0.003 0.055 0.003
4096 0.25 0.015 0.001 0.014 0.001 0.047 0.001 0.045 0.001 0.057 0.002 0.054 0.002
4096 0.5 0.015 0.001 0.013 0.001 0.046 0.001 0.045 0.001 0.058 0.002 0.053 0.000
4096 0.75 0.016 0.001 0.013 0.001 0.048 0.001 0.046 0.001 0.057 0.002 0.056 0.003
4096 0.99 0.047 0.003 0.045 0.002 0.086 0.001 0.088 0.001 0.102 0.005 0.093 0.002
4096 1.0 0.312 0.004 0.350 0.011 0.276 0.001 0.170 0.001 0.633 0.002 0.436 0.002
10240 0.0 0.031 0.002 0.029 0.002 0.061 0.002 0.061 0.002 0.096 0.002 0.094 0.004
10240 0.25 0.031 0.002 0.029 0.002 0.062 0.002 0.062 0.002 0.098 0.005 0.093 0.002
10240 0.5 0.031 0.002 0.029 0.002 0.062 0.002 0.061 0.002 0.099 0.005 0.095 0.004
10240 0.75 0.032 0.002 0.029 0.002 0.063 0.002 0.061 0.002 0.097 0.003 0.095 0.003
10240 0.99 0.067 0.004 0.064 0.004 0.101 0.002 0.103 0.002 0.142 0.004 0.133 0.003
10240 1.0 0.768 0.002 3.491 0.020 0.435 0.001 0.373 0.001 1.335 0.002 0.969 0.003
Speeding Up Matching in Learning Classifier Systems Using CUDA 17

Table 3. Time (ms) required to match 1000 instances when the problem size is 32, 512,
1024, 4096 or 10240 bits, the population size N is 10000, and the population generality
gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs.

n P# CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col


32 0.0 0.034 0.002 0.036 0.002 0.065 0.001 0.066 0.000 0.075 0.002 0.075 0.002
32 0.25 0.034 0.002 0.038 0.003 0.066 0.001 0.066 0.001 0.076 0.003 0.077 0.003
32 0.5 0.032 0.002 0.037 0.003 0.066 0.001 0.066 0.001 0.075 0.003 0.076 0.003
32 0.75 0.035 0.003 0.036 0.002 0.065 0.001 0.066 0.001 0.077 0.004 0.077 0.004
32 0.99 0.061 0.004 0.036 0.002 0.066 0.001 0.066 0.001 0.078 0.003 0.079 0.010
32 1.0 0.037 0.003 0.037 0.003 0.066 0.001 0.066 0.001 0.075 0.003 0.077 0.003
512 0.0 0.050 0.002 0.037 0.003 0.078 0.001 0.066 0.001 0.088 0.002 0.079 0.002
512 0.25 0.054 0.004 0.036 0.002 0.079 0.001 0.065 0.001 0.090 0.002 0.080 0.003
512 0.5 0.053 0.004 0.040 0.003 0.079 0.001 0.066 0.001 0.090 0.003 0.077 0.002
512 0.75 0.056 0.004 0.040 0.003 0.081 0.001 0.067 0.001 0.091 0.003 0.082 0.004
512 0.99 0.322 0.010 0.333 0.011 0.173 0.001 0.109 0.001 0.284 0.003 0.224 0.001
512 1.0 0.375 0.000 0.429 0.011 0.323 0.001 0.111 0.001 0.625 0.011 0.456 0.011
1024 0.0 0.072 0.010 0.040 0.003 0.075 0.001 0.067 0.001 0.095 0.003 0.084 0.003
1024 0.25 0.071 0.008 0.040 0.003 0.075 0.001 0.067 0.001 0.093 0.003 0.083 0.004
1024 0.5 0.067 0.007 0.039 0.003 0.075 0.001 0.066 0.001 0.098 0.010 0.084 0.004
1024 0.75 0.072 0.011 0.042 0.003 0.076 0.001 0.068 0.001 0.095 0.003 0.084 0.003
1024 0.99 0.359 0.007 0.381 0.016 0.194 0.001 0.131 0.001 0.330 0.007 0.260 0.002
1024 1.0 0.747 0.007 0.877 0.015 0.423 0.001 0.163 0.001 1.194 0.004 0.854 0.003
4096 0.0 0.728 0.008 0.048 0.004 0.090 0.001 0.076 0.001 0.126 0.003 0.102 0.003
4096 0.25 0.732 0.003 0.047 0.003 0.090 0.001 0.076 0.002 0.125 0.004 0.103 0.004
4096 0.5 0.730 0.004 0.049 0.003 0.091 0.002 0.076 0.002 0.126 0.003 0.101 0.003
4096 0.75 0.730 0.003 0.051 0.004 0.092 0.002 0.078 0.002 0.134 0.017 0.103 0.003
4096 0.99 1.582 0.067 0.427 0.009 0.255 0.001 0.164 0.002 0.450 0.002 0.300 0.002
4096 1.0 3.571 0.075 7.691 0.075 2.259 0.001 0.467 0.001 5.985 0.001 3.218 0.002
10240 0.0 0.360 0.003 0.060 0.003 0.100 0.002 0.090 0.002 0.197 0.005 0.140 0.003
10240 0.25 0.361 0.002 0.062 0.005 0.102 0.002 0.091 0.003 0.196 0.003 0.139 0.002
10240 0.5 0.362 0.004 0.061 0.004 0.102 0.002 0.090 0.002 0.195 0.001 0.142 0.007
10240 0.75 0.364 0.002 0.066 0.004 0.103 0.002 0.092 0.002 0.198 0.004 0.139 0.002
10240 0.99 1.483 0.013 0.452 0.009 0.268 0.003 0.178 0.001 0.610 0.007 0.341 0.005
10240 1.0 9.350 0.106 67.344 0.055 4.421 0.001 0.998 0.001 21.601 0.007 7.764 0.007

Table 4. Time (ms) required to match 1000 instances when the problem size is 32,
512, 1024, 4096 or 10240 bits, the population size N is 100000, and the population
generality gen is 0.00, 0.25, 0.50, 0.75, 0.99 or 1.0. Data are averages over 10 runs.

n P# CPUrow CPUcol TESLArow TESLAcol GF9600row GF9600col


32 0.0 0.317 0.009 0.349 0.003 0.358 0.003 0.357 0.003 0.548 0.017 0.542 0.005
32 0.25 0.314 0.000 0.348 0.001 0.356 0.003 0.357 0.003 0.551 0.014 0.553 0.020
32 0.5 0.314 0.000 0.351 0.009 0.358 0.003 0.359 0.003 0.541 0.005 0.546 0.011
32 0.75 0.336 0.003 0.351 0.008 0.360 0.003 0.357 0.004 0.550 0.014 0.550 0.013
32 0.99 0.568 0.009 0.348 0.000 0.359 0.003 0.357 0.003 0.548 0.011 0.559 0.023
32 1.0 0.339 0.000 0.348 0.001 0.359 0.003 0.358 0.003 0.563 0.017 0.573 0.031
512 0.0 4.226 0.115 0.353 0.009 0.494 0.003 0.359 0.004 0.681 0.030 0.560 0.017
512 0.25 4.276 0.046 0.349 0.001 0.493 0.002 0.357 0.003 0.658 0.009 0.553 0.009
512 0.5 4.195 0.229 0.352 0.003 0.494 0.001 0.358 0.004 0.662 0.013 0.548 0.011
512 0.75 4.276 0.061 0.384 0.001 0.499 0.002 0.366 0.003 0.662 0.006 0.554 0.007
512 0.99 4.819 0.043 7.296 0.214 1.445 0.004 0.789 0.002 2.617 0.023 2.015 0.018
512 1.0 4.644 0.106 8.449 0.191 2.949 0.001 0.817 0.001 5.959 0.014 4.289 0.089
1024 0.0 5.255 0.090 0.350 0.000 0.440 0.002 0.360 0.003 0.674 0.019 0.550 0.011
1024 0.25 5.269 0.007 0.350 0.001 0.441 0.004 0.359 0.003 0.669 0.007 0.570 0.050
1024 0.5 5.244 0.088 0.352 0.001 0.437 0.001 0.359 0.003 0.649 0.004 0.534 0.003
1024 0.75 5.241 0.088 0.386 0.001 0.447 0.004 0.366 0.002 0.660 0.005 0.544 0.005
1024 0.99 9.374 0.013 9.308 0.010 1.589 0.005 0.979 0.001 2.905 0.005 2.257 0.010
1024 1.0 9.377 0.108 16.793 0.074 3.915 0.005 1.311 0.001 11.650 0.016 8.243 0.016
4096 0.0 7.333 0.010 0.361 0.001 0.517 0.001 0.365 0.003 0.809 0.010 0.561 0.012
4096 0.25 7.338 0.002 0.360 0.001 0.517 0.001 0.364 0.000 0.801 0.003 0.555 0.003
4096 0.5 7.332 0.008 0.363 0.002 0.517 0.001 0.364 0.000 0.803 0.003 0.559 0.014
4096 0.75 7.331 0.011 0.507 0.033 0.527 0.001 0.373 0.001 0.817 0.004 0.563 0.004
4096 0.99 19.103 0.012 9.451 0.014 1.999 0.003 1.082 0.002 3.924 0.008 2.411 0.007
4096 1.0 37.569 0.185 115.499 0.296 22.712 0.004 4.239 0.001 58.887 0.041 30.017 0.039
10240 0.0 5.711 0.021 0.391 0.004 0.737 0.001 0.378 0.001 - -
10240 0.25 5.710 0.019 0.392 0.004 0.737 0.001 0.379 0.002 - -
10240 0.5 5.718 0.014 0.396 0.001 0.737 0.001 0.379 0.001 - -
10240 0.75 5.727 0.036 0.566 0.007 0.744 0.001 0.386 0.001 - -
10240 0.99 19.537 0.074 9.496 0.018 2.392 0.006 1.095 0.004 - -
10240 1.0 93.529 0.208 690.014 0.985 68.851 0.020 9.764 0.003 - -
18 P.L. Lanzi and D. Loiacono

report the average matching time for one condition respectively when N is 1000
(Table 2), 10000 (Table 3), and 100000 (Table 4).
The results conrm several of the previous ndings. Column-based matching
outperforms row-based matching on GPUs. Tesla C1060 is generally faster than
GeForce 9600 GT as expected. Again, in the smaller population, the CPU is gen-
erally faster than both GPUs. In addition, also with 10000 classiers, when less
than 32 or 512 binary inputs are considered (i.e., when conditions are represented
by one to 16 unsigned integers), the CPU is faster; as the population size or the
number of inputs increases, the GPUs outperforms CPU on larger problems.
When P # 0.99 the speedup provided from the Tesla C1060 with respect to
column-based implementation on the CPU can be close to 50. Compared to
the row-based implementation on the CPU, the results show that Tesla C1060
implementation outperforms CPU on medium and big problems (when n > 512
and N 10000) with a speedup near to 20.
As before, column-based matching outperforms row-based matching on GPUs.
However, while with interval-based conditions, the CPU performed best with
row-based implementation, in this case, column-based always performs better
except when classiers are fully general (i.e., P# =1.0). To understand this re-
sult, we need to consider the memory access patterns in the two implemen-
tations. When the P# is not very high, i.e., P# < 0.99, the probability of
matching is easily close to zero when more than few dozens of inputs are consid-
ered3 . Accordingly, the matching process is very likely to stop very early, before
the rst 100 bits have been tested. This is why the average matching times of
classiers with P# in the range [0, 0.75] are very close. As a result, with the
column-based representation only the small memory areas, where the rst bits
are stored, are accessed. Thus, in this case, the cache locality is exploited across
the matching of several classiers: as the matching involve few initial inputs
for each classier, once the data is loaded for matching one classier, then it is
readily available for the following ones. In contrast, in the row-based implemen-
tation, the pattern of memory accesses spread over the whole memory where
the classiers are allocated. On the other hand, when the classiers are fully
general (i.e., when P # = 1.0) the matching involve all the inputs for all the
classiers. Accordingly, the locality is fully exploited by the row-based imple-
mentation, because it performs a sequential memory access pattern. In contrast,
in this case, the memory access pattern of column-based representation is highly
inecient.

7 Conclusions
In this paper, we studied GPU-based parallelization of the matching in learn-
ing classier systems for real inputs (using interval-based conditions) and binary

3
The probability of matching an input nwith n bits for a classifier generated with
a dont care probability P# is 1+p 2
; thus, when P#=0.75, the probability of
matching an input of size n = 100 is lower than 105 .
Speeding Up Matching in Learning Classifier Systems Using CUDA 19

inputs (using ternary conditions). In particular, we applied NVIDIAs Compute


Unied Device Architecture (CUDA) to implement matching procedures that
could exploit the massive parallelization available in GPUs. Our results show
that in small problems, CPU-based matching is faster due to the transfer over-
head introduced by GPUs. However, as the problem size increases, the transfer
overhead becomes less signicant with respect to the time gained through par-
allelization. Accordingly, GPU-based matching signicantly outperforms CPU-
based matching providing a 3-12 speedup on the interval-based representation
and a 20-50 speedup on the ternary-based representation.

References
1. Butz, M.V.: XCS (+ tournament selection) classifier system implementation in c,
version 1.2. Technical Report 2003023, Illinois Genetic Algorithms Laboratory
University of Illinois at Urbana-Champaign (2003)
2. Butz, M.V.: Kernel-based, ellipsoidal conditions in the real-valued xcs classifier
system. In: Beyer, H.-G., OReilly, U.-M. (eds.) GECCO, pp. 18351842. ACM,
New York (2005)
3. Butz, M.V., Lanzi, P.L., Llor` a, X., Loiacono, D.: An analysis of matching in learn-
ing classifier systems. In: Ryan, C., Keijzer, M. (eds.) Proceedings of Genetic and
Evolutionary Computation Conference, GECCO 2008, Atlanta, GA, USA, July
12-16, ACM Press, New York (2008)
4. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Hyper-ellipsoidal conditions in XCS: rota-
tion, linear approximation, and solution structure. In: Cattolico [6], pp. 14571464.
5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. Journal of Soft
Computing 6(3-4), 144153 (2002)
6. Cattolico, M. (ed.): Proceedings of Genetic and Evolutionary Computation Con-
ference, GECCO 2006, Seattle, Washington, USA, July 8-12. ACM, New York
(2006)
7. Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineer-
ing. MIT Press/Bradford Books (1998)
8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learn-
ing. Addison-Wesley, Reading (1989)
9. Holland, J.H.: Escaping Brittleness: The possibilities of General-Purpose Learning
Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Car-
bonell (eds.) Machine learning, an artificial intelligence approach, vol. II, ch. 20,
pp. 593623. Morgan Kaufmann, San Francisco (1986)
10. Holland, J.H., Reitman, J.S.: Cognitive systems based on adaptive algorithms
(1978); Reprinted in: Fogel, D.B. (ed.): Evolutionary Computation. The Fossil
Record. IEEE Press, Los Alamitos (1998) ISBN: 0-7803-3481-7
11. Josuttis, N.M.: The C++ Standard Library: A Tutorial and Reference. Addison-
Wesley Professional, Reading (1999)
12. Lanzi, P.L.: The XCS library (2002)
13. Lanzi, P.L., Wilson, S.W.: Using convex hulls to represent classifier conditions. In:
Cattolico [6], pp. 14811488
14. Llor`a, X., Sastry, K.: Fast rule matching for learning classifier systems via vector
instructions. In: Cattolico [6], pp. 15131520
15. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary
Computation 11(3), 298336 (2003)
20 P.L. Lanzi and D. Loiacono

16. Wilson, S.W.: ZCS: A zeroth level classifier system. Evolutionary Computa-
tion 2(1), 118 (1994), http://prediction-dynamics.com/
17. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computa-
tion 3(2), 149175 (1995)
18. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209222.
Springer, Heidelberg (2000)
19. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W., Wil-
son, S.W. (eds.) IWLCS 2000. LNCS, vol. 1996, pp. 158176. Springer, Heidelberg
(2001)

A Device Specifications
Table 5. Specification of GeForce 9600GT

Major revision number: 1


Minor revision number: 1
Total amount of global memory: 536608768 bytes
Number of multiprocessors: 8
Number of cores: 64
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.60 GHz
Concurrent copy and execution: Yes

Table 6. Specification of Tesla C1060

Major revision number: 1


Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Evolution of Interesting Association Rules
Online with Learning Classifier Systems

Albert Orriols-Puig1 and Jorge Casillas2


1
Grup de Recerca en Sistemes Intelligents
La Salle - Universitat Ramon Llull
Quatre Camins 2, 08022 Barcelona (Spain)
aorriols@salle.url.edu
2
Dept. Computer Science and Articial Intelligence
University of Granada
18071, Granada, Spain
casillas@ugr.es

Abstract. This paper presents CSar, a Michigan-style learning classier


system designed to extract quantitative association rules from streams
of unlabeled examples. The main novelty of CSar with respect to the ex-
isting association rule miners is that it evolves the knowledge online and
it is thus prepared to adapt its knowledge to changes in the variable as-
sociations hidden in the stream of unlabeled data quickly and eciently.
The results provided in this paper show that CSar is able to evolve inter-
esting rules on problems that consist of both categorical and continuous
attributes. Moreover, the comparison of CSar with Apriori on a problem
that consists only of categorical attributes highlights the competitiveness
of CSar with respect to more specic learners that perform enumeration
to return all possible association rules. These promising results encourage
us to further investigate on CSar.

1 Introduction
Association rule mining [2] aims at extracting interesting associations among the
attributesi.e., associations that occur with a certain frequency and strength
of repositories of unlabeled data. Research conducted on association rule mining
was originally focused on extracting rules that identied strong relationships be-
tween the occurrence of two or more attributes or items on collections of binary
data, e.g., if item X occurs then also item Y will occur [2,3,14]. Later on, sev-
eral researchers concentrated on extracting association rules from data described
by continuous attributes [10,22], which posed new challenges to the eld. Several
algorithms proposed to apply a discretization method in advance to transform
the original data into binary values [16,18,22,24] and then use a binary associa-
tion rule miner. This led to further research on designing discretization procedures
that avoid losing useful information. Other approaches mined interval-based as-
sociation rules and permitted the algorithm to independently move the interval

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 2137, 2010.
c Springer-Verlag Berlin Heidelberg 2010
22 A. Orriols-Puig and J. Casillas

bound of each rules variable [17]. Also, fuzzy modeling was introduced to create
fuzzy association rules (e.g., see [13,15]).
Association rules are widely used in various areas such as telecommunication
networks, market and risk management, and inventory control. All these appli-
cations are characterized by generating data online, so that data may be made
available in form of streams [1,19]. Nonetheless, all the aforementioned algo-
rithms were designed for static collections of data. Learning from data streams
has received a special amount of attention in the last few years, particularly in
supervised learning [1,19]. However, few proposals of online binary association
rule miners can be found in the literature, and most of them are only able to
deal with problems with categorical attributes (e.g., see [23]).
In this paper, we address the problem of mining association rules from streams
of examples online. We propose a learning classier system (LCS) whose archi-
tecture is inspired by XCS [25,26] and UCS [6], which we address as classier
system for association rule mining (CSar). CSar uses an interval-based repre-
sentation for evolving quantitative association rules from data with continuous
attributes and a discrete representation for categorical attributes. The system
receives a stream of unlabeled examples which are used to create new rules and
to tune the parameters of the existing ones with the aim of evolving as many
interesting rules as possible. CSar is rst compared with Apriori [3] on a problem
dened only by categorical attributes. The results on this problem indicate that
CSar can evolve rules of similar interest as those created by Apriori, one of the
most referred algorithms in the association rule mining realm, which considers all
the possible combinations of attribute values to create all interesting association
rules (notice that this approach can only be used in domains with categori-
cal data). The experimentation is then extended by considering a collection of
real-world problems and by analyzing the behavior of dierent congurations of
CSar over these problems. The results denote that CSar is able to create highly
supported and interesting interval-based association rules in which the intervals
have not been prexed by a discretization algorithm.
The remainder of this paper is organized as follows. Section 2 provides the
basic concepts of association rules and reviews the main proposals in the lit-
erature for both binary and quantitative association rule mining. Section 3 de-
scribes in detail our proposal. Section 4 explains the methodology followed in
the experiments, and Section 5 analyzes the results of these experiments. Fi-
nally, Section 6 summarizes, concludes, and gives the future work lines that will
be followed.

2 Framework

Before proceeding with the description of our proposal, this section introduces
some important concepts of association rules. We rst describe the problem of
extracting association rules from categorical data. Then, we extend the problem
to mining association rules from data with continuous attributes and review
dierent proposals that can be found in the literature.
Evolution of Interesting Association Rules 23

2.1 Association Rule Mining


The problem of association rule mining was rstly dened over binary data in [2]
as follows. Let I = {i1 , i2 , . . . , i } be a set of binary attributes called items. Let
T be a set of transactions, where each transaction t is represented as a binary
vector of length . Each position i of t indicates whether the item i is present
(ti = 1) or not (ti = 0) in the transaction. X is an itemset if X I. An itemset
X has a support supp(X) which is computed as
|X(T )|
supp(X) = , (1)
|T |
That is, the support is the number of transactions in the database which have the
itemset X, X(T), divided by the total number of transactions in the database,
|T |. An itemset is said to be a frequent itemset if its support is greater than a
user-set threshold, typically addressed as minsupp in the literature.
Then, an association rule R is an implication of the form X Y , where both
X and Y are itemsets and X Y = . Typically, association rules are assessed
with two qualitative measures, their support (supp) and their condence (conf ).
The support of a rule is dened as ratio of the support of the union of antecedent
and consequent to the number of transactions in the database, i.e.,
supp(X Y )
supp(R) = . (2)
|T |
The condence is computed as the ratio of the support of the union of antecedent
and consequent to the support of the antecedent, i.e.,
supp(X Y )
conf (R) = . (3)
supp(X)
Therefore, support indicates the frequency of occurring patterns, and condence
evaluates the strength of the implication denoted in the association rule.
Since the proposal of AIS [2], the rst algorithm to mine association rules from
categorical examples, several algorithms have been designed to perform this task.
Agrawal et al. [3] presented the Apriori algorithm, probably the most inuential
categorical association rule miner. This work resulted in several papers which
designed some modications to the initial Apriori algorithm (e.g., see [8,21]).
All these algorithms used the same methodology as Apriori to mine association
rules, which basically consisted of two dierent phases: (1) identication of all
frequent itemsets (i.e., all itemsets whose support was greater than minsupp),
and (2) generation of association rules from these frequent itemsets.

2.2 Quantitative Association Rules


Early research in the realm of association rules only addressed the problem of
extracting association rules from binary data. Therefore, these types of rules
only permitted reecting whether particular items were present in the transac-
tion, but they did not consider their quantities. Later on, researchers focused
24 A. Orriols-Puig and J. Casillas

on algorithms that were able to extract association rules from databases that
contained quantitative attributes.
Srikant and Agrawal [22] designed an Apriori-like approach to mine quantita-
tive association rules. The authors used an equi-depth partitioning to transform
continuous attributes to categorical attributes. Moreover, the authors identied
the problem of the sharp boundary between discrete intervals, which highlighted
that quantitative mining algorithms may either ignore or over-emphasize the
items that lay near the boundary of intervals. Attempting to address this prob-
lem, several authors applied dierent clustering mechanisms to extract the best
possible intervals from the data [16,18]. A completely dierent approach was
taken in [17], where a genetic-algorithm-based technique was used to evolve
interval-based association rules without applying any discretization procedure
to the variables. The GA was responsible for creating new promising association
rules and for evolving the intervals of the variables of the association rules. The
problem associated to creating variables with unbounded intervals is that, in
general, the support for small intervals is smaller than the support for large in-
tervals, which makes the system create rules with large intervals, covering nearly
all the domain. To avoid this, the system penalized the tness of rules that had
large intervals. In [20] a similar approach was followed. The authors proposed
a framework in which nding good intervals from which interesting association
rules could be extracted was addressed as an optimization problem.
As done in [17,20], CSar does not apply any discretization mechanism to the
original data and interval bounds are evolved by the genetic procedure. The
main novelty of our proposal is that association rules are not mined from static
databases but from streams of examples. This characteristic guides some parts
of the algorithm design, which is described in detail in the next section.

3 Description of CSar
CSar is a Michigan-style LCS for mining interval-based association rules from
data that contain both quantitative and categorical attributes. The learning
architecture of CSar is inspired by UCS [6] and XCS [25,26]. CSar aims at evolv-
ing populations of interesting association rules, i.e., rules with large support and
condence. For this purpose, CSar evaluates a set of association rules online and
evolves this rule set by means of a steady-state genetic algorithm (GA) [11,12]
that is applied to population niches. As follows, a detailed description of the
system is provided, focusing on the dierences in the knowledge representation
and learning process with respect to those of XCS and UCS.

3.1 Knowledge Representation


CSar evolves a population of classiers [P], where each classier consists of a
quantitative association rule and a set of parameters. The quantitative associa-
tion rule is represented as

if xi vi and . . . and xj vj then xk vk ,


Evolution of Interesting Association Rules 25

where the antecedent is represented by a set of a input variables xi , . . . , xj


(0 < a < , 0 i < , and 0 j < ; where  is the number of variables of the
problem) and the consequent contains a single variable xk . Note that we permit
that rules have an arbitrary number of variables in the antecedent, but we only
enable them to have a single variable in the consequent. Restricting the number
of consequent variables to one aims at simplifying the creation of niches (see
next subsection).
For quantitative attributes, a similar representation to the XCSR one is used
[27], in which both antecedent and consequent variables are represented by the
interval of values to which this variable applies, i.e., vi = [li , ui ]. A maximum
interval length maxInt is set to avoid having large intervals that nearly contain
all the possible values of a given variable; therefore, i : ui li maxInt.
Categorical attributes are represented by one of the possible categorical values
xij , i.e., vi = xij . A rule matches an input example if, for all the variables in the
antecedent and consequent of the rule, the corresponding value of the example
is either included in the interval dened for continuous variables or equal to the
value dened for categorical variables.
Each classier has seven main parameters: (1) the support supp, i.e., the occur-
ring frequency of the rule; (2) the condence conf, which indicates the strength
of the implication; (3) the tness F, which denotes the quality of the given rule;
(4) the experience exp, which counts the number of times that the antecedent of
the rule has matched an input instance; (5) the consequent matching sum cm,
which counts the number of times that the whole rule has matched an input
instance; (6) the numerosity num, which reckons the number of copies of the
classier in the population; and (7) the time of creation of the classier tcreate.
The next subsection explains how the classiers are created and evolved and
how their parameters are updated.

3.2 Learning Process Organization

At each learning iteration, CSar receives an input example (e1 , e2 , . . ., e ). Then,


the system creates the match set [M] with all the classiers in the population that
match the input example. If [M] contains less that mna classiers, the covering
operator is triggered to create as many new matching classiers as required to
have mna classiers in [M]. Then, classiers in [M] are organized in association
set candidates following one of the two methodologies explained below. Each
association set is given a probability to be selected that is proportional to the
average condence of the classiers that belong to this association set. The se-
lected association set [A] is checked for subsumption with the aim of diminishing
the number of rules that express similar associations among variables. Then, the
parameters of all the classiers in [M] are updated. At the end of the iteration,
a GA is applied to the selected association set if the average time since the last
application of the GA to the classiers of the selected association set is greater
than GA (GA is a user-set parameter). Finally, for each continuous attribute,
we maintain a list with no repeated elements that stores the last few values
seen for the attribute (in our experiments we stored the last hundred dierent
26 A. Orriols-Puig and J. Casillas

values). This list is used by the mutation operator with the aim of preventing the
existence of intervals that cover the same examples but are slightly dierent. As
follows, we provide details about (1) the covering operator, (2) the procedures
to create association set candidates, (3) the association set subsumption mech-
anism, and (4) the parameter update procedure. Next section explains in more
detail the discovery component. It is worth noting that some of the operators are
similar to those of several existing systems such as the ones described in [5,9].

Covering Operator. The purpose of the covering operator is to feed classiers


that denote interesting associations among variables into the population. Given
the sampled input example e, the covering operator creates a new matching
classier as follows. Each variable is selected with probability 1 P# to belong
to the rules antecedent, with the restriction that, at the end of this process,
at least one variable has to be selected. The values of the selected variables
are initialized dierently depending on the type of attribute. For categorical
attributes, the variable is initialized to the corresponding input value ei . For
continuous attributes, the interval [li , ui ] that represents the variable is obtained
from generalizing the input value ei , i.e.,

li = ei rand(maxInt/2) and (4)


ui = ei + rand(maxInt/2), (5)

where maxInt is the maximum interval length. Finally, one of the previously
unselected variables is randomly chosen to form the consequent of the rule,
which is initialized following the same procedure. Note that the association rule
created is supported by, at least, the sampled example.

Creation of Association Set Candidates. The aim of creating association


set candidates or niches is to group rules that express similar associations to
establish a competition among them and so let the best ones take over their
niche. Whilst the creation of these niches of similar rules is quite immediate in
reinforcement learning [25] and classication [6] tasks, several approaches could
be used to form groups of similar rules in association rule mining. Herein, we
propose two alternatives which are guided by dierent heuristics:

Grouping by antecedent. This strategy considers that two rules are similar
if they have exactly the same variables in their antecedent, regardless of
their corresponding values Vi . Therefore, this grouping strategy creates Na
association set candidates, where Na is the number of rules in [M] with
dierent variables in the antecedent. Each association set contains rules that
have exactly the same variables in the antecedent. The underlying idea is
that rules with the same antecedent may express similar knowledge. Note
that, under this strategy, rules with dierent variables in the consequent can
be grouped in the same association set.
Grouping by consequent. This strategy groups in the same association set
the classiers in [M] that have the same variable in the consequent with
Evolution of Interesting Association Rules 27

equivalent values. We consider that two continuous variables are equivalent


if their intervals are overlapped and that two categorical variables are equiv-
alent if they have the same categorical value. For this purpose, the next
process is followed. The rules in [M] are sorted ascendingly according to the
variable that they have in their consequent. Given two rules r1 and r2 that
have the same variable in the consequent, we consider that r1 is smaller than
r2 if

l1 < l2 or (l1 = l2 and u1 > u2 ) if continuous attribute
ord(x1 ) < ord(x2 ) if categorical attribute

where l1 , l2 , u1 , and u2 are the lower bound and upper bound of the con-
sequent variable of r1 and r2 for a continuous attribute, x1 and x2 are the
values of the consequent variable for a categorical attribute, and ord(xi )
maps each categorical value to a numeric value. It is worth noting that given
two continuous variables with the same lower bound in the interval, we sort
rst the rule with the most general variable (i.e., the rule with larger ui ).
We take this approach with the aim of forming association set candidates
with the largest number of overlapping classiers by using the procedure
explained as follows.
Once [M] has been sorted, the association set candidates are built as fol-
lows. At the beginning, an association set candidate is created and the rst
classier in [M] is added to this association set candidate. Then, the follow-
ing classier is added if it has the same variable in the consequent, and his
lower bound is smaller than the minimum upper bound of the classiers in
the association set. This process is repeated until nding the rst classier
that violates this condition. In this case, a new association set candidate
is created, and the same process is applied to add new classiers to this
association set. The underlying idea of this association set strategy is that
rules that explain the same region of the consequent may denote the same
associations among variables.

The cost of both methodologies for creating the association sets are guided by the
cost of sorting the population. We applied a quicksort strategy for this purpose,
which has a cost of O(n logn), where n is the match set size.

Association Set Subsumption. A subsumption mechanism inspired by the


one presented in [26] was designed with the aim of reducing the number of
dierent rules that express the same or similar knowledge. The process works as
follows. Each rule of the selected association set is checked for subsumption with
each other rule in the same association set. A rule ri is a candidate subsumer of
rj if it satises the following three conditions: (1) ri has higher condence and
it is experienced enough (i.e., conf i >conf 0 and expi > exp , where conf 0 and
exp are user-set parameters); (2) all the variables in the antecedent of ri are
also present in the antecedent of rj and both rules have the same variable in the
consequent (rj can have more variables in the antecedent than ri ); and (3) ri
28 A. Orriols-Puig and J. Casillas

is more general than rj . A rule ri is more general than rj if all the input and
the output variables of ri are also dened in rj , each categorical variable of ri
has the same value as the corresponding variable in rj , and the interval [li , ui ] of
each continuous variable in ri includes the interval [lj , uj ] of the corresponding
variable in rj (i.e., li lj and ui uj ).

Parameter Update. At the end of each learning iteration, the parameters of


all the classiers that belong to the match set are updated. First, we increment
the experience of the classier. Next, we increment the consequent matching
estimate cm if the rules consequent also matches the input example. These two
parameters are used to update the support and condence of the rule i as follows.
Support is computed as:
cmi
suppi = , (6)
ctime tcreatei
where ctime is the time of the current iteration and tcreatei is the iteration in
which the classier i has been created. Then, the condence is computed as
cmi
confi = . (7)
expi
Lastly, the tness of each rule i in [M] is updated with the following formula

Fi = (confi suppi) , (8)

where is a user-set parameter that permits controlling the pressure toward


highly t classiers. Note that with this tness computation, the system makes
pressure towards the evolution of rules with not only high condence but also
high support. We empirically tested to compute the tness only from conf, but
preliminary experiments indicated that CSar could obtain a larger variety of
interesting association rules if support was included in the tness computation.
Finally, the association set size estimate of all rules that belong to the se-
lected association set is updated. Each rule maintains the average size of all the
association sets in which it has participated.

3.3 Discovery Component


CSar uses a steady-state niched GA to discover new promising rules. The GA is
applied to the selected association set [A]. Therefore, the niching is intrinsically
provided since the GA is applied to rules that are similar according to one of
the heuristics for association set formation.
The GA is triggered when the average time from its last application upon
the classiers in [A] exceeds the threshold GA . It selects two parents p1 and p2
from [A] using proportionate selection [11], where the probability of selecting a
classier k is
Fk
pksel =  . (9)
i[A] Fi
Evolution of Interesting Association Rules 29

The two parents are copied into ospring ch1 and ch2 , which undergo crossover
and mutation if required.
The system applies uniform crossover with probability P . First, it considers
each variable in the antecedent of both rules. If only one parent has the vari-
able, one child is randomly selected and the variable is copied to this child. If
both parents contain the variable, this variable is copied to each ospring. The
procedure controls that, at the end of the process, each ospring has, at least,
one input variable. Then, the rule consequent is crossed by adding to the rst
ospring the consequent of one of the parents (which is randomly selected) and
adding to the remaining ospring the consequent of the other parent.
Three types of mutation can be applied to a rule: (1) introduction/removal
of antecedent variables (with probability PI/R ), (2) mutation of variables val-
ues (with probability P ), and (3) mutation of the consequent variable (with
probability PC ). The rst type of mutation chooses randomly whether a new
antecedent variable has to be added to or one of the antecedent variables has to
be removed from the rule. If a variable has to be added, one of the non-existing
variables is randomly selected and added to the rule. This operation can only be
applied if the rule does not have all the possible variables. If a variable has to be
removed, one of the existing variables is randomly selected and removed from the
rule. This operation can only be applied if the rule has at least two variables in
the antecedent. The second type of mutation selects one of the existing variables
of the rule and mutates its value. For continuous variables, two random amounts
ranging in [-m0 , m0 ] are added to the lower bound and the upper bound respec-
tively, where m0 is a user-set parameter. If the interval surpasses the maximum
length or the lower bound becomes greater than the upper bound, the interval
is repaired. Finally, the lower and the upper bounds of the mutated variable
are approximated to the closest value in the list of seen values for this variable.
This process is applied to avoid having rules in the population with very similar
interval bounds in its variables, since having all them may not only provide no
additional knowledge, but also hinder human experts from reading the whole
population. For categorical variables, a new value for the variable is randomly
selected. The last type of mutation randomly selects one of the variables in the
antecedent and exchanges it with the output variable.
After crossover and mutation, the new ospring are introduced into the pop-
ulation. First, each classier is checked for subsumption [26] with their parents.
To decide if any parent can subsume the ospring, the same procedure explained
for association set subsumption is followed. If any parent is identied as a possi-
ble subsumer for the ospring, the ospring is not inserted and the numerosity
of the parent is increased by one. Otherwise, we check [A] for the most general
rule that can subsume the ospring. If no subsumer can be found, the classier
is inserted into the population.
If the population is full, excess classiers are deleted from [P] with probabil-
ity proportional to their association set size estimate as. Moreover, if a classi-
er k is suciently experienced (expk > del ) and its tness F k is signicantly
30 A. Orriols-Puig and J. Casillas

lower than the average tness of the classiers in [P] (F k < F[P ] where F[P ] =

i[P ] F ), its deletion probability is further increased. That is, each classier
1 i
N
has a deletion probability pk of
dk
pk =  , (10)
j[P ] dj
where
 asnumF[P ]
if expk > del and F k < F[P ]
dk = Fk (11)
as num otherwise.
Thus, the deletion algorithm balances the classier allocation in the dierent
association sets by pushing toward the deletion of rules belonging to large correct
sets. At the same time, it favors the search toward highly t classiers, since the
deletion probability of rules whose tness is much smaller than the average tness
is increased.

3.4 Rule Set Reduction


At the end of the learning process, the nal rule set is processed to provide
the user with only interesting rules. For this purpose, we apply the following
reduction mechanism. Firstly, we remove all rules whose experience is smaller
than exp (exp is a user-set parameter). Then, each rule is checked against each
other for subsumption following the same procedure used for association rule
subsumption but with the following exception: now, a rule ri is a candidate
subsumer for rj if ri and rj have the same variables in their antecedent and
consequent, ri is more general than rj , and ri has higher condence than rj . Note
that, during learning, the subsumption mechanism requires that the condence
of ri be greater than conf 0 .
After applying the rule set reduction mechanism, we make sure that the nal
population consists of dierent rules. Other policies can be easily incorporated
to this process such as removing rules whose support and condence are be-
low a predened threshold. Nonetheless, in our experiments we return all the
experienced rules in the nal population that are not subsumed by any other.
The overall section has described the mechanisms that CSar uses to evolve a
population of interesting association rules online. Dierently from other quan-
titative association-rule miners, CSar is characterized for having a maximum
population size that limits the number of dierent interesting association rules
that can exist in the nal population. Then, the system organizes rules in dif-
ferent association sets and uses a GA to make rules in the same association
set compete. Therefore, CSar does not aim at returning all the possible asso-
ciation rules, but at providing the user with a population of limited size with
phenotypically dierent and interesting association rules.

4 Experimental Methodology
After having carefully described the system, now we are in position to exper-
imentally analyze the behavior of CSar. The aim of the experimental analysis
Evolution of Interesting Association Rules 31

Table 1. Properties of the data sets. The columns describe: the identier of the data
set (Id.); the name of the data set (dataset); the number of instances (#Inst); the total
number of features (#Fea); the number of real features (#Re); the number of integer
features (#In); and the number of nominal features (#No).

Id. dataset #Inst #Fea #Re #In #No


adl Adult 48841 15 0 6 9
ann Annealing 898 39 6 0 33
aud Audiology 226 70 0 0 70
aut Automobile 205 26 15 0 11
bpa Bupa 345 7 6 0 1
col Horse colic 368 23 7 0 16
gls Glass 214 10 9 0 1
h-s Heart-s 270 14 13 0 1
irs Iris 150 5 4 0 1
let Letter recognition 20000 17 0 16 1
pim Pima 768 9 8 0 1
tao Tao 1888 3 2 0 1
thy Thyroid 215 6 5 0 1
wdbc Wisc. diagnose breast-cancer 569 31 30 0 1
wne Wine 178 14 13 0 1
wpbc Wisc. prognostic breast-cancer 198 34 33 0 1

was to (1) study whether CSar could actually evolve a set of interesting associa-
tion rules, (2) examine the behavior of the system under dierent congurations.
With these objectives in mind, we did the following two experiments.
As our rst concern was to analyze whether CSar could evolve the most inter-
esting association rules regardless of having a xed population size. Therefore,
we compared CSar with Apriori [3], probably the most inuential association
rule miner, on the zoo problem [4]. We selected the zoo problem for this analysis
since Apriori only works on problems described by categorical attributes and
the zoo problem satises this requirement. More specically, the zoo problem is
dened by (1) fteen binary attributes which indicate whether the animal has
a total of fteen characteristics such as whether it has tail or hair and (2) two
categorical attributes that can take more than two values and which represent
the number of legs and the type of animal.
Secondly, we studied the impact of using the two dierent procedures
to create association rule candidates and of using progressively bigger max-
imum intervals. For this purpose, we ran CSar (1) with both antecedent- and
consequent-grouping strategies to create association sets candidates and (2) with
dierent maximum interval lengths on a collection of real-world problems ex-
tracted from the UCI repository [4] and from local repositories [7]. The charac-
teristics of these problems are reported in Table 1.
In all runs, CSar employed the following conguration: num iterations =
100 000, popSize = 6 400, conf0 = 0.95, = 10, mna = 10, {del , GA } = 50, exp
= 1000, P = 0.8, {PI/R , P , PC } = 0.1, m0 =0.2. Association set subsumption
was activated in all runs.
32 A. Orriols-Puig and J. Casillas

5 Analysis of the Results

With the aim of the experiments in mind, in what follows we discuss about the
experimental results.

5.1 Ability of CSar to Discover Interesting Rules

In order to study the ability of CSar to extract interesting association rules,


we rst compared the system with Apriori on a problem with only categorical
attributes, the zoo problem. CSar was ran with both antecedent-grouping and
consequent-grouping strategies. As we wanted to analyze the interestingness of
the rules created by the systems, we report the number of rules with dier-
ent minimum supports and condences obtained by CSar with the two group-
ing strategies (see Figure 1). The same information is reported for Apriori in
Figure 2; however, in this case, the resulting rules of Apriori have been ltered.

1400 180
Conf > 0.05 Conf > 0.05
Conf > 0.10 Conf > 0.10
Conf > 0.15 160 Conf > 0.15
1200
Conf > 0.20 Conf > 0.20
Conf > 0.25 140 Conf > 0.25
1000 Conf > 0.30 Conf > 0.30
Conf > 0.35 120 Conf > 0.35
Number of Rules

Number of Rules

Conf > 0.40 Conf > 0.40


800 Conf > 0.45 Conf > 0.45
Conf > 0.50 100 Conf > 0.50
Conf > 0.55 Conf > 0.55
600 Conf > 0.60 80 Conf > 0.60
Conf > 0.65 Conf > 0.65
Conf > 0.70 60 Conf > 0.70
400 Conf > 0.75 Conf > 0.75
Conf > 0.80 Conf > 0.80
Conf > 0.85 40 Conf > 0.85
200 Conf > 0.90 Conf > 0.90
Conf > 0.95 20 Conf > 0.95

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
support support
(a) antecedent grouping (b) consequent grouping

Fig. 1. Number of rules evolved with minimum support and condence for the zoo
problem with (a) antecedent-grouping and (b) consequent-grouping strategies. The
curves are averages over ve runs with dierent random seeds.

1600
Conf > 0.05
Conf > 0.10
1400 Conf > 0.15
Conf > 0.20
1200 Conf > 0.25
Conf > 0.30
Conf > 0.35
Number of Rules

1000 Conf > 0.40


Conf > 0.45
Conf > 0.50
800 Conf > 0.55
Conf > 0.60
600 Conf > 0.65
Conf > 0.70
Conf > 0.75
400 Conf > 0.80
Conf > 0.85
Conf > 0.90
200 Conf > 0.95

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
support

Fig. 2. Number of rules created by Apriori with minimum support and condence for
the zoo problem. Lower condence and support are not shown since Apriori creates all
possible combinations of attributes, exponentially increasing the number of rules.
Evolution of Interesting Association Rules 33

Table 2. Comparison of the number of rules evolved by CSar with antecedent- and
consequent-grouping strategies to form the association set candidates with the number
of rules evolved by Apriori at high support and condence values

Confidence
antecedent grouping consequent grouping A-priori
0.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8
0.40 275 30 271 27 230 23 65 10 63 9 59 9 2613 2514 2070
0.50 123 4 123 4 106 3 61 8 61 8 58 8 530 523 399
Support

0.60 58 2 58 2 51 4 51 8 51 8 47 7 118 118 93


0.70 21 1 21 1 19 1 19 2 19 2 18 2 30 30 27
0.80 20 20 20 20 20 20 2 2 2
0.90 00 00 00 00 00 00 0 0 0
1.00 00 00 00 00 00 00 0 0 0

That is, Apriori is a two-phase algorithm that exhaustively explores all the fea-
ture space, discovers all the itemsets with a minimum predened support, and
creates all the possible rules with these itemsets. Therefore, some of the rules
supplied by Apriori are included in other rules. We consider that a rule r1 is
included in another rule r2 if r1 has, at least, the same variables with the same
values in the rule antecedent and the rule consequent as r2 (r1 may have more
variables). In the results provided herein, we removed from the nal population
all the rules that were included by other rules. Thus, we provide an upper bound
of the number of dierent rules that can be generated.
Two important observations can be made from these results. Firstly, the re-
sults clearly show that Apriori can create a higher number of rules than CSAr
(for the sake of clarity, Table 2 species the number of rules for support values
ranging from 0.4 to 1.0 and condence values of {0.4,0.6,0.8}). This behavior
was expected, since CSar has a limited population size, while Apriori returns
all possible association rules. Nevertheless, it is worth noting that CSAr and
Apriori found exactly the same number of highly interesting rules; that is, both
systems discovered two rules with both condence and support higher than 0.8.
This highlights the robustness of CSar, whose mechanisms guide the system to
discover the most interesting rules.
Secondly, focusing on the results reported in Figure 1, we can see that the
populations evolved with the antecedent-grouping strategy are larger than those
built with the consequent-grouping strategy. This behavior will be also present,
and discussed in more detail, in the extended experimental analysis conducted
in the next subsection.

5.2 Study of the Behavior of CSar

After showing that CSar can create highly interesting association rules in a
case-study problem characterized by categorical attributes, we now extend the
experimentation by running the system on 16 real-world data sets. We ran the
system with (1) antecedent-grouping and consequent-grouping strategies and (2)
34 A. Orriols-Puig and J. Casillas

Table 3. Average ( standard deviation of the) number of rules with support and
condence greater than 0.60 created by CSar with antecedent- and consequent-grouping
strategies and with maximum interval sizes of MI={0.10, 0.25, 0.50}. The average and
standard deviation are computed on ve runs with dierent random seeds.

antecedent consequent
MI=0.10 MI=0.25 MI=0.50 MI=0.10 MI=0.25 MI=0.50
adl 135 3 294 15 567 66 46 1 74 3 147 23
ann 1736 133 1765 79 1702 135 478 86 525 112 489 34
aud 2206 80 2017 147 1999 185 1014 12 982 100 880 215
aut 84 14 192 7 710 106 25 6 58 3 188 6
bpa 11 4 174 15 365 42 17 2 100 4 123 22
col 134 14 188 7 377 64 180 13 191 7 198 8
gls 33 4 160 17 694 26 23 2 89 6 205 23
H-s 28 1 61 4 248 32 13 1 29 1 92 13
irs 00 00 50 5 00 00 28 8
let 00 113 17 991 40 00 103 6 205 13
pim 41 93 9 570 51 30 53 5 154 25
tao 00 00 81 00 00 52
thy 46 2 152 4 350 27 29 2 80 3 160 2
wdbc 00 419 43 1143 131 00 145 17 304 16
wne 116 9 273 48 536 34 26 3 65 9 137 17
wpbc 00 00 740 234 00 00 264 34

allowing intervals of maximum length maxInt = {0.1, 0.25, 0.5} for continuous
variables. Note that by using dierent grouping strategies we are changing the
way how the system creates association set candidates; therefore, as competition
is held among rules within the same association set, the resulting rules can be
dierent in both cases. On the other hand, having an increasing larger interval
length for continuous variables enables the system to obtain more general rules.
Table 3 reports the number of rules, with condence and support greater than
or equal to 0.6, created by the dierent congurations of CSar. All the reported
results are averages of ve runs with dierent random seeds.
Comparing the results obtained with the two dierent grouping schemes, we
can see that the antecedent-grouping strategy yielded larger populations than
the consequent-grouping strategy, on average. This behavior was expected since
the antecedent grouping creates smaller association sets, and thus, maintains
more diversity in the population. Nonetheless, a closer examination of the nal
population indicates that the dierence in the nal number of rules decreases if
we only consider the rules with the highest condence and support. For example,
considering all the rules with condence and support greater than or equal to
0.60, the antecedent-grouping strategy results in populations 2.16 bigger than
those of the consequent-grouping strategy. However, considering only the rules
with condence and support greater than or equal to 0.85, the average dierence
in the population length gets reduced to 1.12. This indicates a big proportion
of the most interesting rules are discovered by the two strategies. It is worth
Evolution of Interesting Association Rules 35

highlighting therefore that the lower number of rules evolved by the consequent-
grouping strategy can be considered as an advantage, since the strategy avoids
creating and maintaining uninteresting rules in the population, which implies a
lower computational time to evolve the population.
Focusing on the impact of varying the interval length, the results indicate that
for lower maximum interval lengths CSar tends to evolve rules with less support.
This behavior can be easily explained as follows. Large maximum interval length
enable the existence of highly general rules, which will have higher support.
Moreover, if both antecedent and consequent variables are maximally general,
rules will also have high condence. Taking this idea to the extreme, rules that
contain variables whose intervals range from the minimum value to the maximum
value for the variable will have maximum condence and support. Nonetheless
these rules will be uninteresting for human experts. On the other hand, small
interval lengths may result in more interesting association rules, though too
small lengths may result in rules that denote strong associations but have less
support. This highlights a tradeo in the setting of this parameter, which should
be adjusted for each particular problem. As a rule of thumb, similarly to what can
be done with other association rule miners, the practitioner may start setting
small interval lengths and increase them in case of not obtaining rules with
enough support for the particular domain used.

6 Summary, Conclusion, and Further Work


In this paper, we presented CSar, a Michigan-style LCS designed to evolve
quantitative association rules. The experiments conducted in this paper have
shown that the method holds promise for online extraction of both categori-
cal and quantitative association rules. Results with the zoo problem indicated
that CSar was able to create interesting categorical rules, which were similar
to those built by Apriori. Experiments with a collection of real-world problems
also pointed out the capabilities of CSar to extract quantitative association rules
and served to analyze the behavior of dierent congurations of the system.
These results encourage us to study the system further with the aim of apply-
ing CSar to mine quantitative association rules from new challenging real-world
problems.
Several future work lines can be followed in light of the present work. Firstly,
we aim at comparing CSar with other quantitative association rule miners to
see if the online architecture can extract knowledge similar to that obtained by
other approaches that go several times through the learning data set. Actually,
the online architecture of CSar makes the system suitable for mining association
rules from changing environments with concept drift [1]; and we think that the
existence of concept drift may be a common trait in many real-world problems
to which association rules have historically been applied such as prole mining
from customer information. Therefore, it would be interesting to analyze how
CSar adapts to domains in which variable associations change over time.
36 A. Orriols-Puig and J. Casillas

Acknowledgements

The authors thank the support of Ministerio de Ciencia y Tecnologa under


projects TIN2008-06681-C06-01 and TIN2008-06681-C06-05, Generalitat de Cata-
lunya under Grant 2005SGR-00302, and Andalusian Government under grant
P07-TIC-3185.

References

1. Aggarwal, C. (ed.): Data streams: Models and algorithms. Springer, Heidelberg


(2007)
2. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of
items in large databases. In: Proceedings of the ACM SIGMOD International Con-
ference on Management of Data, Washington D.C, pp. 207216 (May 1993)
3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of the 20th
International Conference on Very Large Data Bases, VLDB, Santiago, Chile, pp.
487499 (September 1994)
4. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, University of
California (2007), http://www.ics.uci.edu/~ mlearn/MLRepository.html
5. Bacardit, J., Krasnogor, N.: Fast rule representation for continuous attributes in
genetics-based machine learning. In: GECCO 2008: Proceedings of the 10th Annual
Conference on Genetic and Evolutionary Computation, pp. 14211422. ACM, New
York (2008)
6. Bernad o-Mansilla, E., Garrell, J.M.: Accuracy-based learning classier systems:
Models, analysis and applications to classication tasks. Evolutionary Computa-
tion 11(3), 209238 (2003)
7. Bernad o-Mansilla, E., Llor`
a, X., Garrell, J.M.: XCS and GALE: A comparative
study of two learning classier systems on data mining. In: Lanzi, P.L., Stolzmann,
W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115132.
Springer, Heidelberg (2002)
8. Cai, C.H., Fu, A.W.-C., Cheng, C.H., Kwong, W.W.: Mining association rules with
weighted items. In: International Database Engineering and Application Sympo-
sium, pp. 6877 (1998)
9. Divina, F.: Hybrid Genetic Relational Search for Inductive Learning. PhD thesis,
Department of Computer Science, Vrije Universiteit, Amsterdam, the Netherlands
(2004)
10. Fukuda, T., Morimoto, Y., Morishita, S., Tokuyama, T.: Mining optimized asso-
ciation rules for numeric attributes. In: PODS 1996: Proceedings of the Fifteenth
ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
pp. 182191. ACM, New York (1996)
11. Goldberg, D.E.: Genetic algorithms in search, optimization & machine learning,
1st edn. Addison-Wesley, Reading (1989)
12. Holland, J.H.: Adaptation in natural and articial systems. The University of
Michigan Press (1975)
13. Hong, T.P., Kuo, C.S., Chi, S.C.: Trade-o between computation time and number
of rules for fuzzy mining from quantitative data. International Journal of Uncer-
tainty, Fuzziness, and Knowledge-Based Systems 9(5), 587604 (2001)
Evolution of Interesting Association Rules 37

14. Houtsma, M., Swami, A.: Set-oriented mining of association rules. Technical Report
RJ 9567, Almaden Research Center, San Jose, California (October 1993)
15. Kaya, M., Alhajj, R.: Genetic algorithm based framework for mining fuzzy associ-
ation rules. Fuzzy Sets and Systems 152(3), 587601 (2005)
16. Lent, B., Swami, A.N., Widom, J.: Clustering association rules. In: Procedings of
the IEEE International Conference on Data Engineering, pp. 220231 (1997)
17. Mata, J., Alvarez, J.L., Riquelme, J.C.: An evolutionary algorithm to discover
numeric association rules. In: SAC 2002: Proceedings of the 2002 ACM Symposium
on Applied Computing, pp. 590594. ACM, New York (2002)
18. Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD 1997:
Proceedings of the 1997 ACM SIGMOD International Conference on Management
of data, pp. 452461. ACM, New York (1997)
19. Nun
ez, M., Fidalgo, R., Morales, R.: Learning in environments with unknown dy-
namics: Towards more robust concept learners. Journal of Machine Learning Re-
search 8, 25952628 (2007)
20. Salleb-Aouissi, A., Vrain, C., Nortet, C.: Quantminer: A genetic algorithm for
mining quantitative association rules. In: Veloso, M.M. (ed.) Proceedings of the
2007 International Join Conference on Articial Intelligence, pp. 10351040 (2007)
21. Savasere, A., Omiecinski, E., Navathe, S.: An ecient algorithm for mining as-
sociation rules in large databases. In: Proceedings of the 21st VLDB Conference,
Zurich, Switzerland, pp. 432443 (1995)
22. Srikant, R., Agrawal, R.: Mining quantitative association rules in large relational
tables. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM
SIGMOD International Conference on Management of Data, Montreal, Quebec,
Canada, pp. 112 (1996)
23. Wang, C.-Y., Tseng, S.-S., Hong, T.-P., Chu, Y.-S.: Online generation of association
rules under multidimensional consideration based on negative border. Journal of
Information Science and Engineering 23, 233242 (2007)
24. Wang, K., Tay, S.H.W., Liu, B.: Interestingness-based interval merger for numeric
association rules. In: Proceedings of the 4th International Conference on Knowledge
Discovery and Data Mining, KDD, pp. 121128. AAAI Press, Menlo Park (1998)
25. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
26. Wilson, S.W.: Generalization in the XCS classier system. In: 3rd Annual Conf.
on Genetic Programming, pp. 665674. Morgan Kaufmann, San Francisco (1998)
27. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
Coevolution of Pattern Generators and
Recognizers

Stewart W. Wilson

Prediction Dynamics, Concord MA 01742 USA


Department of Industrial and Enterprise Systems Engineering
The University of Illinois at Urbana-Champaign IL 61801 USA
wilson@prediction-dynamics.com

Abstract. Proposed is an automatic system for creating pattern gen-


erators and recognizers that may provide new and human-independent
insight into the pattern recognition problem. The system is based on a
three-cornered coevolution of image-transformation programs.

1 Introduction

Pattern recognition is a very dicult problem for computer science. A major


reason is that in many cases pattern classes are not well-specied, frustrating the
design of algorithms (including learning algorithms) to identify or discriminate
them. Intrinsic specication (via formal denition) is often impracticalconsider
the class consisting of hand-written letters A. Extrinsic specication (via nite
sets of examples) has problems of generalization and over-tting.
Many interesting pattern classes are hard to specify because they exist only
in relation to human or animal brains. Humans employ mental processes such as
scaling, point of view adjustment, contrast and texture interpretation, saccades,
etc., permitting classes to be characterized very subtly. It is likely that truly
powerful computer pattern recognition methods will need to employ all such
techniques, which is not generally the case today. In this paper we are concerned
mainly with human-related pattern classes.
A further challenge for pattern recognition research is to create problems with
large sets of examples that can be learned from. An automatic pattern generator
would be valuable, but it should be capable of producing examples of each class
that are diverse and subtle as well as numerous.
This paper proposes an automatic pattern generation and recognition process,
and speculates that it would shed light on both the formal characterization prob-
lem and recognition techniques. The process would permit unlimited generation
of examples and very great exibility of methods, by relying on competitive and
cooperative coevolution of pattern generators and recognizers.
The paper is organized into a rst part in which the pattern recognition
problem is discussed in greater detail; a second part in which the competitive
and cooperative method is explained in concept; and a third part containing
suggestions for a specic implementation.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 3846, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Coevolution of Pattern Generators and Recognizers 39

2 Pattern Recognition Problem


The following is a viewpoint on the pattern recognition problem and what makes
it dicult. Let us rst see some examples of what are generally regarded as
patterns.
Characters, such as letters and numerals. Members of a class can dier in
numerous ways, including placement in the eld of view, size, orientation, shape,
thickness, contrast, constituent texture, distortion including angle of view, noise
of construction, and masking noise, among others.
Patterns in time series, such as musical phrases, price data congurations,
and event sequences. Members of a class can dier in time-scale, shape, intensity,
texture, etc.
Natural patterns, such as trees, landscapes, terrestrial features, and cloud
patterns. Members of a class can dier in size, shape, contrast, color, texture,
etc.
Circumstantial patterns such as situations, moods, plots. Members of a class
can dier along a host of dimensions themselves often hard to dene.
This sampling illustrates the very high diversity within even ordinary pattern
classes and suggests that identifying a class member while dierentiating it from
members of other classes should be very dicult indeed. Yet human beings learn
to do it, and apparently quite easily. While that of course has been pointed out
before, we note two processes which may play key roles, transformation and
context.
Transformative processes would include among others centering an object of
interest in the eld of view via saccades, i.e., translation, and scaling it to a size
appropriate for further steps. Contextual processes would include adjusting the
eective brightness (of a visual object) relative to its background, and seeing a
textured object as in fact a single object on a dierently textured background.
It is clear that contextual processes are also transformations and that viewpoint
will be taken here.
A transformational approach to pattern recognition would imply a sequence
in which the raw stimulus is successively transformed to a form that permits it
to be matched against standard or iconic exemplars, or produces a signal that
is associated with a class. Human pattern recognition is generally rapid and
its steps are not usually conscious, except in dicult cases or in initial learning.
However, people when asked for reasons for a particular recognition will often cite
transformational steps like those above that allow the object to be interpreted
to some standard form. For this admittedly informal reason, transformations are
emphasized in the algorithms proposed here.
It is possible to provide a more formal framework. Pattern recognition can be
viewed as a process in which examples are mapped to classes. But the mappings
are complicated. They are unlike typical functions that map vectors of elements
into, e.g., reals. In such a function, each element has a denite position in the
40 S.W. Wilson

vector (its index). Each position can be thought of as a place, and there is a
value there. An ordinary function is thus a mapping of values in places into
an outcome. Call it a place/value (PV) mapping. If you slide the values along
the placesor expand them from a pointthe outcome is generally completely
dierent. The function depends on just which values are in which places.
Patterns, on the other hand, are relative place/relative value (RPRV) map-
pings. Often, a given instance can be transformed into another instance, but
with the same outcome, by a transformation that maintains the relative places
or values of the elementsfor example, such transformations as scaling, trans-
lation, rotation, contrast, even texture. The RPRV property, however, makes
pattern recognition very dicult for machine learning methods that attach ab-
solute signicance to input element positions and values.
There is considerable work on relative-value, or relational, learning systems,
e.g., in classier systems [5,4], and in reinforcement learning generally [1]. But
for human-related pattern classes, what seems to be required is a method that
is intrinsically able to deal with both relative value and relative place. This
suggests that the method must be capable of transformations, both of its input
and in subsequent stages. The remainder of the paper lays out one proposal for
achieving this.

3 Let the Computer Do It

Traditionally, pattern recognition research involves choosing a domain, creating


a source of exemplars, and trying learning algorithms that seem likely to work
in that domain. Here, however, we are looking broadly at human-related pattern
recognition, or relative place/relative value mappings (Sec. 2). Such a large task
calls for an extensive source of pattern examples. It also calls for experimentation
with a very wide array of transformation operators. Normally, for practicality
one would narrow the domain and the choice of operators. Instead, we want to
leave both as wide as possible, in hopes of achieving signicant generality. While
it changes the problem somewhat, there fortunately appears to be a way of doing
this by allowing the computer itself to pose and solve the problem.
Imagine a kind of communication game (Figure 1). A sender, or source, S,
wants to send messages to a friend F. The messages are in English, and the
letters are represented in binary by ASCII bytes. As long as F can decode bytes
to ASCII (and knows English), F will understand S s messages. But there is also
an enemy E that sees the messages and is not supposed to understand them.
S and F decide to encrypt the messages. But instead of encrypting prior to con-
version to bits, or encrypting the resulting bit pattern, they decide to encrypt each
bit. That is, E s problem is to tell which bits are 1s and which 0s. If E can do that,
the messages will be understandable. Note that F also must decrypt the bits.
For this peculiar setup, S and F agree that when S intends to send a 0, S will
send a variant of the letter A; for a 1, S will send a variant of B. S will produce
these variants using a generation program. Each variant of A created will in
general be dierent; similarly for B. F will know that 0 and 1 are represented
Coevolution of Pattern Generators and Recognizers 41

S F

E
Fig. 1. S sends messages to F that are snied by E

by variants of A and B, respectively, and will use a recognition program to


tell which is which. E, also using a recognition program, knows only that the
messages are in a binary code but does not know anything about how 0s and 1s
are represented.
In this setup, S s objective is to send variants of As and Bs that F will
recognize but E will not recognize. The objectives of both F and E are to
recognize the letters; for this F has some prior information that E does not
have. All the agents will require programs: S for generation and F and E for
recognition. The programs will be evolved using evolutionary computation. Each
agent will maintain its own population of candidate programs. The overall system
will carry out a coevolution [2] in which each agent attempts to evolve the best
program consistent with its objectives.
Evolution requires a tness measure, which we need to speciy for each of
the agents. For each bit transmitted by S, F either recognizes it or does not,
and E either recognizes it or does not. S s aim is for F to recognize correctly
but not E ; call this a success for S. A simple tness measure for an S program
would be the number of its successes divided by a predetermined number of
transmissions, T, assuming that S sends 0s and 1s with equal probability. A
success for F as well as for E would be a correct recognition. A simple tness
measure for their programs would be the number of correct recognitions, again
divided by T transmissions.
S s population would consist of individuals each of which consists of a gener-
ation program. To send a bit, S picks an individual, randomly1 decides whether
to send a 0 or a 1, then as noted above, generates a variant of A for 0, or of B
for 1, the variant diering each time the program is called.
The system determines whether the transmission was a success (for S ). After
a total of T transmissions using a given S individual, its tness is updated. F
and E each have populations of individual recognition programs. Like S, after T
recognition attempts using a population individual, its tness is updated based
on its number of successes.
The testing of individuals could be arranged so that for each transmission,
individuals from the S, F, and E populations would be selected at random.
Or an individual from S could be used for T successive transmissions with F
1
For our purposes, the bits need not encode natural language.
42 S.W. Wilson

and E individuals still randomly picked on each transmission. Various testing


schemes are possible. Selection, reproduction, and genetic operations would occur
in a population at intervals long enough so that the average individual gets
adequately evaluated.
Will the coevolution work? It seems there should be pressure for improvement
in each of the populations. Some initial programs in S should be better than
others; similarly for F and E. The three participants should improve, but the
extent is unknown. It could be that all three success rates end up not much
above 50%. The best result would be 100% for S and F and 0% for E. But that
is unlikely since some degree of success by E would be necessary to push S and
F toward higher performance.

4 Some Implementation Suggestions

Having described a communications game in which patterns are generated and


recognized, and a scheme for coevolving the corresponding programs, it remains
to suggest the form of these programs. For concreteness we consider genera-
tion and recognition of two-dimensional, gray-scale visual patterns and take the
transformational viewpoint of Sec.2.
The programs would be compounds of operators that take an input image and
transform it into an output image. The input of one of S s generating programs
would be an image of an archetypical A or B and its output would be, via
transforms, a variant of the input. A recognition program would take such a
variant as input and, via transforms, output a further variant. F would match
its programs output against the same archetypes of A and B, picking the better
match, and deciding 0 or 1 accordingly. E would simply compute the average
gray level of its programs output image and compare that to a threshold to
decide between 0 and 1.
For a typical transformation we imagine in eect a function that takes an
imagean array of real numbersas input and produces an image as output.
The value at a point x, y of the output may depend on the value at a point (not
necessarily the same point) of the input, or on the values of a collection of input
points. As a simple example, in a translation transformation, the value at each
output point would equal the value at an input point that is displaced linearly
from the output point. In general, we would like the value at an output point
potentially to be a rather complicated function of the points of the input image.
Sims [6], partly with an artistic or visual design purpose, evolved images
using tnesses based on human judgements. In his system, a candidate image
was generated by a Lisp-like tree of elementary functions taking as inputs x, y,
and outputs of other elementary functions. The elementary functions included
standard Lisp functions as well as various image-processing operators such as
blurs, convolutions, or gradients that use neighboring pixel values to calculate
their outputs. Noise generating functions were also included.
The inputs to the function tree were simply the coordinates x and y, so that
the tree in eect performed a transformation of the blank x-y plane to yield the
Coevolution of Pattern Generators and Recognizers 43

output image. The results of evolving such trees of functions could be surprising
and beautiful. Sims article gives a number of examples of the images, including
one (Figure 2) having the following symbolic expression,
(round (log (+ y (color-grad (round (+ (abs (round
(log (+ y (color-grad (round (+ y (log (invert y) 15.5))
x) 3.1 1.86 #(0.95 0.7 0.59) 1.35)) 0.19) x)) (log (invert
y) 15.5)) x) 3.1 1.9 #(0.95 0.7 0.35) 1.35)) 0.19) x).

Fig. 2. Evolved image from Sims [6]. Gray-scale rendering of color original. 
c 1991
Association for Computing Machinery, Inc. Reprinted with permission.

Such an image-generating program is a good starting point for us, except for
two missing properties. First, the program does not transform an input image;
its only inputs are x and y. Second, the program is deterministic: it is not able
to produce dierent outputs for the same image input, a property required in
order to produce image variants.
To transform an image, the program needs to take as input not only x and
y, but also the input image values. A convenient way to do this appears to be
to add the image to the function set. That is, add Im(x, y) to the function set,
where Im is a function that maps image points to image values of the current
input. For example, consider the expression
(* k (Im (- x x0 ) (- y y0 )).
The eect is to produce an output that translates the input by x0 and y0 in the
x and y directions and alters its contrast by the factor k. It seems fairly clear
that adding the current input image, as a kind of function, to the function set
(it could apply at any stage), is quite general and would permit a great variety
of image transformations.
44 S.W. Wilson

To allow dierent transformations from the same program is not dicult. One
approach is to include a switch function, Sw , in the function set. Sw would have
two inputs and would pass one or the other of them to its output depending on
the setting of a random variable at evaluation time (i.e., set when a new image is
to be processed and not reset until the next image). The random variable would
be a component of a vector of random binary variables, one variable for each
specic instance of Sw in the program. Then at evaluation time, the random
vector would be re-sampled and the resulting component values would dene a
specic path through the program tree. The number of distinct paths is 2 raised
to the number of instances of Sw , and equals the number of distinct input image
variants that the program can create. If that number turns out to be too small,
other techniques for creating variation will be required.
The transformation programs just described would be directly usable by S to
generate variants of A and B starting with archetypes of each. F and E would
also use such programs, but not alone. Recognition, in the present approach,
reverses generation: it takes a received image and attempts to transform it back
into an archetype. Since it does not know the identity of the received image, how
does the recognizer know which transformations to apply?
We suggest that a recognition program be a kind of Pittsburgh classier
system [7] in which each classier has a condition part intended to be matched
against the input, and an action part that is a transformation program of the kind
used by S (but without Sw ). In the simplest case the classier condition would
be an image-like array of reals to be matched against the input image; the best-
matching classiers transformation program would then be applied to the image.
The resulting output would then be matched (by F ) against archetypes A and B
and the better-matching character selected. E, as noted earlier, would compare
the average of the output image with a threshold. It might be desirable for
recognition to take more than one match-transform step; they could be chained
up to a certain number, or until a suciently sharp A/B decision (or dierence
from threshold) occurred.2

5 Discussion and Conclusion

A coevolutionary framework has been proposed that, if it works, may create


interesting pattern generators and recognizers. We must ask, is it relevant to the
kinds of natural patterns noted in Section 2?
Natural patterns are not ones created by generators to communicate with
friends without informing enemies3 . Instead, natural patterns seem to be clusters
of variants that become as large as possible without confusing their natural
recipients, and no intruder is involved. Perhaps that framework, which also may
2
Recognition will probably require a chain of steps, as the system changes its center
of attention or other viewpoint. State memory from previous steps will likely be
needed, which favors use of a Pittsburgh over a Michigan [3,8], classier system,
since the former is presently more adept at internal state.
3
There may be special cases!
Coevolution of Pattern Generators and Recognizers 45

suggest a coevolution, ought to be explored. But the present framework should


give insights, too.
A basic hypothesis here is that recognition is a process of transforming a pat-
tern into a standard or archetypical instance. Success by the present scheme
since it uses transformationswould tend to support that hypothesis. More im-
portant, the kinds of operators that are useful will be revealed (though extracting
such information from symbolic expressions can be a chore). For instance, will
the system evolve operators similar to human saccades and will it size-normalize
centered objects? It would also be interesting to observe what kinds of matching
templates evolve in the condition parts of the recognizer classiers. For instance,
are large-area, relatively crude templates relied upon to get a rough idea of which
transforms to apply? If so, it would be in contrast to recognition approaches that
proceed from bottom upe.g. nding edgesinstead of top down.
Such autonomously created processes would seem of great interest to more
standard studies of pattern recognition. The reason is that standard studies in-
volve choices of method that are largely arbitrary, and if they work there is still
a question of generality. In contrast, information gained from a relatively un-
constrained evolutionary approach might, by virtue of its human-independence,
have a greater credibility and extensibility.
It is unclear how well the present framework will workfor instance whether
F s excess of a priori information over E s will be enough to drive the coevo-
lution. It is also unclear, even if it works, whether the results will have wider
relevance. But the proposal is oered in the hope that its dierence from tra-
ditional approaches will inspire new experiments and thinking about a central
problem in computer science.

References
1. Dzeroski, S., de Raedt, L., Driessens, K.: Relational reinforcement learning. Machine
Learning 43, 752 (2001)
2. Daniel Hillis, W.: Co-evolving parasites improve simulated evolution as an optimiza-
tion procedure. Physica D 42, 228234 (1990)
3. Holland, J.H.: Escaping Brittleness: The Possibilities of General-Purpose Learning
Algorithms Applied to Parallel Rule-Based Systems. In: Mitchell, Michalski, Car-
bonell (eds.) Machine Learning, an Articial Intelligence Approach, vol. II, ch. 20,
pp. 593623. Morgan Kaufmann, San Francisco (1986)
4. Mellor, D.: A rst order logic classier system. In: Beyer, H.-G., OReilly, U.-M.,
Arnold, D.V., Banzhaf, W., Blum, C., Bonabeau, E.W., Cantu-Paz, E., Dasgupta,
D., Deb, K., Foster, J.A., de Jong, E.D., Lipson, H., Llora, X., Mancoridis, S.,
Pelikan, M., Raidl, G.R., Soule, T., Tyrrell, A.M., Watson, J.-P., Zitzler, E. (eds.)
GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary
Computation, Washington DC, USA, June 25-29, vol. 2, pp. 18191826. ACM Press,
New York (2005)
5. Shu, L., Schaeer, J.: VCS: Variable Classier System. In: David Schaer, J. (ed.)
Proceedings of the 3rd International Conference on Genetic Algorithms (ICGA
1989), George Mason University, pp. 334339. Morgan Kaufmann, San Francisco
(June 1989), http://www.cs.ualberta.ca/~ jonathan/Papers/Papers/vcs.ps
46 S.W. Wilson

6. Sims, K.: Articial evolution for computer graphics. Computer Graphics 25(4), 319
328 (1991), http://doi.acm.org/10.1145/122718.122752, Also
http://www.karlsims.com/papers/siggraph91.html
7. Smith, S.F.: A Learning System Based on Genetic Adaptive Algorithms. PhD thesis,
University of Pittsburgh (1980)
8. Wilson, S.W.: Classier Fitness Based on Accuracy. Evolutionary Computation 3(2),
149175 (1995)
How Fitness Estimates Interact with
Reproduction Rates: Towards Variable Ospring
Set Sizes in XCSF

Patrick O. Stalph and Martin V. Butz

Department of Cognitive Psychology III, University of W


urzburg
Rontgenring 11, 97080 Wurzburg, Germany
{patrick.stalph,butz}@psychologie.uni-wuerzburg.de
http://www.coboslab.psychologie.uni-wuerzburg.de

Abstract. Despite many successful applications of the XCS classier


system, a rather crucial aspect of XCS learning mechanism has hardly
ever been modied: exactly two classiers are reproduced when XCSFs
iterative evolutionary algorithm is applied in a sampled problem niche.
In this paper, we investigate the eect of modifying the number of re-
produced classiers. In the investigated problems, increasing the num-
ber of reproduced classiers increases the initial learning speed. In less
challenging approximation problems, also the nal approximation accu-
racy is not aected. In harder problems, however, learning may stall,
yielding worse nal accuracies. In this case, over-reproductions of inac-
curate, ill-estimated, over-general classiers occur. Since the quality of
the tness signal decreases if there is less time for evaluation, a higher
reproduction rate can deteriorate the tness signal, thusdependent on
the diculty of the approximation problempreventing further learn-
ing improvements. In order to speed-up learning where possible while
still assuring learning success, we propose an adaptive ospring set size
that may depend on the current reliability of classier parameter esti-
mates. Initial experiments with a simple ospring set size adaptation
show promising results.

Keywords: LCS, XCS, Reproduction, Selection Pressure.

1 Introduction
Learning classier systems were introduced over thirty years ago [1] as cognitive
systems. Over all these years, it has been clear that there is a strong interac-
tion between parameter estimationsbe it by traditional bucket brigade tech-
niques [2], the Widrow-Ho rule [3,4], or by recursive least squares and related
linear approximation techniques [5,6]and the genetic algorithm, in which the
successful identication and propagation of better classiers depends on the ac-
curacy of these estimates. Various control parameters have been used to balance
genetic reproduction with the reliability of the parameter estimation, but to the
best of our knowledge, there is no study that addresses the estimation problem
explicitly.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 4756, 2010.
c Springer-Verlag Berlin Heidelberg 2010
48 P.O. Stalph and M.V. Butz

In the XCS classier system [4], reproduction takes place by means of a steady-
state, niched GA. Reproductions are activated in current action sets (or match
sets in function approximation problems as well as in the original XCS paper).
Upon reproduction, two ospring classiers are generated, which are mutated
and recombined with certain probabilities. Reproduction is balanced by the GA
threshold. It species that GA reproduction is activated only if the average time
of the last GA activation in the set lies longer in the past than GA . It has
been shown that the threshold can delay learning speed but it also prevents the
neglect of rarely sampled problem niches in the case of unbalanced data sets [7].
Nonetheless, the reproduction of two classiers seems to be rather arbitrary
except for the fact that two ospring classiers are needed for simple recombi-
nation mechanisms. Unless the Learning Classier System has a hard time to
learn the problem, the reproduction of more than two classiers could speed
up learning. Thus, this study investigates the eect of modifying the number of
ospring classiers generated upon GA invocation. We further focus our study
on the real-valued domain and thus on the XCSF system [8,9]. Besides, we use
the rotating hyperellipsoidal representation for the evolving classier condition
structures [10].
This paper is structured as follows. Since we assume general knowledge of
XCS1 , we immediately start investigating performance of XCSF on various test
problems and with various ospring set sizes. Next, we discuss the results and
provide some theoretical considerations. Finally, we propose a road-map for fur-
ther studying the observed eects and adapting the ospring set sizes according
to the perceived problem diculty and learning progress as well as on the esti-
mated reliability of available classier estimates.

2 Increased Ospring Set Sizes

To study the eects of increased ospring set sizes, we chose four challenging
functions dened in [0, 1]2 , each with rather distinct regularities:

f1 (x1 , x2 ) = sin(4(x1 + x2 )) (1)


   
 
f2 (x1 , x2 ) = exp 8 (xi 0.5) cos 8
2
(xi 0.5)2
(2)
i i
    
f3 (x1 , x2 ) = max exp 10(2x1 1)2 , exp 50(2x2 1)2 , (3)
 
1.25 exp 5((2x1 1)2 + (2x2 1)2 )
f4 (x1 , x2 ) = sin(4(x1 + sin(x2 ))) (4)

Function f1 has been used in various studies [10] and has a diagonal regularity. It
requires the evolution of stretched hyperellipsoids that are rotated by 45 . Func-
tion f2 is a radial sine function that requires a somewhat circular distribution of
1
For details about XCS refer to [4,11].
Towards Variable Ospring Set Sizes in XCSF 49

prediction
0.5
0
-0.5

0.5
f
0

-0.5
1
-1 0.8
0 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0

(a) sine function

prediction
1
0.5
0

f 0.5

0
1
0.8
-0.50 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0

(b) radial sine function

prediction
1
0.5
0
1.5

1
f
0.5

1
0 0.8
0 0.6
0.2 0.4 y
0.4 0.2
0.6
x 0.8
1 0

(c) crossed ridge function

prediction
1
1.5 0.5
0
1 -0.5
-1
f 0.5
0
-0.5
-1
-1.5
0
0.2
0.4 1
x 0.6 0.8
0.6
0.8 0.4
0.2 y
1 0

(d) sine-in-sine function

Fig. 1. Final function approximations, including contour lines, are shown on the left-
hand side. The corresponding population distributions after compaction are shown on
the right-hand side. For visualization purposes, the conditions are drawn 80% smaller
than their actual size.
50 P.O. Stalph and M.V. Butz

6400 6400

macro classifiers
macro classifiers
1 1000 1 sel2 - pred. error 1000
select 2 - pred. error
macro cl. macro cl.
select 4 - pred. error sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. 100 macro cl. 100
0.1 0.1

prediction error
prediction error

0.01 0.01

0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(a) sine function

6400 6400

macro classifiers
macro classifiers

1 1000 1 sel2 - pred. error 1000


select 2 - pred. error
macro cl. macro cl.
select 4 - pred. error sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error
prediction error

0.01 0.01

0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(b) radial sine function

Fig. 2. Dierent selection strengths with xed (left hand side) or match-set-size relative
(right hand side) ospring set sizes can speed-up learning signicantly but potentially
increase the nal error level reached. The vertical axis is log-scaled. Error bars represent
one standard deviation and the thin dashed line shows the target error 0 = 0.01.

classiers. Function f3 is a crossed ridge function, for which it has been shown
that XCSF performs competitively in comparison with deterministic machine
learning techniques [10]. Finally, function f4 twists two sine functions so that it
becomes very hard for the evolutionary algorithm to receive enough signal from
the parameter estimates in order to structure the problem space more eectively
for an accurate function approximation.
Figure 1 shows the approximation surfaces and spatial partitions generated
by XCSF with a population size of N = 6400 and with compaction [10] acti-
vated after 90k learning iterations.2 The graphs on the left-hand side show the
actual function predictions and qualitatively conrm that XCSF is able to learn
accurate approximations for all four functions. On the right-hand side, the cor-
responding condition structures of the nal populations are shown. In XCS and

2
Other parameters were set to the following values: = .1, = .5, = 1, 0 = .01,
= 5, GA = 50, = 1.0, = .05, r0 = 1, del = 20, = 0.1, sub = 20. All
experiments in this paper are averaged over 20 experiments.
Towards Variable Ospring Set Sizes in XCSF 51

6400 6400

macro classifiers

macro classifiers
1 select 2 - pred. error 1000 1 sel2 - pred. error 1000
macro cl. macro cl.
select 4 - pred. error sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error

prediction error
0.01 0.01

0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(a) crossed ridge function


6400 6400

macro classifiers

macro classifiers
1 1000 1 1000

100 100
0.1 0.1
prediction error

prediction error

select 2 - pred. error sel2 - pred. error


macro cl. macro cl.
0.01 select 4 - pred. error 0.01 sel10% - pred. error
macro cl. macro cl.
select 8 - pred. error sel50% - pred. error
macro cl. macro cl.
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(b) sine-in-sine function

Fig. 3. While in the crossed ridge function larger ospring set sizes mainly speed-up
learning, in the challenging sine-in-sine function, larger ospring set sizes can strongly
aect the nal error level reached

XCSF, two classiers are selected for reproduction, crossover, and mutation. We
now investigate the inuence of modied reproduction sizes.
Performance of the standard setting, where two classiers are selected for re-
production (with replacement), is compared with four other reproduction size
choices. In the rst experiment the ospring set size was set to four and eight
classiers respectively. Thus, four (eight) classiers are reproduced upon GA in-
vocation and crossover is applied twice (four times) before the mutation operator
is applied. In a second, more aggressive setting the ospring set size is set rela-
tive to the current match set size, namely to 10% and 50% of the match set size.
Especially the last setting was expected to reveal that excessive reproduction
can deteriorate learning.
Learning progress is shown in Figure 2 for functions f1 and f2 . It can be seen
that in both cases standard XCSF with two ospring classiers learns signif-
icantly slower than settings with a larger number of ospring classiers. The
number of distinct classiers in the population (so called macro classiers), on
the other hand, shows that initially larger ospring set sizes increase the popula-
tion sizes much faster. Thus, an initially higher diversity due to larger ospring
sets yields faster initial learning progress. However, towards the end of the run,
52 P.O. Stalph and M.V. Butz

standard XCSF actually reaches a slightly lower error than the settings with
larger ospring sets. This eect is the more pronounced the larger the ospring
set. In the radial sine function, this eect is not as strong as in the sine function.
Similar observations can also be made in the crossed ridge function, which
is shown in Figure 3(a). In the sine-in-sine function f4 (Figure 3(b)), larger
ospring set sizes degrade performance most severely. While a selection of four
ospring classiers as well as a selection of a size of 10% of the match set size still
shows slight error decreases, larger ospring set sizes completely stall learning
despite large and diverse populations. It appears that the larger ospring set sizes
prevent the population from identifying relevant structures and thus prevent the
development of accurate function approximations.

3 Theoretical Considerations
What is the eect of increasing the number of ospring generated upon GA
invocation? The results indicate that initially, faster learning can be induced.
However, later on, learning potentially stalls.
Previously, learning in XCS was characterized as an interactive learning pro-
cess in which several evolutionary pressures [12] foster learning progress: (1) A
tness pressure is induced since usually on average more accurate classiers are
selected for reproduction than for deletion. (2) A set pressure, which causes an
intrinsic generalization pressure, is induced since also on average more general
classiers are selected for reproduction than for deletion. (3) Mutation pressure
causes diversication of classier conditions. (4) Subsumption pressure causes
convergence to maximally accurate, general classiers, if found. Since tness and
set pressure work on the same principle, increasing the number of reproductions
generally equally increases both pressures. Thus, their balance is maintained.
However, the tness pressure only applies if there is a strong-enough tness sig-
nal, which depends on the number of evaluations a classier underwent before
the reproduction process. The mutation pressure also depends on the number of
reproductions; thus, a faster diversication can be expected given larger ospring
set sizes.
Another analysis estimated the reproductive opportunities a superior classier
might have before being deleted [13]. Moreover, a niche support bound was
derived [14], which characterizes the probability that a classier is sustained
in the population, given that it represents an important problem niche for the
nal solution. Both of these bounds assume that the accuracy of the classier
is accurately specied. However, the larger the ospring set size is, the faster
the classier turnaround, thus the shorter the average iteration time a classier
stays in the population, and thus the fewer the number of iterations available
to a classier until it is deleted. The eect is that the GA in XCS has to work
with classier parameter estimates that are less reliable since they underwent
less updates on average. Thus, larger ospring set sizes induce larger noise in
the selection process.
As long as the tness pressure leads in the right direction because the param-
eter estimates have enough signal, learning proceeds faster. This latter reason
Towards Variable Ospring Set Sizes in XCSF 53

stands also in relation to the estimated learning speed of XCS approximated


elsewhere [15]. Since reproductions of more accurate classiers are increased,
learning speed increases as long as more accurate classiers are detected.
Due to this reasoning, however, it can also be expected that learning can stall
prematurely. This should be the case when the noise, induced by an increased
reproduction rate, is too high so that the identication of more accurate clas-
siers becomes impossible. Better ospring classiers get deleted before their
tness is suciently evaluated. In other words, the tness signal is too weak
for the selection process. This signal-to-noise ratio (tness signal to selection
noise) depends on (1) the problem structure at hand, (2) the solution repre-
sentation given to XCS (condition and prediction structures), and (3) on the
population size. Thus, it is hard to specify the ratio exactly and future research
is needed to derive mathematical bounds on this problem. Nonetheless, these
considerations explain the general observations in the considered functions: The
more complex the function, the more problematic larger ospring sets become
even the traditional two ospring classiers may be too fast to reach the target
error 0 .
To control the signal-to-noise problem, consequently, it is important to bal-
ance reproduction rates and ospring set sizes problem-dependently. A similar
suggestion was made elsewhere for the control of parameter GA [7]. In the fol-
lowing, we investigate an approach that decreases the ospring set size over a
learning experiment to get the best of both worlds: fast initial learning speeds
and maximally accurate nal solution representations.

4 Adapting Ospring Set Sizes


As a rst approach to determine if it can be useful to use larger initial ospring
set sizes and to decrease those sizes during the run, we linearly scale the ospring
set size from 10% ospring set size to two over the 100k learning iterations.
Figure 4 shows the resulting performance in all four functions comparing the
linear scaling with traditional two ospring classiers and xed 10% ospring. In
graphs 4(a)-(c) we can see that the scaling technique reaches maximum accuracy.
Particularly in Graph 4(a) we can see that the performance stalling is overcome
and an error level is reached that is similar to the one reached with the traditional
XCS setting. However, performance in function f4 shows that the error still stays
on a high level initially but it starts decreasing further when compared to a 10%
ospring set size later in the run.
Thus, the results show that a linear reduction of ospring set sizes can have
positive eects on initial learning speed while low reproduction rates at the end
of a run allow for a renement of the nal solution structure. However, the
results also suggest that the simple linear scheme is not necessarily optimal and
its success is highly problem-dependent. Future research needs to investigate
exible adaptation schemes that take the signal-to-noise ratio into account.
54 P.O. Stalph and M.V. Butz

6400 6400

macro classifiers

macro classifiers
1 sel2 - pred. error 1000 1 sel2 - pred. error 1000
macro cl. macro cl.
sel10% - pred. error sel10% - pred. error
macro cl. macro cl.
sel10%to2 - pred. error sel10%to2 - pred. error
macro cl. 100 macro cl. 100
0.1 0.1
prediction error

prediction error
0.01 0.01

0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(a) sine function (b) radial sine function

6400 macro classifiers 6400

macro classifiers
1 sel2 - pred. error 1000 1 1000
macro cl.
sel10% - pred. error
macro cl.
sel10%to2 - pred. error
macro cl. 100 100
0.1 0.1
prediction error

prediction error

sel2 - pred. error


macro cl.
0.01 0.01 sel10% - pred. error
macro cl.
sel10%to2 - pred. error
macro cl.

0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(c) crossed ridge function (d) sine-in-sine function

Fig. 4. When decreasing the number of generated ospring over the learning trial,
learning speed is kept high while the error convergence reaches the level that is reached
by always generating two ospring classiers (a,b,c). However, in the case of the chal-
lenging sine-in-sine function, further learning would be necessary to reach a similarly
low error level (d).

5 Conclusions

This paper has shown that a xed ospring set size does not necessarily yield
the best learning speed that XCSF can achieve. Larger ospring set sizes can
strongly increase the initial learning speed but do not necessarily reach maximum
accuracy. Adaptive ospring set sizes, if scheduled appropriately, can get the best
of both worlds in yielding high initial learning speed and low nal error. The
results however also suggest that a simple adaptation scheme is not generally
applicable. Furthermore, the theoretical considerations suggest that a signal-
to-noise estimate could be used to control the GA ospring schedule and the
ospring set sizes. Given a strong tness signal, a larger set of ospring could
be generated.
Another consideration that needs to be taken into account in such an o-
spring generation scheme, however, is the fact that problem domains may be
Towards Variable Ospring Set Sizes in XCSF 55

strongly unbalanced, in which some subspaces may be very easily approximated


while others may be very hard. In these cases, it has been shown, though, that
the GA threshold can be increased to ensure a representation of the complete
problem [7]. Future research should consider adapting GA hand-in-hand with
the ospring set sizes. In which way this may be accomplished exactly still needs
to be determined. Nonetheless, it is hoped that the results and considerations
of this work provide clues in the right direction in order to speed-up XCS(F)
learning and to make learning even more robust in hard problems.

Acknowledgments

The authors acknowledge funding from the Emmy Noether program of the Ger-
man research foundation (grant BU1335/3-1) and like to thank their colleagues
at the department of psychology and the COBOSLAB team.

References

1. Holland, J.H.: Adaptation. In: Progress in Theoretical Biology, vol. 4, pp. 263293.
Academic Press, New York (1976)
2. Holland, J.H.: Properties of the bucket brigade algorithm. In: Proceedings of the
1st International Conference on Genetic Algorithms, Hillsdale, NJ, USA, pp. 17.
L. Erlbaum Associates Inc., Mahwah (1985)
3. Widrow, B., Ho, M.E.: Adaptive switching circuits. Western Electronic Show and
Convention, Convention Record, Part 4, 96104 (1960)
4. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
5. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update al-
gorithms for XCSF: RLS, Kalman lter, and gain adaptation. In: GECCO 2006:
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Compu-
tation, pp. 15051512. ACM, New York (2006)
6. Drugowitsch, J., Barry, A.: A formal framework and extensions for function ap-
proximation in learning classier systems. Machine Learning 70, 4588 (2008)
7. Orriols-Puig, A., Bernad o-Mansilla, E.: Bounding XCSs parameters for unbal-
anced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation, pp. 15611568. ACM, New York (2006)
8. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
9. Wilson, S.W.: Classiers that approximate functions. Natural Computing 1, 211
234 (2002)
10. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyper-
ellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions
on Evolutionary Computation 12, 355376 (2008)
11. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp.
267274. Springer, Heidelberg (2001)
56 P.O. Stalph and M.V. Butz

12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of general-
ization and learning in XCS. IEEE Transactions on Evolutionary Computation 8,
2846 (2004)
13. Butz, M.V., Goldberg, D.E., Tharakunnel, K.: Analysis and improvement of t-
ness exploitation in XCS: Bounding models, tournament selection, and bilateral
accuracy. Evolutionary Computation 11, 239277 (2003)
14. Butz, M.V., Goldberg, D.E., Lanzi, P.L., Sastry, K.: Problem solution sustenance
in XCS: Markov chain analysis of niche support distributions and the impact on
computational complexity. Genetic Programming and Evolvable Machines 8, 537
(2007)
15. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Bounding learning time in XCS. In: Deb,
K., et al. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 739750. Springer, Heidelberg
(2004)
Current XCSF Capabilities and Challenges

Patrick O. Stalph and Martin V. Butz

Department of Cognitive Psychology III, University of W


urzburg
Rontgenring 11, 97080 Wurzburg, Germany
{patrick.stalph,butz}@psychologie.uni-wuerzburg.de
http://www.coboslab.psychologie.uni-wuerzburg.de

Abstract. Function approximation is an important technique used in


many dierent domains, including numerical mathematics, engineering,
and neuroscience. The XCSF classier system is able to approximate
complex multi-dimensional function surfaces using a patchwork of sim-
pler functions. Typically, locally linear functions are used due to the
tradeo between expressiveness and interpretability. This work discusses
XCSFs current capabilities, but also points out current challenges that
can hinder learning success. A theoretical discussion on when XCSF
works is intended to improve the comprehensibility of the system. Cur-
rent advances with respect to scalability theory show that the system
constitutes a very eective machine learning technique. Furthermore,
the paper points-out how to tune relevant XCSF parameters in actual
applications and how to choose appropriate condition and prediction
structures. Finally, a brief comparison to the Locally Weighted Projec-
tion Regression (LWPR) algorithm highlights positive as well as negative
aspects of both methods.

Keywords: LCS, XCS, XCSF, LWPR.

1 Introduction

The increasing interest in Learning Classier Systems (LCS) [1] has propelled
research and LCS have proven their capabilities in various applications, includ-
ing multistep problems [2,3], datamining tasks [4,5], as well as robot applica-
tions [6,7]. The focus of this work is on the Learning Classier System XCSF [8],
which is a modied version of the original XCS [2]. XCSF is able to approxi-
mate multi-dimensional, real-valued function surfaces from samples by locally
weighted, usually linear, models.
While XCS theory has been investigated thoroughly in the binary domain [5],
theory on real-valued input and output spaces remains sparse. There are two
important questions: When does the system work at all and how does it scale
with increasing complexity? We will address these questions by rst carrying
over parts of the XCS theory and, secondly, showing the results of a scalability
analysis, which suggests that XCSF scales optimally in the required population
size.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 5769, 2010.
c Springer-Verlag Berlin Heidelberg 2010
58 P.O. Stalph and M.V. Butz

However, when theory tells that a system is applicable to a specic problem


type, the problem is still not solved, yet. The practitioner has to choose appro-
priate parameters and has to decide on the solution representation, which are
condition and prediction structures for XCSF. Therefore, we give a short guide
on the systems relevant parameters and how to set them appropriately. Fur-
thermore, a brief discussion on condition and prediction structures is provided
to foster the understanding of how XCSFs generalization power can be fully
exploited.
Finally, we briey compare XCSF with Locally Weighted Projection Regres-
sion (LWPR). LWPR is a statistics-based greedy algorithm for function approx-
imation that also uses spatially localized linear models to predict the value of
non-linear functions. A discussion of pros and cons points out the capabilities of
each algorithm.
The remainder of this article is structured as follows. Section 2 is concerned
with theoretical aspects of XCSF, that is, (1) when the system works at all and
(2) how XCSF scales with increasing problem complexity. In contrast, Section 3
discusses how to set relevant parameters given an actual, unknown problem. In
Section 4, we briey compare XCSF with LWPR and the article ends with a
short summary and concluding remarks.

2 Theory
We assume sucient knowledge about the XCSF Learning Classier System
and directly start with a theoretical analysis. We carry over preconditions for
successful learning known from binary XCS and propose a scalability model,
which shows how the population size scales with increasing function complexity
and dimensionality.

2.1 Preconditions - When It Works


In order to successfully approximate a function, XCSF has to overcome the same
challenges that were identied for XCS in binary domains [5]. These challenges
were described as (1) covering challenge, (2) schema challenge, (3) reproductive
opportunity challenge, (4) learning time challenge, and (5) solution sustenance
challenge. The following paragraphs briey summarize results from a recent
study [9] that investigated the mentioned challenges in depth with respect to
XCSF.
Covering Challenge. The initial population of XCSF should be able to cover the
whole input space, because otherwise the deletion mechanism creates holes in the
input space and local knowledge about these subspaces is lost (so called covering-
deletion cycle [10]). Consequently, when successively sampled problem instances
tend to be located in empty subspaces, the hole is covered with a default classier
and another hole is created due to the deletion mechanism. In analogy to results
with binary XCS, there is a linear relation between initial classier volume and
the required population size to master the covering challenge. In particular, the
population size has to grow inversely linear with the initial classier volume.
Current XCSF Capabilities and Challenges 59

Schema and Reproductive Opportunity Challenge. When the covering challenge


is met, it is required that the genetic algorithm (a) discovers better substructures
and (b) reproduces these substructures. In binary genetic algorithms such sub-
structures are often termed Building Blocks, as proposed in John H. Hollands
Schema Theory [1].
However, the denition of real-valued schemata is non-trivial [11,12,13,14] and
it is even more dicult to dene building blocks for innite input and output
spaces [15,16]. While the stepwise character in binary functions emphasizes the
processing of building blocks via crossover, the smooth character of real-valued
functions emphasizes hill-climbing mechanisms. To the best of our knowledge,
there is no consensus in the literature on this topic and consequently it remains
unclear how a building block can be dened for the real-valued XCSF Learning
Classier System.
If XCSFs tness landscape is neither at nor deceptive, there remains one last
problem: noise on the tness signal due to a nite number of samples. Prediction
parameter estimates rely on the samples seen so far and so does the prediction
error and the tness. If the classier turnaround (that is, reproduction and dele-
tion of classiers) is too high, the selection mechanism cannot identify better
substructures and the learning process is stuck [17], which can be alleviated by
slowing down the learning, e.g. by increasing GA [18].
Learning Time Challenge. The learning time mainly depends on the number
of mutations from initial classiers to the target shape of accurate and max-
imally general classiers. A too-small population size may delay the learning
time, because good classiers get deleted and knowledge is lost. Furthermore,
redundancy in the space of possible mutations (e.g. rotation for dimensions n > 3
is not unique) may increase the learning time. A recent study estimated a linear
relation between the number of required mutations and the learning time [9].
Solution Sustenance Challenge. Finally, XCSF has to assure that the evolved
accurate solution is sustained. This challenge is mainly concerned with the dele-
tion probability. Given the population size is high enough, the GA has enough
room to work without destroying accurate classiers. The resulting bound
states that the population size needs to grow inversely linear in the volume of
the accurate classiers to be sustained.

2.2 A Scalability Model

Given that all of the above challenges are overcome and the system is able to
learn an accurate approximation of the problem at hand, it is important to know
how changes in the function complexity or dimensionality aect XCSFs learning
performance. In particular, we model the relation between

function complexity (dened via the prediction error),


input space dimensionality,
XCSFs population size, and
the target error 0 .
60 P.O. Stalph and M.V. Butz

In order to simplify the model, we assume a uniform function structure and uni-
form sampling1 . This also implies a uniform classier structure, that is, uniform
shape and size. Without loss of generality, let the n-dimensional input space
be conned to [0, 1]n . Furthermore, we assume that XCSF evolves an optimal
solution [19]. This includes four properties, namely
1. completeness, that is, each possible input is covered in that at least one
classier matches.
2. correctness, that is, the population predicts the function surface accurately
in that the prediction error is below the target error 0 .
3. minimality, that is, the population contains the minimum number of classi-
ers needed to represent the function completely and correctly.
4. non-overlappingness, that is, no input is matched by more than one classier.
In sum, we assume a uniform patchwork of equally sized, non-overlapping, ac-
curate, and maximally general classiers. These assumptions reect reality on
uniform functions except for non-overlappingness, which is almost impossible for
real-valued input spaces.
We consider a uniformly sampled function of uniform structure

f : [0, 1]n R, (1)

where n is the dimensionality of the input space and reects the function
complexity. Since we do neither x the condition type, not the predictor used
in XCSF, we have to dene the complexity via the prediction error. We dene
such that a linear increase in this value results in the same increase in the
prediction error. Thus, saying that the function is twice as complex induces that
the prediction error is twice as high for the same classiers. Since the classier
volume V inuences the prediction error in a polynomial fashion on uniform
functions, we can summarize the assumptions in the following equation.

n
= V (2)

We can now derive the optimal classifier volume and the optimal population size.
Using the target error 0 , we get an optimal volume of
 n
0
Vopt = . (3)

The volume of the input space to be covered is one and it follows that the optimal
population size is  n

Nopt = . (4)
0
To sum up, the dimensionality n has an exponential inuence on the population
size, while the function complexity and the target error 0 have a polynomial
inuence. Increasing the function complexity will require a polynomial increase
of the population size in the order n.
1
Non-uniform sampling is discussed elsewhere [18].
Current XCSF Capabilities and Challenges 61

5000
macro classifiers
1D
2D
3D

macro classifiers (log-scale)


4D
5D
6D
1000

500

100
0.1 1 10
gradient (log-scale)

Fig. 1. Comparative plots of the nal population size after condensation (data points)
and the developed scalability theory (solid lines) for dimensions n = 1 to n = 6.
The number of macro classiers is plotted against the function complexity, which is
modeled via the increasing gradient. The order of the polynomials are equal to the
dimension n, which requires an exponential increase in population size. An increasing
function complexity results in a polynomial increase. Apart from an approximately
constant overhead due to overlapping classiers, the scalability model ts reality.

Note that no assumptions are made about the condition type or the predic-
tor used. The intentionally simple equations 3 and 4 hide a complex geometric
problem in the variable . For example, assume a three-dimensional non-linear
function that is approximated using linear predictions and rotating ellipsoidal
conditions. Calculating the prediction error is non-trivial for such a setup. When
the above bounds are required exactly, this geometric problem has to be solved
for any condition-prediction-function combination anew.
In order to validate the scalability model, we conducted experiments with
interval conditions and constant predictions on a linear function2 . XCSF with
constant predictions equals XCSR [20], however, only one dummy action is avail-
able. As done before in [19] with respect to XCS, we analyze a restricted class of
problems for XCSF. On the one hand, the constant prediction makes this setup
a worst case scenario in terms of required population size. On the other hand,
the simple setup allows for solving the geometric problem analyticallythus,
we can compare the theoretical population size bound from Equation 4 with the
actual population size that is required to approximate the respective function. A
so called bisection algorithm runs XCSF with dierent population size settings
in a binary search fashion. On termination, the bisection procedure returns the
approximately minimal population size N that is required for successful learning.
2
Other settings: 500000 iterations, 0 = 0.01, = 0.1, = 1, = 0.1, = 5, = 1,
= 0.05, r0 = 1, GA = 50, del = 20, sub = 20. GA subsumption was applied.
Uniform crossover was applied.
62 P.O. Stalph and M.V. Butz

For details of the bisection algorithm and how the geometric problem is solved,
please refer to [9].
Figure 1 shows the results of the bisection  experiments on the one- to six-
dimensional linear function f (x1 , . . . , xn ) = ni=1 xi , where solid lines repre-
sent the developed theory (Equation 4) and the data shown represents the nal
population size after condensation [21]. For each dimension n, the function di-
culty was linearly increased by increasing the gradient of the linear function.
The polynomials are shown as straight lines on a log-log-scale plot, where the
gradient of a line equals the order of the corresponding polynomial.
We observe an approximately constant overhead from scalability theory to
actual population size. This overhead is expected, since the scalability model
assumes non-overlappingness. Most importantly, the prediction of the model lies
parallel to the actual data, which indicates that the dimension n ts the ex-
ponent of the theoretical model. Thus, the experiment conrms the scalability
model: Problem dimensionality has an exponential inuence on the required pop-
ulation size (given full problem space sampling). Furthermore, a linear increase
in the problem diculty (or a linear decrease of the target error 0 ) induces a
polynomial increase in the population size.

3 How to Set XCSFs Parameters


Although theoretical knowledge shows that XCSF works theoretically optimally,
it is also important to understand the inuence of XCSFs parameter settings
such as population size, condition structures, and prediction types. Besides the
importance and the direct inuence of a parameter, the interdependencies be-
tween parameters are also relevant for the practitioner. In the following, we give
a brief overview of important parameters, their dependencies, and how to tune
them in actual applications.

3.1 Important Parameters and Interdependencies


A long list of available parameters exists for both XCS and XCSF. Among ob-
viously important parameters, such as the population size N , there are less
frequently tuned parameters (e.g. GA ) and parameters that are rarely changed
at all, such as the crossover rate or the accuracy scale . The most important
parameters are summarized here.
Population Size N This parameter species the available workspace for the
evolutionary search. Therefore it is crucial to set this value high enough to
prevent deletion of good classiers (see Section 2.1).
Target Error 0 The error threshold denes the desired accuracy. Evolu-
tionary pressures drive classiers towards this threshold of accurate and
maximally general classiers.
Condition Type The structuring capability of XCSF is dened by this set-
tings. Various condition structures are available, including simple axis-par-
allel intervals [22], rotating ellipsoids [23], and arbitrary shapes using gene
expression programming [24].
Current XCSF Capabilities and Challenges 63

Prediction Type Typically linear predictors are used for a good balance
of expressiveness and interpretability. However, others are possible, such as
constant predictors [8] or polynomial ones [25].
Learning Time The number of iterations should be set high enough to assure
that the prediction error converges to a value below the desired 0 .
GA Frequency Threshold GA This threshold species that GA reproduc-
tion is activated only if the average time of the last GA activation in the
set lies longer in the past than GA . Increasing this value delays learning,
but may also prevent forgetting and overgeneralization in unbalanced data
sets [18].
Mutation Rate The probability of mutation is closely related to the avail-
able mutation options of the condition type and thus it is also connected to
the dimensionality of the problem. It should be set according to the problem
at hand, e.g. = 1/m, where m is the number of available mutation options.
Initial classifier size r0 One the one hand, this value should be set high
enough to meet the covering challenge, that is, it should be set such that
simple covering with less than N classiers is sucient to cover the whole
input space. On the other hand, the initial size should be small enough to
yield a tness signal upon crossover or mutation in order to prevent oversized
classiers from taking over the population.

The other parameters can be set to their default values, thus ensuring a good
balance of the evolutionary pressures.
The strongest interdependencies can be found between population size N ,
target error 0 , condition structure, and prediction type as indicated by the
scalability model of Section 2.2. Changing either of these will aect XCSFs
learning performance signicantly. For example, with a higher population size a
lower target error can be reached. An appropriate condition structure may turn
a polynomial problem into a linear one, thus requiring less classiers. Advanced
predictors are able to approximate more complex functions and thus enable
coarse structuring of the input space, again reducing the required population
size. When tuning either of these settings, the related parameters should be
kept in mind.

3.2 XCSFs Solution Representation

Before running XCSF with some arbitrary settings on a particular problem, a few
things have to be considered. This concerns mainly the condition and prediction
structures, that is, XCSFs solution representation. The next two paragraphs
highlight some issues about dierent representations.
Selecting an Appropriate Predictor. The rst step is to select the type of predic-
tion to be used for the function approximation. Linear predictions have a reason-
able computational complexity and good expressiveness, while the nal solution
is well interpretable. In some cases, it might be required to invert the approx-
imated function after learning, which is easily possible with a linear predictor.
However, if prior knowledge suggests a special type of function (e.g. polynomials
64 P.O. Stalph and M.V. Butz

or sinusoidal functions) this knowledge can be exploited by using correspond-


ing predictors. The complexity of the prediction mainly inuences the classier
updates, which is usually depending on the dimensionality a minor factor.
Structuring Capabilities. Closely related to the predictor is the condition struc-
ture. The simplest formulation are intervals, that is, rectangles. Alternatively,
spheres or ellipsoids (also known as radial basis functions or receptive elds) can
be used. More advanced structures include rotation, which allows for exploiting
interdimensional dependencies, but also increases the complexity of (1) the evolu-
tionary search space and (2) the computational time for matching, which are ma-
jor inuences on the learning time. On the other hand, if interdependencies can
be exploited, the required population size may shrink dramaticallyeectively
speeding up the whole learning process by orders of magnitude. Finally, it is
also possible to use arbitrary structures such as gene expression programming or
neural networks. However, the improved generalization capabilities can reduce
the interpretability of the developed solutions and learning success can usually
not be guaranteed because the used genetic operators may not necessarily yield
a mainly local phenotypic search through the expressible condition structures.

3.3 When XCSF Fails

Even the best condition and prediction structures do not necessarily guarantee
successful learning. This section discusses some issues, where ne-tuning of some
parameters may help to reach the desired accuracy. Furthermore, we point out
when XCSF reaches its limits, so that simple parameter tuning cannot overcome
learning failures.
Ideally, given an unknown function, XCSFs prediction error quickly drops
below 0 (see Figure 2(a) for a typical performance graph). When XCSF is not
able to accurately learn the function, there are four possible main reasons:

1. The prediction error has not yet converged.


2. The prediction error converged to an average error above the target error.
3. The prediction error stays on an initially very low level, but the function
surface is not fully approximated.
4. The prediction error stays on an initially high level.

Given case 1, the learning time is too short to allow for an appropriate structuring
of the input space. Increasing the number of iterations will solve this issue.
In contrast, case 2 indicates that the function is too dicult to approximate
with the given population size, target error, predictor, and condition structure.
Figure 2(b) illustrates a problem in which the system does not reach the target
error. Increasing the learning time allows for a settling of the prediction error, but
the target error is only reached when the maximum population size is increased.
While in the previous examples XCSF just does not reach the target er-
ror, in other scenarios the system completely fails to learn anything due to
bad parameter choices. There are two major factors that may prevent learning
completely: covering-deletion cycles and at tness landscapes. Although case 3
Current XCSF Capabilities and Challenges 65

6400 6400
pred. error pred. error
macro cl. macro cl.
matchset macro cl. 1000 matchset macro cl. 1000
1

macro classifiers

macro classifiers
1
100 100
prediction error

prediction error
0.1
10 10
0.1

0.01
1 1

0.01
0.001
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(a) crossed ridge 2D (b) sine-in-sine 2D

Fig. 2. Typical performance measurements on two benchmark functions. The target


error 0 = 0.01 is represented by a dashed line. (a) The chosen settings are well suited
for the crossed-ridge function and the prediction error converges to a value below
the target error. (b) In contrast, the sine-in-sine function is too dicult for the same
settings and the system does neither reach the target error nor does the prediction
error converge within the given learning time.

6400 6400
pred. error
macro cl.
matchset macro cl. 1000 1000
macro classifiers

macro classifiers
pred. error
1e-15 macro cl.
100 matchset macro cl. 100
prediction error

prediction error

10

10 10
1e-16
1
1 1

1e-17 0.1
0 20 40 60 80 100 0 20 40 60 80 100
number of learning steps (1000s) number of learning steps (1000s)

(a) sine 20D, too small r0 (b) sine 20D, too large r0

Fig. 3. Especially on high-dimensional functions, it is crucial to set the initial classier


size r0 to a reasonable value. (a) A small initial size leads to a covering-deletion cycle.
(b) When the tness landscape is too at, the evolutionary search is unable to identify
better substructures and oversized classiers prevent learning.

seems strange, there is a simple explanation. If the population size and initial
classier size are set such that the input space cannot be covered by the covering
mechanism, the system continuously covers and deletes classiers without any
knowledge gain (so called covering-deletion cycle [10]). Typically, the average
match set size is one, the population size quickly reaches the maximum, and
the average prediction error is almost zero because the error during covering is
zero. Exemplary, we equip XCSF with a small initial classier size r0 and run
the system on a 20-dimensional sine function as shown in Figure 3(a). Especially
high-dimensional input spaces are prone to this problematic cycle, because (1)
66 P.O. Stalph and M.V. Butz

the initial classier volume has to be high enough to allow for a complete cover-
age, but (2) the initial volume may not exceed the size where the GA does not
receive a sucient tness signal.
The latter may be the case when a single mutation of the initial covering
shape cannot produce a suciently small classier that captures the (eventually
ne-grained) structure of the underlying function. Thus, the GA is missing a
tness gradient and, due to higher reproductive opportunities, over-general clas-
siers take over the population as shown in Figure 3(b). Typically, the prediction
error does not drop at all. Here XCSF reaches its limits and simple parame-
ter tuning may not help to overcome the problem with a reasonable population
size. Eventually, a rened initial classier size hits a reasonable tness and pre-
vents over-general classiers. Otherwise, it might be necessary to reconsider the
condition structure or corresponding evolutionary operators.

4 A Brief Comparison with Locally Weighted Projection


Regression
Apart from traditional function tting, where the general type of the under-
lying function has to be known before tting the data, the so called Locally
Weighted Projection Regression (LWPR) algorithm [26,27] also approximates
functions iteratively by means of local linear models, as does XCSF. The fol-
lowing paragraphs highlight the main dierences of LWPR to XCSF and sketch
some theoretical thoughts on performance as well as on the applicability of both
systems.
The locality of each model is dened by so called receptive fields, which cor-
respond to XCSFs rotating hyperellipsoidal condition structures [23]. However,
in contrast to the steady state GA in XCSF, the receptive elds in LWPR are
structured by means of a statistical gradient descent. The center, that is, the
position of a receptive eld, is never changed once it is created. Based on the
prediction errors, the receptive elds can shrink in specic directions, which
theoretically minimize the error. Indenite shrinking is prevented by introduc-
ing a penalty term, which penalizes small receptive elds. Thus, receptive elds
shrink due to prediction errors and enlarge if the inuence of prediction errors
is less than the inuence of the penalty term. However, the ideal statistics from
batch-learning can only be estimated in an iterative algorithm and experimental
validation is required to shed light on the actual performance of both systems,
when compared on benchmark functions.
One disadvantage of LWPR is that all its statistics are based on linear predic-
tions and the ellipsoidal shape of receptive elds. Thus, alternative predictions
or conditions cannot be applied directly. In contrast, a wide variety of prediction
types and condition structures are available for XCSF, allowing for a higher rep-
resentational exibility. Furthermore, it is easily possible to decouple conditions
and predictions in XCSF [6], in which case conditions cluster a contextual space
for the predictions in another space. Since the tness signal for the GA is only
based on prediction errors, no coupling is necessary. It remains an open research
challenge to realize similar mechanisms and modications with LWPR.
Current XCSF Capabilities and Challenges 67

On the other hand, the disadvantage of XCSF is a higher population size


during learning, which is necessary for the niched evolutionary algorithm to
work successfully. Dierent condition shapes have to be evaluated with several
samples before a stable tness value can be used in the evolutionary selection
process. Nevertheless, it has been shown that both systems achieve comparable
prediction errors in particular scenarios [23]. Future research will compare XCSF
and LWPR in detail, including theoretical considerations as well as empirical
evaluations on various benchmark functions.

5 Summary and Conclusions


This article discussed XCSFs current capabilities as well as scenarios that pose
a challenge for the system. From a theoretical point of view, we analyzed the
preconditions for successful learning and, if these conditions are met, how the
system scales to higher problem complexities, including function structure and
dimensionality.
In order to successfully learn the surface of a given function, XCSF has to
overcome the same challenges that were identied for XCS: covering challenge,
schema challenge, reproductive opportunity challenge, learning time challenge,
and solution sustenance challenge. Given a uniform function structure and uni-
form sampling, the scalability model predicts an exponential inuence of the
input space dimensionality on the population size. Moreover, a polynomial in-
crease in the required population size is expected when the function complexity
is linearly increased or when the target error is linearly decreased.
From a practitioners viewpoint, we highlighted XCSFs important parame-
ters and gave a brief guide how to set these parameters appropriately. Additional
parameter tuning suggestions may help if initial settings fail to reach the desired
target error in certain cases. Examples illustrate when XCSF completely fails
due to a covering-deletion cycle or due to at tness landscapes. Thus, fail-
ures in actual applications can be understood and rened parameter choices can
eventually resolve the problem.
Finally, a brief comparison with a statistics-based machine learning technique,
namely Locally Weighted Projection Regression (LWPR), discussed advantages
and disadvantages of the evolutionary approach employed in XCSF. A current
study, which includes also empirical experiments, supports the presented com-
parison with respect to several relevant performance measures [28].

Acknowledgments
The authors acknowledge funding from the Emmy Noether program of the Ger-
man research foundation (grant BU1335/3-1) and like to thank their colleagues
at the department of psychology and the COBOSLAB team.
68 P.O. Stalph and M.V. Butz

References
1. Holland, J.H.: Adaptation in natural and articial systems: An introductory analy-
sis with applications to biology, control, and articial intelligence. The MIT Press,
Cambridge (1992)
2. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
3. Butz, M.V., Goldberg, D.E., Lanzi, P.L.: Gradient descent methods in learning
classier systems: Improving XCS performance in multistep problems. Technical
report, Illinois Genetic Algorithms Laboratory (2003)
4. Bernad o-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classier sys-
tems: Models, analysis, and applications to classication tasks. Evolutionary Com-
putation 11, 209238 (2003)
5. Butz, M.V.: Rule-Based Evolutionary Online Learning Systems: A Principal Ap-
proach to LCS Analysis and Design. Springer, Heidelberg (2006)
6. Butz, M.V., Herbort, O.: Context-dependent predictions and cognitive arm control
with XCSF. In: GECCO 2008: Proceedings of the 10th Annual Conference on
Genetic and Evolutionary Computation, pp. 13571364. ACM, New York (2008)
7. Stalph, P.O., Butz, M.V., Pedersen, G.K.M.: Controlling a four degree of freedom
arm in 3D using the XCSF learning classier system. In: Mertsching, B., Hund,
M., Aziz, Z. (eds.) KI 2009. LNCS, vol. 5803, pp. 193200. Springer, Heidelberg
(2009)
8. Wilson, S.W.: Classiers that approximate functions. Natural Computing 1, 211
234 (2002)
9. Stalph, P.O., Llor`a, X., Goldberg, D.E., Butz, M.V.: Resource Management and
Scalability of the XCSF Learning Classier System. Theoretical Computer Science
(in press), http://dx.doi.org/10.1016/j.tcs.2010.07.007
10. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate clas-
siers. In: Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO 2001), pp. 927934 (2001)
11. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Foundations
of Genetic Algorithms, pp. 205218. Morgan Kaufmann, San Francisco (1991)
12. Goldberg, D.E.: Real-coded genetic algorithms, virtual alphabets, and blocking.
Complex Systems 5, 139167 (1991)
13. Radclie, N.J.: Equivalence class analysis of genetic algorithms. Complex Sys-
tems 5, 183205 (1991)
14. M uhlenbein, H., Schlierkamp-Voosen, D.: Predictive models for the breeder genetic
algorithm I. continuous parameter optimization. Evolutionary Computation 1,
2549 (1993)
15. Beyer, H.G., Schwefel, H.P.: Evolution strategies - a comprehensive introduction.
Natural Computing 1(1), 352 (2002)
16. Bosman, P.A.N., Thierens, D.: Numerical optimization with real-valued estimation-
of-distribution algorithms. In: Scalable Optimization via Probabilistic Modeling.
SCI, vol. 33, pp. 91120. Springer, Heidelberg (2006)
17. Stalph, P.O., Butz, M.V.: How Fitness Estimates Interact with Reproduction
Rates: Towards Variable Ospring Set Sizes in XCSF. In: Bacardit, J. (ed.) IWLCS
2008/2009. LNCS (LNAI), vol. 6471, pp. 4756. Springer, Heidelberg (2010)
18. Orriols-Puig, A., Bernad o-Mansilla, E.: Bounding XCSs parameters for unbal-
anced datasets. In: GECCO 2006: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation, pp. 15611568. ACM, New York (2006)
Current XCSF Capabilities and Challenges 69

19. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp.
251258. Springer, Heidelberg (2001)
20. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, pp. 209219.
Springer, Heidelberg (2000)
21. Wilson, S.W.: Generalization in the XCS classier system. In: Genetic Program-
ming 1998: Proceedings of the Third Annual Conference, pp. 665674 (1998)
22. Stone, C., Bull, L.: For real! XCS with continuous-valued inputs. Evolutionary
Computation 11(3), 299336 (2003)
23. Butz, M.V., Lanzi, P.L., Wilson, S.W.: Function approximation with XCS: Hyper-
ellipsoidal conditions, recursive least squares, and compaction. IEEE Transactions
on Evolutionary Computation 12, 355376 (2008)
24. Wilson, S.W.: Classier conditions using gene expression programming. In: Bac-
ardit, J., Bernad o-Mansilla, E., Butz, M.V., Kovacs, T., Llor` a, X., Takadama,
K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 206217.
Springer, Heidelberg (2008)
25. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond
linear approximation. In: GECCO 2005: Proceedings of the 2005 Conference on
Genetic and Evolutionary Computation, pp. 18271834 (2005)
26. Vijayakumar, S., Schaal, S.: Locally weighted projection regression: An O(n) algo-
rithm for incremental real time learning in high dimensional space. In: ICML 2000:
Proceedings of the Seventeenth International Conference on Machine Learning, pp.
10791086 (2000)
27. Vijayakumar, S., DSouza, A., Schaal, S.: Incremental online learning in high di-
mensions. Neural Computation 17(12), 26022634 (2005)
28. Stalph, P.O., Rubinsztajn, J., Sigaud, O., Butz, M.V.: A comparative study: Func-
tion approximation with LWPR and XCSF. In: GECCO 2010: Proceedings of the
12th Annual Conference on Genetic and Evolutionary Computation (in press, 2010)
Recursive Least Squares and Quadratic
Prediction in Continuous Multistep Problems

Daniele Loiacono and Pier Luca Lanzi

Dipartimento di Elettronica e Informazione


Politecnico di Milano
Milano, Italy
{loiacono,lanzi}@elet.polimi.it

Abstract. XCS with computed prediction, namely XCSF, has been re-
cently extended in several ways. In particular, a novel prediction update
algorithm based on recursive least squares and the extension to poly-
nomial prediction led to signicant improvements of XCSF. However,
these extensions have been studied so far only on single step problems
and it is currently not clear if these ndings might be extended also to
multistep problems. In this paper we investigate this issue by analyzing
the performance of XCSF with recursive least squares and with quadratic
prediction on continuous multistep problems. Our results show that both
these extensions improve the convergence speed of XCSF toward an op-
timal performance. As showed by the analysis reported in this paper,
these improvements are due to the capabilities of recursive least squares
and of polynomial prediction to provide a more accurate approximation
of the problem value function after the rst few learning problems.

1 Introduction
Learning Classier Systems are a genetic based machine learning technique for
solving problems through the interaction with an unknown environment. The
XCS classier system [16] is probably the most successful learning classier sys-
tem to date. It couples eective temporal dierence learning, implemented as
a modication of the well-known Q-learning [14], to a niched genetic algorithm
guided by an accuracy based tness to evolve accurate maximally general so-
lutions. In [18] Wilson extended XCS with the idea of computed prediction to
improve the estimation of the classiers prediction. In XCS with computed pre-
diction, XCSF in brief, the classier prediction is not memorized into a parameter
but computed as a linear combination of the current input and a weight vector
associated to each classier. Recently, in [11] the classier weights update has
been improved with a recursive least squares approach and the idea of computed
prediction has been further extended to polynomial prediction. Both the recur-
sive least squares update and the polynomial prediction have been eectively
applied to solve function approximation problems as well as to learn Boolean
functions. However, so far it is not currently clear whether these ndings might
be extended also to continuous multistep problems, where Wilsons XCSF has

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 7086, 2010.
c Springer-Verlag Berlin Heidelberg 2010
Recursive Least Squares and Quadratic Prediction 71

been already successfully applied [9]. In this paper we investigate this important
issue. First, we extend the recursive least squares update algorithm to multistep
problems with the covariance resetting, a well known approach to deal with a non
stationary target. Then, to test our approach, we compare the usual Widrow-Ho
update rule to the recursive least squares one (extended with covariance reset-
ting) on a class of continuous multistep problems, the 2D Gridworld problems [1].
Our results show that XCSF with recursive least squares outperforms XCSF with
Widrow-Ho rule in terms of convergence speed, although both reach nally an
optimal performance. Thus, the results conrm the ndings of previous works
on XCSF with recursive least squares applied to single step problems. In addi-
tion, we performed a similar experimental analysis to investigate the eect of
polynomial prediction on the same set of problems. Also in this case, the re-
sults suggest that quadratic prediction results in a faster convergence of XCSF
toward the optimal performance. Finally, to explain why recursive least squares
and polynomial prediction increase the convergence speed of XCSF we showed
that they improve the accuracy of the payo landscape learned in the rst few
learning problems.

2 XCS with Computed Prediction


XCSF diers from XCS in three respects: (i) classier conditions are extended
for numerical inputs, as done in XCSI [17]; (ii) classiers are extended with a
vector of weights w, that are used to compute prediction; nally, (iii) the original
update of classier prediction must be modied so that the weights are updated
instead of the classier prediction. These three modications result in a version of
XCS, XCSF [18,19], that maps numerical inputs into actions with an associated
calculated prediction. In the original paper [18] classiers have no action and it is
assumed that XCSF outputs the estimated prediction, instead of the action itself.
In this paper, we consider the version of XCSF with actions and linear prediction
(named XCS-LP [19]) in which more than one action is available. As said before,
throughout the paper we do not keep the (rather historical) distinction between
XCSF and XCS-LP since the two systems are basically identical except for the
use of actions in the latter case.
Classifiers. In XCSF, classiers consist of a condition, an action, and four main
parameters. The condition species which input states the classier matches;
as in XCSI [17], it is represented by a concatenation of interval predicates,
int i = (li , ui ), where li (lower) and ui (upper) are integers, though they
might be also real. The action species the action for which the payo is pre-
dicted. The four parameters are: the weight vector w, used to compute the clas-
sier prediction as a function of the current input; the prediction error , that
estimates the error aecting classier prediction; the tness F that estimates
the accuracy of the classier prediction; the numerosity num, a counter used to
represent dierent copies of the same classier. Note that the size of the weight
vector w depends on the type of approximation. In the case of piecewise-linear
approximation, considered in this paper, the weight vector w has one weight wi
72 D. Loiacono and P.L. Lanzi

for each possible input, and an additional weight w0 corresponding to a constant


input x0 , that is set as a parameter of XCSF.
Performance Component. XCSF works as XCS. At each time step t, XCSF
builds a match set [M] containing the classiers in the population [P] whose
condition matches the current sensory input st ; if [M] contains less than mna
actions, covering takes place and creates a new classier that matches the cur-
rent inputs and has a random action. Each interval predicate int i = (li , ui ) in
the condition of a covering classier is generated as li = st (i) rand(r0 ), and
ui = st (i) + rand(r0 ), where st (i) is the input value of state st matched by the
interval predicate int i , and the function rand(r0 ) generates a random integer in
the interval [0, r0 ] with r0 xed integer. The weight vector w of covering classi-
ers is randomly initialized with values from [-1,1]; all the other parameters are
initialized as in XCS (see [3]).
For each action ai in [M], XCSF computes the system prediction which estimates
the payo that XCSF expects when action ai is performed. As in XCS, in XCSF
the system prediction of action a is computed by the tness-weighted average
of all matching classiers that specify action a. However, in contrast with XCS,
in XCSF classier prediction is computed as a function of the current state st
and the classier vector weight w. Accordingly, in XCSF system prediction is
a function of both the current state s and the action a. Following a notation
similar to [2], the system prediction for action a in state st , P (st , a), is dened
as: 
cl [M ]|a cl.p(st ) cl.F
P (st , a) =  (1)
cl[M ]|a cl.F

where cl is a classier, [M]|a represents the subset of classiers in [M] with action
a, cl.F is the tness of cl; cl.p(st ) is the prediction of cl computed in the state
st . In particular, when piecewise-linear approximation is considered, cl.p(st ) is
computed as: 
cl.p(st ) = cl .w0 x0 + cl.wi st (i) (2)
i>0
where cl.w i is the weight wi of cl and x0 is a constant input. The values of
P (st , a) form the prediction array. Next, XCSF selects an action to perform.
The classiers in [M] that advocate the selected action are put in the current
action set [A]; the selected action is sent to the environment and a reward P is
returned to the system.
Reinforcement Component. XCSF uses the incoming reward P to update the
parameters of classiers in action set [A]. The weight vector w of the classiers
in [A] is updated using a modified delta rule [15]. For each classier cl [A],
each weight cl.w i is adjusted by a quantity wi computed as:

wi = (P cl.p(st ))st (i) (3)
|st |2
where is the correction rate and |st |2 is the norm of the input vector st , (see [18]
for details). Equation 3 is usually referred to as the normalized Widrow-Ho
Recursive Least Squares and Quadratic Prediction 73

update or modified delta rule, because of the presence of the term |st (i)|2 [5].
The values wi are used to update the weights of classier cl as:
cl.w i cl.w i + wi (4)
Then the prediction error is updated as:
cl. cl. + (|P cl.p(st )| cl.) (5)
Finally, classier tness is updated as in XCS.
Discovery Component. The genetic algorithm and subsumption deletion in
XCSF work as in XCSI [17]. On a regular basis depending on the parameter ga ,
the genetic algorithm is applied to classiers in [A]. It selects two classiers with
probability proportional to their fitness, copies them, and with probability
performs crossover on the copies; then, with probability it mutates each allele.
Crossover and mutation work as in XCSI [17,18]. The resulting ospring are
inserted into the population and two classiers are deleted to keep the population
size constant.

3 Improving and Extending Computed Prediction


The idea of computed prediction, introduced by Wilson in [18], has been recently
improved and extended in several ways [11,12,6,10]. In particular, Lanzi et al. ex-
tended the computed prediction to polynomial functions [7] and they introduced
in [11] a novel prediction update algorithm, based on recursive least squares. Al-
though these extensions proved to be very eective in single step problems, both
in function approximation problems [11,7] and in boolean problems [8], they have
never been applied to multistep problems so far. In the following, we briey de-
scribe the classier update algorithm based on recursive least squares and how it
can be applied to multistep problems. Finally, we show how computed prediction
can be extended to polynomial prediction.

3.1 XCSF with Recursive Least Squares


In XCSF with recursive least squares,the Widrow-Ho rule used to update the
classier weights is replaced with a more eective update algorithm based on
recursive least squares (RLS). At time step t, given the current state st and the
target payo P , recursive least squares update the weight vector w as
wt = wt1 + kt [P xt wt1 ],
where xt = [x0 st ]T and kt , called gain vector, is computed as
Vt1 xt
kt = , (6)
1 + xTt Vt1 xt
while matrix Vt is computed recursively by,
 
Vt = I kt xTt Vt1 . (7)
74 D. Loiacono and P.L. Lanzi

The matrix V(t) is usually initialized as V(0) = rls I, where rls is a positive
constant and I is the n n identity matrix. A higher rls , denotes that initial
parametrization is uncertain, accordingly, initially the algorithm will use a higher,
thus faster, update rate (kt ). A lower rls , denotes that initial parametrization is
rather certain, accordingly the algorithm will use a slower update. It is worthwhile
to say that the recursive least squares approach presented above involves two basic
underlying assumptions [5,4]: (i) the noise on the target payo P used for updat-
ing the classier weights can be modeled as a unitary variance white noise and
(ii) the optimal classier weights vector does not change during the learning pro-
cess, i.e., the problem is stationary. While the rst assumption is often reasonable
and has usually a small impact on the nal outcome, the second assumption is
not justied in many problems and may have a big impact on the performance.
In the literature [5,4] many approaches have been introduced for relaxing this as-
sumption. In particular, a straightforward approach is the resetting of the matrix
V: every rls updates, the matrix V is reset to its initial value rls I. Intuitively,
this prevent RLS to converge toward a xed parameter estimate by continually
restarting the learning process. We refer the interested reader to [5,4] for a more
detailed analysis of recursive least squares and other related approaches, like the
well known Kalman lter. The extension of XCSF with recursive least squares is
straightforward: we added to each classier the matrix V as an additional param-
eter and we replaced the usual update of classier weights with the recursive least
squares update described above and reported as Algorithm 1.

Algorithm 1. Update classier cl with RLS algorithm


1: procedure update prediction(cl , s, P )
2: error P cl.p(s);  Compute the current error
3: x(0) x0 ;  Build x by adding x0 to s
4: for i {1, . . . , |s|} do
5: x(i) s(i);
6: end for
7: if # of updates from last reset > rls then
8: cl .V rls I  Reset cl .V
9: end if
10: rls (1 + xT cl.V x)1 ;
11: cl .V cl .V rls cl.V xxT cl .V ;  Update cl .V
12: kT cl .V xT ;
13: for i {0, . . . , |s|} do  Update classiers weights
14: cl.w i cl.w i + k(i) error;
15: end for
16: end procedure

Computational Complexity. It is worth comparing the complexity of the


Widrow-Ho rule and recursive least squares both in terms of memory required
for each classier and time required by each classier update. For each classier,
recursive least squares stores the matrix cl.Vwhich is n n, thus its additional
space complexity is O(n2 ), where n = |x| is the size of the input vector. With
Recursive Least Squares and Quadratic Prediction 75

respect to the time required for each update, the Widrow-Ho update rule in-
volves only n scalar multiplications and, thus, is O(n); instead, recursive least
squares requires a matrix multiplication, which is O(n2 ). Therefore, recursive
least squares is more complex than Widrow-Ho rule both in terms of memory
and time requirements.

3.2 Beyond Linear Prediction


Usually in XCSF the classier prediction is computed as a linear function, so
that piecewise linear approximations of the action-value function are evolved.
However, XCSF can be easily extended to evolve also polynomial approxima-
tions. Let us consider a simple problem with a single variable state space. At
time step t, the classier prediction is computed as,

cl.p(st ) = w0 x0 + w1 st ,

where x0 is a constant input and st is the current state. Thus, we can introduce
a quadratic term in the approximation evolved by XCSF:

cl.p(st ) = w0 x0 + w1 st + w2 s2t . (8)

To learn the new set of weights we use the usual XCSF update algorithm (e.g.,
either RLS or Widrow-Ho) applied to the input vector xt , dened as xt =
x0 , st , s2t .
When more variables are involved, so that st = st (1), . . . , st (n), we dene

xt = x0 , st (1), s2t (1), . . . , st (n), s2t (n),

and apply XCSF to the newly dened input space. The same approach can
be generalized to allow the approximation of any polynomials of order k by
extending the input vector xt with high order terms. However in this paper, for
the sake of simplicity, we will limit our analysis to the quadratic prediction.

4 Experimental Design
To study how recursive least squares and the quadratic prediction aect the
performance of XCSF on continuous multistep problems we considered a well
known class of problems: the 2D gridworld problems, introduced in [1]. They
are two dimensional environments in which the current state is dened by a
pair of real valued coordinates x, y in [0, 1]2 , the only goal is in position 1, 1,
and there are four possible actions (left, right, up, and down) coded with two
bits; each action corresponds in a step of size s in the corresponding direction;
actions that would take the system outside the domain [0, 1]2 take the system to
the nearest position of the grid border. The system can start anywhere but in the
goal position and it reaches the goal position when both coordinates are equal or
greater than one. When the system reaches the goal it receives 0, in all the other
cases it receives -0.5. We called the problem described above empty gridworld,
76 D. Loiacono and P.L. Lanzi

V(x,y)
6

10
1
1
0.5
0.5

y 0 0
x

(a)
1

1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
y

0
0 x 1

(b)

5
V(x,y)

10

15

20
1
1
0.5 0.5

y 0 0 x

(c)
Fig. 1. The 2D Continuous Gridworld problems: (a) the optimal value function of
Grid(0.05) when =0.95; (b) the Puddles(0.05) environment; (c) the optimal value
function of Puddles(0.05) when =0.95

dubbed Grid(s), where s is the agent step size. Figure 1a shows the optimal
value function associated to the empty gridworld problem, when s = 0.05 and
= 0.95.
A slightly more challenging problem can be obtained by adding some obstacles
to the empty gridworld environment, as proposed in [1]: each obstacle represents
an area in which there is an additional cost for moving. These areas are called
puddles [1], since they actually create a sort of puddle in the optimal value
function. Figure 1b depicts the Puddles(s) environment that is derived from
Grid(s) by adding two puddles (the gray areas). When the system is in a puddle,
it receives an additional negative reward of -2, i.e., the action has an additional
Recursive Least Squares and Quadratic Prediction 77

cost of -2; in the area where the two puddles overlap, the darker gray region, the
two negative rewards add up, i.e., the action has a total additional cost of -4.
We called this second problem puddle world, dubbed Puddles(s), where s is the
agent step size. Figure 1c shows the optimal value function of the puddle world,
when s = 0.05 and = 0.95.
The performance is computed as the average number of steps to reach the
goal during the last 100 test problems. To speed up the experiments, problems
can last at most 500 steps; when this limit is reached the problem stops even if
the system did not reach the goal. All the statistics reported in this paper are
averaged over 20 experiments.

5 Experimental Results
Our aim is to study how the RLS update and the quadratic prediction aect
the performance of XCSF on continuous multistep problems. To this purpose
we applied XCSF with dierent type of prediction, i.e., linear and quadratic,
and with dierent update rules, i.e., Widrow-Ho and RLS, on the Grid(0.05)
and Puddles(0.05) problems. In addition, we also compared the performance
of XCSF to the one obtained with tabular Q-learning [13], a standard reference
in the RL literature. In order to apply tabular Q-learning to the 2D Gridworld
problems, we discretized the the continuous problem space, using the step size
s = 0.05 as resolution for the discretization process. In the rst set of experiments
we investigated the eect of the RLS update on the performance of XCSF, while
in the second set of experiments we extended our analysis also to quadratic
prediction. Finally, we analyzed the results obtained and the accuracy of the
action-value approximations learned by the dierent versions of XCSF.

5.1 Results with Recursive Least Squares


In the rst set of experiments we compared Q-learning and XCSF with the
two dierent updates on the 2D continuous gridworld problems. For XCSF we
used the following parameter settings: N = 5000, 0 = 0.05; = 0.2; = 0.1;
= 0.95; = 5; = 0.8, = 0.04, pexplr = 0.5, del = 50, GA = 50, and
= 0.1; GA-subsumption is on with sub = 50; while action-set subsumption
is o; the parameters for integer conditions are m0 = 0.5, r0 = 0.25 [17]; the
parameter x0 for XCSF is 1 [18]. In addition, with the RLS update we used
rls = 10 and rls = 50. Accordingly, for Q-learning we set = 0.2, = 0.95,
and pexplr = 0.5. The Figure 2a compares the performance of Q-learning and of
the two versions of XCSF on the Grid(0.05) problem. All the systems are able
to reach an optimal performance and XCSF with the RLS update is able to learn
much faster than XCSF with the Widrow-Ho update, although Q-learning is
even faster. This is not surprising, as Q-learning is provided with the optimal
state space discretization to solve the problem, while XCSF has to search for
it. However it is worthwhile to notice that when the RLS update rule is used,
XCSF is able to learn almost as fast as Q-learning. Moving to the more dicult
Puddles(0.05) problem, we nd very similar results as showed by Figure 2b.
78 D. Loiacono and P.L. Lanzi

40
WH

AVERAGE NUMBER OF STEPS


RLS
QL
30 Optimum (21)

20

10

0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS

(a)
40
WH
AVERAGE NUMBER OF STEPS

RLS
QL
30

20

10

0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(b)

Fig. 2. The performance of Q-learning (reported as QL), XCSF with the Widrow-Ho
update (reported as WH), and of XCSF with the RLS update (reported as RLS) applied
to: (a) Grid(0.05) problem (b) Puddles(0.05) problem. Curves are averages on 20
runs.

Also in this case, XCSF with RLS update is able to learn faster than XCSF with
the usual Widrow-Ho update rule and the dierence with Q-learning is even
less evident.
Therefore, our results suggest that the RLS update rule is able to exploit the
experience collected more eectively than the Widrow-Ho rule and conrm the
previous ndings on single step problems reported in [11].

5.2 Results with Quadratic Prediction

In the second set of experiments, we compared linear prediction to quadratic pre-


diction on the Grid(0.05) and the Puddles(0.05) problems, using both
Widrow-Ho and RLS updates. Parameters are set as in the previous exper-
iments. Table 1a reports the performance of the systems in the rst 500 test
problems as a measure of the convergence speed. As found in the previous set of
Recursive Least Squares and Quadratic Prediction 79

Table 1. XCSF applied to Grid(0.05) and to Puddles(0.05) problems. (a) Average


number of steps to reach the goal per episode in the rst 500 test problems; (b) average
number of steps to reach the goal per episode in the last 500 test problems; (c) size of
the population evolved. Statistics are averages over 20 experiments.

experiments, the RLS update leads to a faster convergence, also when quadratic
prediction is used. In addition, the results suggest that also quadratic prediction
aects the learning speed: both with Widrow-Ho update and with the RLS up-
date the quadratic prediction outperforms the linear one. In particular, XCSF
with the quadratic prediction and the RLS update is able to learn even faster
than Q-learning in both Grid(0.05) and Puddles(0.05) problems. However, as
Table 1b shows, all the systems reach an optimal performance. Finally, it can
be noticed that the number of macroclassiers evolved (Table 1c) is very similar
for all the systems, suggesting that XCSF with quadratic prediction does not
evolve a more compact solution.

5.3 Analysis of Results


Our results suggest that in continuous multistep problems, the RLS update and
the quadratic prediction does not give any advantage either in terms of nal
performance or in terms of population size. On the other hand, both these ex-
tensions lead to an eective improvement of the learning speed, that is they
play an important role in the early stage of the learning process. However, this
80 D. Loiacono and P.L. Lanzi

4
LINEAR WH
LINEAR RLS
QUADRATIC WH
3 QUADRATIC RLS

AVERAGE ERROR
2

0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS
(a)
4
LINEAR WH
LINEAR RLS
QUADRATIC WH
3 QUADRATIC RLS
AVERAGE ERROR

0
0 1000 2000 3000 4000 5000
LEARNING PROBLEMS

(a)

Fig. 3. Average absolute error of the value functions learned by XCSF on (a) the
Grid(0.05) problem and (b) the Puddles(0.05) problem. Curves are averages over 20
runs.

results is not surprising: (i) the RLS update exploits more eectively the expe-
rience collected and learns faster an accurate approximation; (ii) the quadratic
prediction allows a broader generalization in the early stages that leads very
quickly to a rough approximation of the payo landscape. Figure 3 reports the
error of the value function learned by the four XCSF versions during the learning
process. The error of a learned value function is measured as the absolute error
with respect to the optimal value function, computed as the average of the abso-
lute errors over an uniform grid of 100 100 samples of the problem space. For
each version of XCSF this error measure is computed at dierent stages of the
learning process and then averaged over the 20 runs to generate the error curves
reported in Figure 3. Results conrm our hypothesis: both quadratic prediction
and RLS update lead very fast to accurate approximations of the optimal value
function, although the nal approximations are as accurate as the one evolved
by XCSF with Widrow-Ho rule and linear prediction. To better understand
how the dierent versions of XCSF approximate the value function, Figure 4,
Recursive Least Squares and Quadratic Prediction 81

4
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(a)

4
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(b)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(c)

Fig. 4. Examples of the value function evolved by XCSF with linear prediction and
Widrow-Ho update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after
500 learning episodes (c) at the end of the experiment (after 5000 learning episode)
82 D. Loiacono and P.L. Lanzi

4
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(a)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(b)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(c)

Fig. 5. Examples of the value function evolved by XCSF with linear prediction and
RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500
learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction 83

4
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(a)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(b)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(c)

Fig. 6. Examples of the value function evolved by XCSF with quadratic prediction
and Widrow-Ho update on the Grid(0.05) problem: (a) after 50 learning episodes
(b) after 500 learning episodes (c) at the end of the experiment (after 5000 learning
episode)
84 D. Loiacono and P.L. Lanzi

4
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(a)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(b)

0
V(x,y)

10
1
1
0.5
0.5

y 0 0
x

(c)

Fig. 7. Examples of the value function evolved by XCSF with quadratic prediction and
RLS update on the Grid(0.05) problem: (a) after 50 learning episodes (b) after 500
learning episodes (c) at the end of the experiment (after 5000 learning episode)
Recursive Least Squares and Quadratic Prediction 85

Figure 5, Figure 6, and Figure 7 show some examples of the value functions
learned by XCSF at dierent stages of the learning process. In particular,
Figure 4a and Figure 5a show the value function learned by XCSF with lin-
ear prediction after few learning episodes, using respectively the Widrow-Ho
update and the RLS update. While the value function learned by XCSF with
Widrow-Ho is at and very uninformative, the one learned by XCSF with RLS
update provides a rough approximation to the slope of the optimal value func-
tion, despite it is still far from being accurate. Finally, Figure 6 and Figure 7
report similar examples of value functions learned by XCSF with quadratic pre-
dictions. Figure 7a shows how XCSF with both quadratic prediction and RLS
update may learn very quickly a rough approximations of the optimal value
function after very few learning episodes. A similar analysis can be performed
on the Puddles(0.05) but it is not reported here due to the lack of space.

6 Conclusions

In this paper we investigated the application of two successful extensions of


XCSF, the recursive least squares update algorithm and the quadratic prediction,
to multistep problems First, we extended the recursive least squares approach,
originally devised only for single step problems, to the multistep problems with
the covariance resetting, a technique to deal with a non stationary target. Second,
we showed how the linear prediction used by XCSF can be extended to quadratic
prediction in a very straightforward way. Then the recursive least squares update
and the quadratic prediction have been compared to the usual XCSF on the
2D Gridworld problems. Our results suggest that the recursive least squares
update as well as the quadratic prediction lead to a faster convergence speed of
XCSF toward the optimal performance. The analysis of the accuracy of the value
function estimate showed that recursive least squares and quadratic prediction
play an important role in the early stage of the learning process. The capabilities
of recursive least squares of exploiting more eectively the experience collected
and the broader generalization allowed by the quadratic prediction, lead to a
more accurate estimate of the value function after a few learning episodes. In
conclusion, we showed that the previous ndings on recursive least squares and
polynomial prediction applied to single step problems can be extended also to
continuous multistep problems. Further investigations will include the analysis of
the generalizations evolved by XCSF with recursive least squares and quadratic
prediction.

References

1. Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely ap-


proximating the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.)
Advances in Neural Information Processing Systems 7, pp. 369376. The MIT
Press, Cambridge (1995)
86 D. Loiacono and P.L. Lanzi

2. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in xcs. In: Spec-
tor, L., Goodman, E.D., Wu, A., Langdon, W.B., Voigt, H.-M., Gen, M., Sen, S.,
Dorigo, M., Pezeshk, S., Garzon, M.H., Burke, E. (eds.) Proceedings of the Ge-
netic and Evolutionary Computation Conference (GECCO 2001), July 7-11, pp.
935942. Morgan Kaufmann, San Francisco (2001)
3. Butz, M.V., Wilson, S.W.: An algorithmic description of xcs. Journal of Soft Com-
puting 6(3-4), 144153 (2002)
4. Goodwin, G.C., Sin, K.S.: Adaptive Filtering: Prediction and Control, Prentice-
Hall information and system sciences series (March 1984)
5. Haykin, S.: Adaptive Filter Theory, 4th edn. Prentice-Hall, Englewood Clis (2001)
6. Lanzi, P.L., Loiacono, D.: Xcsf with neural prediction. In: IEEE Congress on Evo-
lutionary Computation, CEC 2006, pp. 22702276 (2006)
7. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Extending XCSF beyond
linear approximation. In: Genetic and Evolutionary Computation GECCO-2005,
Washington DC, USA, pp. 18591866. ACM Press, New York (2005)
8. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed pre-
diction for the learning of boolean functions. In: Proceedings of the IEEE Congress
on Evolutionary Computation CEC 2005, Edinburgh, UK, pp. 588595. IEEE,
Los Alamitos (September 2005)
9. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: XCS with computed
prediction in continuous multistep environments. In: Proceedings of the IEEE
Congress on Evolutionary Computation CEC 2005, Edinburgh, UK, pp. 2032
2039. IEEE, Los Alamitos (September 2005)
10. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Prediction update al-
gorithms for XCSF: RLS, kalman lter, and gain adaptation. In: GECCO 2006:
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Compu-
tation, pp. 15051512. ACM Press, New York (2006)
11. Lanzi, P.L., Loiacono, D., Wilson, S.W., Goldberg, D.E.: Generalization in the
XCSF classier system: Analysis, improvement, and extension. Evolutionary Com-
putation 15(2), 133168 (2007)
12. Loiacono, D., Marelli, A., Lanzi, P.L.: Support vector regression for classier pre-
diction. In: GECCO 2007: Proceedings of the 9th Annual Conference on Genetic
and Evolutionary Computation, pp. 18061813. ACM Press, New York (2007)
13. Watkins, C.J.C.H.: Learning from delayed reward. PhD thesis (1989)
14. Watkins, C.J.C.H., Dayan, P.: Technical note: Q-Learning. Machine Learning 8,
279292 (1992)
15. Widrow, B., Ho, M.E.: Neurocomputing: Foundation of Research. In: Adaptive
Switching Circuits, pp. 126134. The MIT Press, Cambridge (1988)
16. Wilson, S.W.: Classier Fitness Based on Accuracy. Evolutionary Computa-
tion 3(2), 149175 (1995), http://prediction-dynamics.com/
17. Wilson, S.W.: Mining Oblique Data with XCS. In: Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (workshop organisers): Proceedings of the International Workshop
on Learning Classier Systems (IWLCS-2000), in the Joint Workshops of SAB
2000 and PPSN 2000, pp. 158174 (2000)
18. Wilson, S.W.: Classiers that approximate functions. Journal of Natural Comput-
ing 1(2-3), 211234 (2002)
19. Wilson, S.W.: Classier systems for continuous payo environments. In: Deb, K.,
Poli, R., Banzhaf, W., Beyer, H.-G., Burke, E., Darwen, P., Dasgupta, D., Floreano,
D., Foster, J., Harman, M., Holland, O., Lanzi, P.L., Spector, L., Tettamanzi,
A., Thierens, D., Tyrrell, A. (eds.) GECCO 2004. LNCS, vol. 3103, pp. 824835.
Springer, Heidelberg (2004)
Use of a Connection-Selection Scheme in Neural XCSF

Gerard David Howard1, Larry Bull1, and Pier-Luca Lanzi2


1
Department of Computer Science, University of the West of England, Bristol, UK
{gerard2.howard,larry.bull}@uwe.ac.uk
2
Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
pierluca.lanzi@polimi.it

Abstract. XCSF is a modern form of Learning Classifier System (LCS) that


has proven successful in a number of problem domains. In this paper we exploit
the modular nature of XCSF to include a number of extensions, namely a neural
classifier representation, self-adaptive mutation rates and neural constructivism.
It is shown that, via constructivism, appropriate internal rule complexity
emerges during learning. It is also shown that self-adaptation allows this rule
complexity to emerge at a rate controlled by the learner. We evaluate this
system on both discrete and continuous-valued maze environments. The main
contribution of this work is the implementation of a feature selection derivative
(termed connection selection), which is applied to modify network connectivity
patterns. We evaluate the effect of connection selection, in terms of both
solution size and system performance, on both discrete and continuous-valued
environments.

Keywords: feature selection, neural network, self-adaptation.

1 Introduction
Two main theories to explain the emergence of complexity in the brain are construc-
tivism (e.g.[1]), where complexity develops by adding neural structure to a simple
network, and selectionism [2] where an initial amount of over-complexity is gradually
pruned over time through experience. We are interested in the feasibility of combin-
ing both approaches to realize flexible learning within Learning Classifier Systems
(LCS) [3], exploiting their Genetic Algorithm (GA) [4] foundation in particular. In
this paper we present a form of neural LCS [5] based on XCSF [6] which includes the
use of self-adaptive search operators to exploit both constructivism and selectionism
during reinforcement learning.
The focus of this paper centres around the impact of a form of feature selection that
we apply to the neural classifiers, allowing a more granular exploration of the net-
work weight space. Unlike traditional feature selection, which acts only on input
channels, we allow every connection in our networks to be enabled or disabled. We
term this addition connection selection, and evaluate in detail the effects of its in-
clusion in our LCS, in terms of solution size, internal knowledge representation and
stability of evolved solutions in two evaluation environments; the first a discrete maze
and the second a continuous maze.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 87106, 2010.
Springer-Verlag Berlin Heidelberg 2010
88 G.D. Howard, L. Bull, and P.-L. Lanzi

For claritys sake, we shall refer to the system without connection selection as N-
XCSF, and the version with connection selection as N-XCSFcs. Applications of this
type of learning system are varied, including (but not limited to) agent navigation,
data mining and function approximation; we are interested in the field of simulated
agent navigation. The rest of this paper is organized as follows: section 2 details
background research, section 3 introduces the evaluation environments used, and
section 4 shows the implementation of neural XCSF. Section 5 describes connection
selection, section 6 provides results of the experiments conducted, and section 7
provides a brief discussion and suggests further avenues of research.

2 Background

2.1 Neural Classifier Systems

Benefits of Artificial Neural Network (ANN) representations mimic those of their


real-life inspiration; including flexibility, robustness to noise and graceful perfor-
mance degradation. The type of neural network that will be used in our project is the
Multi Layer Perceptron (MLP) [7]. There are a number of neural LCS in the literature
that are relevant to this paper. The initial work exploring artificial neural networks
within LCS used traditional feedforward MLPs to represent the rules [5]. Recurrent
MLPs were then shown able to provide memory for a simple maze task [8]. Radial
Basis Function networks [9] were later used for both simulated [10] and real [11]
robotics tasks. Both forms of neural representation have been shown amenable to a
constructionist approach wherein the number of nodes within the hidden layer is un-
der evolutionary control, along with the network connection weights [5][11]. Here a
mutation operator either adds or removes nodes from the hidden-layer. MLPs have
also been used in LCS to calculate the predicted payoff [12][13][14], to compute only
the action [15], and to predict the next sensory state [16].

2.2 Neural Constructivism

Heuristic approaches to neural constructivism include FAST [17]. Here, a learning


agent is made to navigate a discrete maze environment using Q learning [18]. The
system begins with a single network, and more are added if the oscillation in Q value
between two states is greater than a given threshold (e.g. there exist two states speci-
fying different payoffs/actions, with only one network to cover both states). More
networks are added until the solution space is fully covered by a number of neural
networks, which allows the system to select optimal actions for each location within
the environment.
With regards to the use of constructivism in LCS, the first implementation is de-
scribed in [5], where Wilsons Zeroth-level Classifier System (ZCS) [19] is used as a
basis, the system being evaluated (NCS) on the Woods1 environment. The author
implements a constructivist approach to topology evolution using fully-connected,
MLPs to represent a classifier condition. Each classifier begins with one hidden layer
node. A constructivism event may be triggered during a GA cycle, and adds or
Use of a Connection-Selection Scheme in Neural XCSF 89

removes a single, fully-connected hidden layer neuron to the classifier condition. The
author then proceeds to define the use of NCS in continuous-valued environments
using a bounded-range representation, which reduces the number of neurons required
by each MLP. This constructivist LCS was then modified to include parameter
self-adaptation in [11]. The probabilities of constructivism events occurring are
self-adaptive in the same way as the mutation rate in [20], where an Evolutionary
Strategy inspired implementation is used to control the amount of genetic mutation
that occurs within each GA niche in a classifier system. This allows classifiers that
match in suboptimal niches to search more broadly within the solution space when
is large, and decreasing the mutation rate when an optimal solution has been found to
maintain stability within the niche. In both cases it is reported that networks of differ-
ent structure evolve to handle different areas of the problem space, thereby identifying
the underlying structure of the task.
Constructivism leads us to the field of variable length neural representations. Tra-
ditional genetic crossover operators are of questionable utility when applied to the
variable-length genomes that constructivism generates, as all rely on randomly pick-
ing points within the genome to perform crossover on. This can have the effect of
breaking the genome in areas that rely on spatial proximity to provide high-utility. A
number of methods, notably Harveys Species Adaptive Genetic Algorithm (SAGA)
[21] and Hutt and Warwicks Synapsing Variable-Length Crossover (SVLC) [22]
provide methods of crossing variable-length genetic strings, with SVLC reporting
superior performance than SAGA in a variable-length test problem. SVLC also elimi-
nates the main weakness of SAGA; that the initial crossover point on the first genome
is still chosen randomly, with only the second subject to a selection heuristic. It
should be noted that neither N-XCSF nor N-XCSFcs use any version of crossover
during a GA cycle; the reasoning behind this omission being twofold. Firstly, directly
addressing the problem would require increasing the complexity of the system (add-
ing SVLC-like functionality, for example). Secondly, and more importantly, experi-
mental evidence suggests that sufficient solution space exploration can be obtained
via a combination of GA mutation, self-adaptive mutation and neural constructivism,
to produce optimal solutions in both discrete and continuous environments. This view
is reinforced elsewhere in literature, e.g. [23].
Aside from GA-based crossover difficulties, there are also problems related to
creating novel network structures of high utility. For example, the competing conven-
tions problem (e.g. [24]) demonstrates how two networks of different structure but
identical utility may compete with each other for fitness, despite being essentially the
same network. Neuro Evolution of Augmenting Topologies (NEAT) [25] presents a
method for addressing this problem under constructivism. Each gene under the
NEAT scheme specifies a connection, specifying the input neuron and output neuron,
the connection weight, and a Boolean flag indicating if the connection is currently
enabled or disabled. Each gene also has a marker that corresponds to that genes first
appearance in the population, with markers passed down from parents to children
during a GA event, and is based on the assumption that genes from the same origin
are more likely to encode similar functions. The marker is retained to make it more
likely that homologous genes will be selected during crossover. NEAT has been ap-
plied to evolve robot controllers [26].
90 G.D. Howard, L. Bull, and P.-L. Lanzi

2.3 Feature Selection

Feature selection is a method of streamlining the data input to a process, where the
input data can be imagined as a vector of inputs, with dimension >1. This can be done
manually (by a human with relevant domain knowledge), although this process can be
error-prone, costly in terms of both time and potentially money, and, of course, re-
quires expert domain knowledge. A popular alternative in the machine learning com-
munity is automatic feature selection.
The use of feature selection brings two major benefits firstly, that the amount of
data being input to a process can be reduced (increasing computational efficiency),
and secondly that noisy connections (or those otherwise inhibitory to the successful
performance of the system) can be disabled. Useful features within the input vector
are preserved as the performance of the system can be expected to drop if they are
disabled, with the converse being true for disabling noisy/low-fitness connections.
This is especially useful when considering the case of mobile robot control, where
sensors are invariably subject to a certain level of noise that can be automatically
filtered out by the feature selection mechanism. This description of the concept of
feature selection can be seen to display a strong relationship with the MLP (and in-
deed any connectionist neural) paradigm, which uses a collection of clearly discre-
tised input channels to produce an output. It can be demonstrated that the disabling of
connections within the input layer of an MLP can have a (sometimes drastic) affect on
the output of the network [27].
Related work on the subject of feature selection in neural networks can be found in
[28] and [29], who explore the use of feature selection in a variety of neural networks.
Also especially pertinent is the implementation of feature selection within the NEAT
framework (FS-NEAT) [30], who apply their system to a double pole balancing task
with 256 inputs. FS-NEAT performs feature selection by giving each input feature a
small chance (1/I, where I is the dimension of the input vector) to be connected to
every output node. An unaltered NEAT mutation sequence then allows these connec-
tions to connect to nodes in the hidden layers of the networks, as well as providing the
ability to add further input nodes to the networks, again with a small probability of
input addition.
The authors make the point that NEAT, following a constructivist methodology,
tends to evolve small networks without superfluous connections. They observe both a
quicker convergence to optimality and networks with only around 32% of the avail-
able input nodes connected in the best-performing network, a reduction from 256
inputs to an average useful subset size of 83.6 enabled input nodes. Also highly
relevant is the derivative FD-NEAT (Feature Deselection NEAT) [31], where all
connections are enabled by default, and pruning rather than growing of connections
takes place (it should be noted that FS-NEAT and neural constructivism [1] are simi-
lar, as are FD-NEAT and Edelmans theory of neural Darwinism [2]). Consistent
between all four papers mentioned above is that they perform input feature selection
only (in other words, only input connections are viable candidates for enabling/
disabling).
A comparative study into neuroevolution for both classification and regression
tasks (supervised) can be found in [32], where the authors compare purely heuristic
approaches with an ensemble of evolutionary neural networks (ENNs), whose MLPs
Use of a Connection-Selection Scheme in Neural XCSF 91

are designed through evolutionary computing. In the former case, randomly-weighted


fully-connected networks with hidden layer size N (determined experimentally) are
used to solve the tasks. In the latter, each network begins with a bounded-random
number of hidden layer nodes. A feature-selection derivative similar to our approach
is then implemented, whereby each network connection is probabilistically enabled.
Structural mutation is then applied so that, with each GA application, a random num-
ber of either nodes or connections are added or deleted. Also similar to our implemen-
tation, the authors disable crossover, citing [17] due to negligible impact on the final
solution performance. They then expand this work to evolve topologies and weights
simultaneously, as evolving one without the other was revealed to be disruptive to the
learning process. In their implementation, the non-adaptive rates of weight mutation
and topological mutation are controlled by individual variables, each with a 50%
chance of altering the network.
Finally, it should be noted that this work builds on a previous publication [33],
which introduces the design of the N-XCSF (and N-XCS [ibid.], which does not in-
clude function approximation). The research highlights the benefits of N-XCSF,
mainly in terms of generalization capability and population size reduction. It is shown
that the use of MLPs allow the same classifier to match in multiple location within the
same environmental payoff level, indicating differing actions thanks to action compu-
tation. It is also shown that the inclusion of function approximation allows the same
classifier to match accurately in many payoff levels; combined these two features
allow the system to perform optimally with a degree of generalization (i.e. fewer total
networks required in [P]).

3 Environments
Discrete maze experiments are conducted on a real-valued version of the Maze4 envi-
ronment [34] (Figure 1). In the diagram, O represents an obstacle that the agent
cannot traverse, G is the goal state, where the agent must reach to receive reward,
and * is a free space that the agent can occupy. The environmental discount rate
=0.71. The environmental representation was altered to loosely approximate a real
robots sensor readings - the binary string normally used to represent a given input
state st is replaced with a real-valued counterpart in the same way as [5]. That is, each
exclusive object type the agent could encounter is represented by a random real num-
ber within a specified range ([0.0, 0.1] for free space, [0.4,0.5] for an obstacle and
[0.9, 1.0] for the goal state). In the discrete environment, the input state st consists
of the cell contents of the 8 cells directly surrounding the agents current position, and
the boundedly-random numeric representation attempts to emulate the sensory noise
that real robots encounter. Performance is gauged by a Step-to-goal count the
number of discrete movements required to reach the goal state from a random starting
position in the maze; in Maze 4 this figure is 3.5. Upon reaching the goal state, the
agent receives a reward of 1000. Action calculation is covered in section 4. The test
environment for the continuous experiments is the 2-D continuous grid world,
Grid(0.05) (Figure 2) [35]. This is two-dimensional environment where the agents
current state, st, consists of the x and y components of the agents current location
within the environment; to emulate sensory noise both the x and y location of the
92 G.D. Howard, L. Bull, and P.-L. Lanzi

agent are subject to random noise +/- [0%-5%] of the agents true position. Both x and
y are bounded in the range [0,1]; any movement outside of this range takes the agent
to the nearest grid boundary. The environmental discount rate =0.95. The agent
moves a predetermined step size (in this case 0.05) within this environment. The only
goal state is in the top-right hand corner of the grid where (x+y >1.90). The agent
can start anywhere except the goal state, and must reach a goal state in the fewest
possible movements, where it receives a reward of 1000. Again, action calculation is
covered in section 4.

O O O O O O O O 1.0

O * * O * * G O
O O * * O * * O
O O * O * * O O
0.5
O * * * * * * O
O O * O * * * O
O * * * * O * O
O O O O O O O O
0.0 0.5 1.0
Fig. 1. The discrete Maze4 environment Fig. 2. The continuous grid (0.05) environment

4 Neural XCSF (N-XCSF)


XCSF [6] is a form a classifier system in which a classifiers prediction (that is, the
reward a classifier expects to gain from executing its action based on the current input
state) is computed. Like other classifier systems, XCSF evolves a population of clas-
sifiers, [P], to cover a problem space. Each classifier consists of a condition and
an action, as well as a number of other parameters. In our case, a fully-connected
Multi-Layer Perceptron neural network[7] is used in place of the traditional ternary
condition, and is used to calculate the action. Prediction computation is unchanged,
computed linearly using a separate series of weights.
Each classifier is represented by a vector that details the connection weights of an
MLP. Each connection weight is uniformly initialized randomly in the range [-1, 1].
In the discrete case, there are 8 input neurons, representing the contents of the cells in
8 compass directions surrounding the agents current location. For the continuous
environment, each network comprises 2 input neurons (representing the noisy x and y
location of the agent). Both network types also consist of a number of hidden layer
neurons under evolutionary control (see Section 4.2), and 3 output neurons. Each
node (hidden and output) in the neural network has a sigmoidal activation function to
constrain the range of output values. The first two output neurons represent the
strength of action passed to the left and right motors of the robot respectively, and the
third output neuron is a dont-match neuron, that excludes the classifier from the
Use of a Connection-Selection Scheme in Neural XCSF 93

match set if it has activation greater than 0.5. This is necessary as the action of the
classifier must be re-calculated for each state the classifier encounters, so each clas-
sifier sees each input. The outputs at the other two neurons (real numbers) are
mapped to a single discrete movement, which varies between discrete and continuous
environments. In the discrete case, the outputs at the other two neurons are mapped to
a movement in one of eight compass directions (N, NE, E, etc.). This takes place in a
way similar to [5], where three ranges of discrete output are possible for each node:
0.0<x<0.4 (low), 0.4<x<0.6 (medium), and 0.6<x<1.00 (high). The unequal partition-
ing is used to counteract the insensitivity of the sigmoid function to values within the
extreme reaches of its range. A discrete movement is mapped from these continuous
action inputs (high, high) = north, (high, med) = northeast, (high, low) = east, and so
on. It should be noted that the final two motor pairings (low, medium) and (low,
low) both produce a move to the northwest.
In the continuous environment, movement is constrained to one of four compass
directions (North, east, south, west). This takes place similarly to discrete environ-
ments, except here there are four possible directions and only two ranges of discrete
output are possible: 0.0<x<0.5 (low), and 0.5<x<1.00 (high). The combined actions of
each motor translate to a discrete movement according to the two motor output
strengths (high, high) = north, (high, low) = east, (low, high) = south, and (low,
low) = west.
At each time-step, XCSF builds a match set, [M], from [P] consisting of all clas-
sifiers whose conditions match the current input state st. In neural XCSF, every ac-
tion must be present in each [M]. If this is not the case, covering is used to generate
classifiers that advocate the missing action(s); covering repeatedly generates random
networks until the network action matches the desired output for a given input state.
Once [M] is formed, a prediction array is created. In XCSF, each classifier prediction
(cl.p) is calculated as a product of the environmental input (or state, st) and the predic-
tion weight vector (w) associated with each classifier. This vector has one element for
each input (8 in the discrete case, 2 in the continuous case), plus an additional element
w0 which corresponds to x0, a constant input that is set as a parameter of XCSF. A
classifiers prediction is calculated as shown in equation 1:

. . . (1)

The prediction array is the fitness weighted-average of the calculated predictions for
each possible action. An action selection policy is used to decide which action should
be taken (in [6], a random action selection policy is used on explore trials, and deter-
ministic on exploit trials). All classifiers that advocate the selected action form the
action set [A]. The action is taken and, if the goal state is reached, a reward is returned
from the environment that is used to update the parameters of the classifiers in [A]. A
discounted reward is propagated to the previous action set [A-1] if it exists. The predic-
tion weight vector of each classifier in the action set is updated using a version of the
delta rule, rather than updating the classifiers prediction value (equation 2). Each
prediction weight is then updated (equation 3) and prediction error is calculated (equa-
tion 4). Here, the vector x is the state st augmented by the parameter x0.

. (2)
| |
94 G.D. Howard, L. Bull, and P.-L. Lanzi

. . (3)

| . | ) (4)

Further details of the update procedure used in XCSF can be found in [6]. The GA
may then fire if the average time since the last GA application to the classifiers in [A]
exceeds a threshold GA. Our GA is modified to be a two-stage process. Stage 1 (sec-
tion 4.1) controls the rates of mutation and constructivism/connection selection that
occur within the system, with stage 2 (section 4.2 and section 5) controlling the evolu-
tion of neural architecture in terms of both neurons and connections. Deletion occurs
as in [6]. This cycle of [P] ([M] [A]) reward is called a trial. Each experiment
consists of 50,000 trials (20,000 in the continuous case). Each trial is either in explo-
ration mode (roulette wheel action selection) or exploitation mode (deterministic
action selection). We employ roulette wheel action selection on exploration trials to
discourage potentially time wasting agent movements, especially when the agents
payoff landscape becomes more accurate.

4.1 Self-adaptation

Traditionally in XCSF, two offspring classifiers are generated by reproducing, cross-


ing, and mutating the parents. The offspring are inserted into the population, two
classifiers are deleted. As happens in all other models of classifier systems, parents
stay in the population competing with their offspring. A GA is periodically triggered
in [A] to evolve fitter classifiers in an environmental niche.
It is potentially beneficial if a learning system is able to exert some form of control
over its own learning interactions with the environment. To this end, we include a
number of self-adaptive mechanisms which grant the learner flexibility to tailor its
internal knowledge representation in a problem-dependent manner at a rate controlled
by the learner. We apply self-adaptation as in [20], to dynamically control the
amount of genetic search (the frequency of network weight mutation events) taking
place within the niche. This provides stability to parts of the problem space that are
already solved as the mutation rate for a niche is typically directly proportional to
its distance from the goal state during learning; generalization learning, along with the
value function learning, occurs faster nearer the goal state. Self-adaptive mutation is
here applied, whereby the value (rate of mutation per allele) of each classifier is
initialized uniformly randomly in the range [0,1]. During a GA cycle, a parents
value is modified: * e N(0,1). The offspring then applies its own to itself (for
each allele) before being inserted into the population.

4.2 Neural Constructivism

Implementation of NC in this system is based on the work of Bull [5]. Each rule has a
varying number of hidden layer neurons (initially 1, and always > 0), with additional
neurons being added or removed from the hidden layer depending on the constructiv-
ism element of the system. Constructivism takes place during a GA cycle, after muta-
tion. Two new self-adaptive parameters, and , are added. Here, represents the
Use of a Connection-Selection Scheme in Neural XCSF 95

probability of performing a constructivism event and is the probability of adding a


neuron, with removal occurring with probability 1- . As with self-adaptive mutation,
both are initially randomly generated uniformly in the range [0,1], and offspring clas-
sifiers have their parents and values modified during reproduction as with .
We feel it is important to draw a number of comparisons between NEAT [25] and our
constructivism mechanism, mainly in the respect of GA action, which in our case is
confined to the area of search space covered by the current action set [A] when a GA
event is triggered. This encourages the GA to select similar classifiers within a niche,
much as NEAT selects structures from its own niches. In N-XCSF, as successful
classifiers evolve within a niche, offspring tend to share the same heredity (a nexus of
high-fitness parents that, due to roulette selection, will be more likely to be repeatedly
selected for reproduction within that niche). This phenomenon shares certain traits
with the NEAT genetic marker mechanism, whereby genes that possess a common
ancestry are more likely to be mated together, although our approach does not require
niches to be predefined.
The effect of the mechanisms described in sections 4.1 and 4.2 is to tailor the evo-
lution of the classifier to the complexity of the environment, either by altering the
amount of mutation that takes place in a given niche at a given time [20], or by adapt-
ing the hidden layer topology of the neural networks to reflect the complexity of the
problem space considered by the network [5,11]. During a GA cycle, the operators are
applied as follows: (1) self-adaptation (2) mutate MLP weights (3) enable/disable
connections (4) add/remove nodes.

5 Connection Selection (N-XCSFcs)


Feature selection is a way of streamlining inputs to a given process. Automatic fea-
ture selection includes both wrapper approaches (where feature subsets can change
during the running of the algorithm) and filter approaches (where the subset selection
is a pre-processing step). It should be noted that traditional feature selection is ap-
plied only to input channels (in a neural network, this corresponds to connections
between the input and hidden layer). In this work we allow any connection within the
network to be disabled, allowing potentially more parsimonious results whilst retain-
ing the capability to filter our unnecessary or noisy channels. We term this network-
wide feature selection connection selection. Additional information can be found in
[36], where many neuro-evolution methodologies are compared and contrasted by the
authors, including connection selection. They report favourable performance using a
connection selection scheme similar to our own, although in a single neural network
as opposed to the collection of networks used in this paper.
The purpose of the experimentation carried out in sections 6 and 7 is to assess the
impact of a network-wide feature selection scheme on our neural XCSF system, both
in terms of performance and computational efficiency. Connection selection is im-
plemented in our system as follows: Each connection in a classifiers condition has a
Boolean flag attached to it. During a GA cycle, and based on a new self-adaptive
parameter (which is initialized and self-adapted in the same manner as the other
parameters), the Boolean flag can be flipped. If the flag is false, the connection is
96 G.D. Howard, L. Bull, and P.-L. Lanzi

disabled (set to 0.0, not a viable target for connection weight mutation). If the flag
was false but then flipped to true, the connection weight is randomly initialised
uniformly in the range [-1, 1]. All flags are initially set to true for newly initialised
classifiers and classifiers created via covering. During a node addition event, the flags
representing the new nodes connections are set probabilistically, with P(connection
enabled) = 0.5. Sharing something of a middle ground between FS-NEAT [30] and
FD-NEAT [31], we grow whole neurons (using constructivism, high-granularity
feature selection), but tend to prune connections from those neurons (using our net-
work-wide feature selection implementation, low granularity feature deselection). The
exception to this is node addition, which produces neurons that are on average 50%
connected.

6 Experimentation
Following this brief introduction is a comparison of neural XCSF, with (N-XCSFcs)
and without (N-XCSF) connection selection, in both discrete (Maze4) and continuous
(Grid(0.05)) environments. In Figure 3(a), as well as all other steps-to-goal graphs
(4(a), 5(a), 6(a)), the red dashed line represents optimal performance.

6.1 Discrete Environment

Discrete-environment experiments are parameterized as follows: N=7000,=0.2,


0=0.01, =5, GA= 25, DEL=50. Additionally the x0 parameter is set to 1.0 and the
correction rate to 0.2. Each experiment is repeated ten times with the results being
the mean average of these 10 runs. Every 50 trials, the current state of the system is
analyzed. All averages are means from the population as a whole.

6.1.1 System Performance


In the Maze4 environment, it can be seen that the system without connection selection
initially descends steeply to the optimal steps-to-goal value (Figure 3(a)), but then
takes some time to transition from near-optimal to optimal performance. Connection
selection (figure 4(a)) shows a less uniform curvature of descent, and can be seen to
reach optimality with slightly more expediency. Connection selection has the advan-
tage of being able to disable connections as a binary event, and hence can more dras-
tically alter network performance (effectively allowing it to perform connection
weight changes that are out of range of the standard mutation operator), whilst also
providing an extra degree of freedom in which to alter network behaviour. The
non-uniform curvature of the plot could hint towards the potential disruptiveness of
connection selection, although since the granularity of the changes (and hence magni-
tude of disruption potential) is minimal (i.e. a connection is the smallest possible
component to alter in a MLP), and small network alterations usually result in small
performance alterations, this does not prevent the connection selection-enabled sys-
tem from performing optimally.
After each explore-exploit cycle, an additional knowledge test trial is held with the
agent always starting in the closest available location to the top-left hand corner of the
Use of a Connection-Selection Scheme in Neural XCSF 97

(a) (b)

(c)

Fig. 3. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values in N-XCSF
N (with no connection selection) in Maze4

maze, and the steps-to-goaal count recorded. Under the standard maze scenario uused
above, and in the LCS literrature, it is not possible to perform standard statistical teests
for significant differences in performance as performance is plotted as a 50-pooint
moving average due to the random start location. Using an extra exploit trial from ma
fixed position eases statisticcal comparison and allows us to define stabilty. Stabilitty is
defined as follows: A solu ution can be said to be stable if, for each of 50 consecuttive
knowledge test trials interrspersed between standard explore and exploit trials - frrom
the constant location in thee maze, the solution always finds the optimal path to the
goal. The first trial at whicch each run of a system reaches stability is recorded, and
this set of 10 numbers is compared
c to the sets produced by the other variants of the
system using a standard T--test. We also record various other indicators, namely the
average self-adaptive mutaation rate, , and the average number of connected hiddden
layer nodes.
98 G.D. Howard, L. Bulll, and P.-L. Lanzi

(a) (b)

(c)

(d)

Fig. 4. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs
(with connection selection) in Maze4.

Table 1 shows that the stabilities


s of the two systems are not overly affected by the
addition of connection selection. Most indicative of this fact is the P value of 0.644. It
does however show that connection
c selection has the potential to be performannce-
enhancing (comparing average steps to stability).
Use of a Connection-Selection Scheme in Neural XCSF 99

Table 1. Detailing the average time to stability and T-Test results when comparing N-XCSF
and N-XCSFcs in Maze4

Average P value
Connection selection 8838.40 0.64
No connection selection 10358.8

6.1.2 Effect on Self-adaptive Mutation Rates


Comparison between Figures 3(c) and 4(c) show that overall the plots follow the same
pattern (i.e. is the highest value, then , and finally ); additionally, all three plots
share similar curvature. The most obvious difference is that all three self-adaptive
values are higher in the connection selection version. A possible explanation for
these results is that, since disabled connections cannot be mutated weight-wise, more
search is required by the enabled connections to effectively search all of the connec-
tion weight space (e.g higher mutation rates make up for the lack of mutation variety
due to there being fewer connections to mutate). Interestingly enough, where 60%
connections are enabled, the difference in final mu values is approximately 60% ( =
0.1 for the standard system vs. = 0.06 for the connection-selection version, Figures
3(c) and 4(c)/4(d)). Table 2 highlights statistically significant differences in the final
self-adaptive mutation rate parameters, whereas Table 3 shows variances in the aver-
age number of connected hidden layer nodes per classifier that are also statistically
significant. This indicates that, although performance is not statistically different
when connection selection is added, evidence suggests that the internal problem re-
presentation (that is, the way the system solves the problem) is altered significantly.

Table 2. Detailing the average self-adaptive mutation rate and T-Test results when comparing
N-XCSF and N-XCSFcs in Maze4

Average P value
Connection selection 0.11 2.99E-09
No connection selection 0.058

Table 3. Detailing the average number of hidden layer nodes and T-Test results when compar-
ing N-XCSF and N-XCSFcs in Maze4

Average P value
Connection selection 2.81 9.16E-08
No connection selection 1.50

6.1.3 Effect on Computational Efficiency


We explore the effect of connection selection on computational efficiency in three
ways; the size of the population, the size of of action sets produced, and the number of
enabled connections within the population. When connection selection is enabled, the
average final population is 3041, whereas with no connection selection this value is
4652.5. This translates into a saving during match set generation, where the entire popu-
lation must be processed to derive their actions from the current input. Interestingly, the
100 G.D. Howard, L. Bull, and P.-L. Lanzi

reverse is true for the action set size estimate; 225.7 is the average when connection
selection is enabled, 143.8 without. So even though the match set generation is quicker
with connection selection, all action set-based operations (overall action determination,
parameter updates, reinforcement, GA activation) can be expected to be computation-
ally less efficient with a connection selection scheme applied. In terms of actual en-
abled connections within the population, we can observe that the average number of
connected nodes in the hidden layers of the classifiers (Figures 3(b) and 4(b)) do not
favour connection selection (1.5 connected nodes vs. 2.7 connected nodes). However,
connection selection has only 60% enabled connections on average (Figure 4(d)). We
can then calculate the number of connections enabled in the entire population as:

average population size * average connected hidden layer nodes * average


enabled connections per connected hidden layer node.

Connection selection: = 2.7 * (0.6*11) = 17.82 connections per network


= 3041 * 2.7 * (0.6*11) = 54,190.62 connections in population

No connection selection = 1.5 * (1.0 *11) = 16.5 connections per network


= 4652 * 1.5 * (1.0 *11) = 76,758 connections in population

Hence, even though there are more connections in the connection selection networks
(as there are more hidden layer nodes on average in those networks), the lower re-
quired population means that fewer calculation computations are necessary. For a
neural representation to function, it is postulated that information from all surround-
ing locations would be needed to make an accurate decision with regards to move-
ment in the environment (i.e. keeping a Markov problem structure). Observations of
the final networks agree with this, showing that connections are more frequently cut
between the hidden and output neurons.

6.2 Continuous Environment

The continuous-environment experiments are parameterized as follows: N=20000,


=0.2, 0=0.005, =5, GA=50, DEL=50, x0=1, =0.2. As the continuous environments
have fewer connections per node it is suggested that more of the connections will be
required to preserve necessary classifier utility. Note that due to the differing envi-
ronmental representations, and difficulties found in the continuous environment with
calculating certain actions in certain areas of the state space, a bias single node was
added to all networks, providing a constant weighted positive input in the range [0,1]
to each hidden layer node.

6.2.1 System Performance


Comparison of figures 5(a) and 6(a) reveal that both N-XCSF and N-XCSFcs have
very similar performance in the continuous environment. Connection selection (Fig-
ure 5(a)) shows a less uniform curvature of descent, and can be seen to form optimal
solutions with less connected hidden layer nodes (figures 5(b) and 6(b)). These
figures also reveal some of the disruption that connection selection can add to the
solution, which is evidenced most obviously in the uneven curvature in figure 6(b),
which is echoed in figure 6(c) (the path of the variable towards the end of the trial
Use of a Connection-Selection Scheme in Neural XCSF 101

Table 4. Detailing the averag


ge time to stability and T-Test results when comparing N-XC
CSF
and N-XCSFcs in the continuoous Grid (0.05) environment

Average P value
Connection seleection 11453.5 0.45
No connection selection
s 13453.7

and the unsteady path of after


a the initial steep descent). Figure 6(d) shows over 880%
network connectivity in thee continuous case; the reason for this reasonably high vaalue
is given in section 6.2. Tab ble 4 shows the results of T-Tests carried out to assess the
impact of connection selecction in the continuous environment; it can be seen tthat
connection selection in thiss case is beneficial, in terms of possessing lower averrage
steps-to-stability. However,, a P value of 0.45 shows that any performance differennces
are not statistically significaant.

(a) (b)

(c)

Fig. 5. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values in N-XCSF
N (with no connection selection) in Grid (0.05)
102 G.D. Howard, L. Bulll, and P.-L. Lanzi

6.2.2 Effect on Self-adaptive Mutation Rates


Again, it can be said that th
he self-adaptation mechanisms in both versions of the ssys-
tem perform comparably, (figures
( 5(c) and 6(c)). Connection selection (N-XCSF
Fcs)
introduces a slight instabiliity that is not apparent in N-XCSF. This observation wwas
also made about the system ms when compared in the Maze4 environment (Figures 33(c)
and 4(c)). However, unlikee the discrete environment results, the actual self-adapttive
parameter values are very similar
s between N-XCSF and N-XCSFcs, being on averrage
only slightly higher in the connection
c selection version.

(
(a) (b)

(c)

(d)

Fig. 6. (a) Steps to goal (b) average number of nodes per classifier in the population (c) sself-
adaptive parameter values (d)) average enabled connections per classifier (%) in N-XCS SFcs
(with connection selection) in Grid (0.05)
Use of a Connection-Selection Scheme in Neural XCSF 103

The final self-adaptive parameter is the most pronounced of these differences, dif-
fering between continuous and discrete environments by a factor of ten (0.3 and 0.03
respectively). Table 5 shows the impact in regards to self-adaptive mutation rate when
connection selection is added in the continuous environment. Both connection selec-
tion and non-connection selection versions share similar average values, although the P
value reveals that the difference is still close to statistically significant. Table 6 indi-
cates that, in contrast to the discrete case (Table 3), the average number of hidden layer
nodes evolved by connection selection and non-connection selection versions in the
continuous case are not statistically significantly different. These results indicate that
the impact of connection selection is less in the continuous environment than in the
discrete environment. A possible explanation is that since there are fewer connections
per node in the continuous environment (5 as opposed to 11), the ability of a connec-
tion selection scheme to alter the functionality of a network is reduced.

Table 5. Detailing the average self-adaptive mutation rate and T-Test results when comparing
N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment

Average P value
Connection selection 0.45 0.02
No connection selection 0.47

Table 6. Detailing the average number of hidden layer nodes and T-Test results when compar-
ing N-XCSF and N-XCSFcs in the continuous Grid (0.05) environment

Average P value
Connection selection 1.36 0.46
No connection selection 1.29

6.2.3 Effect on Computational Efficiency


Similarly to section 6.1.3, the average number of connections in the entire populations
of the final solutions can be compared, the results being calculated as follows:

Average connections in population = average population size * average connected


hidden layer nodes * average enabled connections per connected hidden layer
node.

Connection selection: = 2.1 * (0.82*5) = 8.61 connections per network


= 8128 * 2.1 * (0.82*5) = 69,982connections in population

No connection selection = 2.3 * (1.0*5) = 11.5 connections per network


= 12187 * 2.3 * (1.0*5) = 140,150.5 connections in population

These results show that even with 20% more network connectivity (figure 4(d) vs.
figure 6(d)), in the continuous case, the reduced population needs of N-XCSFcs in a
continuous environment provides a greater efficiency enhancement, as not only does
each network contain less connections, also the overall number of networks in the
final solution is significantly reduced.
104 G.D. Howard, L. Bull, and P.-L. Lanzi

7 Discussion
This paper has detailed the implementation of an XCSF system for simulated agent
navigation; as well as the addition of various other elements, namely neural classifier
representation, self-adaptive mutation rates, and neural constructivism. The effects of
a network-wide feature-selection derivative have been examined, with particular em-
phasis placed on computational efficiency and final solution parsimony. Furthermore,
it has been shown that such a system can have a significant impact on both of these
factors in solving both discrete and continuous agent navigation tasks. The research
presented here could be extended in a number of ways; including comparison of other
network types or classifier representations on the same tasks. We also aim to investi-
gate the effects of different methods of performing constructivism (see e.g.[35]).

References
[1] Quartz, S.R., Sejnowski, T.J.: The Neural Basis of Cognitive Development: A Construc-
tionist Manifesto. Behavioural and Brain Sciences 20(4), 537596 (1997)
[2] Edelman, G.: Neural Darwinism: The Theory of Neuronal Group Selection. Basic
Books, New York (1987)
[3] Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in Theoretical Biol-
ogy, vol. 4, pp. 263293. Academic Press, New York (1976)
[4] Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor (1975)
[5] Bull, L.: On Using Constructivism in Neural Classifier Systems. In: Guervs, J.J.M.,
Adamidis, P.A., Beyer, H.-G., Fernndez-Villacaas, J.-L., Schwefel, H.-P. (eds.) PPSN
2002. LNCS, vol. 2439, pp. 558567. Springer, Heidelberg (2002)
[6] Wilson, S.W.: Function Approximation with a Classifier System. In: Spector, L.D., Wu,
G.E.A., Langdon, W.B., Voight, H.M., Gen, M. (eds.) Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO 2001), pp. 974981. Morgan Kauf-
mann, San Francisco (2001)
[7] Rumelhart, D.E., McClelland, J.L.: Parallel Distributed Processing. MIT Press, Cam-
bridge (1986)
[8] Bull, L., Hurst, J.: A Neural Learning Classifier System with Self-Adaptive Constructiv-
ism. In: IEEE Congress on Evolutionary Computation. IEEE Press, Los Alamitos (2003)
[9] Buhmann, M.D.: Radial Basis Functions: Theory and Implementations. Cambridge Uni-
versity, Cambridge (2003)
[10] Bull, L., OHara, T.: Accuracy-based Neuro and Neuro-Fuzzy Classifier Systems. In:
Langdon, W.B., Cantu-Paz, E., Mathias, K., Roy, R., Davis, D., Poli, R., Balakrishnan,
K., Hanavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller,
J.F., Burke, E., Jonoska, N. (eds.) GECCO 2002: Proceedings of the Genetic and Evolu-
tionary Computation Conference, pp. 905911. Morgan Kaufmann, San Francisco
(2002)
[11] Hurst, J., Bull, L.: A Neural Learning Classifier System with Self-Adaptive Constructiv-
ism for Mobile Robot Control. Artificial Life 12(3), 353380 (2006)
[12] Giani, A., Baiardi, F., Starita, A.: PANIC: A Parallel Evolutionary Rule Based System.
In: Proceedings of the Fourth Annual Conference on Evolutionary Programming, EP
1995 (1995)
Use of a Connection-Selection Scheme in Neural XCSF 105

[13] OHara, T., Bull, L.: Prediction Calculation in Accuracy-based Neural Learning Classifi-
er Systems. Tech report UWELCSG04-004 (2004)
[14] Lanzi, P.L., Loiacono, D.: XCSF with Neural Prediction. In: IEEE Congress on Evolu-
tionary Computation, CEC 2006, pp. 22702276 (2006)
[15] Dam, H.H., Abbass, H.A., Lokan, C., Yao, X.: Neural-Based Learning Classifier Sys-
tems. IEEE Trans. on Knowl. and Data Eng. 20(1), 2639 (2008)
[16] OHara, T., Bull, L.: Building Anticipations in an Accuracy-based Learning Classifier
System by use of an Artificial Neural Network. In: Proceedings of the IEEE Congress on
Evolutionary Computation, pp. 20462052. IEEE Press, Los Alamitos (2005)
[17] Prez-Uribe, A., Sanchez, E.: FPGA Implementation of an Adaptable-Size Neural Net-
work. In: Vorbrggen, J.C., von Seelen, W., Sendhoff, B. (eds.) ICANN 1996. LNCS,
vol. 1112, pp. 383388. Springer, Heidelberg (1996)
[18] Watkins, C.J.C.H.: Learning with Delayed Rewards. PhD thesis, Psychology Depart-
ment, University of Cambridge, England (1989)
[19] Wilson, S.W.: ZCS: A Zeroth-level Classifier System. Evolutionary Computation 2(1),
118 (1994)
[20] Bull, L., Hurst, J., Tomlinson, A.: Self-Adaptive Mutation in Classifier System Control-
lers. In: Meyer, J.-A., Berthoz, A., Floreano, D., Roitblatt, H., Wilson, S.W. (eds.) From
Animals to Animats 6 The Sixth International Conference on the Simulation of Adap-
tive Behaviour. MIT Press, Cambridge (2000)
[21] Harvey, I., Husbands, P., Cliff, D.: Seeing the Light: Artificial Evolution, Real Vision.
In: Cliff, D., Husbands, P., Meyer, J.-A., Wilson, S.W. (eds.) From Animals to Animats
3: Proceedings of the Third International Conference on Simulation of Adaptive Beha-
viour, pp. 392401. MIT Press, Cambridge (1994)
[22] Hutt, B., Warwick, K.: Synapsing Variable-Length Crossover: Meaningful Crossover for
Variable-Length Genomes. IEEE Transactions on Evolutionary Computation 11(1),
118131 (2007)
[23] Rocha, M., Cortez, P., Neves, J.: Evolutionary Neural Network Learning. In: Pires, F.M.,
Abreu, S.P. (eds.) EPIA 2003. LNCS (LNAI), vol. 2902, pp. 2428. Springer, Heidel-
berg (2003)
[24] Schaffer, J.D., Whitley, D., Eshelman, L.J.: Combinations of genetic algorithms and
neural networks: A survey of the state of the art. In: Whitley, D., Schaffer, J. (eds.) Pro-
ceedings of the International Workshop on Combinations of Genetic Algorithms and
Neural Networks (COGANN 1992), pp. 137. IEEE Press, Piscataway (1992)
[25] Stanley, K.O., Miikkulainen, R.: Evolving Neural Networks Through Augmenting To-
pologies. Evolutionary Computation 10(2), 99127 (2002)
[26] Stanley, K.O., Miikkulainen, R.: Competitive Coevolution through Evolutionary Com-
plexification. Journal of Artificial Intelligence Research 2004(21), 63100 (2002)
[27] Basheer, A., Hajmeer, M.: Artificial neural networks: fundamentals, computing, design,
and application. Journal of Microbiological Methods 43(1) (2000)
[28] Belue, L.M., Bauer Jr., K.W.: Determining input features for multilayer perceptrons.
Neurocomputing 7, 111121 (1995)
[29] Basak, J., Mitra, S.: Feature selection using radial basis function networks. Neural Com-
put. Appl. 8, 297302 (1999)
[30] Whiteson, S., Stone, P., Stanley, K.O., Miikkulainen, R., Kohl, N.: Automatic feature se-
lection in neuroevolution. In: Proceedings of the 2005 Conference on Genetic and Evolu-
tionary Computation, Washington DC, USA, June 25-29 (2005)
[31] Tan, M., Hartley, M., Bister, M., Deklerck, R.: Automated feature selection in neuroevo-
lution. Evolutionary Intelligence 1(4), 271292 (2009)
106 G.D. Howard, L. Bull, and P.-L. Lanzi

[32] Rocha, M., Cortez, P., Neves, J.: Evolution of neural networks for classification and re-
gression. Neurocomput. 70(16-18), 28092816 (2007)
[33] Howard, D., Bull, L., Lanzi, P.-L.: Self-Adaptive Constructivism in Neural XCS and
XCSF. In: Keijzer, M., et al. (eds.) GECCO 2008: Proceedings of the Genetic and Evo-
lutionary Computation Conference, ACM Press, New York (2008)
[34] Lanzi, P.L.: An Analysis of Generalization in the XCS Classifier System. Evolutionary
Computation 7(2), 125149 (1999)
[35] Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximat-
ing the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in
Neural Information Processing Systems 7, pp. 369376. The MIT Press, Cambridge
(1995)
[36] Schlessinger, E., Bentley, P.J., Lotto, R.B.: Analysing the Evolvability of Neural Net-
work Agents through Structural Mutations. In: Capcarrre, M.S., Freitas, A.A., Bentley,
P.J., Johnson, C.G., Timmis, J. (eds.) ECAL 2005. LNCS (LNAI), vol. 3630, pp. 312
321. Springer, Heidelberg (2005)
Building Accurate Strategies in Non Markovian
Environments without Memory

ne Gilles1 and Proumalnak Mathias2

Universit des Antilles Guyane


LAMIA Laboratory
Campus Fouillole, BP 592
97157 Pointe Pitre Cedex
Guadeloupe
gilles.enee@univ-ag.fr,
mperouma@univ-ag.fr

Abstract. This paper focuses on the study of the behavior of a genetic


algorithm based classier system, the Adapted Pittsburgh Classier Sys-
tem (A.P.C.S), on maze type environments containing aliasing squares.
This type of environment is often used in reinforcement learning litera-
ture to assess the performances of learning methods when facing prob-
lems containing non markovian situations.
Through this study, we discuss on the performance of the APCS upon
two mazes (Woods 101 and Maze E2) and also on the eciency of an
improvement of the APCS learning method inspired from the XCS: the
covering mechanism. We manage to show that, without any memory
mechanism, the APCS is able to build and to keep accurate strategies
to produce regular sub-optimal solution to these maze problems. This
statement is shown through a comparison between the results obtained
by the XCS on two specic maze problems and those obtained by the
APCS.

1 Introduction

Classier systems based on genetic algorithm are rule based systems whose di-
agnose ability is known to be used on parameters optimisation problems [10].
Nevertheless, this kind of classiers needs to perform a learning step before
being used in a production and/or a diagnostic context. Most often, this learning
stage is performed on a sample of data representing the available and validated
/ expertised data of the considered environment. Tendencies contained by this
data set are assimilated by the classier system using reinforcement learning.
In this purpose, the system is continuously exposed to signals created using the
learning sample. At this point, the action performed by the classier in reaction
to the incomming signal is rewarded thanks to the tness function. This function
is dened depending on the learning problem considered: its aim is to maintain
accurate classiers within the population by preventing them to be deleted or
lost by genetic pressure.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 107126, 2010.
c Springer-Verlag Berlin Heidelberg 2010
108 . Gilles and P. Mathias

In the literature, we encounter various methods used to successfully perform


this type of reinforcement learning but, concerning learning classier system
using genetic algorithms, Q-Learning reinforcement methods and anticipation
based methods are the most widely used [13,5,4,14].
However, when facing some multi-step problems, most of these methods have
diculties to build accurate strategies. A solution which is often used is to add
a certain amount of information in order to build more precise strategies [11,13].
The main drawback of this solution is to determine which amount of information
should be added to solve a given multi-step problem. In this study, we chose to
focus on another possibility that allows us to create a cognitive pattern within the
learning system by using a dierent structure of knowledge. Our assesment relies
on a parallel exploration of the cognitive space available by dierent collections
of classication rules. Our approach is mainly supported by the structure of the
cognitive system we use: the Pittsburgh Classier System.
This paper is structured as follows. In Section 2 we introduce the type of
multi-step environment chosen and the measures we used. Then, in Section 3,
we describe the main algorithm of Adapted Pittsburgh Classier System, includ-
ing improvements brought by the covering mechanism. After this point, we will
describe the experiments we conducted and discuss the obtained results in Sec-
tion 4. This discussion will be extended to Section 5 with a comparison between
the results previously obtained and results obtained with the eXtended Classier
System (XCS) [5] on the same reinforcement learning problems. We will then
conclude on the measured improvements led by the covering mechanism and on
further results.

2 Context and Related Work


2.1 Maze Problems
Maze problems, as simplied reinforcement learning problems, are often used in
classier systems literature to assess the eciency of a learning method (XCS,
ZCS), an improvement of an existing classier system (XCSM, ZCSM), or to
validate a new algorithm (ACS, AgentP, ATNoSFERES) [18]. Moreover, some
mazes also oer perceptually similar situations which require dierent actions to
reach the goal. These situations are designated in previous studies as aliasing
situations. In addition to that fact, abilities needed by a learning classier system
to solve a maze problem may be related to abilities needed to solve a given
optimisation problem with a learning sample containing missing or aliased data.
A maze can be dened as a given number of neighbouring cells. A cell is a
bounded space formally dened: it is the elementary unit of a maze. When it is
not empty, a cell can contain either an obstacle, either food, either an animat or
eventually a predator of the animat.
The maze problem is dened as following: an animat is randomly placed in
the maze and its aim is to set its position to a cell containing food. To perform
this task, it possesses a limited perception of its environment. This perception
is dened by collecting the state (i.e. obstacle, food or empty) of the eight cells
Building Accurate Strategies in Non Markovian Environments 109

surrounding its position. The animat can move only to an empty cell between
these neighbouring cells, moving step by step through the maze in order to fulll
its goal.
A cognitive system which is studied on this kind of environment have to pilot
an animat through the maze in order to reach the food. The problem given
to this cognitive system is to attend to adopt a policy of moves inside of this
environment. This strategy must allow the animat to complete its goal with an
accurate and nite number of steps.
Maze environments oer plenty of parameters that allows to evaluate the com-
plexity of a given maze and the eciency of given a learning method. Aliasing
positions are described in [2] as positions with identical perceptions for the an-
imat. According to this study, it exists three types of aliasing positions. Type
I aliasing positions are located at a dierent distances from the food but re-
quire the same actions to get closer to it (Actionx,y = Actionx ,y D(x, y) =
D(x , y  )). Type II aliasing positions are located at dierent distances from the
food and require dierent actions to get closer to this objective (Actionx,y =
Actionx ,y D(x, y) = D(x , y  )). At last, type III aliasing squares, which also
require dierent actions to reach the goal, are located at the same distance from
it (Actionx,y = Actionx ,y D(x, y) = D(x , y  )).
The folowing chart (g. 1) , extracted from the same study, presents most
of mazes that are available in the literature. It was built considering both the
type of the aliasing squares contained by each maze and the mean of the average
number of steps Q m done by a Q-Learning algorithm to reach the food over the
average distance to the food m measured for this maze.
We have chosen to focus our experimental part on the study of two mazes
: the Woods101 and the Maze E2. The Woods101 maze is easy to represent
Q
(g. 2) and oers a high complexity (Q m = 402.3 and m = 2.7, m = 149 [18]).
m

Fig. 1. Complexity chart of maze-type environments


110 . Gilles and P. Mathias

(a) Maze Woods101 (b) Maze E2

Fig. 2. Maze environments used in this study

On the other hand, the E2 maze, which is also easy to represent, has a higher
Q
complexity than the Woods101 (Q m = 710.23 and m = 2.33, m = 304.81 [18]).
m

Moreover, this maze also present both type II and type III aliasing squares.

2.2 Random Walk and Optimal Number of Steps

In order to determine the quality of the results obtained on a given maze, they
are two situations that we need to test and to establish clearly: the performances
observed when using a random method and these that would be obtained con-
sidering the optimal choices. We refer as optimal the best choices that could
be done by a cognitive system which is not able to distinguish two perceptually
similar situations.
We have chosen to perform the same measures on the random walk and on
the optimal choice case than on the performances of the classier systems. Con-
cerning the random walk, the animat is randomly placed in the maze and at each
time step it chooses a random direction between the eight directions available.
We measure the number of steps done by the animat and the nal distance of the
animat to the food (these choices are clearly established in Section 4.2, please
refer to it for more details).
To calculate the optimal number of steps that should be done by the animat
to fulll its goal, we had to consider the fact that the mazes we have chosen
contain aliasing squares. As the learning classier system we study do not use
memory mechanisms, we must take into account that neither the XCS, neither
the APCS are able to make the dierence between two squares of a given aliasing
situation. As a consequence, we need to reevaluate the optimal number of steps
for each maze considering the optimal policy that these classier systems would
be able to adopt to solve the problem.
Concerning the Woods101 environment, each aliasing square (dashed on g.
3a) impose to the system to maintain 2 dierent moves in order to reach the
food. If the animat is placed on the left-side aliasing position, it must go south-
east. On the contrary, if the animat is placed on the right-side aliasing position,
it must go south-west to reach the food. As a consequence, the optimal number
of steps done when starting from those positions should be set to 3 instead of 2
(g. 3b). As a consequence, it increases the number of steps to reach the food
for squares situated behind the aliasing positions. Those modications raise the
Building Accurate Strategies in Non Markovian Environments 111

(a) Woods101 opt. pol- (b) Woods101 opt. nb.


icy steps

(c) Maze E2 opt. policy (d) Maze E2 opt. nb.


steps

Fig. 3. Considered optimal policies

average number of steps for this maze to 3.5 instead of 2.7 [2] considering this
new optimal policy.
The impact of the aliasing positions is harder to evaluate when considering
the maze E2. Sixteen aliasing squares are at the same distance to the food (2
squares) and oer the same perception (8 empty squares). As a consequence,
when we consider each one of these squares, we notice that at least 2 opposite
available diagonal directions can be kept in order to reach the food (see g. 3c).
As a consequence, in order to establish a metric for this maze, we will consider the
following statement to evaluate the optimal number of steps for those squares:
we exclude action chains that allow moves from an aliasing position to another
of the same type. These modications raise the average number of steps when
using an optimal policy on this maze to 3.16 instead of 2.33.

3 Principles of Studied CS
3.1 The Adapted Pittsburgh Classifier System
The Adapted Pittsburgh Classier System (APCS) is derived from the original
work of Smith on LS1 [16]. As Michigan classier systems (ZCS, XCS) [15],
Pittsburgh classier systems rely on two basic elements: the classier, which is
the container of the knowledge acquired by the system, and the genetic algorithm
which allow this knowledge to evolve. Nevertheless, instead of considering a
global collection of classiers, Pittsburgh classier systems co-evolve multiple
small collections of classiers in parallel.
112 . Gilles and P. Mathias

In the next sub-sections, we will precise how does structure, evaluation and
evolution mechanism in APCS diers from the original work done by Smith on
LS1 and from other existing Pittsburgh approaches.

Structure. As other classier systems, APCS is built with production rules


also called classiers. These classiers are formed of two parts: a condition part
(also called sensor) which is sensitive to signals from the environment, and an
action part (also called eecter) which is the answer predicated by the classier to
signals that activate the condition part. When dealing with multi-step problems,
this answer may induce a modication of the perception of the system.
In this case of study, the condition part is dened upon a ternary alphabet
{0, 1, #}, where #(wildcard) stands for both 0 and 1. The action part contains
only bits {0, 1}. As we said before, Pittsburgh classier systems evolve collections
of classiers. These collections are designated individuals. Contrary to other
Pittsburgh classier systems (GABIL [7], GALE [3], GAssist [1]), the individuals
of the APCS contain a xed number of classiers (see g. 4).
Table 1 summarizes the structural dierences between two recent Pittsburgh
classier systems and APCS. We can notice that, regarding the structure of the
population, the main dierence resides in the composition of the individuals. As
a consequence, the classical genetic operators (crossover and mutation) are also
impacted.
In the APCS, a population of individuals is initially created using four
parameters:

A xed number Ni of individuals in population.


A xed number Nc of classiers per individual.
A xed size Lc for all classiers.
An allelic probability P# of having a wildcard in the condition part.

This population is rst lled with random classiers. It is also possible to ll


initial population with specic classiers uploaded from a le. We will focus now
on how this population is evaluated in the case of the APCS.

Evaluation mechanism. As they are collections of classiers, the individuals


of an APCS interact with their environment through their sensors and eectors.

Fig. 4. APCS Population


Building Accurate Strategies in Non Markovian Environments 113

Table 1. Comparison between GALE/GAssist structure and APCS structure

Element GALE/GAssist APCS


Number of Individual Ni Fixed Fixed
Classiers (Nc ) per Individual Variable Fixed
Crossover Operator Variable length operator Standard monopoint
Mutation Operator Allelic Allelic

As these collections evolve in parallel, each individual is aected either to local


partition of the global environment either to a copy of this environment which
provides it elements to perform its learning. An individual is rewarded thanks
to a tness function which measure the adequation of the answers given by this
individual to its cognitive context. As a consequence, a scalar (the strength)
is aected to each individual and globally reects the mean strength of the
classiers lling it.
As rst evocated by Smith, by ne [8] and by other studies related to Pitts-
burgh Classier systems [1], at the end of the simulation, each individual tend
to be a potential solution to the problem. In order to ensure that the strengh
of an individual reects its composition, all individuals are submitted to a xed
number K of trials.
A trial consists in four steps (see g.5). First, the environment sends a signal
to the individual which is currently evaluated. Then, thanks to that signal, the
individual forms a match set [M ] which contains all of its classiers that have
been activated by this signal. Then, a classier is selected in [M ] to perform its
action. We can use four methods to select a classier in [M ]: (1) randomly, (2)
with the lesser wildcard rate in the condition part (specific), (3) with the higher
wildcard rate in the condition part (generic) or (4) the rst one in [M ]. The
classier which is selected is used to perform its action upon the environment. At
last, the tness function gives a K weighted reward to the individual considering
the action expressed.

Fig. 5. Evaluation mechanism


114 . Gilles and P. Mathias

Individuals are evaluated separately. Due to that fact, regarding multistep


problems, the environment related to each individual may change along the
learning stage. The major interest we found in separated environment is that
the genetic algorithm mix together positive experiences of each individual and
allows the population of individual to evolve and to improve their adequation to
the global environment.
In the following sub-section, we describe more precisely this evolution
mechanism.

Evolution mechanism: The Genetic Algorithm. The evolution algorithm


(here genetic algorithm) is applied when all individuals had been evaluated K
times. The reward attributed to each individual can be either continuous, us-
ing a Q-Learning method for tness / strength update, either be reset at each
generation and based only upon the trials occured during the last generation.
As for other learning classier systems, GA is essential to APCS: it allows
individuals to exchange cognitive material. The GA applies three main operators
among individuals of the population using their tness. It rst selects parents
that will eventually reproduce using crossover operator and using mutation
operator to create new ospring.
The selection mechanism mainly uses the roulette wheel or the tournament
methods to select parents that will be kept to generate next generation.
The crossover operator chooses randomly a {n, i} pair where i{2, Lc 1} is
the position where crossover will occur within classier n (n{1, Nc}) in each
individual selected for reproduction. The crossover is one point and manipulates
individuals of the same size. The reason why ne has chosen xed length in-
dividuals [8] essentially comes from Smith work[16] and from Baccardit work
on the bloat eect [1]. Both studies have stated that the crossover operator
between individuals of dierent lengths will mainly accentuate the dierence of
length between the two ospring. The selection mechanism would then progres-
sively erase short individuals that would not be able to answer problems because
they do not have enough classiers.
Thus, bigger individuals would be selected and the individual size would tend
to grow through generations. Smith also observed that while the individuals
tend to grow, they rst reach an optimal size to well answer to the problem.
Then additional classiers appear and produce noise in answer: this eect is also
known as the bloat eect [1]. In our case of study, xed size individuals and
monopoint crossover appeared to be the simplest way to avoid this side eect
without discarding the advantages of the intrinsic parallelism of the individuals
of an APCS.
The mutation operator is allelic: each binary position (allele) of a classier
can be mutated depending upon a mutation probability PMut .
Crossover and mutation are expressed as probabilities.
Best genetic algorithm parameters usually taken by Smith are (except for the
selection mechanism):
Building Accurate Strategies in Non Markovian Environments 115

Selection mechanism: roulette wheel.


Crossover probability: 50-100%.
Allelic mutation probability: 0,01-0,5%.

As other Pittsburgh LCS, the individuals contained by a population of APCS


tend to be very similar after a given number of generations which is problem de-
pendent [9]. At this point, each individual is a complete solution of the problem.

Further discussion upon APCS and covering mechanism. Figure 3 shows


the optimal policy that could be used to solve mazes E2 and Woods101 without
memory. This optimal policy can be considered only when the learning system is
able to keep dierent actions associated to the same biased perceptual situation.
The structure of individuals used in APCS allows to maintain classiers that
matches the same perceptual situation but acts dierently. This particularity is
mainly due to the fact that each individual is globally rewarded. Due to that
fact, the strength of an individual directly reects the adequation of its collection
of classiers to the environment.
As previously described in this section, they are four mechanism that can be
used during the evaluation stage to select a classier within the matching set [M ]:
(1) random, (2) specic, (3) generic, (4) rst. If the random classier selection
mechanism is chosen, the system is guaranteed to re, at least sometime, the
adequate classier if it is in its pool [M ].
These two previous points indicate us that APCS should be ecient when
facing problems with non Markovian chains to the solution. As a consequence,
our system should be able to build accurate strategies without memory. This
assesment is strongly linked to the ability of the tness function to reward the
most tted individuals.
Algorithm 1 propose a complete view of the evaluation mechanism.
The ending criteria evocated in this algorithm can either be a number of
generation either occur when the K trials are successful for an individual. So to
evaluate an APCS, the following parameters are needed:

Simulator / Environment reset mode: To zero / To previous status.


A number of trials.
A selection rule mechanism: random / specic / generic / rst.
A number of generations.

In the introduction of this study, we have announced that the covering mecha-
nism is newly implemented in APCS. Directly inspired from Wilson work with
XCS[17], this mechanism consists here in replacing the sensor part of a classier
when no classier matches a given signal from environment. To enhance this
mechanism, we add a parameter to each classier, called covering time Ct , in
order to measure the number of generations a classier has not been activated
before it should be covered. This permits to every classier to have a chance
of being useful to its cognitive pool, i.e. the individual, before being covered.
The covering mechanism replaces the condition part of a classier with the mes-
sage from environment adding wildcards depending on the wildcard probability
116 . Gilles and P. Mathias

P# (see section 3.1). Nevertheless, the action part of the covered classier is kept
as it was before covering.
Now that we have presented the main trends of our system, the APCS, we will
focus on the other learning classier system used in this study : the eXtended
Classier System (XCS).

3.2 The eXtended Classifier System

The eXtended Classier System is a classier issued from the Michigan approach
which was rst built by Wilson in 1995 [17]. It started to become mature near
2000 thanks to Butz, Lanzi and Kovacs who realised rigorous performance and
accuracy studies over this system, allowing it to evolve through new mechanisms:
improvements on the main algorithm, adding memory (XCSM1, XCSM2 [12]).
As a consequence, this system shows pretty good results, even on maze-type
environments with aliasing squares (see [13]).

Table 2. Dierences between XCS and APCS

Element XCS APCS


Number of Individual Ni Adaptative Fixed
Classiers (Nc ) per Individual 1 Fixed
Crossover Operator Literature standards Standard monopoint
Mutation Operator Allelic Allelic

The XCS principle, as described in [17] consists in a population of classiers


(sensor plus eecter) called individuals, which evolution relies on the ability
of each classier to predict, thanks to a Q-Learning algorithm, the reward that
should be obtained by an individual for a given action in answer to a given signal.
These classiers are mixed with a classical GA and converge easily towards quite
generalist classiers that answer globally to the problem.
Table 2 summarizes the main structural dierences between XCS and APCS.
The main dierence resides in the fact that a classier is an individual in the
design of XCS. As a consequence, a given reward concerns only one classier.
This mainly explains why XCS without memory may not be able to solve non
Markovian processes (please also refer to [11]). In this study, we have compared
the performances of the APCS with those obtained with an XCS on the same
maze.
To perform our measures, we have used the version 1.2 of the XCS [6], which
is related to the algorithmic version of XCS published in the article of Butz
and Wilson [5]. This version includes most of the mechanisms developed until
here for the XCS, without including register memory mechanism. We strongly
recommend the reader to read [17] and [5] for further details upon XCS.
Building Accurate Strategies in Non Markovian Environments 117

Algorithm 1. of Adapted Pittsburgh style classier system


Begin
//Random initialization of P or using a file initialization.
Fill(P );
Generation = 1;
Repeat
For (All Ij of P ) Do
//Reset environment to individual Ij last status or to zero .
Reset_Environment(j);
//In case of continuous reward, remove the following line.
RewardIj = 0;
For (A number of Trials k) Do
//Store a message from environment.
Fill(Message);
//Create Match-Set from classifiers of Ik that match signal.
Fill(Match-List);
If (IsEmpty(Match-List)) Then
//Select a classifier not used since Ct last generations.
ReplaceP osition = ChooseCoveredClassifier();
//Replace condition part of classifier number ReplaceP osition.
CoverClassifier(ReplaceP osition, Message);
Fill(Match-List);
Endif
If (Size(Match-List)> 1) Then
//Choose classifier in Match-List using strategy described in
this section.
C = BestChoice(Match-List);
Else
C = First(Match-List);
Endif
A = Action(C);
//Acts upon environment using action A.
DoAction(A);
//ActionReward is the fitness function.
RewardIj = RewardIj + ActionReward(A)/k;
EndFor
//Change strength of individual Ij using RewardIj .
ChangeStrength(Ij ,RewardIj ); { }
EndFor
//Genetic Algorithm is applied after every individuals had been
evaluated.
ApplyGA(P );
Generation = Generation + 1;
Until (Ending Criteria encountered);
End
118 . Gilles and P. Mathias

4 Experiments Results
4.1 Experimental Settings
For each experiment, we submitted 20000 consecutive problems to the system:
for each problem, the animat is randomly put on a free square of the maze and
the trial stops when one of those two conditions is fullled:
1. the position of the animat in the maze is equal to the position of the food
2. the number of steps done by the animat surpass a certain threshold
(MaxSteps, equal to 50 steps for every presented results)
When the problem is solved, we record the starting distance to the food of the
animat, its nal distance to the food1 and its total number of steps. Both APCS
and XCS results shown in this paper are averaged over 10 experiments.
As done in [5] and in [13], the signal received by the system consists in a 16
bits string that represents each of the 8 squares surrounding the animat. Those
squares are encoded clockwise, starting by North: (00) stands for an empty cell,
(11) for food and (10) for an obstacle. As a consequence, the sensor part of the
classier also contain 16 positions. Each position in the sensor can be randomly
occupied by 0, 1 or by a wildcard (#). The eector part, coded by a string
of 3 bits, stands for one of the eight directions available for the animat, coded
clockwise, as the sensor part.
Specic settings used for XCS are the same as those used by Lanzi in 1999
[13], please refer to this experiment for more details.
Concerning the APCS, each evaluation group, i.e. individual, controls an an-
imat. As a consequence, during the experiment, each group is submitted to the
20000 problems and solve them asynchronously. The experiment stops when all
evaluation groups have solved at least 20000 problems. The number of moves is
measured during each of the K trials (see section 3.1).

Algorithm 2. algorithmic view of the tness function used in this study


if (next positionIj = food)
RewardIj RewardIj + 1.0 K
else if (next positionIj = obstacle)
RewardIj RewardIj 0.5 K
else
RewardIj RewardIj + 0.2 K
endif

In the algorithm 2, we propose an algorithmic view of the tness function we


use in this study. This function is dened by the move done by a given animat
at each trial: if it is correct, (i.e. if the animat moves toward an empty cell), the
evaluated individual receives a reward of 0.2 K and the movement is performed by
1
Due to MaxSteps, it may be greater than 0.
Building Accurate Strategies in Non Markovian Environments 119

the animat of the group; if it reaches the food, it receives a reward of 1.0 K and
the animat of this group is randomly placed in the maze. Else, the individual
receives a negative reward of 0.5K and the animat of the group is not moved.
During the K trials, for each individual, the random selection method is used to
select a classier from [M ] (see section 3.1).
Concerning the GA step, the mutation mechanism allows to reinforce the
exploratory ability of the system by creating new classiers. These classiers
are created when modifying existing classiers locally and randomly according
to a certain rate, which is expressed by the chosen mutation probability P Mut .
However, two consecutive positions of the considered mazes rarely diers one
from another for more than 4 bits, and the non-sense sequence (01) can, in the
present case, invalidate the activation of a classier. As a consequence, the higher
the number of mutated bits, the more the system may lose classiers that could
have allowed it to manage to nd the food. Experiments performed in [8] have
also validated that a high mutation rate (P Mut > 0.2) prevents the system from
keeping optimal classiers.
The other important parameter of the GA step, the cross-over rate, has a great
inuence on the homogeneity and on the stabilisation of the system: combined to
elitism, it allows the system to preserve the best behaviours expressed inside the
population. ne has measured in [8] that a low cross-over rate (P Cross < 0.6)
slows the convergence of the system by preventing good genetic precursors to be
replicated inside the population.
Elitism was set to 60% in order to keep the best parents and to stabilize
population more rapidly.
As shown in [9,8], stable results are obtained using a mutation probability
P Mut set to 0.005 and a cross-over probability P Cross set to 0.75. As a conse-
quence, we have chosen to use those values of parameters to perform our exper-
iments.

4.2 Results without Covering


For the presented experiments, we made two signicant measures: the rst mea-
sure, the average number of steps done by the animats, allows us to tackle the
ability of the system to conform to a moving policy inside of the maze. Due to
that fact, the evolution of this measure reects and characterize the evolution of
this ability.
As a second measure, the average nal distance of the evaluation groups at
the end of a trial allows us to validate the results outlined by the rst measure.
As the animat is randomly placed in the environment when the number of steps
it has done crosses a certain threshold, this second measure shows the eciency
of the policy built by the system.
Now, we can study the inuence of parameters proper to the classier system.
The number of individuals and the number of classiers are the only parameters
that really have an inuence on the cognitive properties of the CS [8,16]. First,
for a xed number of classiers (Nc = 30) , lets compare results obtained for NI
between 20 and 50 (Table 3).
120 . Gilles and P. Mathias

Table 3. Measure of the inuence of the variation of the number of individuals on the
average number of steps (Nc = 30, NI = 20, .., 50)

Woods 101 Maze E2


Avg. nb of steps Avg. nal dist. Avg. nb of steps Avg. nal dist.
Random walk 29.19 0.66 32.49 0.78
NI = 20 8.72 0.12 24.84 0.58
NI = 30 7.52 0.08 15.03 0.11
NI = 40 7.11 0.05 13.43 0.08
NI = 50 6.56 0.03 12.62 0.06

Table 4. Measure of the inuence of the variation of the number of classiers on the
average number of steps (NI = 30, Nc = 20, .., 50)

Woods 101 Maze E2


Avg. nb of steps Avg. nal dist. Avg. nb of steps Avg. nal dist.
Random walk 29.19 0.66 32.49 0.78
Nc = 20 12.13 0.33 33.35 1.25
Nc = 30 7.52 0.08 15.03 0.11
Nc = 40 11.10 0.28 12.53 0.06
Nc = 50 8.67 0.18 11.88 0.07

When the number of individuals changes, it also modies the number of eval-
uation groups (see section 3.1). As in this experiment, each group controls the
moves of an animat, the classier system is able to learn on N I dierent situ-
ations for each trials. As a consequence, the raise of the number of individuals
also raise the exploratory ability of the system which accelerates and improves
the convergence of the system.
When considering the evolution of the average nal distance to the food
(Table 3, column Avg. nal dist)., we can conclude that the raise of the number
of individuals contribute to the systems stabilization by diusing more eciently
a common accurate strategy.
Lets now consider the inuence of the variation of the number of classiers
contained by an individual on the performances of the system on this problem.
The following experiments (Table. 4) have been conducted with a xed number
of individuals (NI = 30) and various number of classiers (N c between 20 and
50).
Each classier determines the answer of the individual to one or several given
signals coming from the environment. As a consequence, the number of classiers
contained by an individual can possibly have an inuence on the number of
signals which may trigger an answer from a given individual.
As shown by Smith [16], if the potential information contained by an in-
dividual is too high, the exceeding information generates noise that disturbs
the answer of the system and the evolution of more tted classiers. In addi-
tion to this phenomenon, as non markovian situations may induce a rise in the
Building Accurate Strategies in Non Markovian Environments 121

average wildcard rate of the classiers [2], each additional classier may bring
more unnecessary information. This side eect emphasizes the fact that it exists
a potential information threshold on information contained by an individual.
When considering the obtained measures, we can conclude that this threshold
depend on the considered problem.
As a conclusion, if in the beginning, providing to the individuals of the APCS
additional cognitive capacity can improve the quality of the answer of the system,
additional classiers may contain useless precursors that disturb the convergence
of the system. As said earlier, this issue is successfully addressed in studies of
other Pittsburgh approaches as the bloat eect [1].

4.3 Results Obtained Using Covering


In order to improve our results, we have adapted the covering mechanism used
in the XCS to allow APCS to generate well tted classiers when encountering
an unknown signal. To measure the impact of the activation of this mechanism
on the evolution of the classiers contained by the individuals of the APCS, we
have chosen to measure the changes registered when choosing dierent values for
the Ct parameter. To prevent any additional perturbation, these tests have been
performed using a xed number of classiers and a xed number of individuals.
In this paper, we have chosen to present those results for NI = 30, and Nc = 30
for the Woods 101 and with NI = 40, Nc = 30 for the maze E2 with a Ct
between 0 (no covering) and 30 (uses the classiers that have been less triggered
during the last 30 generations).
We can notice in table 5 that the number of steps done by the system to reach
the food decreases when increasing the Ct parameter. This improvement occurs
due to the modication done to the available pool of classiers by the covering
mechanism. As we have measured, a classier that has not been activated on Ct
generations has a strong probability to contain one or many defective precursors.
While this mechanism allows to remove those precursors from the population, it
also increases the accuracy of the strategy built by the system. This phenomenon
is suggested by the evolution of the average nal distance to the food of the
animats: the tendancies observed on the rst measure are assessed by the second
one.

Table 5. Measure of the inuence of the variation of the parameter Ct on the obtained
results

Woods 101 (NI = 30, Nc = 30) Maze E2 (NI = 40, Nc = 30)


Avg. nb of steps Avg. nal dist. Avg. nb of steps Avg. nal dist.
Ct = 0 7.52 0.08 13.43 0.08
Ct = 3 6.22 0.03 8.65 0.01
Ct = 7 5.7 0.001 8.47 0.01
Ct = 11 6.03 0.02 8.60 0.01
Ct = 15 5.84 0.01 8.64 0.01
Ct = 20 5.86 0.01 8.48 0.01
Ct = 30 5.77 0.01 8.83 0.01
122 . Gilles and P. Mathias

(a) Woods 101 (b) Maze E2

Fig. 6. Measure of the gain related to the covering mechanism

As a summary, through the results presented in table 5, we deduce that the


classiers with the condition part replaced by the covering mechanism carry
an amount of information whose usability decreases depending on the raise of
the number of evaluation steps passed without being activated. As the animat
problem studied is a multi step environment, it may occur that a classier that
has not been triggered at a given evaluation step ti can be triggered at the step
ti+n depending on the moves of the animats through the environment. However,
the number of squares which can be occupied by an animat are nite so it
exist an upper boundary to the number of available signals generated by this
environment. If we relate this hypothesis to the previous one, we can suppose
that, for each environment, each situation will have been tested by the system
after a nite number of generations.
Figures 6a and 6b plot the average gain in % over all the conducted experi-
ments when we increase the Ct parameter. We can observe that this gain becomes
equal to 0 when a certain value of Ct is crossed. Thus, we can suppose that it
exists a nite number of generations N CT which may allow us to diagnose that
if a classier has not matched any signal emitted by the environment during the
last N CT generations, the signals corresponding to this classier are not avail-
able in the current environment. Due to that fact and according to the structure
of the APCS (see Section 3.1), if we extend the range of this conclusion, we can
also suppose that for each environment of this type, it exists a threshold value
of evaluation steps K N CT over which we can decide that an upper value of
the Ct parameter does not carry any additional knowledge on the potentially
useless information carried by a classier.

5 Comparison between APCS and XCS

As shown by the results of the experiments presented in this paper, APCS man-
age to evolve classiers allowing it to adopt a stable moving policy whose quality
is greatly improved by the recently added covering mechanism.
Building Accurate Strategies in Non Markovian Environments 123

(a) XCS (2000 Individuals) (b) APCS (NI = 30, Nc = 20, Ct = 7)

Fig. 7. Best performances with Woods 101

We will now focus on the comparative study of the best results we obtained
with the XCS (g. 7 and 8) with the best results obtained with the APCS.
Parameters used for the XCS during those experiments are those used in
the experiment conducted by Lanzi in 1999 on this type of environment[13],
exception made for the number of individual which is 2000 for the Woods101
and 8000 on the Maze E2.
We will center our discussion on the policies built by the classiers. When
we consider the dierences between the best results obtained by the XCS and
the best results obtained by the APCS, we notice two dierences. The rst one,
signicant on the Maze E2 but not on the Woods 101 is that the average number
of steps done by the APCS is closer to the optimal than the average number of
steps done by the XCS. This statement can nd its foundation in the second
dierence noticed: the average nal distance to the food for the APCS on the
two mazes is lower than the one measured for the XCS. This dierence means
that the APCS accurately nds the food more often than the XCS which implies
that the policy built by the APCS is more stable and accurate.
However, the learning strategies employed by the two systems are quite dier-
ent one from another: the XCS learning stage focuses on the value function and
the policy deployed by this system is strongly dependent on this value function.
In addition to that fact, in order to keep the accuracy of its prediction, this
mechanism requires to maintain most of the actions available for a given signal.
XCS was built to nd Markov chains to solve a problem, so it is not supposed
to maintain at the same level of prediction, classiers with identical condition
parts and dierent actions. Reward in XCS is given to each classier part of the
Markov chain to solution. This is what we pointed out while describing both
CS. APCS maintains knowledge structures with several classiers. As the re-
ward concerns the whole knowledge structure, it is possible to have two or more
classiers with the same condition part but with dierent actions upon environ-
ment. Due to those facts, even if the XCS succeed in keeping an almost stable
policy for the Woods 101 environment, it fails when facing an environment with
numerous aliasing situations.
124 . Gilles and P. Mathias

(a) XCS (8000 Individuals) (b) APCS (NI = 40, Nc = 50, Ct = 11)

Fig. 8. Best performances with Maze E2

On the opposite, the learning mechanism used by the APCS relies on its cog-
nitive capacity (Nc ) which is highly problem dependent but allows it to develop
strategies regarding its past actions. The most accurate classiers will tend to
stay in the population because they will allow their owner to reach the food
more accurately and more often than the other individuals of the APCS. As
a consequence, in a multistep environment, the classiers contained by those
strong individuals allow them to build action chains which are reward depen-
dent. Moreover, instead of solving one problem at once, the APCS tries to solve
NI problems in the same time (see Section 3.1.2) which allows it to explore a
higher number of situations in a short time of simulation. Due to all those facts,
the system tend to nd a sub-optimal solution that allows it to be rewarded
accurately and more often.

Fig. 9. Example of policy deployed by the APCS to solve the Maze E2


Building Accurate Strategies in Non Markovian Environments 125

We observe the formation of policies (see g. 9 for policy obtained with APCS
in maze E2) that makes it possible to reach the food from every position, even
when the environment contains aliasing squares. This policy evolves with the
frequency of the reward encountered by the individuals which allow them to
adapt and modify (via the genetic algorithm) their classiers.

6 Conclusions
Through this paper, we have shown and studied results which indicate that,
without any knowledge of its environment, even when facing non-markovian
positions, the Adapted Pittsburgh Classier System improved with the covering
mechanism and with the tted parameters, is able to adopt an almost stable
policy in maze environments containing aliasing square. This policy allows it to
reach accurately the food with a low but not optimal number of steps.
When studying the number of classiers contained by an individual, we have
shown that the raise of this local cognitive capacity can benet to the system if it
remains under a problem dependent threshold (see also [9]). Cognitive capacity
provided over this threshold showed the conservation of defective precursors and
a strong disturbance of the system answer due to them. Fortunately, as shown in
the experiments, those precursors are assimilated/eliminated by the system in
an amount of trials depending on the useless amount of information they carry.
We have also shown that the covering mechanism we propose has a notice-
able inuence on the performances of the system: classiers eliminated by this
mechanism carry an information that have a decreasing usability depending on
the number of evaluation steps during which those classiers are not triggered
by a signal from the environment.
Some interesting further work remains to be nalized, especially concerning
the precise built of the policy evolved by the APCS along the experiment.

References
1. Bacardit, J., Garrell-Guiu, J.M.: Bloat control and generalization pressure using
the minimum description length principle for a pittsburgh approach learning clas-
sier system. In: Kovacs, T., Llor, X., Takadama, K., Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 5979. Springer,
Heidelberg (2007)
2. Bagnall, A.J., Zatuchna, Z.: On the classication of maze problems. In: Bull, L.,
Kovacs, T. (eds.) Applications of Learning Classier Systems. Studies in Fuzziness
and Soft Computing, vol. 183, pp. 307316. Springer, Heidelberg (2005)
3. Bernad-Mansilla, E., Llor, X., Garrell-Guiu, J.M.: Xcs and gale: A compara-
tive study of two learning classier systems on data mining. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115132.
Springer, Heidelberg (2002)
4. Bull, L.: Lookahead and latent learning in ZCS. In: GECCO 2002: Proceedings of
the Genetic and Evolutionary Computation Conference, New York, July 9-13, pp.
897904. Morgan Kaufmann Publishers, San Francisco (2002)
126 . Gilles and P. Mathias

5. Butz, M., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 253272.
Springer, Heidelberg (2001)
6. Butz, M.V.: Documentation of XCS+TS c-code 1.2. IlliGAL Report 2003023, Illi-
nois Genetic Algorithms Laboratory (October 2003)
7. De Jong, K.A., Spears, W.M., Gordon, D.F.: Using Genetic Algorithms for Concept
Learning. Machine Learning 13(3), 161188
8. ne, G.: Systmes de Classeurs et Communication dans les Systmes Multi-
Agents. PhD thesis, Ecole Doctorale de STIC, Universit de Nice Sophia-Antipolis,
(Janvier 2003)
9. ne, G., Barbaroux, P.: Adapted pittsburgh-style classier-system: Case-study.
In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003. LNCS (LNAI),
vol. 2661, pp. 3045. Springer, Heidelberg (2003)
10. Holmes, J.H., Lanzi, P.L., Stolzmann, W., Wilson, S.W.: Learning classier sys-
tems: New models, successful applications. Inf. Process. Lett. 82(1), 2330 (2002)
11. Lanzi, P.L.: Adding Memory to XCS. In: Proceedings of the IEEE Conference on
Evolutionary Computation (ICEC 1998), IEEE Press, Los Alamitos (1998),
http://ftp.elet.polimi.it/people/lanzi/icec98.ps.gz
12. Lanzi, P.L.: An analysis of the memory mechanism of XCSM. In: Proceedings of
the Third Genetic Programming Conference, pp. 643651. Morgan Kaufmann, San
Francisco (1998), http://ftp.elet.polimi.it/people/lanzi/gp98.ps.gz
13. Lanzi, P.L., Wilson, S.W.: Optimal classier system performance in non-markovian
environments. Technical Report 99.36, Illinois Genetic Algorithms Laboratory, Mi-
lan, Italy (1999)
14. Sigaud, O.: Les systmes de classeurs: un tat de lrt. Revue dintelligence Arti-
cielle RSTI srie RIA,Lavoisier, vol. 21 (February 2007)
15. Sigaud, O., Wilson, S.W.: Learning classier systems: a survey. Soft Com-
put. 11(11), 10651078 (2007)
16. Smith, S.F.: A Learning System based on Genetic Adaptive Algorithms. PhD the-
sis, University of Pittsburgh (1980)
17. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
148175 (1995)
18. Zatuchna, Z.V.: AgentP: A Learning Classier System with Associative Perception
in Maze Environments. PhD thesis, School of Computing Sciences, UEA (2005)
Classification Potential vs. Classification Accuracy:
A Comprehensive Study of Evolutionary Algorithms
with Biomedical Datasets

Ajay Kumar Tanwani and Muddassar Farooq

Next Generation Intelligent Networks Research Center (nexGIN RC)


National University of Computer & Emerging Sciences (FAST-NU)
Islamabad, Pakistan
{ajay.tanwani,muddassar.farooq}@nexginrc.org

Abstract. Biomedical datasets pose a unique challenge for machine learning and
data mining techniques to extract accurate, comprehensible and hidden knowl-
edge from them. In this paper, we investigate the role of a biomedical dataset
on the classification accuracy of an algorithm. To this end, we quantify the com-
plexity of a biomedical dataset in terms of its missing values, imbalance ratio,
noise and information gain. We have performed our experiments using six well-
known evolutionary rule learning algorithms XCS, UCS, GAssist, cAnt-Miner,
SLAVE and Ishibuchi on 31 publicly available biomedical datasets. The results
of our experiments and statistical analysis show that GAssist gives better classi-
fication results on majority of biomedical datasets among the compared schemes
but cannot be categorized as the best classifier. Moreover, our analysis reveals
that the nature of a biomedical dataset not the selection of evolutionary algo-
rithm plays a major role in determining the classification accuracy of a dataset.
We further show that noise is a dominating factor in determining the complex-
ity of a dataset and it is inversely proportional to the classification accuracy of
all evaluated algorithms. Towards the end, we provide researchers with a meta-
classification model that can be used to determine the classification potential of a
dataset on the basis of its complexity measures.

Keywords: Classification, Evolutionary Rule Learning Algorithms, Biomedical


Datasets, Performance Measures.

1 Introduction
Recent advancements in the field of bioinformatics and computational biology are in-
creasing the complexity of underlying biomedical datasets. The use of sophisticated
equipment like mass spectrometers and magnetic resonance imaging (MRI) scanners
generate large amounts of data that pose a number of issues regarding electronic stor-
age and efficient processing. One of the major challenges in this context is to automat-
ically extract accurate, comprehensible, and hidden knowledge from large amounts of
raw data. The discovered knowledge can then help medical experts in classification of
anomalies for these datasets.
Well-known data mining techniques for knowledge extraction and classification in-
clude probabilistic methods, neural networks, support vector machines, decision trees,

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 127144, 2010.
c Springer-Verlag Berlin Heidelberg 2010
128 A.K. Tanwani and M. Farooq

instance based learners, rough sets and evolutionary algorithms. The evolutionary algo-
rithms inspired from the evolution process in the biological species show a number
of desirable properties like self-adaptation, robustness, collective learning etc., which
make them suitable for challenging real world problems. The Evolutionary Computa-
tion (EC) paradigm has been successfully used in several data mining techniques in-
cluding but not limited to genetic based machine learning systems (GBML), learning
classifier systems (LCS), ant colony inspired classifiers, and hybrid variants of evolu-
tionary fuzzy systems and neural networks. The evolutionary classifiers are becoming
popular for data mining of medical datasets because of their ability to find hidden pat-
terns in electronic records that are not otherwise obvious even to physicians [1].
However, it is not obvious to a researcher working on the classification of biomed-
ical datasets to choose a suitable classifier. Consequently, the common methodology
adopted by researchers is to empirically evaluate their dataset with a few well-known
machine learning techniques and select the one that gives better results. As a result,
no attempt is made to systematically investigate the factors that define the accuracy of
a classifier. An important contribution of this paper is that the accuracy of a classifier
depends on the complexity of a datset. We define the complexity of a dataset in terms
of missing values, imbalance ratio, noise and information gain. Moreover, we eval-
uate the performance of six well-known evolutionary rule learning classifiers XCS,
UCS, GAssist, cAnt-Miner, SLAVE and Ishibuchi on 31 publicly available biomedical
datasets. The results of our experiments provide two valuable insights: (1) classification
accuracy strongly depends on the complexity of a biomedical dataset, and (2) noise of a
dataset predominately defines its complexity. To conclude, we propose that researchers
should first evaluate the complexity of their medical dataset and then use our proposed
meta-model to determine its classification potential.
The remaining paper is organized as follows: we introduce the evolutionary algo-
rithms used in our study in Section 3. In Section 4, we quantify the complexity of the
biomedical datasets. We report the results of our experiments which are followed by
statistical analysis and discussions in Section 5. Finally, we conclude the paper with an
outlook to our future work.

2 Related Work

We now present a brief overview of different studies that analyze the performance of
evolutionary algorithms on various biomedical domains. In [2], Wong et al. applied evo-
lutionary algorithms to discover knowledge in the form of rules and casual structures
from fracture and scoliosis databases. Their results suggest that evolutionary algorithms
are useful in finding interesting patterns. John Holmes in [3] presented his stimulus
response learning classifier system, EpiCS, to enhance classification accuracy in an
imbalanced class dataset. He, however, used artificially created liver cancer dataset.
Bernado-Mansilla in [4] characterized the complexity of the classification problem by a
set of geometrical descriptors and analyzed the competence of XCS in this domain. The
authors in [5] compared XCS with Bayesian network, SMO and C4.5 for mining breast
cancer data and showed that XCS provides significantly higher accuracy followed by
C4.5. However its rules are considered more comprehensible and descriptive by the
Classification Potential vs. Classification Accuracy 129

domain experts. The work in [6] evaluates two competitive learning classifier systems,
XCS and UCS, for extracting knowledge from imbalanced data using both fabricated
and real world problems. The results of their study prove the robustness of these algo-
rithms compared with IBk, C4.5 and SMO. In [7], the authors compared the Pittsburgh
and Michigan style classifier using XCS and GAssist on 13 publicly available datasets
to reveal important differences between the two systems. The comparative study per-
formed in [8] between evolutionary algorithms (XCS and Gale) and non-evolutionary
algorithms (instance based, decision trees, rule-learning, statistical models and support
vector machines) on several datasets suggests evolutionary algorithms as more suitable
for data mining and classification. The results of the experiments carried in [9] show
better classification accuracy for well-known ant colony inspired, Ant-Miner, compared
with C4.5 on 4 biomedical datasets. The authors in [10] have analyzed several strate-
gies of evolutionary fuzzy models for data mining and knowledge discovery. In our
earlier work [11], we provide several guidelines to select a suitable machine learning
scheme for classification of biomedical datasets, however, the work is limited to non-
evolutionary algorithms.
A common theme observed in various studies is that they are inclined towards par-
ticular classifier(s) instead of the biomedical dataset(s). In contrast, our study uses a
novel methodology to quantify the complexity of a dataset, which we show, defines the
accuracy of a classifier. Moreover, we also build a meta-model of our findings that can
be used to determine the classification potential of a biomedical dataset.

3 Evolutionary Algorithms
We have selected a diverse set of well-known evolutionary rule learning algorithms
for our empirical study. The selected algorithms are: (1) reinforcement learning based
Michigan style XCS [12], (2) supervised learning based Michigan style UCS [13], (3)
Pittsburgh style GAssist [14], (4) Ant Colony Optimization (ACO) inspired cAnt-Miner
[15], (5) genetic fuzzy iterative learner SLAVE [16], and (6) genetic fuzzy classifier
Ishibuchi [17]. In all our experiments, the parameters are selected to achieve the best
operating point on the ROC (Receiver Operating Characteristic) curve [18].

3.1 XCS
XCS is a reinforcement learning based Michigan-style classifier that evolves a set of
rules as a population of classifiers (P ). Each rule consists of a condition, an action and
three performance parameters: (1) payoff prediction (p), (2) prediction error (), and
(3) fitness (F ). The first step in classification is to build a match set (M ) that consists
of rules whose conditions are satisfied. The payoff prediction of each rule is computed
and its corresponding action set (A) is created. The online learning is made possible
with a reward (r), returned by the environment, that is subsequently used to tune the
performance parameters of the rules in the action set. The updated fitness is inversely
proportional to the prediction error. Finally a genetic algorithm GA, with crossover and
mutation probabilities and respectively, is applied to the rules in the action set and
consequently new rules are added to the population. Some rules are also deleted from
the population depending on their experience.
130 A.K. Tanwani and M. Farooq

The parameter configuration of XCS used in our experiments is as follows: popula-


tion size N = 6400, learning rate = 0.2, sub = del = 50, tournament size = 0.4, =
0.8, = 0.04 and the number of explorations are kept at 100, 000.

3.2 UCS
UCS is an accuracy based Michigan-style classifier which is in principle quite similar
to XCS. However, it uses a supervised learning scheme to compute fitness instead of
reinforcement learning employed by XCS. UCS like XCS also evolves a population
of rules (P ). Each rule has two parameters: (1) accuracy (acc), and (2) fitness (F ).
During the training phase, for every instance a set of rules whose conditions are satisfied
become part of its match set (M ). The rules that perform correct classification become
part of the correct set (C), and the others become part of the incorrect set (!C). Finally,
the genetic algorithm GA is applied to the correct set to update its population. Every
instance during testing is classified through weighted voting, on the basis of fitness, to
select the action.
We have used following parameter settings: N = 6400, number of iterations = 100, 000
and acc0 = 0.99. The other tuning parameters of GA are kept same as that in XCS.

3.3 GAssist
GAssist (Genetic Algorithms based claSSIfier sySTem), in contrast to XCS and UCS,
is a Pittsburgh-style learning classifier in which the rules are assembled in the form
of a decision list. GAssist-ADI uses Adaptive Discretization Intervals (ADI) rule rep-
resentation. In such systems, the continuous space is discretized into fixed intervals
for developing rules. Generalization is introduced by deleting and selecting rule set as
a function of their accuracy and length. The crossover between two rules takes place
across attribute boundaries rather than attribute intervals.
GAssist parameter setting is as follows: crossover probability = 0.6, number of it-
erations = 500, minimum number of rules for rule deletion = 12, and set of uniform
discreteness 4, 5, 6, 7, 8, 10, 15, 20 and 25 bins.

3.4 cAnt-Miner
Ant Miner, inspired by behavior of real ant colonies, uses Ant Colony Optimization
(ACO) to construct classification rules from the training data. The Rule Discovery pro-
cess consists of 3 steps i.e. rule generation, rule pruning and rule updating. In the rule
generation step, an ant starts with an empty rule list and adds one term at a time based
on the probability of that attribute-value pair. It continues to add terms to the rule with-
out duplication until all the attributes are exhausted or the new terms make the rule more
specific, defined by a user specified threshold. In the rule pruning step all the terms are
removed one by one from the rule that degrades the accuracy of that rule. While up-
dating rules, the pheromone values of terms are increased or decreased on the basis of
their usage in the rule discovery process. cAnt-Miner is a variant of Ant Miner for real
valued attributes.
The parameters of cAnt-Miner are: the number of ants = 3000, minimum cases per
rule = 5, maximum number of uncovered cases = 10 and convergence test size = 10.
Classification Potential vs. Classification Accuracy 131

3.5 SLAVE
SLAVE (Structural Learning Algorithm in Vague Environment) is totally different from
the classical Michigan-style and Pittsburgh-style rule learning algorithms. In this ap-
proach, every entity in the population represents a unique rule. But during an iteration
of a genetic algorithm, only the best individual is added to the final set of rules which
is eventually used for classification. In this way, SLAVE combines its iterative learning
approach with the fuzzy models. The fitness of the rules is determined by their com-
pleteness and consistency.
In our experiments, the parameter configuration of SLAVE is: the number of labels
= 5, population size = 100, number of iterations allowed without change = 500 and
mutation probability = 0.01.

3.6 Ishibuchi
Ishibuchi et al. proposed a fuzzy rule learning method for multidimensional pattern
classification problem with continuous attributes. The classification is done with the
help of a fuzzy-rule base in which each fuzzy if-then rule is handled as an individual,
and a fitness value is assigned to each rule. The criteria for assigning a class label
is based on a simple heuristic procedure which assigns a grade of certainty for each
fuzzy if-then rule. Because it uses linguistic values with fixed membership functions
as antecedent fuzzy sets, a linguistic interpretation of each fuzzy if-then rule is easily
obtained which greatly helps in comprehending the generated solution.
The experiments are carried with the following parameters: the number of labels =
5, population size = 100, number of evaluations = 10, 000, along with crossover and
mutation probabilities of 1.0 and 0.9.

4 Nature of Biomedical Datasets


Biomedical datasets provide a whole spectrum of difficulties high-dimensionality,
multiple classes, imbalanced classes, missing values and noisy data that affect the
classification accuracy of algorithms. The inconsistencies and inherent complexities
in biomedical datasets obtained from different sources justify the need to separately
investigate the impact of the nature of biomedical dataset in classification. To this end,
we have selected 31 diverse biomedical datasets publicly available from UCI machine
learning repository [19]. We now introduce four parameters that we use to quantify the
complexity of a biomedical dataset: (1) missing values, (2) imbalance ratio, (3) noise,
and (4) information gain.

4.1 Missing Values


A major focus of the machine learning community has been to analyze the effect of
missing data on the accuracy of a classifier. The missing data is generally classified
into three types: (1) missing completely at random (MCAR), (2) missing at random
(MAR), and (3) not missing at random (NMAR). The datasets obtained from clinical
databases contain several missing fields which can belong to all three categories of
missing values. In Table 1 we see that VA-Heart dataset contains up to 27% of missing
values in its attributes.
132 A.K. Tanwani and M. Farooq

4.2 Imbalance Ratio


Orriols-Puig and Bernado-Mansilla compute class imbalance as the ratio between the
number of majority class instances and the number of minority class instances [6].
But, this is only suitable for two-class problems as it does not include proportion of
other class instances for a multi-class dataset. For example, Thyroid0387 has a total of
32 classes with 6771 majority class instances and only 1 minority class instance. The
imbalance ratio, using the above method, is 6771 which definitely does not represent the
true picture because the distribution of instances of other classes is relatively uniform.
Therefore, we use following definition of imbalance ratio Ir to cater for proportion of
all class distributions:
Nc 1 
Nc
Ii
Ir = (1)
Nc i=1 In Ii
where Ir is in the range (1 Ir < ) and Ir = 1 is a completely balanced dataset
having equal instances of all classes. Nc is the number of classes, Ii is the number of
instances of class i and In is the total number of instances. Hyperthyroid is the most
imbalanced dataset in our repository with an imbalance ratio of 28.81.

4.3 Noise
Noise is of two types: (1) attribute noise, and (2) class noise. Research has shown that
the impact of class noise on classification accuracy is significantly more as compared
to the attribute noise [20] and hence, we only quantify class noise in our study. The
common sources of class noise are inconsistent and mislabeled instances. A number
of research efforts have been made to quantify the level of noise in a dataset, but its
definition still remains subjective. Brodley and Friedl characterized noise as the pro-
portion of incorrectly classified instances by a set of trained classifiers [21]. We use a
similar approach to quantify noise but utilize confusion matrices for a set of classifiers
to determine noisy instances. Noise is then quantified as the sum of all off-diagonal
entities (incorrectly classified instances) where each entity is the minimum of all the
corresponding elements in a set of confusion matrices. The defined criteria is based
upon two assumptions: (1) an inconsistent or misclassified instance is likely to confuse
every classifier, and (2) the bias of an algorithm towards particular class instances can
be factored out by using a set of classifiers. The advantage of our approach is that we
separately identify misclassified instances of every class and only categorize those as
noisy which are misclassified by all the classifiers.
The confusion matrix of a nth classifier in a set of n classifiers can in general be
represented as: n n
i11 i12 . . . in1j
in21 in22 . . . in2j

Cn = . . ... .

. . ... .
ini1 ini2 . . . inij
where the diagonal elements in Cn represent the correctly classified instances and off-
diagonal elements are the incorrectly classified instances. The percentage of class noise
in a dataset of In instances can be computed as below:
Classification Potential vs. Classification Accuracy 133

1 
Nc Nc
N oise = min(C1 (i, j), C2 (i, j)......Cn (i, j)) 100 (2)
In i=1 j=1

where i = j and min(C1 (i, j), C2 (i, j)......Cn (i, j)) is an entity for corresponding i
and j that represents minimum number of class instances misclassified by all the classi-
fiers. We have used five well-known and diverse machine learning algorithms as a set of
classifiers in our study: Naive Bayes (probabilistic), SMO (support vector machines),
J48 (decision trees), Ripper (inductive rule learner) and IBk (instance based learner). We
use the standard implementations of these schemes in Wakaito Environment for Knowl-
edge Acquisition (WEKA) [22]. It is evident from Table 1 that biomedical datasets are
generally associated with high percentage of noise levels.

4.4 Information Gain

Information gain is an information-theoretic measure that evaluates the quality of at-


tributes in a dataset [22]. It measures the reduction in uncertainty if the values of an
attribute are known. For a given attribute X and a class attribute Y , the uncertainty is
given by their respective entropies H(X) and H(Y ). Then the information gain of X
with respect to Y is given by I(Y ; X), where

I(Y ; X) = H(Y ) H(Y |X) (3)

The average and total information gain of a biomedical dataset shown in Table 1 gives
a direct measure of the quality of its attributes for classification.

5 Results and Discussions

We now present the results of our experiments that we have done to analyze the nature
of 31 biomedical datasets with six evolutionary algorithms. We have used the standard
ACO framework, MYRA [23], for cAnt-Miner and Knowledge Extraction based on
Evolutionary Learning (KEEL) [24] for other evolutionary classifiers to remove any
implementation bias in our study. We evaluate the classification accuracy of the evolu-
tionary algorithms using standard ten fold stratified cross-validation in order to ensure
systematic and unbiased analysis. The results summarized in Table 1 show the nature
of a dataset in terms of its quantified parameters, along with the resulting classification
accuracies of all the algorithms. We now provide the insights of the obtained results
using statistical procedures to analyze the effect of evolutionary learning paradigm and
then discuss in detail the role of nature of biomedical dataset on classification accuracy.

5.1 Statistical Analysis of Results

In this section, we provide the statistical analysis of the results obtained in Table 1 to
systematically quantify the performance of evolutionary algorithms. The common ap-
proach used by many researchers in such cases is to use pairwise comparisons between
all the classifiers using commonly used statistical tests such as paired t-test or wilcoxon
134

Table 1. The Table shows: (1) Summary of used datasets in alphabetical order; number of instances, classes, attributes (continuous, binary, nominal),
percentage of missing values in the attributes, noise, average information gain (Avg Info Gain) and total information gain (Net Info Gain). (2) Classification
accuracies of evolutionary rule-learning algorithms; bold entries in every row represents the best accuracy.
Nature of Dataset Evolutionary Rule Learning Classifiers
Dataset Attributes Missing Imb Avg Info Net Info
Instances Classes Noise XCS UCS GAssist cAnt-Miner SLAVE Ishibuchi Mean
Con Bin Nom Values Ratio Gain Gain
Ann-Thyroid 7200 3 6 15 0 0 0.11 8.37 0.037 0.78 97.08 96.99 94.67 99.15 93.29 92.61 95.63 2.52
Breast Cancer 699 2 1 0 9 0.23 2.72 1.21 0.451 4.51 96.14 96.57 94.56 93.56 94.70 94.71 95.04 1.11
Breast Cancer Diagnostic 569 2 31 0 0 0 2.11 1.14 0.303 9.39 93.67 92.44 95.43 93.15 91.56 92.09 93.06 1.38
Breast Cancer Prognostic 198 2 33 0 0 0.06 13.64 1.76 0.004 0.15 65.76 72.82 70.29 73.82 74.29 76.29 72.21 3.72
Cardiac Arrhythmia 452 16 272 7 0 0.32 11.28 1.57 0.047 13.06 - 61.31 54.86 67.92 65.49 - 62.39 5.72
Cleveland-Heart 303 5 10 3 0 0.15 17.82 1.37 0.115 1.49 58.09 52.15 57.41 57.74 48.85 54.44 54.78 3.71
Contraceptive Method 1473 3 2 3 4 0 31.98 1.04 0.041 0.36 53.43 47.32 55.54 50.92 25.46 43.58 46.04 10.95
Dermatology 366 6 1 1 32 0.06 0.82 1.05 0.442 15.02 94.84 96.99 92.64 91.00 3.83 30.60 68.32 40.53
A.K. Tanwani and M. Farooq

Echocardiogram 132 2 8 2 2 4.67 6.06 1.24 0.084 1.01 88.63 84.78 96.21 83.19 92.47 93.24 89.75 5.10
E-Coli 336 8 7 0 1 0 6.55 1.25 0.678 5.42 90.51 93.73 74.74 79.17 82.72 67.89 81.46 9.68
Habermans Survival 306 3 3 0 0 0 16.67 1.57 0.023 0.07 74.23 74.20 69.96 71.53 73.18 73.20 72.72 1.67
Hepatitis 155 2 6 0 13 5.67 10.97 2.05 0.058 1.10 81.33 81.29 91.50 80.00 81.96 80.04 82.69 4.39
Horse Colic 368 2 8 4 15 19.39 11.96 1.15 0.061 1.64 84.23 81.47 93.73 83.97 67.33 63.05 78.96 11.54
Hungarian Heart 294 5 10 3 0 20.46 13.61 1.74 0.079 1.02 65.98 62.26 75.14 62.95 64.60 63.95 65.81 4.75
Hyper Thyroid 3772 5 7 21 1 2.17 0.34 28.81 0.012 0.36 97.35 97.88 98.57 98.12 97.43 97.30 97.77 0.51
Hypo-Thyroid 3163 2 7 18 0 6.74 0.54 9.99 0.024 0.60 97.16 97.85 99.43 98.96 95.51 95.23 97.36 1.74
Liver Disorders 345 2 6 0 0 0 21.88 1.05 0.011 0.06 63.26 67.27 61.18 65.48 58.54 58.27 62.33 3.67
Lung Cancer 32 3 0 0 56 0.28 9.86 1.02 0.152 8.50 30.83 44.99 41.67 45.83 - - 40.83 6.90
Lymph Nodes 148 4 3 9 6 0 10.81 1.46 0.138 2.48 79.19 81.09 78.57 77.81 69.57 72.90 76.52 4.36
Mammographic Masses 961 2 1 0 4 3.37 14.15 1.01 0.193 0.97 80.75 82.21 83.25 81.06 65.56 66.60 76.57 8.18
New Thyroid 215 3 5 0 0 0 2.79 1.78 0.602 3.01 94.93 92.60 92.19 90.24 91.23 86.15 91.22 2.94
Pima Indians Diabetes 768 2 8 0 0 0 20.18 1.20 0.064 0.52 73.71 74.76 72.15 75.00 72.67 68.62 72.81 2.34
Post Operative Patient 90 3 0 0 8 0.44 30.00 1.90 0.016 0.13 70.00 63.33 61.11 60.00 70.00 71.11 65.93 5.00
Promoters Genes Sequence 106 2 0 0 58 0 4.72 1.00 0.078 4.51 2.82 76.27 62.91 75.45 27.27 - 48.94 32.56
Protein Data 21618 3 0 0 1 0 45.48 1.19 0.065 0.07 51.41 51.21 54.52 54.46 54.52 54.52 53.44 1.65
Sick 2800 2 7 21 1 2.24 0.71 7.72 0.013 0.37 93.89 97.57 97.32 97.18 93.86 93.89 95.62 1.91
Splice-Junction Gene Sequence 3190 3 0 0 61 0 4.6 1.15 0.022 3.94 5.60 57.30 92.45 83.80 52.55 - 58.34 34.01
Statlog Heart 270 2 7 3 3 0 15.19 1.03 0.092 1.19 80.74 83.33 81.11 75.19 72.22 73.33 77.65 4.65
Switzerland Heart 123 5 10 3 0 17.07 32.52 1.14 0.023 0.30 31.67 31.67 65.83 30.19 31.79 42.37 38.92 13.91
Thyroid0387 9172 32 7 21 1 5.50 1.35 2.99 0.091 2.64 74.13 81.92 79.83 85.47 76.46 74.02 78.64 4.59
VA-Heart 200 5 10 3 0 26.85 27.00 1.04 0.023 0.30 32.00 28.99 58.50 29.00 20.00 33.50 33.66 13.04
70.11 (5) 74.34 (3) 77.33 (1) 74.56 (2) 66.96 (6) 70.87 (4)
Mean 1930 4.5 15 4 9 3.73 12.49 2.97 0.13 2.74
26.71 19.95 16.69 18.76 24.92 19.21
Average Ranks 3.29 (3) 3.02 (2) 2.71 (1) 3.35 (4) 4.35 (6) 4.27 (5)
Classification Potential vs. Classification Accuracy 135

signed rank test and to report significant differences between the pairs [6][8]. Demsar
has criticized the misuse of these approaches for multiple classifier comparisons be-
cause: (1) none of them reasons about comparing the means of more than two random
variables, and (2) a certain portion of null hypothesis is always rejected due to a random
chance by doing so [25]. In this paper, we use more specialized methods for comparing
the average ranks of evolutionary classifiers (see Table 1) as suggested by Demsar [25]
and Garcia [26].
Global Comparison of Evolutionary Classifiers. We use two most widely used non-
parametric tests for comparison of multiple hypothesis among the classifiers: (1) Fried-
man Test [27], and (2) Iman and Davenport Test [28]. These tests utilize 2 and F
distributions respectively to check if the distribution of observed and expected frequen-
cies differ from each other.
Friedman and Iman and Davenport tests perform a global analysis to check whether
the measured average ranks of all the classifiers are significantly different from the
mean rank (3.5 in our case). The corresponding statistics 2F and FF are calculated as
explained by Friedman and Iman and Davenport:
2F = 19.94, FF = 4.44
The critical values for corresponding tests 2C and FC obtained from the 2 and F
distribution tables at = 0.05 with 5 and 150 degrees of freedom are:
2C (5) = 11.07, FC (5, 150) = 2.27
Since, the critical values are lower than the test statistics, the null hypothesis can be
rejected and the post-hoc tests can be applied to detect significant differences between
classifiers.
Comparison with the Control Classifier GAssist. It can be seen from the results in
Table 1 that GAssist provides the best overall classification accuracy of 77.33 and least
standard deviation of 16.63. Moreover, it also outperformed other classifiers for 13
biomedical datasets. To compare the performance of GAssist with other evolutionary
algorithms, we now establish the multiple hypothesis where every other evolutionary
classifier is statistically compared with GAssist.
We use two post-hoc tests to determine the statistical significance of results: (1)
Bonferroni-Dunn Test [29], and (2) Holm Test [30]. In general, these post-hoc tests
vary in adjusting the threshold of significance level in accordance with their multiple
hypothesis. Bonferroni-Dunn Test controls the family-wise error rate in a single step by
dividing with the number of comparisons (k 1). Holms Test is a step-down pro-
cedure in which the hypothesis is tested on the p-values arranged in ascending order.
Starting from the lowest p-value, all the hypothesis are rejected for which pi <= /ki
while all the other remaining hypothesis are retained. Holms Test is more powerful as
it makes no assumptions about the hypothesis and in general, rejects more hypothesis
than Bonferroni-Dunns Test. The corresponding probability of the test statistic from
the normal distribution table is obtained from the z-value by comparing ith and j th
classifier. If the probability is less than the appropriate significance level, the null hy-
pothesis is rejected. The results of comparison with control classifier GAssist are shown
in Table 2.
136 A.K. Tanwani and M. Farooq

Table 2. Test statistics for comparison with control classifier - GAssist ( = 0.05, k = 6,
N = 31 and Rj = 2.71). Null hypothesis is rejected for bold entries in p column.

Z-Value
Bonferroni-Dunn (B-D) Holm Critical Value
i Algorithms p
(Ri Rj )/ k(k + 1)/6N /(k 1) /(k i) B-D Holm
1 SLAVE 3.462 5.36E-4 0.01 0.01
2 Ishibuchi 3.292 9.93E-4 0.01 0.0125
3 cAntMiner 1.358 0.174 0.01 0.017 0.01 0.017
4 XCS 1.222 0.222 0.01 0.025
5 UCS 0.645 0.519 0.01 0.05

Table 3. Test statistics for pairwise comparisons ( = 0.05, k = 6, N = 31). Null hypothesis is
rejected for bold entries in p column.

Z-Value
Nemenyi Holm Critical Value
i Algorithms p
(Ri Rj )/ k(k + 1)/6N 2 /k(k 1) /(k i) Nemenyi Holm
1 GAssist vs SLAVE 3.462 5.36E-4 0.003 0.003
2 GAssist vs Ishibuchi 3.292 9.93E-4 0.003 0.004
3 UCS vs SLAVE 2.817 0.004 0.003 0.004
4 UCS vs Ishibuchi 2.647 0.008 0.003 0.004
5 XCS vs SLAVE 2.240 0.025 0.003 0.004
6 cAnt-Miner vs SLAVE 2.104 0.035 0.003 0.005
7 XCS vs Ishibuchi 2.070 0.038 0.003 0.0055
8 cAnt-Miner vs Ishibuchi 1.935 0.053 0.003 0.006 0.003 0.004
9 GAssist vs cAntMiner 1.358 0.174 0.003 0.007
10 XCS vs GAssist 1.222 0.222 0.003 0.008
11 UCS vs cAnt-Miner 0.713 0.476 0.003 0.01
12 UCS vs Gassist 0.645 0.519 0.003 0.0125
13 XCS vs UCS 0.577 0.564 0.003 0.017
14 SLAVE vs Ishibuchi 0.170 0.865 0.003 0.025
15 XCS vs cAnt-Miner 0.136 0.892 0.003 0.05

The last column gives the critical values of the used tests. If the p-value is less than
or equal to this critical value, the null hypothesis is rejected for the corresponding test.
It can be seen that the results of GAssist are statistically significant compared to SLAVE
and Ishibuchi and hence, the null hypothesis can be rejected, while nothing much can
be said about other algorithms with the given results.

Pairwise Comparisons. As GAssist cannot be termed as the best classifier against all
the other classifiers in the last section, we now make the pairwise comparisons to ana-
lyze the statistical differences between all the classifiers. Along with the Holms Test,
we use the pairwise counterpart of Bonferroni-Dunns Test called Nemenyi Test [31],
for comparing all classifiers with each other. Nemenyi Test is more conservative than
Benferroni-Dunns Test as it steps-down the significance level by number of pairwise
comparisons (k(k 1)/2 instead of (k 1)). The results in Table 3 show that the Ne-
menyi Test rejects the hypothesis of GAssist against SLAVE and Ishibuchi while the
Holms method also allows to reject the hypothesis for UCS vs SLAVE.

5.2 Effect of Evolutionary Algorithm

The use of statistical analysis provides deeper analysis to the obtained results than sim-
ply averaging the classification accuracies; a raw measure of ranking the performance
Classification Potential vs. Classification Accuracy 137

of algorithms. We now present the role of evolutionary learning paradigm in classifying


biomedical datasets based on the obtained results:

Pittsburgh-Style GAssist. The results of our experiments show that GAssist


a Pittsburgh-style learning classifier performs better than other evolutionary rule-
learning algorithms. The greater accuracy is a result of its superior fitness function that
combines the accuracy and complexity of an individual using Minimum Description
Length (MDL) principle to yield optimum rules [14].

Nature Inspired cAnt-Miner. cAnt-Miner closely follows GAssists policy to gen-


erate simpler rules. The ants generate rules by selecting attribute-value pairs on the
basis of their entropy and pheromone values [32]. Consequently, it uses only high qual-
ity attributes (we model quality with information gain) in the formulation of its rules.
Moreover, its pruning mechanism yields simpler and shorter rules, thereby, achieving
greater classification accuracy.

Michigan-Style UCS and XCS. The Michigan-style learning classifiers UCS and
XCS use online learning to evolve a set of condition-action rules from each training
instance. Thus, they can be more useful in identifying hidden patterns and generating in-
formation rich rules compared with simple and generic rules of GAssist and cAntMiner.
We therefore suggest that if medical experts are available to refine rules, Michigan-style
classifier for knowledge extraction can prove to be useful.

Genetic Fuzzy SLAVE and Ishibuchi. The results show that the genetic fuzzy rule
learning classifiers are not generally suitable for classification of biomedical datasets.
The fuzzy rules so generated, however, can be particularly used to evaluate the uncer-
tainty associated with the prognosis.

5.3 Effect of Nature of Dataset


A careful insight into the results of Table 1 enables the reader to draw an important con-
clusion: the variance in accuracy of classifiers on a particular dataset is significantly
smaller compared with the variance in accuracy of the same classifier on different
datasets. The statement holds for more than 25 datasets; with notable exceptions being
Dermatology, Splice-Junction Gene Sequence, and Promoters Gene Sequence. Conse-
quently, we can say that accuracy is strongly dependent on the nature of biomedical
dataset. We now discuss important factors that determine the net classification potential
of a dataset.

Role of Multiple Classes. It can be inferred from Table 1 that for multi-class problems,
UCS gives significantly better accuracy compared with other classifiers. The reason is
that it evolves only those highly-rewarded classifiers of the match set in the correct
set, which predict the same class as that of the training example [33]. In comparison,
GAssist has serious problems in dealing with multi-class problems specially when the
number of output classes are more than 5. On these datasets, the average accuracy of
UCS is 83.49% compared with 75.52% of GAssist.
138 A.K. Tanwani and M. Farooq

Role of Instances. It is obvious in Table 1 that evolutionary algorithms over-fit on


datasets with small number of instances. Consequently, accuracy of classifiers on Lung
Cancer, Post Operative Patient, Promoters Gene Sequence and Switzerland Heart
datasets severely degrades. We argue that during training, classifiers create small dis-
juncts with rare cases [34]; as a result, their accuracy significantly degrades during
testing.

Role of Attributes. The attributes of a dataset vary in three aspects: (1) number, (2)
type (continuous, binary and nominal), and (3) quality. We see in Table 1 that number
and type of attributes have little role in defining the classification potential of a dataset.
Very poor performance of XCS on Splice-Junction Gene Sequence, Promoters Genes
Sequence and Lung Cancer datasets came as a surprise to us. Our analysis reveals that
large number of nominal attributes in these datasets 61, 58 and 56 respectively is the
main cause of their poor performance with XCS. Our conclusion is that XCS is unable
to cater for large number of nominal attributes in a dataset.
Remember, we quantify quality of attributes with information gain. The graph in
Figure 1 clearly shows that classification accuracy increases with an increase in the
information gain of its attributes.

Fig. 1. Average Information Gain vs Classification Accuracy

Role of Missing Values. The missing or incomplete data degrades the accuracy of
learning algorithms. Therefore, a number of methods like Wild-to-Wild, mean or mode
method, random assignment, InGrimputation Model, listwise deletion etc. have been
proposed for imputation to increase the accuracy of a classifier. Figure 2 reveals that
GAssist is relatively more resilient to missing values compared with other algorithms.
GAssist replaces a missing value with the mean of its class for real valued attributes.
For nominal attributes it replaces missing value with their mode.

Role of Imbalanced Classes. A learning algorithm during classification may develop a


bias towards its majority class. However, Figure 3 shows that the net accuracy of evolu-
tionary classifiers remains unaffected even in datasets with high imbalance
ratios.
Classification Potential vs. Classification Accuracy 139

Fig. 2. Missing Values vs Classification Accuracy

Fig. 3. Class Imbalance (Log Scale) vs Classification Accuracy

Role of Noise. The results in Table 1 show that classification potential of a dataset
is inversely proportional to the level of noise in a dataset. Consequently, accuracy of
classifying noisy datasets is very small (see Figure 4). GAssist shows more resilience
to noise in datasets because of its added generalization pressure with bloat control based
on MDL principle. The MDL principle forces GAssist to reduce the size and length of
its individuals. In short its simple evolution policy makes it resilient to noise.

5.4 Combined Effect of Nature of Dataset


Our facet-wise study of dataset parameters show that noise, information gain and miss-
ing values play a significant role in defining the classification accuracy of an algorithm
while imbalance ratio does not dominate the resulting accuracy. We now conclude our
findings in Figure 5 to have a better understanding of the combined effect of the com-
plexity parameters.
It is obvious in Figure 5 that noise in a dataset effectively determines the classifica-
tion accuracy. The high average information gain of a dataset yields better classification
accuracy; while the percentage of missing values in a dataset has minor impact on the
accuracy.
140 A.K. Tanwani and M. Farooq

Fig. 4. Noise vs Classification Accuracy

Fig. 5. Relationship between Classification Accuracy and Nature of Dataset: x-axis contains
biomedical datasets in increasing order of their classification accuracies; y-axis contains nor-
malized parameters of datasets, 1-Average Information Gain, 2-Missing Values, 3-Noise, 4-
Classification Accuracy

Meta-Model for Classification Potential of a Dataset. In this section, we apply our


meta-model framework [35] to get a measure of the classification potential of a dataset
based on the complexity parameters. We create a meta-dataset comprising of three at-
tributes for complexity parameters: average information gain, missing values, and noise.
Classification Potential vs. Classification Accuracy 141

We categorize the output class classification potential into three classes based on the
classification accuracy: good (greater than 0.8), satisfactory (0.6-0.8) and bad (less than
0.6). The interesting patterns lying in this meta-dataset are extracted using two clas-
sifiers: (1) GAssist, it gives good classification results, and (2) Boosted J48 [22], to
compare the results with well-known non-evolutionary algorithm.

Classification Rules of GAssist


0:Noise is [>0.667]|bad
1:MissingValues is [>0.905]|bad
2:MissingValues is [<0.125]|Noise is [>0.145]|satisfactory
3:Noise is [>0.287]|satisfactory
4:AvgInfoGain is [<0.29]|MissingValues is [>0.6]|bad
5:Default rule -> good

The classification rules generated by both classifiers prove our thesis that a noise
level greater than 0.25 severely degrades the classification potential of a dataset. As
expected, GAssist is able to generate more generic and comprehensible rules. For ex-
ample, if noise level is above 0.667, the classification potential is bad irrespective of
other parameters. The knowledge extracted by both algorithms provide same general-
ization. Hence, our proposed meta-model can be effectively used in determining the
true classification potential of a biomedical dataset. We believe this can prove to be a
very effective tool for analyzing the inherent complexities and needs for pre-processing
the dataset.

Decision Tree of J48

Noise <= 0.26297


| MissingValues <= 0.016387
| | AvgInfoGain <= 0.65192
| | | AvgInfoGain <= 0.059957: good
| | | AvgInfoGain > 0.059957: satisfactory
| | AvgInfoGain > 0.65192: good
| MissingValues > 0.016387: good
Noise > 0.26297
| MissingValues <= 0.002235: satisfactory
| MissingValues > 0.002235: bad

6 Conclusion
In this paper, we have quantified the complexity of biomedical datasets in terms of
missing values, noise, imbalance ratio and information gain. The effect of complexity
on classification accuracy is evaluated using six well-known evolutionary rule learning
algorithms. The results of our experiments show that GAssist in most of the datasets
provides better classification accuracy compared with other algorithms. Our analysis
reveals that the classification accuracy of a biomedical dataset is, however, a function
142 A.K. Tanwani and M. Farooq

of the nature of biomedical dataset rather than the choice of a particular evolutionary
learner. The major contribution of this paper is a unique methodology to determine the
classification potential of a dataset using a meta-model framework. In the future, we
would like to present the generated rules of different classifiers to the medical experts
for their feedback.

Acknowledgements

The authors of this paper are supported, in part, by the National ICT R&D Fund, Min-
istry of Information Technology, Government of Pakistan. The information, data, com-
ments, and views detailed herein may not necessarily reflect the endorsements of views
of the National ICT R&D Fund.

References
1. Pena-Reyes, C.A., Sipper, M.: Evolutionary computation in medicine: an overview. Journal
of Artificial Intelligence in Medicine 19(1), 123 (2000)
2. Wong, M.L., Lam, W., Leung, K.S., Ngan, P.S., Cheng, J.C.V.: Discovering knowledge from
medical databases using evolutionary algorithms. IEEE Engineering in Medicine and Biol-
ogy 19(4), 4555 (2000)
3. Holmes, J.H.: Learning classifier systems applied to knowledge discovery in clinical research
databases. In: Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI),
vol. 1996, pp. 243261. Springer, Heidelberg (2001)
4. Bernado Mansilla, E.: Domain of competence of XCS classifier system in complexity mea-
surement space. IEEE Transactions on Evolutionary Computation 9(1), 82104 (2005)
5. Kharbat, F., Bull, L., Odeh, M.: Mining breast cancer data with XCS, Genetic and Evolution-
ary Computation Conference (GECCO), pp. 2066-2073, UK (2007)
6. Puig, A.O., Mansilla, E.B.: Evolutionary rule-based systems for imbalanced data sets. Soft
Computing - A Fusion of Foundations, Methodologies and Applications 13(3), 213225
(2009)
7. Bacardit, J., Butz, M.V.: Data mining in learning classifier systems: comparing XCS with
GAssist. In: Kovacs, T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W.
(eds.) IWLCS 2003. LNCS (LNAI), vol. 4399, pp. 282290. Springer, Heidelberg (2007)
8. Bernado, E., Llor`a, X., Garrell, J.M.: XCS and GALE: a comparative study of two learning
classifier systems with six other learning algorithms on classification tasks. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 115132.
Springer, Heidelberg (2002)
9. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: An ant colony based system for data mining: ap-
plications to medical data. In: Int. Conf. on Knowledge Discovery and Data mining, Boston,
pp. 5562 (2000)
10. Galea, M., Shen, Q., Levine, J.: Evolutionary approaches to fuzzy modelling for classifica-
tion. Knowledge Engineering Review 19(1), 2759 (2004)
11. Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M.: Guidelines to select machine learning
scheme for classifcation of biomedical datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M.
(eds.) EvoBIO 2009. LNCS, vol. 5483, pp. 128139. Springer, Heidelberg (2009)
12. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Toward a theory of generalization and
learning in XCS. IEEE Transactions on Evolutionary Computation 8(1), 2846 (2004)
Classification Potential vs. Classification Accuracy 143

13. Bernado-Mansilla, E., Garrell-Guiu, J.M.: Accuracy-based learning classifier systems: mod-
els, analysis and applications to classification tasks. Evolutionary Computation 11(3), 209
238 (2006)
14. Bacardit, J., Garrell, J.M.: Bloat control and generalization pressure using the minimum de-
scription length principle for a Pittsburgh approach Learning Classifier System. In: Kovacs,
T., Llor`a, X., Takadama, K., Lanzi, P.L., Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2003.
LNCS (LNAI), vol. 4399, pp. 5979. Springer, Heidelberg (2007)
15. Otero, F.E.B., Freitas, A.A., Johnson, C.J.: cAnt-Miner: an ant colony classification algo-
rithm to cope with continuous attributes. In: Ant Colony Optimization and Swarm Intelli-
gence, Belgium, pp. 4859 (2008)
16. Gonzalez, A., Perez, R.: SLAVE: a genetic learning system based on an iterative approach.
IEEE Transaction on Fuzzy Systems 7(2), 176191 (1999)
17. Ishibuchi, H., Nakashima, T., Murata, T.: Performance evaluation of fuzzy classifier systems
for multidimensional pattern classification problems. IEEE Transactions on Systems, Man,
and Cybernetics 29(5), 601618 (1999)
18. Fawcett, T.: ROC graphs: notes and practical considerations for researchers, TR HPL-2003-
4, HP Labs, USA (2004)
19. UCI repository of machine learning databases, University of California-Irvine, Department
of Information and Computer Science,
www.ics.uci.edu/mlearn/MLRepository.html (last accessed: June 25, 2010)
20. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. Artifi-
cial Intelligence Review 22(3), 177210 (2004)
21. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intel-
ligence Research 11, 131167 (1999)
22. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd
edn. Morgan Kaufmann, San Francisco (2005)
23. Otero, F.E.B.: Ant Colony Optimization Framework, MYRA,
http://sourceforge.net/projects/myra/ (last accessed: June 27, 2010)
24. Alcala-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J.,
Romero, C., Bacardit, J., Rivas, V.M., Fernandez, J.C., Herrera, F.: KEEL: a software tool
to assess evolutionary algorithms for data mining problems. Soft Computing 13, 307318
(2008)
25. Demsar, J.: Statistical comparisons of classifiers over multiple datasets. Journal of Machine
Learning and Research 7, 130 (2006)
26. Garcia, S., Herrera, F.: An extension on Statistical comparisons of classifiers over multiple
datasets for all pairwise comparisons. Journal of Machine Learning and Research 9, 2677
2694 (2008)
27. Friedman, M.: A comparison of alternative tests of significance for the problem of m rank-
ings. Annals of Mathematical Statistics 11, 8692 (1940)
28. Iman, R.L., Davenport, J.M.: Approximations of the critical region of the Friedman statistic.
Communications in Statistics, 571595 (1980)
29. Dunn, O.J.: Multiple comparisons among means. Journal of the American Statistical Asso-
ciation 56, 5264 (1961)
30. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics 6, 6570 (1979)
31. Nemenyi, P.B.: Distribution-free multiple comparisons, PhD Thesis, Princeton University
(1963)
144 A.K. Tanwani and M. Farooq

32. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization
algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321332 (2002)
33. Orriols-Puig, A., Bernado-Mansilla, E.: Revisiting UCS: description, fitness sharing and
comparison with XCS. In: Bacardit, J., Bernado-Mansilla, E., Butz, M.V., Kovacs, T., Llor`a,
X., Takadama, K. (eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 96
116. Springer, Heidelberg (2008)
34. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1),
4049 (2004)
35. Tanwani, A.K., Farooq, M.: The role of biomedical dataset in classification. In: Combi,
C., Shahar, Y., Abu-Hanna, A. (eds.) Artificial Intelligence in Medicine. LNCS (LNAI),
vol. 5651, pp. 370374. Springer, Heidelberg (2009)
Supply Chain Management Sales Using XCSR

Mara Franco, Ivette Martnez, and Celso Gorrin

Departamento de Computacion y Tecnologa de la Informaci


on
Universidad Sim
on Bolvar, Caracas, Venezuela
maria@gia.usb.ve,
martinez@ldc.usb.ve

Abstract. The Trading Agent Competition in its category Supply Chain


Management (TAC SCM) is an international forum where teams develop
agents that control a computer assembly company in a simulated environ-
ment. TAC SCM involves the following problems: to determine when to
send oers, decide the nal sales prices of the goods oered and plan the
factory and delivery schedules. In this work, we developed a TACSCM
agent called TicTACtoe, that uses Wilsons XCSR classier system to
decide the nal sales prices. In addition, we developed an adaptation for
this classier system, that we called blocking classifiers technique, which
allows the use of XCSR within environments with single-step tasks and
delayed rewards. Our results show that XCSR allows generating a set of
rules that solves the TAC SCM sales problem in a satisfactory way. More-
over, we found that the blocking mechanism improves the performance
of the agent in the TAC SCM scenario.

1 Introduction

The supply chain management embodies the management of all the process and
information that moves along through the supply chain from the supplier to the
manufacturer right through to the retailer and the nal customer. Nowadays,
the supply chain management is one of the most important industrial activities.
Planning the activities through the supply chain is vital to the competitiveness
of manufacturing enterprises. According to [6], while todays supply chains are
essentially static, relying on long-term relationships among key trading partners,
more exible and dynamic practices oer the prospect of better matches between
suppliers and customers as market conditions change.
The Trading Agent Competition of Supply Chain Management (TAC SCM)[6]
was designed to expose the participants to the typical challenges presented in the
dynamic supply chain. These challenges include competing for the components
provided by the suppliers, managing the inventory, transforming components
into nal products and competing for the customers. These problems can be
classied into three main problems: purchases, production and sales.
Pardoe and Stone made experiments applying dierent learning techniques to
sales decisions of TAC SCM agents[11]. One of their main conclusions was that
winning oers in TAC SCM is a very complex problem because the winning prices

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 145165, 2010.
c Springer-Verlag Berlin Heidelberg 2010
146 M. Franco, I. Martnez, and C. Gorrin

may vary very quickly. Therefore, this work arms that taking decisions based
on previous states of the current game is inaccurate, while using information
taken from a lot of previous games will show better results.
The goal of this work is to present an approach to the TAC SCM problem
using an evolutionary reinforcement learning system. We specically use XCSR
to solve one of the most important sales problems: pricing the components in
order to compete over the market and maximize the prot at the same time.

2 TAC SCM
The TAC SCM competition[6] was designed by a team of researchers from the e-
Supply Chain Management Lab at Carnegie Mellon University in collaboration
with the Swedish Institute of Computer Science (SICS). In this contest, each
team has to develop an intelligent agent capable of handle the main supply chain
management problems (which orders accept, decide the sale price for products,
compete over the market, among others).
Agents compete against each other in a simulation that lasts 220 days and
includes customers and suppliers to deal with. The main goal of the competitors
is to maximize the nal prot by selling assembled computers to the customers.
The prot of an agent is calculated by subtracting production costs to the in-
comes. This prot reects itself in the amount of money the agents have at the
end of the game, which indicates which agent is the winner.
Each TAC SCM simulation has three actors: customers who buy computers,
manufacturers (agents) who produce and sell computers, and suppliers who pro-
vide the unassembled components to the manufacturers. A detailed description
of these actors can be found in [6].
At the beginning of each day, the agent receives request for quotes (also
known as RFQs) from the customers. Afterwards, the agent decides which RFQs
should be accepted and which should be the nal oer price. After sending
the oers, the agent waits for the orders from the customers. Only the best
priced oers are accepted and turn into orders. If the agent receives the order,
it decides when to produce and deliver, and even more important, how much
components it should buy to accomplish the production schedules. In order to
buy the components, the agent sends the suppliers RFQs for the spare parts. In
response, the suppliers send oers to the agent who has to decide whether or
not to accept them.
Each team competing in a TAC SCM game should develop a manufacturer
agent that has to deal with the main decisions of the supply chain management:
how much components shall we buy, when shall we produce an order and which
RFQs shall we accept. Moreover, when accepting a RFQ, the agent should de-
cide which the nal price should be for these goods. In this work, these three
problems will be referred as the purchase problem, the production problem and
the sales problem.
More than 30 agents participate in this competition each year. Among the
most successful solutions to the TAC SCM problem we found: TacTex-06,
Supply Chain Management Sales Using XCSR 147

PhantAgent and CMiex. TacTex-06[12] is an agent that uses a prediction model


trained with the Additive Regression with Decision Stumps algorithm[17] within
the purchase strategy. In addition, this agent also uses another prediction model
for the sales strategy based on the idea that the winning prices follow a nor-
mal distribution. Other interesting approach to the problem is presented by
PhantAgent[13] which uses heuristics to solve the purchase and sales problems.
Furthermore, the CMiex agent[2] uses a forecasting module that predicts the
sales price for the components and products for the following days. However,
the code for these solutions was not available at the time and this situation
encouraged us to create our solution to the problem from scratch.
In our solution, we addressed the sales problem using an evolutionary re-
inforcement learning technique. The other problems were solved using
simple static strategies in order to evaluate the impact of the learning system
on the sales problem. Our approach to the problem will be better explained
Section 4.

3 XCSR
XCS is a Michigan Learning Classier System rst described by Wilson[14]. This
system is based on the work proposed by Holland[9] but uses the accuracy instead
of the payo as a measure of goodness of a classier. In our implementation we
used XCSR[15], a version of the of XCS that accepts real numbers as inputs. To
do this the features in the condition are represented by lower and upper bounds
while the action remains discrete.
The reason why we decided to use this approach is because all the inputs of
the decision we wanted to make were real and the decisive thresholds needed to
be found dynamically. We also decided to use XCSR because the rule system
can constantly adapt to new environments using a xed rate of exploration[3]
and the rules that it generates are interpretable by human beings[10].

4 TicTACtoe
TicTACtoe is our approach to the TAC SCM problem. TicTACtoe has three
modules: Purchase, Production and Sales (see Figure 1). Each module manages
one of the sub-problems in the supply chain management. Every module takes its
own decisions using information taken from the environment and other modules.
On the next subsections, we will focus on the details of these modules.
In addition to these modules, we provided the agent with memory through an
organizer structure. This structure keeps track of: orders scheduled for produc-
tion, possible order commitments, actual produced orders1 and possible future
inventory2. This memory allows to record decisions taken by the agent each day

1
Production schedules may vary due to the lack of components.
2
The future inventory is based on the component orders placed by the agent.
148 M. Franco, I. Martnez, and C. Gorrin

Fig. 1. TicTACtoe Architecture

and to consider events that would happen in the future, which will be used to
make further decisions.

4.1 Purchases
The purchase module is in charge of sending RFQs to the suppliers in order
to buy the necessary components for production. This module has two tasks:
a) creating the RFQs to get the current component prices and b) decide which
supplier oers to accept.

Suppliers RFQ creation. First, the agent calculates how many components
are needed for production within the next ten days. These calculations are based
on the current inventory, orders scheduled for the next ten days and component
orders that have been placed already. The agent always sends the RFQ to its
favourite supplier for that particular component, which is the one who has given
the best prices lately. There is only one favorite supplier for each component.
However, the agent also asks the other suppliers for the current prices in order
to update the favourite supplier if necessary.
The favorite supplier is preferred in order to get lower prices. This is based in
the assumption that the state of a supplier does not change drastically. Therefore,
if a supplier gives an agent the best price, probably it would continue giving good
prices for some time.

Accepting the oers. When the agent ask for components, the suppliers might
not be able to comply with the agents requirements. When the supplier is not
able to deliver the products the agent asks for, it sends two types of adjusted
oers instead: oers that vary the quantity and oers with a later due date. If
this happens, the priority of TicTACtoe is to accept rst the complete oers
and then the ones that vary the quantity. Once an order is set, the agent adds
a record of the components arrival to calculate the future inventory.
Supply Chain Management Sales Using XCSR 149

Furthermore, the agent keeps a historical record of the base price for each
component. The component base price is calculated every day as a weighted
average as shown in equation (1):

Pdc = Sdc w + Pd1


c
(1 w) (1)
where Pdc
is the base price for component c in day d, Sdc
is the suppliers price
for component c in day d and w is a constant for weighting.

4.2 Production
The production module is in charge of scheduling the production of the active
orders (the orders that are waiting to be elaborated and delivered). This module
prioritizes the active orders with sooner due dates and higher penalties (in case
the orders are behind schedule). The agent loops over the orders checking if
there is enough inventory of products to deliver them. If the agent has enough
products the order is delivered. This strategy is used by PhantAgent[13] to avoid
extra storing charges. In case there are not enough components to deliver the
order, the agent veries if the order is beyond the latest possible delivery day3 .
In case it is already too late, the customer would not receive the order anymore.
So, the agent cancels it and frees all the components and products associated in
order to be able to use them to full other orders.
If there are not enough products to fulll the order but the customer can still
wait for it, the agent tries to produce it. To produce an order scheduled for a
specic day, the agent checks if there are enough components. When there are
not enough components to produce the desired quantity, the agent produces the
maximum quantity allowed. If the agent cannot produce an order completely, it
continues producing it the next day.
At the end of the day, the production module determines the number of late
orders and the number of active orders. This information is used by the sales
module to adjust the quantity of free cycles the agent can oer. This forces the
agent to save cycles for late order production.

4.3 Sales
The sales module is in charge of pricing the products and dealing with the cus-
tomers. This module checks everyday the customer RFQs and sends oers to the
ones that meet the following characteristics: (a) a reserve price higher than
the products base price and (b) a due date earlier than the end of the simulation.
The agent calculates the base price for a product as the sum of the estimated prices
of all the spare parts. This estimates how protable the order would be.
Afterwards, the agent uses the set of rules generated using a XCSR to deter-
mine the discount factor over the reserve price of each RFQ. The reserve price is
the maximum price a customer is willing to pay for an order. The agent that of-
fers the lower price wins the bid. The implementation of XCSR will be explained
in greater detail in Section 5.
3
The latest possible deliver day is determined when the customer sends the RFQ.
150 M. Franco, I. Martnez, and C. Gorrin

The nal oer price is determined by equation (2), where BaseP rice is the
calculated cost of the product based on recent experiences, ReserveP rice is
the reference price determined by the customer and d is the discount factor
determined by the XCSR.

OerPrice = BasePrice + Revenue (1 d) (2)

Revenue = ReservePrice BasePrice (3)

Once the agent calculates the oer price for each RFQ, a production schedule
is generated including these possible orders4 . The orders that involve higher
revenues have more priority. In order to save production cycles for future orders
that would need to be delivered earlier, the agent always tries to produce an
order as late as possible according to its due date. This strategy is very similar
to the one used in [12]. If there are not free cycles the agent checks the inventory
to see if there are enough products to deliver these orders the next day. Moreover,
non of these is possible the less protable RFQs are discarded.
Moreover, the daily free cycles are multiplied by a factor between 0 and 1,
inversely proportional to the quantity of late orders that the agent has. This helps
the agent to get on schedule again, by leaving some cycles for the production of
late orders.
Our agent remembers all the placed oers as possible commitments. However,
customers only accept the best-priced oers. In case a customer rejects an oer,
the commitment is removed and all the components and cycles associated are
released.

5 XCSR Inside TicTACtoe


One of the most important decisions in the supply chain management is to decide
the nal price for the products. This price should be low enough to win the order
and at the same time high enough to maximize the agents prot.
The decision taken by the XCSR is the nal price discount the agent should
oer to win the bid. This decision is taken inside the Sales Module by accessing
the XCSR library through two methods. The rst one introduces the current
state of the environment, nds the match set and the action that should take
eect. Moreover, it associates the action set to the RFQ, in order to reward it
later. The second one rewards the action set and saves the error information to
compute further population statistics.

5.1 Classiers Structure

In the following sections, we explain the structure used to represent the TAC
SCM sales problem using real inputs and discrete actions.

4
This helps the agent to calculate how much free cycles are left for the production of
further orders.
Supply Chain Management Sales Using XCSR 151

Condition. There are simulation values known by the agent that provide impor-
tant information for its future decisions. Including all this values in the classiers
structure decrease the eciency of the GA in terms of execution time. To avoid
this, we selected the more important features for the decision we wanted to make.
Preliminary experiments showed that the more suitable features for the classier
would be:

x1 Rate of late orders5 over the total of active orders. This determines how much
work is late and how convenient is to make a good oer when the agent is
already behind schedule.
lateOrders
x1 = (4)
totalOrders
x2 Rate of the factory cycles that remain unused the day before. This helps the
agent determine if it should raise or lower the price discount. For example,
if the factory is full, the agent should give low discounts in order to try to
nish with its active orders before getting new ones.
freeFactoryCapacity
x2 = (5)
totalFactoryCapacity
x3 Rate of the base price over the reserve price indicated by the customer. This
represents how protable an order would be. The agent discards the cases
when the base price is higher than the reserve price.
basePrice
x3 = (6)
reservePrice
x4 The number of days between the current date and the day the order should
be delivered. This indicates how much time the agent has to produce and
deliver an order. This value is scaled between 0 and 1 considering that the
due dates are, at most, 12 days after the actual date.

(dueDate day)
x4 = (7)
12
x5 The actual day of the simulation normalized by the maximum number of
days a game has. This value is very important because there are dierent
situations as the days go by. For example, in the middle of a simulation
components start to be scarce and their prices start rising. This feature
helps the agent to determine dierent stages of the simulation that require
specic behaviours.
day
x5 = (8)
220
All the features are normalized between 0 and 1, to use these values as upper
and lower bounds. This aspect will be better explained in Section 6.1.
5
The late orders are the active orders that are producing penalties because they are
going to be delivered after the due date.
152 M. Franco, I. Martnez, and C. Gorrin

Action. Our implementation of XCSR has 10 actions that represent the dierent
discounts over the possible revenue. The revenue is computed as the dierence
between the base price and the reserve price determined customers. The dierent
discounts go from 0% to 90% with 10% steps.

Reward. The reward is determined by the prot obtained through a RFQ,


scaled by the amount of money that implied its fabrication when the agent sent
the oer. There are three dierent scenarios in which the agent can reward an
action set.
When the oer is not accepted by the customer. In this case the RFQ
did not make the agent earn or lose any money and the reward is zero.
When the order is delivered. In this case we consider the money earned by
the sale and the money lost because of the penalty (if the order was delivered
late). The prot and loss are scaled by the investment of the agent, which is
calculated based on the base price of the product. The reward in this case is
calculated using the equation (9).
2 2
reward = (prof it) (loss) (9)
where
oeredPrice
prof it = (10)
basePrice
max(day duedate, 0) penalty
loss = (11)
basePrice quantity
When we scale the prot and the loss using the base price we obtain a
percentage value of the money earned with the order. We could think that a
good approximation for the reward function is to subtract the expenses to the
prot; but the net earnings are not the same for the dierent products6 . If we
use the earnings as the reward, the rules that obtain the highest rewards will
be only the ones that sell the most expensive products. However, we want to
learn how to sell dierent types products, not only the most expensive ones.
Therefore, it is more appropriate to reward a classier based on the prot
margin.
When the order is canceled without being delivered. In this case the
agent did not produce the order on time, so it only produced losses for
the agent. This is due to the penalties the agent had to pay to the customer.
The money invested in the production is not considered as an expense, be-
cause these products can be used to full another order. In this situation, the
reward is calculated by the equation (12), which is very similar to equation
(9) eliminating the term corresponding to the prot.
reward = (loss)2 (12)
In equations (9) and (12), the terms corresponding to the prot and the loss are
squared in order to give the classier a stronger reward when this quantities are
signicantly greater.
6
Products produce dierent earnings depending to their production costs.
Supply Chain Management Sales Using XCSR 153

6 Implementation Details

The XCSR library implementation is based on Butzs XCS library (Version


1.0)[4]. We adapted this library to XCSR using a lower and upper bound no-
tation as in XCSI[16], but allowing real values for the bounds. In the following
subsections, we will explain some characteristics of the system relevant to our
implementation.

6.1 Dont Care

The dont care in our library was implemented as the absence of lower or upper
bound, depending on the allele we wanted to modify. To implement this dont
care, we had to put a restriction to the data: all the features should be bounded
between 0 and 1. Putting a dont care in an allele is equivalent to put 0 or 1,
depending if it is a lower or an upper bound. In this way, we open the range to
the maximum limit so the allele classies all the states.

6.2 Classier Subsumption

Since Butzs library was oriented to boolean features, we had to implement other
subsumption rules so they adapt to our classier structure, where all the features
are bounded between 0 and 1. The rules used were the same rules used by Wilson
in [16], where a classier is more general than another if all the ranges of the
rst classier contain the second one. For example, (li , ui ) subsumes (lj , uj ),
if ui > uj li < lj . The actions of the classier should be the same for the
subsumption to occur.

6.3 Crossover Operators

We implemented a restricted two-point crossover operator between conditional


ranges that generates new individuals with valid ranges. This means that only
the points between an upper bound and a lower bound can be chosen. This
crossover operator is equivalent to boolean two-point crossover operator, because
the crossover only crosses full conditions. The ranges of the new individuals are
always valid, because they are combinations of the parents ranges.

6.4 Additional Adaptations

Additional adaptations were necessary to include XCSR in our TAC SCM agent
due to the characteristics of the problem.

Blocking classifers. In our classier system, the reward of an action set is given
based on the amount of money the agent wins o loses when making the correspond-
ing oer. This value merely depends on the discount given by the agent in the oer.
This is the reason why this problem is modelled as a single-step problem. Never-
theless, the agent only knows the reward few days after making the decision. This
diers from the classic problems used as benchmarks (i.e. boolean multiplexer), in
which the reward arrives immediately after applying an action.
154 M. Franco, I. Martnez, and C. Gorrin

Considering the delayed reward, it is necessary to save the action set asso-
ciated to the order, so these classiers are given a reward when the agent gets
the nal result. Since we are interested in continuing learning while a classier
waits for its reward, classiers are used in multiple learning iterations parallel to
each other. This aspect of the online learning, in addition to the delayed reward,
presents a new problem to us. The problem occurs when a classier that is wait-
ing for a reward is selected for deletion or subsumption. Since these mechanisms
could be executed by any learning iteration, they could erase this classier based
on information that is not updated. Consequently, the knowledge represented by
this classier and its upcoming rewards are lost.
In order to avoid the deletion of the classiers expecting a reward based on
incomplete information, we implemented a simple counting semaphore. Each
classier has a counter that indicates the number of rewards it is expecting.
A single classier participates in a lot of decisions each day and needs to wait
a reward for each one of them. Therefore, we only consider for deletion the
classiers that are not blocked, the ones that have their counter in zero.
We had to add also another important restriction in the subsuming mecha-
nism. A classier can not be subsumed if it is blocked, because its information
is not entirely up to date to become part of another classier. The blocked clas-
siers may participate in all the other mechanisms like crossover and mutation.

Dynamic population generation. The version of XCS used has a dynamic


population generation method, in which the population starts empty. Each time
the algorithm generates a match set, it inserts new classiers into the population
until all the actions are covered. In other words, the algorithm guarantees that
there is at least one classier for each possible action. If there is no classier in
the population for a specic action, covering is performed and the new classier
is inserted into the population. The advantage of this technique is that the
population grows dynamically as the states occur in the experiment, covering all
the search space.
On the other hand, the population has a limited size and, covering all the ac-
tions, yields to the loss of old classiers when inserting the new ones. Moreover,
dierent groups of classiers activate themselves in dierent stages of the simu-
lation. Considering that blocking classiers places restrictions over the deletion
of the active individuals, this increases the probability of deleting classiers that
activate themselves in other stages of the simulation. In order to avoid deleting
good inactive classiers in advance execution stages, we activate the population
generation method until the population reaches its size limit. After that, when
covering is necessary, we only generate one rule with a random action. How-
ever, we continue inserting and deleting individuals when applying the genetic
algorithm over the action set.

Variable epsilon-greedy action selection policy. The base library inter-


leaves between exploitation and exploration, rewarding the classiers only during
the exploration and taking learning statistics only during exploitation. In this
problem, the classier system learns while the agent competes in a simulation.
Supply Chain Management Sales Using XCSR 155

Since all the decisions taken by the XCSR aects the nal result, regardless
of whether it was determined by exploration or exploitation, we changed the
algorithm in order to reward the classiers in both cases.
Considering the dynamic characteristics of the simulation, we decided to use
an -greedy action selection policy. This consist of selecting the best possible
action with probability 1  and exploring the rest of the time. However, we
did a slightly modication so the  starts at 1 and decreases linearly until it
reaches a threshold. This forces the system to explore more at the beginning
of the simulation and less by the end of it. When  reaches the threshold, its
value remains constant allowing the agent to perform some explorations that
facilitates its adaptation to changes in the simulation.

7 Experiments and Results

We designed three experiments to test the eectiveness of the proposed mecha-


nisms. First, we tested the performance of the XCSR against other static solu-
tions. Afterwards, we analysed the impact of the blocking classiers technique
and the application of dierent exploration and exploitation rates.
During the execution of the experiments, each agent plays separately against
ve dummy agents7 . The source code for these dummy agents can be found in
[1]. In each experiment, the agents ran for 40 games. In the rst game, the XCSR
population is empty. At the end of each game, the population is saved, and at
the beginning of the next game it is recovered and established as the initial
population. All data presented in the following gures corresponds to the last 25
games. The rst 15 games were taken as the training stage. However, during this
25 games, the agent is still doing some explorative actions due to the dynamic
nature of the simulation (See Section 6.4).
The length of each game is 220 simulations days. Each simulation day last 5
seconds8 , considering than none of the agents would need more time to complete
its daily actions.
The performance of the agents was evaluated using two main performance
measures:

(a) Final result: the nal amount of money in the agents bank account. This in-
dicates how much money the agent earned and how protable its investments
were.
(b) Received orders: the number of orders placed by the customers. This value
indicates the percentage of the market the agent served. This is directly
linked to the decision taken by the XCSR, because if the agent gives a better
price, it receives more orders
7
These agents come along with the TAC SCM library. They are used for testing pur-
poses and they use simple but coherent strategies to handle the dierent problems.
8
The standard parameters for the games are 220 simulation days with a duration of
15 seconds.
156 M. Franco, I. Martnez, and C. Gorrin

Over these two performance measures, we applied non-parametric tests to de-


termine if the dierences between the agents were signicant. Since the variables
subject to study take integer values, we cannot assume normality. Therefore, we
applied the Kruskal-Wallis test[7] to determine if there are signicant dierences
between the agents. After that, we used the Wilcoxon test[7] to perform pair-wise
comparisons between the agents.
Also, additional performance measures are considered in some experiments to
have more observational insights of the performance of the agents:

(a) Factory usage: the percentage of usage of the factory capacity. This value
indicates how many factory cycles are used on average. This represent the
productive capacity of the agents, and it should be used at maximum.
(b) Penalties: the amount of money paid to the costumers for late deliveries.
This indicates how many late orders the agent had.
(c) Interests: the amount of money paid to bank entity for having a negative
balance in the bank account.
(d) Total income: the total amount of money earned by the agents without
considering the losses.
(e) Component costs: the amount of money spent in buying components.
(f) Storage costs: the amount of money spent in storing components to be used
in future production.

The component costs, storage costs, penalties and interests are represented as
the percentage of the total revenue, while the nal result and the total income
are represented in US dollars. The combination of these measures with the main
ones will show how eective the learning was, considering that we wanted to
learn a discount strategy that maximizes the revenue of the agent by winning
protable and manageable orders. However, these performance measures are
shown only as a support of the two main measures. Therefore, no statistical test
were performed over them.
The parameters used in our implementation of XCSR for the calculation of
price discounts are = 0.1, = 0.2, = 0.1, = 5, GA = 25, 0 = 10,
del = 20, = 0.8, = 0.04, p# = 0.1, pI = 10.0, I = 0, FI = 0.01,
sub = 20, mna = 1, s0 = 0.05 and N = 1000. The meaning of these pa-
rameters is explained in [5]. Moreover, the sources of TicTACtoe can be found
in http://www.gia.usb.ve/~maria/tictactoe.

7.1 Experiment 1: TicTACtoe Performance


The goal of this experiment is to compare three dierent strategies to determine
the price discount. These strategies are: learning using XCSR (L-TicTACtoe),
Random and Static. All these strategies were tested using the base version of
TicTACtoe. In this experiment, we also compare L-TicTACtoe with the dummy
agent provided by the server.
The learning version of TicTACtoe, L-TicTACtoe, uses an exploitation rate
(1-) of 30% and a population size of 1000 classiers using the blocking mecha-
nism. This conguration was the most favourable according to Sections 7.2 and
Supply Chain Management Sales Using XCSR 157

7.3. Moreover, preliminary experiments[8] showed that a population size of 1000


produces the best results for this problem.
The other versions of TicTACtoe involved are Random and Static. The rst
one decides the price discount randomly, while the second one gives a discount
on day d as follows:


80% if freeFactoryCapacity d1 > 80%
discount(d) = 10% if freeFactoryCapacityd1 < 5%


30% all other cases.
where freeFactoryCapacity d1 is the percentage of free factory capacity on sim-
ulation day d 1. These naive rules try to avoid factory saturation raising
prices every time the number of free factory cycles goes below 5% and tries to
attract customers when this value goes over 80%.
Figure 2(a) shows the global performance of the agents. These results clearly
show that L-TicTACtoe outperforms Random and Static, with an average result
twice as high as Random and four times higher that the dummy agent. Considering
that these agents only dier in their pricing strategy, it is evident that a change
on this strategy aects the global performance of the agent.
2e+07

8000
7000
0e+00
Final Result (US$)

Number of Orders

6000
2e+07

5000
4000
4e+07

3000
6e+07

2000

LTicTACtoe dummy static random LTicTACtoe dummy static random

Agents Agents

(a) Final Result (b) Received Orders

Fig. 2. Comparison of the dummy agent and the TicTACtoe agent using dierent
pricing strategies in terms of nal result and received orders

Table 1 shows the p-values of the statistical comparisons among the agents.
This table shows that the L-TicTACtoe is signicantly better than the other
solutions presented. Moreover, the learning agent performs better than the Ran-
dom agent in 99.9% of the cases, supporting the statements above.
Even though we could expect that L-TicTACtoe manages more orders than
the other agents, Figure 2(b) reveals that Random and Static win more oers.
However, Table 2 indicates that Random and Static are delivering more orders
late and therefore, incurring in more penalties. These results show that the
pricing strategy of these agents is less advantageous because they commit to
orders which they cannot deliver on time and hence, they are penalized.
We can also observe in this table that Static gets negative interests. In other
words, this agent had to pay the bank for having a negative balance in its bank
158 M. Franco, I. Martnez, and C. Gorrin

Table 1. Statistical comparison of the TicTACtoe agent using dierent pricing starte-
gies. Column Kt shown the p-value for the Kruskal-Wallis test and column W ilcox. test
shows the p-values for the Wilcoxon tests.

Agent Avg Std Kt Wilcox. test (p-values)


Random Dummy Static
Final Result
L-Tic 17684004.68 3827810.34 0.0013 0.0000 0.0000
Random 10157938.28 10313748.01 0.0016 0.0000
0.0000
Dummy 4800379.28 3909834.59 0.0000
Static -17317410.84 18120669.01
Received Orders
L-Tic 5453.96 649.66 0.0000 0.0000 0.0000
Random 6924.28 593.06 0.0000 0.0000
0.0000
Dummy 2676.88 344.13 0.0000
Static 7841.80 223.25

Table 2. Results in terms of penalties, interest and factory usage of the TicTACtoe
agent using dierent pricing strategies and the dummy agent

Agent Penalties (US$) Interest (US$) Fact. usage (%)


L-TicTACtoe 412209.08 703824.06 241286.56 93755.58 69.12 8.05
Random 5596709.56 6221754.37 66187.76 248242.79 85.56 6.90
Static 10535156.52 10403092.31 -781983.80 594421.05 93.44 1.53
Dummy 882527.44 1539109.47 37324.28 94207.14 34.56 4.28

account. This indicates that the strategy taken by Static is decient, because it
incurs in negative balances on most of the simulation days. On the other hand,
L-TicTACtoe is the agent that earns more interests from the bank and presents
the lowest variance. This shows that this agent has a more stable behavior in
terms of bank account balances.
Regarding the factory utilization, we can appreciate that the agents Random
and Static achieve a higher factory utilization. High factory utilization suggests
a procient management of the productive capacity. However, the penalties ob-
tained by these agents demonstrate that these agents are surpassing their pro-
duction capacity. L-TicTACtoe does not use the factory as much as these agents,
but still presents a better solution to this problem because it served eciently
a considerable portion of the market.
Finally, through this experiment we can conrm that the strategy used by
L-TicTACtoe improves the global performance of our solution to the TAC SCM
problem. Furthermore, the static and random strategies show poor results as
a consequence of the incapacity to adapt themselves to new situations. These
Supply Chain Management Sales Using XCSR 159

results indicate that we have accomplished the goal of applying an evolutionary


rule learning technique inside the sales strategy of a TAC SCM agent.

7.2 Experiment 2: Classiers Blocking


In this experiment, we compare the performance of the TicTACtoe agent with
and without the blocking classier technique described in Section 6.4. The agent
that does not block classiers allows erasing classiers freely (ignoring if they
are waiting for a reward), while the other one preserves this classiers. In order
to keep the agents as similar as possible, both versions of TicTACtoe used an
exploitation rate of 70% and a population size of 1000 classiers. The goal of
this comparison was to determine the impact of this technique.
The results of this experiment will show if this simplication leads to infor-
mation loss when we continue learning without waiting for the rewards.
3e+07

7500
2e+07

7000
1e+07
Final Result (US$)

Number of Orders

6500
0e+00

6000
2e+07 1e+07

5500

tblock tnoblock tblock tnoblock

Agents Agents

(a) Final Result (b) Received Orders

Fig. 3. Comparison of the performance of TicTACtoe with and without the blocking
classiers technique in terms of nal result and received orders

Table 3 shows that t-block (L-TicTACToe with blocking) receives 312 more
orders than t-noblock(L-TicTACtoe without blocking). This dierence is small
and is not strong enough to make any assumptions on the performance of the
agents as shown in Table 3. However, Figure 3(a) shows that t-block obtains
more frequently a better nal result than t-noblock. According to Table 3 this
dierence is not statistically signicant using using a condence interval of 0.05.
However, we could say that t-block behaves better than t-noblock in 94.4% of
the cases. This dierence in the nal balance is explained by the high penalties
obtained by t-noblock as shown in Table 4. These penalties indicate that this
agent does not develop an appropriate set of rules to determine the nal sale
price for an RFQ. Moreover, t-noblock makes oers at very low prices to orders
that have a high penalty and are very dicult to produce because of the lack
of the required components. When this agent oers products at low prices, it
obtains plenty of orders, but most of them do not represent a protable portion
of the market considering its penalties.
Moreover, we can observe in Table 4 that t-noblock gets negative interests
from the bank, while the t-block gets positive interests. This implies that, on
160 M. Franco, I. Martnez, and C. Gorrin

Table 3. Statistical results from the comparison of both agents using and not using
the blocking classier technique. The columns W (p value) show the p-value of the
Wilcoxon test between agents.

Agent Avg Std W (p-val) Avg Std W (p-val)


Final Result Received Orders
t-block 15206827.28 11654157.53 0.0567 7448.44 278.09 0.3159
t-noblock 7860693.72 14530705.27 7136.20 726.64

Table 4. Results in terms of interest, penalties, component costs and storage costs of
the TicTACtoe agent using and not using the blocking classiers technique.

Interests (US$) Penalties (US$) Comp. costs (%) Storage (%)


t-block 156981250208 47913146357927 85.08 3.26 1.13 0.20
t-noblock -35459408030 80837677679835 85.07 4.83 1.28 0.28

average, the agent that does not block classiers incurs in debts, while the other
agent maintains a positive balance in its bank account. This factor, in addition
to the penalties, explains why in Figure 3(a) the agent t-noblock ends with less
money than agent t-block.
To determine the impact of the blocking technique, it is also important to ana-
lyze the experience of the XCSR system in each agent. The experience is a measure
of the classier usage; it indicates how many times a classier has been used.
In Figure 4, we can observe that the mean experience of the population of
t-block is higher than the mean experience of t-noblock. This pattern occurs
because t-noblock allows erasing classiers anytime based on incomplete and in-
accurate information. This rules are still waiting for a reward that will determine
if they performed well. Consequently, classiers that could lead to good decisions
are erased before the reward arrives, and their knowledge is completely lost.
It is interesting to notice that the relationship between the average experience
and the day of the simulation is approximately 0.05. This means that each classi-
er is used at most during 5% of the simulation. Considering that the simulation
has 220 days, the 5% correspond to 11 days. Our explanation for this behaviour
is that the generated rules are, in fact, detecting dierent stages during the
simulation, and not all the classiers are used in the same stages.
The blocking classier technique increments the global experience of the pop-
ulation and the probability of survival of possible good sub-solutions. Never-
theless, the tradeo of using these mechanisms is that the system could also
block bad solutions, and the probability of erasing good rules that have not
been activated gets higher.
The results of this experiment show that agents using the blocking classiers
technique inside XCSR preserve important information in the classiers. This
might lead to better performance in environments with single step tasks and
Supply Chain Management Sales Using XCSR 161

Fig. 4. Mean experience of the XCSR population during 8800 days (40 simulations)

delayed rewards. As a further work, more experimentation will be carried out


to validate these hypothesis and determine more advantages and disadvantages
this mechanism could have.

7.3 Experiment 3: Exploitation Rate

The aim of this experiment is to determine the best exploitation rate or value
for (1 ) for this particular problem. We tested the performance of the agent
using dierent exploitation rates (0.9; 0.7; 0.5; 0.3) to determine which one is
the most suitable for the problem that we are trying to solve. We also included
two extra exploitation rates 0 and 1 for control. Afterwards, we analyse the
two most interesting cases and compare them with the results of their dummy
competitors9 . The rest of the parameters of the algorithm and the agent remained
the same. For this experiment we used a population of 1000 individuals and the
blocking mechanism.
The TicTACtoe and dummy agents involved in this experiment will be referred
as tx and dx respectively, where x stands for the nal threshold exploitation rate
or 1  (See Section 6.4).
Figure 5 shows the results according to the main measures of performance: the
nal results and the number of received orders. In Figure 5(a) we can observe that
agents with the smallest nal balance in the bank account at the end of the game
are t0 followed by t100. The same behaviour can be observed in Figure 5(b).
This means that a constant exploration (t0) (always giving the price discount
9
In this experiment we ran our base agent only with dummy competitors, using
dierent policies each time.
162 M. Franco, I. Martnez, and C. Gorrin

8000
7000
2e+07
Final Result (US$)

Number of Orders

6000
1e+07

5000
0e+00

4000
1e+07

3000
t0 t10 t30 t50 t70 t90 t100 t0 t10 t30 t50 t70 t90 t100

Agents Agents

(a) Final Result (b) Received Orders

Fig. 5. Comparison of the performance of TicTACtoe using dierent exploitation rates


in terms of nal result and received orders.

in a random manner) produces the worst results. On the other hand, a pure
exploitation (t100) does not achieve a good performance either, because it is
incapable of adapting to new environments. Agents that combine exploitation
and exploration during the whole learning process obtain the best results due
to the dynamic characteristics of the environment. According to Table 3, we
can say that the agents t0 and t100 are signicantly worse than the rest of the
agents, in terms of nal result and received orders.
It is worth noticing the curve in these two gures. This suggests that the
exploration rate does, in fact, aect the strategy developed, and balance between
exploitation and exploration is necessary to achieve good performance.
In Figure 5(b) we can see that the agent that serve more orders is t70. Ac-
cording to Table 5, there are not signicant dierences between t30, t50 and
t70 in terms of the nal result but there are dierences in terms of the orders.
Moreover, the agent with the highest nal result on average turns out to be t30.
This situation is claried by Table 6, which compares the performance of
these two agents against their dummy competitors. Despite of eorts of t70 of
serving the largest portion of the market, this agent gets plenty of penalties for
late deliveries. Moreover, although the agent t30 does not have as many orders
as t70, this situation helps the agent to full the orders that it has already. At
the end t30, does not have as much penalty as t70, producing a more steady
behaviour (lower variance). We could say that agent t30 is learning how to
handle a number of orders that minimizes the obtained penalty and maximizes
the nal revenue.
We can also notice in this table that the implementation of TicTACtoe, no mat-
ter the exploitation rate used (t30 or t70), gets a higher nal revenue and handles
a larger portion of the market than the dummy competitors. Furthermore, there
is a dierence also in the behaviour of both dummy agents since the performance
of the agents is relative to the competitors behaviour. We can notice that agent
t70 makes it more dicult for the competitor d70 to obtain costumers.
Regarding to the factory usage, it is considered that a good agent uses its fac-
tory capacity as much as possible to complete orders[13]. This helps the agent to
obtain higher revenues at the end of the game. Even though both congurations
Supply Chain Management Sales Using XCSR 163

Table 5. Statistical results from the comparisons of the agents using dierent values
for the exploitation rate. Column Kt shows the p-value for the Kruskal-Wallis test and
column W ilcox. test shows the p-values for the Wilcoxon tests.

Agent Avg Std Kt Wilcox. test (p-values)


t30 t50 t70
Final Result
t0 9235641 3638978 0.0000 0.0000 0.0084
t10 14543996 2925682 0.0006 0.0021 0.0593
t30 17684005 3827810 0.5768 0.6581
t50 17486431 5252774 0.0000 0.7437
t70 15206827 11654158
t90 13201893 5666869 0.0000 0.1823 0.0000
t100 10945370 3575705 0.0000 0.0000 0.0000
Received Orders
t0 4226.76 574.72 0.0000 0.0000 0.0000
t10 5301.56 674.03 0.4320 0.0004 0.0000
t30 5453.96 649.66 0.0012 0.0000
t50 6398.00 1374.57 0.0000 0.0002
t70 7448.44 278.09
t90 6284.28 362.91 0.0000 0.0303 0.0000
t100 4586.12 660.45 0.0000 0.0000 0.0000

Table 6. Comparisons between the agents using 30% and 70% exploitation rates in
terms of penalties, factory usage and total income

Penalties (US$) Factory Usage (%) Total income (US$)


t30 412209 703824 69.12 8.05 108718063 12611823
d30 882527 1539109 34.56 4.28 59104351 6999544
t70 4791314 6357927 91.16 2.95 142639167 6966361
d70 807530 1301921 29.52 5.23 49645853 8701695

t30 and t70 have the same production strategy, t70 makes more usage of these
resourses than t30. This behaviour is explained by the fact that agent t70 has
more orders to attend. Consequently, considering this performance measure the
agent t70 learns a better strategy. Nevertheless, the production and purchase
strategies are still very simple, which makes it harder for this agent to deliver
these orders on time.
Regarding to the total income, we can notice that both TicTACtoe agents
have incomes proportional to the number of received orders. Also, both agents
have higher incomes than their competitors. This evidences that the developed
strategies give competitive prices according to the cost of the products and do
not oer the products below the production costs.
164 M. Franco, I. Martnez, and C. Gorrin

8 Conclusion
We designed and implemented a supply chain management agent for the TAC
SCM problem. Our agent solves the production and the purchases sub-problems
using static strategies, while it solves the sales sub-problem using a dynamic
strategy.
Moreover, the purchase strategy is based on the acquisition of components
considering production commitments for the next simulation days. The produc-
tion strategy is based on manufacturing goods prioritizing orders according to
their expected prots and due dates.
In addition, we implemented a dynamic sales strategy built on Wilsons XCSR
classier systems. Through the XCSR mechanism, we obtained a suitable set of
rules for the TAC SCM sales problem. This set of rules worked better than the
strategies used for control.
As our initial solution for the TAC SCM sales problem encountered an issue
when handling delayed rewards in a single-step environment, we introduced a
blocking classier technique. We showed that the use of this technique yields to
more experienced populations and improves the quality of the generated strate-
gies in this scenario. However, more experimentation needs to be carried out
regarding this matter.

References
1. Trading agent competition - TAC SCM game description,
http://www.sics.se/tac/page.php?id=13
2. Benisch, M., Sardinha, A., Andrews, J., Sadeh, N.: CMieux: adaptive strategies
for competitive supply chain trading. In: ICEC 2006: Proceedings of the 8th in-
ternational conference on Electronic commerce, pp. 4758. ACM Press, New York
(2006)
3. Bull, L.: Applications of Learning Classier Systems. Springer, Heidelberg (2004)
4. Butz, M.: Illigal Java-XCS - LCS Web (2006)
5. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, p.
253. Springer, Heidelberg (2001)
6. Collins, J., Arunachalam, R., Sadeh, N., Eriksson, J., Finne, N., Janson, S.: The
Supply Chain Management Game for the 2007 Trading Agent Competition, Pitts-
bourg, Pensilvania (2006)
7. Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons, Chichester
(December 1998)
8. Franco, M., Gorrin, C.: Dise no e implementaci
on de un agente de corretaje en una
cadena de suministros en un ambiente simulado, Universidad Sim on Bolvar (2007)
9. Holland, J.H.: Adaptation. In: Rosen, R., Snell, F.M. (eds.) Progress in theoretical
biology IV, pp. 263293. Academic Press, Nueva York (1976)
10. Lanzi, P.: Learning classier systems: then and now. Evolutionary Intelligence 1(1),
6382 (2008)
11. Pardoe, D., Stone, P.: Bidding for customer orders in TAC SCM. In: Faratin, P.,
Rodrguez-Aguilar, J.-A. (eds.) AMEC 2004. LNCS (LNAI), vol. 3435, pp. 143157.
Springer, Heidelberg (2006)
Supply Chain Management Sales Using XCSR 165

12. Pardoe, D., Stone, P.: An autonomous agent for supply chain management. In: Ado-
mavicius, G., Gupta, A. (eds.) Handbooks in Information Systems Series: Business
Computing, vol. 3, pp. 141172. Emerald Group (2009)
13. Stan, M., Stan, B., Florea, A.M.: A dynamic strategy agent for supply chain man-
agement. In: Proceedings of the Eighth International Symposium on Symbolic and
Numeric Algorithms for Scientic Computing, pp. 227232. IEEE Computer Soci-
ety, Los Alamitos (2006)
14. Wilson, S.W.: Classier tness based on accuracy. Evolutionary Computation 3(2),
149175 (1995)
15. Wilson, S.W.: Get real! XCS with Continuous-Valued inputs. In: Lanzi, P.L., Stolz-
mann, W., Wilson, S.W. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1813, p. 209.
Springer, Heidelberg (2000)
16. Wilson, S.W.: Mining oblique data with XCS. In: Lanzi, P.L., Stolzmann, W.,
Wilson, S.W. (eds.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 158174. Springer,
Heidelberg (2001)
17. Witten, I.H., Frank, E.: Data mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, San Francisco (2005)
Identifying Trade Entry and Exit Timing Using
Mathematical Technical Indicators in XCS

Richard Preen

Department of Computer Science


University of the West of England
Bristol, BS16 1QY, UK
richard.preen@live.uwe.ac.uk

Abstract. This paper extends current LCS research into financial time series
forecasting by analysing the performance of agents utilising mathematical tech-
nical indicators for both environment classification and in selecting actions to
be executed. It compares these agents with traditional models which only use
such indicators to classify the environment and exit at the close of the next day.
It is proposed that XCS agents utilising mathematical technical indicators for
exit conditions will not only outperform similar agents which close the trade at
the end of the next day, but also result in fewer trades and consequently lower
commissions paid. The results show that in four of five assets, agents using in-
dicator exit conditions outperformed those exiting at the close of the next day,
before commissions were factored in. After commissions are factored in, the
performance gap between the two agent classes further widens.

Keywords: Computational Finance, Learning Classifier Systems, XCS.

1 Introduction

The primary objective of this paper is to extend the current research into the use of the
XCS Learning Classifier System [28] within the domain of financial time series fore-
casting. Recent work (e.g., [9], [21], [13], and [24]) has demonstrated the successful
application of XCS in this area. However, in each of the studies, agents are trained on
daily price data to evolve trade entry rules composed of mathematical technical indi-
cators in conjunction with a fixed rule to close the trade the following day, i.e., the
exit timing is not evolved. It is posited that by utilising mathematical technical indica-
tors to identify the timing of the market exit, as opposed to simply exiting on the next
day, not only are the associated transaction costs reduced, but the excess returns are
increased due to an inherent noise reduction by requiring less prediction accuracy.
Initially, several XCS agents are produced to replicate the traditional model and
demonstrate their application to financial time series forecasting. In extending this
work, the agents additionally evolve mathematical technical indicators to identify
appropriate exit conditions. These two models are then compared and the agents are
furthermore benchmarked against a buy-and-hold strategy to evaluate whether market
beating excess returns can be generated.

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 166184, 2010.
Springer-Verlag Berlin Heidelberg 2010
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 167

Brock, Lakonishock and LeBaron [4] investigated two of the most popular trading
rules from technical analysis (moving averages and trading range breakout) on the
Dow Jones Industrial Average over the period 1897-1986. They generated typical
returns of 0.8% over a 10-day period compared to a normal 10-day upward drift of
0.17%. After the buy signals were generated, the market increased at a rate of 12%
per year. Following the sell signals, a decrease of 7% per year was noted. Subse-
quently, Detry and Grgoire [6] successfully replicated the results for the moving
average tests on a series of formally selected European indexes. Moreover, technical
analysis has been shown useful in the foreign exchange markets by Dooley and
Schaffer [8], Sweeney [25], Levich and Thomas [12], Neely et al. [15], Dewachter
[7], Okunev and White [16], and Olson [17].
The primary benefit from the use of mathematical technical indicators in financial
time series forecasting is that the algorithms are precisely defined. This means that the
signals they produce are free from errors of subjective human judgement and emotion,
are replicable, and can easily be tested over large amounts of data and varying assets
to quantify performance. Learning Classifier Systems (LCS) [10] can easily co-evolve
different combinations of these indicators to form entry/exit rules for financial trad-
ing, and even to evolve the technical indicators themselves.

2 Related Work

There has been widespread research on Artificial Neural Networks (ANN) and Ge-
netic Programming (GP) for financial time series forecasting. GP examples include
Neely et al. [15], Allen and Karjalainen [1], and Chen [5]. Examples of ANN fore-
casting financial time series include Tsibouris and Zeidenberg [25], Steiner and
Wittkemper [23], Kalyvas [11], and Srinivasa, Venugopal and Patnaik [22]. In con-
trast to ANN and GP, comparatively little research has been conducted into the use of
LCS for financial time series forecasting. Early examples of LCS research in this area
include Beltrametti et al. [2] using LCS to predict currencies, and Mahfoud and Mani
[14] and Schulenburg and Ross ([18], [19], and [20]) predicting stocks.
More recently, Stone and Bull [24] created a single-step ZCS [27] agent to forecast
long or short positions on the Foreign Exchange (FX) Market, trading with the full
amount of the balance each time. The architecture was modified by utilising the
NewBoole update mechanism, tweaking the covering algorithm, and introducing a
new specialize operator. The agent was required to always be in the market. Daily
price and interest rate data was used, covering the period of January 1974 to October
1995 for the U.S. Dollar (USD), German Deutsche Mark (DEM), British Pound
(GBP), Japanese Yen (JPY), and Swiss Frank (CHF). These were then used to create
currency pairs for USDGBP, USDDEM, USDCHF, USDJPY, DEMJPY, and
GBPCHF.
The mathematical technical indicators used were based on four primitive functions
of the time series which could return either the average price over a specified period,
the minimum price over a specified period, the maximum price over a specified pe-
riod, or the price at a specified day. ZCS was used to generate the indicators, where an
indicator is a ratio of two of the primitive functions. For example, a log indicator:
168 R. Preen

lag(4)/max(10) with a range [0.032,0.457] and an action of 1 translates to go long if


the price 4 days ago is greater than 1.033 to 1.579 times the maximum price over the
past 10 days. The Genetic search took place on the range and historical period pa-
rameters. Crossover was applied by switching the period parameters. For example,
two initial indicators: lag(8)/max(22) and min(12)/avg(40) results in the two indica-
tors: lag(12)/max(40) and min(8)/avg(22). Mutation was then used to modify the
range and period parameters in the normal way. An 8-bit encoding was used which
limited parameters in the range [0,255]. The reward given was based on the additional
return of the next days price over any interest potentially accrued on the margin.
Commission was set at 2.5 basis points and therefore an action taken could be correct
even though it produced a negative return. Therefore, a fixed reward of 1000 was
given only for actions resulting in positive returns.
The results of the ZCS agent produced Annual Percent Rate (APR) excess returns
on 5 out of 6 currency pairs. Additionally, the number of ZCS runs with positive ex-
cess returns correlated well with the mean excess return achieved. However, while it
was found that it is possible to achieve excess returns with ZCS, the performance
(using the derived mathematical technical indicators) was not as good as a Genetic
Programming benchmark. Furthermore, the most likely reason for this was because of
the ZCS agents high trading frequency and associated costs. It is suggested that the
reason for this is due to the single-step model, and that using a multi-step model could
reduce the frequency. However, it seems more likely the reason is due to the require-
ment that the agent always have a presence in the market. The rationale for this is
unclear, particularly since a major advantage private traders have over institutions is
being able to stay out of the market until the exact moment when a high probability
opportunity occurs. Moreover, the technical indicators used were extremely primitive.
If the indicators had tougher constraints for providing an entry signal this could easily
have been used to reduce the trading frequency and perhaps provide superior
performance.
Gershoff [9] investigated the use of a hierarchical configuration of XCS agents
(HXCS). Here agents would take the inputs from technical indicators and attempt to
learn profitable rules to trade the market data provided. The three mathematical tech-
nical indicators used were: Rate of Change (ROC), Simple Moving Average (SMA)
and Relative Strength Index (RSI). The HXCS comprised of four micro agents: an
RSI Agent, a Volume Agent, a Random Agent, and a Constant Agent.
After viewing a state, each Micro Agent produced an action signal (0 or 1 to
buy or sell). The signals were then sent to an Aggregate Agent and treated as a vote
for the action. The signal that received the majority of the vote was then designated as
the aggregate signal for the collection of the Micro Agents. By receiving the votes,
the Aggregate Agent could deduce the competitiveness of the vote for each action
(i.e., the confidence value). The confidence value was then used either to simply se-
lect the action, or as an indication to a Meta Agent whether the aggregate signal won
the vote with more than a specified threshold. The Meta Agent received the set of
majority signals and confidence indicators, and produced an action signal to execute
in the environment.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 169

The payoff (or feedback) given to the agents for executing a particular action was
decided based upon whether the following days price closed above or below the
current day. A payoff of zero was awarded for executing a wrong action and a con-
stant non-zero value was awarded for executing a correct action. The agents were
assessed using daily price data for IBM, EXXON, Ford, CitiGroup, Coca-Cola, and
Banco Santander Cent Hispano. The training period ran from January 1990 to De-
cember 2003 and then an evaluation phase took place on data from January 2005 to
June 2006.
The results found that the Meta Agents usually outperformed the individual techni-
cal agents and that the Micro Agents could not outperform both the buy and hold and
bank strategies. Further, that the Meta Agent always outperformed the Random
Agent. However, in terms of accuracy, the Meta Agents performed the same or worse
than the Micro Agents. In summary, the major finding of this model was that a Hier-
archical XCS using multiple agents can produce better results than using a single
agent XCS. The fact that the Meta Agents always outperformed the Random Agent
also illustrates that the system is capable of learning useful rules, even though in this
case they were not able to outperform the relevant real-world benchmarks.
Schulenburg and Wong [21] explored Portfolio Allocation using a HXCS. Agents
received inputs from technical indicators and attempted to learn profitable rules to
trade the market data provided. In addition to a Technical Analysis (TA) Agent, a
Market (Mkt) Agent and an Options Agent were created to provide further informa-
tion to the decision making process.
The TA Agent incorporated rules based upon inputs from the following four
mathematical technical indicators: Rate of Change (ROC), Relative Strength Index
(RSI), Ultimate Oscillator (ULTOSC), and On Balance Volume (OBV). The Mkt
Agent integrated rules from the following three general market indicators: the daily
percent return of the S&P500 Index, the daily S&P500 Index volume, the daily 10
year T-note bond yield, and the daily 3 month T-bill bond yield. The Options Agent
included rules from the following 5 Options market indicators: Delta (i.e., the meas-
urement of the sensitivity of an Option value to the underlying stock price), Gamma
(i.e., the measurement of the second order sensitivity of the Option value to the under-
lying stock price), Vega (i.e., the measurement of the sensitivity of Option value to
the stock price volatility), Theta (i.e., the measurement of the sensitivity of Option
value to the passage of time), and implied volatility (i.e., the stock volatility estimate
given by the Black Scholes formula).
The daily stock data tested was for CitiGroup, IBM, General Motors, Eastman Ko-
dak, and Exxon Mobil over the period 4th January 1996 to 28th April 2006. A com-
mission fee of 0.5% of the transaction value was set. In contrast to Gershoffs HXCS,
the agents attempted to predict the price movement of tomorrows stock price and the
percentage of total wealth to invest, instead of just buy or sell signals. The agents
were given the choice between investing in the risky stock and investing in safe treas-
ury bills which returned a variable interest rate based on real world values.
The input data from the indicators was first divided into nine discrete cut points by
using leave-one-out-cross-validation. The target series then underwent two phases of
discretization. The first phase quantized the data using the unsupervised method of
histogram equalisation in order to add class label information to the target series.
Subsequently, the supervised method of entropy-based discretization was used to split
170 R. Preen

the series into intervals in order to maximise the information gain. Once quantization
had been completed, a binary vector was mapped to the intervals so that it could be
used by an XCS agent.
Next, the cumulative performance of the Meta Agent was evaluated. If its predic-
tion accuracy was less than the specified threshold value, all agents (including the
Meta Agent itself) were destroyed and a new set of agents with a new discretization
process was launched. The new set of cut points were based on the preceding ten
days. All new agents then started their training phases by exploring themselves into
the new training environment. After completing training they were placed back into
the real world environment.
The best results of the agents were compared against four benchmarks: buy and
hold, bank, price trend, and a Random Agent. In the case of CitiGroup, all of the XCS
agents outperformed all four of the benchmark agents. Moreover, in all five stocks, all
XCS agents outperformed the Random Agents. The authors suggest that there is a
mere 0.00003% probability that this occurred by chance and that it provides solid
proof that stock prices have a rational component.
Further, the XCS agents discovered a famous 1960s trading rule1. This last dis-
covery highlights one of the major benefits to using XCS (as opposed to other alterna-
tives such as an ANN) to forecast financial time series. The ability to have the rules in
an easily human readable form enables the researcher to evaluate the logic of any
discovered rule and decide whether it makes sense. This is important because if the
rule does not make any logical sense to a trader then it is quite possible that the rule
has been derived from over-fitting the data and its use in the future is questionable.
Interestingly, in contrast to Gershoffs findings, the Meta Agents here did not per-
form very well in comparison to the single agents. In 3 of the 5 stocks, the Meta
Agents underperformed all three of the single agents. If we are to use the best results
as indicative of performance (as suggested by [21]), this provides mixed information
on the effectiveness of HXCS as opposed to standard XCS agents.
Liu and Nagao [13] conducted a further assay on the application of HXCS to fi-
nancial time series forecasting. Here performance was evaluated on the prediction
accuracy of the direction of the next day. Two Meta Agents were used and their
binary perceptions set solely according to comparisons between various moving aver-
ages. The moving averages used were of the form MAt,m where the average is calcu-
lated from time t back to time t-m. Agent1 consisted of a bitstring of length 24 where
each bit was set according to the evaluation of 24 pairs of successive moving averages
with an interval length of 20. Agent2 consisted of a bitstring of length 18 where the
first 6 moving averages used an interval length of 10 and a further 12 moving aver-
ages with an interval length of 5, e.g., bit18 is set to logical 1 if MAt4,5<MAt, 5. Fur-
thermore a fuzzy matching mechanism was used where a classifier is said to have
matched the environment state even if 10% of the bits are non-matching. For each
environment state received by the HXCS, each Meta Agent receives the input, con-
structs a match set, and then calculates an average prediction value for the set. The
agent with the highest average match set prediction value is then chosen to advocate
1
If the ultimate oscillator is greater than 70, and the previous stock price change is within 2 to
3%, then tomorrows stock price will be -2.5 to -3.5%.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 171

an action in the normal XCS procedure and parameters are updated for the action set
of that Meta Agent.
Experiments were run on four indexes (NIKKEI, NASDAQ, TOPIX, HSI) and 11
other stocks selected from the NIKKEI using daily closing price data from January
2000 to December 2004. The direction hit-rate of both Meta Agents always provided
superior performance to a trend-following strategy that predicted the direction of the
next day based on the change from the previous day. In addition, HXCS outperformed
the Meta Agents by 2-3%. For example, the trend following strategy correctly pre-
dicted the direction 56% of the time for the NASDAQ, whereas Agent1 was correct
66.9% of the time, Agent2 70.8%, and HXCS 73.8%.

3 Learning Framework

Perhaps the biggest limitation that is consistent among [9], [21], [13], and [24] is that
they all attempt to use daily data to forecast the next days price. Since the accuracy
of agents predictions depends largely on how well the problem is represented [21]
we should adopt an approach that mimics how real trading is conducted as closely as
possible.
Figure 1 shows the daily price chart of the EURUSD currency pair with the vertical
dotted line in the centre marking August 15th 2007. At the close of this day, the Rela-
tive Strength Index (RSI) indicator set to 14 periods (i.e., to calculate the RSI over the
previous fourteen daily open, high, low, close bars), RSI(14), produces a value of
31.2109. For Agent 1 in [9], this value would set bit6 (RSI(14)35) to 1. On the
following day the price closed lower (at 1.3426) from its open (of 1.3442). Supposing
that the agent had identified that this rule was part of a buy signal, it would have re-
sulted in a loss under the model and negative feedback would have been given.

Fig. 1. EURUSD Daily Price Chart 01.08.2007 31.08.2007


172 R. Preen

However, if we look at the bigger picture in Figure 2 we can see that in fact this
would have been an excellent place to enter the market. In Figure 2, the vertical dot-
ted line highlights the same day as in Figure 1 but illustrates that, in the bigger pic-
ture, the EUR continued to climb in value against the USD during the subsequent
months following the RSI signal. Clearly, the method of evaluation and providing
feedback to the model is far too short-sighted and asking for far too much accuracy.
Real traders utilise Stop Losses (SL) which are triggers set a certain distance from the
entry price and exit the market at a loss. This value is there in part because markets
are infamous for swaying noisily whilst actually moving towards a logical target (as
in a drunken man analogy). Furthermore, most real traders would never attempt to
predict the closing price of the next bar (e.g., next day when using daily data) because
it is asking for far too much accuracy within a widely acknowledged noisy system.
They would simply exit the market at their SL, or attempt to exit the market in profit
at some multiple of the initial risk (i.e., SL). Through such a method, successful trad-
ers can lose half, or more, of their trades whilst still finishing profitably.

Fig. 2. EURUSD Daily Price Chart 01.08.2007 30.11.2007

If the models are intended to replicate real traders, we must adopt a more real-
world approach. Such an approach must seek to avoid pre-specifying the exact bar to
exit the trade and provide feedback. One approach commonly used in real trading is to
define the exit conditions in terms of fixed price numbers. For example, if the agent
discovered a buy signal, the SL is set $5 below the entry price, and a Take Profit (TP)
(i.e., a price level a trade is considered a winner and profit is taken) of $10 above the
entry price is set. A more sophisticated technique would be to test the combinations of
SL and TP to find the optimal pair in addition to the entry signal. However, this might
easily lead towards curve-fitting the model too specifically to the training set.
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 173

Perhaps the most widely used method to identify when to exit a trade is the same as
that used to enter the trade in the first place: technical analysis. For example, if a rule
to buy an asset is if RSI(14)<30 then buy, then a suitable exit rule might be if
RSI(14)>70 then exit. This risks complicating the model and exponentially increas-
ing the search space, but it is the only way to provide a real-world measurement of
success.

4 Implementation

4.1 Data

The data used is the daily price/volume information over the period of February 3rd
1992 to December 14th 2007 for Exxon Mobil Corp. (XOM) (Figure 3.a) the Dow
Jones Industrial Average (DJI) (Figure 3.b), General Motors Corporation (GM)
(Figure 3.c), Intel Corp. (INTC) (Figure 3.d). In addition, data over the period of
December 26th 1991 to December 14th 2007 is used for 30-Year Treasury Bonds
(TYX) (Figure 3.e). They were chosen to include one index (DJI), two ranging assets
(GM and INTEL), one falling asset (TYX), and one increasing asset (XOM).
Moreover, the assets represent diverse market sectors: automobiles, technology,
bonds, oil, and an index average. For DJI, the adjusted closing price is divided by
1000 to enable the agents to purchase shares with a balance of $10,000 or less. In all
cases, 4000 data points (i.e., days) are used. The first 3000 data points form a training
set used to evolve new rules and the most recent 1000 data points are used as a trading
set to evaluate these rules.

4.2 XCS

The traditional ternary representation is used, where the environment inputs are dis-
cretized as outlined in the following sections. A fixed reward of 1000 is given to prof-
itable actions and 0 to actions which result in no profit or a loss. XCS parameters used
are as follows (taken from [3] and not further optimised so as not to bias the results
used to compare the models): =1, =0.2, =0.1, GA=25, del=20, sub=20, P#=0.6,
v=5, =0.8, 0=10, =0.04. Each agent is shown the training set only once before be-
ing evaluated on the trading set. The alteration between exploring and exploiting rules
is modified as in [21] to:

(1)

Running the equation above over 1000 iterations (i.e., the length of the trading set)
produced a range of 896 to 932 exploit steps being executed. Thus, over 1000 itera-
tions, exploits are conducted approximately 89.6 - 93.2% of the time. This produces
an increasing bias towards exploiting the knowledge acquired as the rules become
more evolved, which is important since the system will perform a single pass through
the data.
174 R. Preen

(a) XOM (b) DJI

(c) GM (d) INTEL

(e) TYX

Fig. 3. Daily Adjusted Closing Price data used in experimentation


Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 175

4.3 Agent 1 - Entries

Agent 1 utilises three stochastic indicators with the periods (8, 3, 3), (32, 12, 12), and
(128, 48, 48). The (8, 3, 3) was chosen simply because it is the most commonly used
configuration, then the two subsequent combinations are each four times greater,
thereby providing a short-term trend, intermediate-term trend, and long-term trend.
The direction of the stochastic indicators and their position (i.e., the value between 0
and 100) is used to classify the environment. The signal line was used for the (8,3,3)
parameters to smooth the line to reduce noise whereas the (32,12,12) and (128,48,48)
main lines are already sufficiently smoothed.
The real numbered indicators are discretized through a simple mechanism. A 9 bit
binary string is composed where the first two bits are used to classify the (8,3,3) sig-
nal lines position. The third and fourth bits are used to classify the (32,12,12) main
lines position and the fifth and sixth bits are used to classify the (128,48,48) main
lines position. The indicator to binary encoding for each indicators position is sum-
marised below in Figure 4.

Indicator Binary
0 - 24 00
25 - 49 01
50 - 74 10
75 - 100 11

Fig. 4. Indicator Value to Binary Encoding

Lastly, three bits are used to classify the direction of each of the stochastic lines as in
Figure 5.

Bit7 = 1 if Stochastic (8,3,3) current signal line > Stochastic else 0


(8,3,3) previous signal line
Bit8 = 1 if Stochastic (32,12,12) current main line > Stochastic else 0
(32,12,12) previous main line
Bit9 = 1 if Stochastic (128,48,48) current main line > Stochastic else 0
(128,48,48) previous main line

Fig. 5. Agent 1 Encoding

4.4 Agent 2 - Entries

The second agent is a trend following agent comprised mostly of Exponential Moving
Averages (EMA). A 20, 50 and 100 period EMA is constructed. The EMAs direction
(i.e., rising or falling) and the position of the current price relative to the EMA (i.e.,
above or below) is used to classify the environment. In addition, the direction of the
Moving Average Convergence Divergence (MACD) (12, 26, 9) main line, and the
direction of the Stochastic (32, 12, 12) main line are used to provide additional trend
information. The encoded is summarised below in Figure 6.
176 R. Preen

Bit1 = 1 if EMA (20) current > EMA (20) previous else 0


Bit2 = 1 if EMA (50) current > EMA (50) previous else 0
Bit3 = 1 if EMA (100) current > EMA (100) previous else 0
Bit4 = 1 if price current > EMA (20) current else 0
Bit5 = 1 if price current > EMA (50) current else 0
Bit6 = 1 if price current > EMA (100) current else 0
Bit7 = 1 if Stochastic (32,12,12) current main line > else 0
Stochastic (32,12,12) previous main line
Bit8 = 1 if MACD (12,26,9) current main line > MACD else 0
(12,26,9) previous main line

Fig. 6. Agent 2 Encoding

4.5 Agent 3 - Entries


Agent 3 is the first Agent (Tt1) from [18]. The agent consists of comparisons between
the current price and the previous price, a series of Simple Moving Averages (SMA),
and the highest and lowest prices observed. The environment bit string consists of 7
binary digits and is encoded as follows in Figure 7.

Bit1 = 1 if price current > price previous else 0


Bit2 = 1 if price current > 1.2 x SMA(5) else 0
Bit3 = 1 if price current > 1.1 x SMA(10) else 0
Bit4 = 1 if price current > 1.05 x SMA(20) else 0
Bit5 = 1 if price current > 1.025 x SMA(30) else 0
Bit6 = 1 if price current > highest price else 0
Bit7 = 1 if price current < lowest price else 0
Fig. 7. Agent 3 Encoding

4.6 Agent Exits

There are three sets of exit conditions for each agent. Firstly, there is the traditional
model where the next day is used as the only exit condition, meaning that any trade
entered today is exited at tomorrows closing price. In addition to this, there are two
sets of technical indicator exit conditions: a simple set with only 4 exit conditions (see
Figure 8) and a more advanced set comprising 16 exit conditions (see Figure 9). To
keep the current study simple, the agents were only allowed to buy or hold, with sell-
ing not permitted. In both the 4 and 16 exit sets, one of the actions causes the agent to
move to the next day without trading (i.e., holds for one day) where reward is given if
the price remained unchanged or decreased.

The executable actions in the set of four:


1. Do not enter any trades today (i.e., hold for one day.)
2. Buy today and exit when MACD (12,26,9) decreases.
3. Buy today and exit when EMA (20) decreases.
4. Buy today and exit when both MACD (12,26,9) and EMA (20) decrease.
Fig. 8. Four Technical Exit Conditions
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 177

This is implemented by moving forward each day in the index and comparing the
indicators parameters with the exit conditions (as would happen in live trading.)
When a match is found, the result of the action is calculated, the balance updated, and
reward given. The comparison of the indicator parameters was implemented by indi-
vidually checking each rule. This was done for simplicity and to ensure that the rules
were functioning correctly. However, with a bigger set of exit conditions to test (since
we are testing every applicable combination), one would assign bits to each condition
in the same manner the environment conditions are constructed, and then any invalid
actions (e.g., EMA (20) cannot be rising and falling simultaneously) would be re-
moved by forcing XCS to choose another action.

The executable actions in the set of sixteen:

1. Do not enter any trades today (i.e., hold for one day.)
2. Buy today and exit when MACD (12,26,9) decreases.
3. Buy today and exit when EMA (20) decreases.
4. Buy today and exit when Stochastic (32,12,12) decreases.
5. Buy today and exit when EMA (50) decreases.
6. Buy today and exit when MACD (12,26,9) and EMA (20) decrease.
7. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) decrease.
8. Buy today and exit when MACD (12,26,9) and EMA (50) decrease.
9. Buy today and exit when EMA (20) and Stochastic (32,12,12) decrease.
10. Buy today and exit when EMA (20) and EMA (50) decrease.
11. Buy today and exit when Stochastic (32,12,12) and EMA (50) decrease.
12. Buy today and exit when MACD (12,26,9) and EMA (20) and Stochastic
(32,12,12) decrease.
13. Buy today and exit when MACD (12,26,9) and EMA (20) and EMA (50) de-
crease.
14. Buy today and exit when MACD (12,26,9) and Stochastic (32,12,12) and EMA
(50) decrease.
15. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA (50)
decrease.
16. Buy today and exit when EMA (20) and Stochastic (32,12,12) and EMA(50)
and MACD (12,26,9) decrease.

Fig. 9. Sixteen Technical Exit Conditions

5 Experimentation
Tables 1 to 5 present a comparison between the agents with the next day as the exit
condition, 4 technical indicator exits as the exit conditions, and with 16 technical
indicator exits as the exit conditions. Each agent starts with an initial balance of
$10,000. The results presented are the best run and the average run of 100 experi-
ments. The highest performing result in each category is highlighted in bold.
The results from the experiments comparing the next-day-exit agents with the
agents using technical indicator exit conditions, after being shown the training set
178 R. Preen

only once (Tables 1-5), show that for XOM, the agent with the highest balance
($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical
indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest
balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the high-
est average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits
produced the highest balance ($21,000.59) and the highest average balance
($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits pro-
duced both the highest balance ($20,116.72) and the highest average balance
($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both
the highest balance ($15,671.20) and highest average balance ($11,389.56).
The results have shown that in all cases (except TYX), an agent using technical in-
dicator exits was superior to exiting at the next day for both the highest achievable
balance and the average balance over its experiments. Moreover, since commissions
are not factored into the agents at this stage, it is highly likely that the gap between
the two agent classes would further widen.

Table 1. XOM

Agent Best ($) Average ($)


Agent 3: Next Day Exit 16,568.02 13,518.73
Agent 2: Next Day Exit 17,015.35 12,863.05
Agent 1: Next Day Exit 18,085.78 13,815.44
Agent 3: 16 Technical Exits 25,648.60 15,442.76
Agent 2: 16 Technical Exits 25,648.75 15,899.56
Agent 1: 16 Technical Exits 22,883.49 15,849.93
Agent 3: 4 Technical Exits 16,133.73 14,825.81
Agent 2: 4 Technical Exits 21,105,34 13,823.89
Agent 1: 4 Technical Exits 19,904.95 14,224.36
Buy and Hold 24,634.00 24,634.00

Table 2. DJI

Agent Best ($) Average ($)


Agent 3: Next Day Exit 13,180.21 11,314.48
Agent 2: Next Day Exit 13,664.05 11,338.99
Agent 1: Next Day Exit 12,782.90 11,280.55
Agent 3: 16 Technical Exits 14,589.01 12,102.06
Agent 2: 16 Technical Exits 14,068.26 11,835.86
Agent 1: 16 Technical Exits 14,443.68 12,027.56
Agent 3: 4 Technical Exits 13,701.04 11,975.34
Agent 2: 4 Technical Exits 14,664.57 11,868.51
Agent 1: 4 Technical Exits 15,120.46 12,033.45
Buy and Hold 12,918.69 12,918.69
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 179

Table 3. INTEL

Agent Best ($) Average ($)


Agent 3: Next Day Exit 12,672.98 9,512.07
Agent 2: Next Day Exit 14,240.27 9,727.86
Agent 1: Next Day Exit 13,476.69 9,731.87
Agent 3: 16 Technical Exits 12,889.49 8,391.51
Agent 2: 16 Technical Exits 13,736.25 8,860.61
Agent 1: 16 Technical Exits 15,759.57 8,481.99
Agent 3: 4 Technical Exits 16,511.56 9,504.32
Agent 2: 4 Technical Exits 21,000.59 10,522.50
Agent 1: 4 Technical Exits 16,568.16 9,924.76
Buy and Hold 8,894.74 8,894.74

The results from the experiments comparing the next-day-exit agents with the
agents using technical indicator exit conditions, after being shown the training set
only once (Tables 1-5), show that for XOM, the agent with the highest balance
($25,648.75) and highest average balance ($15,899.56) was Agent 2 with 16 technical
indicator exits. For DJI, Agent 1 with 4 technical indicator exits produced the highest
balance ($15,120.46) and Agent 3 with 16 technical indicator exits achieved the high-
est average balance ($12,102.06). For INTEL, Agent 2 with 4 technical indicator exits
produced the highest balance ($21,000.59) and the highest average balance
($10,522.50). In the case of GM, again Agent 2 with 4 technical indicator exits pro-
duced both the highest balance ($20,116.72) and the highest average balance
($9,645.54). Lastly, for TYX, Agent 1 with next-day-exit conditions produced both
the highest balance ($15,671.20) and highest average balance ($11,389.56).

Table 4. GM

Agent Best ($) Average ($)


Agent 3: Next Day Exit 13,505.11 8,251.02
Agent 2: Next Day Exit 14,324.42 7,927.37
Agent 1: Next Day Exit 16,789.67 8,579.46
Agent 3: 16 Technical Exits 15,605.10 8,827.06
Agent 2: 16 Technical Exits 18,114.27 9,254.52
Agent 1: 16 Technical Exits 17,338.24 9,153.40
Agent 3: 4 Technical Exits 15,804.40 9,226.62
Agent 2: 4 Technical Exits 20,116.72 9,645.54
Agent 1: 4 Technical Exits 14,565.23 8,362.22
Buy and Hold 5,970.25 5,970.25
180 R. Preen

Table 5. TYX

Agent Best ($) Average ($)


Agent 3: Next Day Exit 14,180.51 10,959.06
Agent 2: Next Day Exit 14,297.20 10,730.10
Agent 1: Next Day Exit 15,671.20 11,389.56
Agent 3: 16 Technical Exits 12,773.89 10,010.81
Agent 2: 16 Technical Exits 12,503.13 9,632.41
Agent 1: 16 Technical Exits 12,047.33 9,815.09
Agent 3: 4 Technical Exits 11,346.18 9,870.72
Agent 2: 4 Technical Exits 14,297.84 10,014.32
Agent 1: 4 Technical Exits 12,260.75 9,936.21
Buy and Hold 9,227.80 9,227.80

Table 6. t-Stats of Tech Exits vs. Next Day (N.D.) exits. Two-Sample Assuming Unequal
Variances. Results in bold are statistically significant at the 95% confidence level.

Agent 1 Agent 2 Agent 3


Stock 4 Ex. 16 Ex. 4 Ex. 16 Ex. 4 Ex. 16 Ex.
vs. vs. vs. vs. vs. vs.
N.D. N.D. N.D. N.D. N.D. N.D.
XOM 1.90 6.48 3.15 9.40 4.10 5.80
DJI 5.60 6.19 3.73 4.05 3.73 5.82
INTEL 0.86 -6.20 3.61 -4.06 -0.04 -5.72
GM -0.69 1.93 5.13 4.09 2.73 1.87
TYX -8.34 -9.60 -4.08 -6.90 -7.96 -6.30

The results have shown that in all cases (except TYX), an agent using technical in-
dicator exits was superior to exiting at the next day for both the highest achievable
balance and the average balance over its experiments. Moreover, since commissions
are not factored into the agents at this stage, it is highly likely that the gap between
the two agent classes would further widen.
However, in the case of TYX, the best performing agent was Agent 1 with next-
day-exit conditions. Furthermore, all next-day-exit agents surpassed the technical
indicator exit agents in terms of both highest balance and average balance, showing
that for some assets next-day-exits can be the best. However, introducing commis-
sions would likely reduce this gap and perhaps even supplant the next-day-exit agents.
Nevertheless, the fact that the next-day-exit agents beat the technical indicator exits is
perhaps explainable by the split between the training and trading set, since the train-
ing set for TYX primarily decreases but the trading set moves in a side-ways range.
Table 6 presents the t-Stats for the three agent types where exiting at the close of
the next day is compared with both the 4 and 16 technical indicator exit sets. It is
shown that almost all of the results are statistically significant at the 95% confidence
level. In particular, for XOM and DJI, all agents utilising technical indicator exits
surpassed the same agents when exiting at the close of the next day, and these results
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 181

were statistically significant. Additionally, Agent 2 when using 4 indicator exits has
provided statistically significant and superior results when compared to exiting at the
close of the next day in all cases except for TYX.
Finally, when comparing the best performing agents with a buy and hold strategy,
we observe that for INTEL, GM, and TYX, all of the agents using technical indicator
exits beat this strategy. Further, the best performing agents on all assets were always
able to beat the buy and hold balance; however the average of the agents balances did
not. Furthermore, should commissions be introduced (the cost would vary from bro-
ker to broker) these results when compared to a buy and hold strategy would deterio-
rate to some extent.
However, the agents average balances only outperformed a buy and hold strategy
when the stocks declined. An explanation for this is that when the agent exits the
market wrongfully, although there is no actual loss, there is an opportunity cost be-
cause the market increases and the agent underperforms its benchmark. Thus, stocks
which generally decline over the period analysed are much easier to beat because
agents have the choice to be in or out of the market, while it is much harder to beat
those that are generally going up.
Table 7 shows the average number of trades executed over 100 tests of each asset
by Agent 2. Again, the agent is shown the training set only once before being assessed
in the trading set. The table shows that when using 4 technical indicator exits, the
agent always trades fewer times than with next-day-exit conditions. Further, this is
statistically significant (as shown in table 8). In some cases 40% less trades are exe-
cuted which would result in substantial transaction fee savings. When utilising 16
technical indicator exits, Agent 2 trades a similar number of times as the agents using
next-day-exit conditions. This is a result of adding more exit conditions which in-
crease the probability of closing the trade after a short period of time. Thus, the 16
technical indicator exit agents tested do not offer any transaction fee savings in com-
parison to the traditional model.

Table 7. Average Number of Trades Executed by Agent 2.

Agent 2: XOM DJI INTEL GM TYX


Next-day-exit 243.25 267.20 266.83 154.37 160.89
4 Tech- Exits 164.84 170.74 168.30 136.14 105.82
16 Tech- Exits 241.17 255.23 255.55 144.69 158.54

Table 8. t-Stats of Number of trades Executed by Agent 2 with Tech Exits vs. Next Day (N.D.)
exits. Two-Sample Assuming Unequal Variances. Results in bold are statistically significant at
the 95% confidence level.

Agent 2: XOM DJI INTEL GM TYX


4 Tech- Exits vs. N.D. 4.63 5.51 6.60 1.98 3.58
16 Tech- Exits vs. N.D. 0.13 0.51 0.60 1.36 0.13
182 R. Preen

6 Conclusions
Agents utilising mathematical technical indicators for the exit conditions outper-
formed similar agents which used the next day as the exit condition in all cases except
for TYX (30-Year Treasury bond), even before taking commissions into account,
which would penalise the most active agents (i.e., the agents using next-day-exit).
Moreover, these results were achieved with generic XCS parameters and not tuned to
improve performance.
The reason TYX was anomalous is attributable to either the position of the cut-off
point between the training and trading set, or the TYX data being inherently noisier
than the other assets, which were all stocks. The cut point in this asset is particularly
important because it resulted in a training set which primarily declined and a trading
set that ranged sideways. Thus, the agents would have adapted rules to trade within
this downward environment but were not prepared for the environment within which
they were assessed.
An analysis of the number of trades executed by each agent showed that, on aver-
age, 31.73% less trades were executed when using 4 technical indicator exit condi-
tions; this would result in substantial transaction savings and further boost the
performance of these agents in comparison to the agents using next-day-exit condi-
tions. However, the agents using 16 mathematical technical indicator exits executed
with approximately the same frequency as the agents using next-day-exit conditions.
This was a result of having more rules with different exit conditions that could be
triggered, so the agents were closing the trades with greater frequency.

References
1. Allen, F., Karjalainen, R.: Using Genetic Algorithms to find technical trading rules. Jour-
nal of Financial Economics 51(2), 245271 (1999)
2. Beltrametti, L., Fiorentini, R., Marengo, L., Tamborini, R.: A learning-to-forecast experi-
ment on the foreign exchange market with a Classifier System. Journal of Economic Dy-
namics and Control 21(8&9), 15431575 (1997)
3. Butz, M., Sastry, K., Goldberg, D.: Strong, Stable, and Reliable Fitness Pressure in XCS
due to Tournament Selection. Genetic Programming and Evolvable Machines 6(1), 5377
(2005)
4. Brock, W., Lakonishock, J., LeBaron, B.: Simple Technical Trading Rules and the Sto-
chastic Properties of Stock Returns. Journal of Finance 47, 17311764 (1992)
5. Chen, S.-H.: Genetic Algorithms and Genetic Programming in Computational Finance.
Kluwer Academic Publishers, Norwell (2002)
6. Detry, P.J., Grgoire, P.: Other evidences of the predictive power of technical analysis: the
moving average rules on European indexes, CeReFiM, Belgium, pp. 125 (1999)
7. Dewachter, H.: Can Markov switching models replicate chartist profits in the foreign ex-
change market? Journal of International Money and Finance 20(1), 2541 (2001)
8. Dooley, M., Schaffer, J.: Analysis of Short-Run Exchange Rate Behavior: March 1973 to
November 1981. In: Bigman, D., Taya, T. (eds.) Floating Exchange Rates and State of
World Trade and Payments, pp. 4370. Ballinger Publishing Company, Cambridge (1983)
9. Gershoff, M.: An investigation of HXCS Traders. School of Informatics. Vol. Master of
Sciences Edinburgh. University of Edinburgh (2006)
Identifying Trade Entry and Exit Timing Using Mathematical Technical Indicators 183

10. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press
(1975)
11. Kalyvas, E.: Using Neural Networks and Genetic Algorithms to Predict Stock Market Re-
turns. University of Manchester Master of Science thesis (2001)
12. Levich, R., Thomas, L.: The Merits of Active Currency Management: Evidence from In-
ternational Bond Portfolios. Financial Analysts Journal 49(5), 6370 (1993)
13. Liu, S., Nagao, T.: HXCS and its Application to Financial Time Series Forecasting. IEEJ
Transactions on Electrical and Electronic Engineering 1, 417425 (2006)
14. Mahfoud, S., Mani, G.: Financial forecasting using Genetic Algorithms. Applied Artificial
Intelligence 10(6), 543565 (1996)
15. Neely, C., Weller, P., Dittmar, R.: Is Technical Analysis in the Foreign Exchange Market
Profitable? A Genetic Programming Approach. Journal of Financial and Quantitative
Analysis 32(4), 405426 (1997)
16. Okunev, J., White, D.: Do momentum-based strategies still work in foreign currency mar-
kets? Journal of Financial and Quantitative Analysis 38, 425447 (2003)
17. Olson, D.: Have trading rule profits in the currency market declined over time? Journal of
Banking and Finance 28, 85105 (2004)
18. Schulenburg, S., Ross, P.: An Adaptive Agent Based Economic Model. In: Lanzi, P.L., et
al. (eds.) IWLCS 1999. LNCS (LNAI), vol. 1996, pp. 265284. Springer, Heidelberg
(2001)
19. Schulenburg, S., Ross, P.: Strength and money: An LCS approach to increasing returns. In:
Lanzi, P.L. (ed.) IWLCS 2000. LNCS (LNAI), vol. 1996, pp. 114137. Springer, Heidel-
berg (2001)
20. Schulenburg, S., Ross, P.: Explorations in LCS models of stock trading. In: Lanzi, P.L.,
Stolzmann, W., Wilson, S.W. (eds.) IWLCS 2001. LNCS (LNAI), vol. 2321, pp. 151180.
Springer, Heidelberg (2002)
21. Schulenburg, S., Wong, S.Y.: Portfolio allocation using XCS experts in technical analysis,
market conditions and options market. In: Proceedings of the 2007 GECCO Conference
Companion on Genetic and Evolutionary Computation, pp. 29652972. ACM, New York
(2007)
22. Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: An efficient fuzzy based neuro: genetic
algorithm for stock market prediction. International Journal of Hybrid Intelligent Sys-
tems 3(2), 6381, (2006)
23. Steiner, M., Wittkemper, H.G.: Neural networks as an alternative stock market model. In:
Refenes, A.P. (ed.) Neural networks in the capital markets, pp. 137149. John Wiley and
Sons, Chichester (1996)
24. Stone, C., Bull, L.: Foreign Exchange Trading using a Learning Classifier System. In:
Bull, L., Bernado-Mansilla, E., Holmes, J. (eds.) Learning Classifier Systems in Data Min-
ing, pp. 169190. Springer, Heidelberg (2008)
25. Sweeney, R.J.: Beating the foreign exchange market. Journal of Finance 41, 163182
(1986)
26. Tsibouris, G., Zeidenberg, M.: Testing the Efficient Market Hypothesis with Gradient De-
scent Algorithms, pp. 127136. John Wiley and Sons Ltd., Chichester (1996)
27. Wilson, S.W.: ZCS: A Zeroth Level Classifier. Evolutionary Computation 2, 118 (1994)
28. Wilson, S.W.: Classifier Fitness Based on Accuracy. Evolutionary Computation 3(2), 149
175 (1995)
184 R. Preen

Appendix: Mathematical Technical Indicators

Simple Moving Average: SMA(N)


SMAt = (Closet +Closet-1 ... + Closet-N)/N
Where Close is the closing price being averaged and N is the number of days in the
moving average.

Exponential Moving Average: EMA(N)


EMAt = Closet K + EMAt-1 (1-K)
Where K=2/(N+1), N is the number of days in the EMA, Closet is todays closing
price, and EMAt-1 is the EMA of yesterday.

Moving Average Convergence Divergence: MACD(a,b,c)


MACD main line = EMA(a) EMA(b)
MACD signal line = EMA(c)
Where EMA(c) is an exponential moving average of the MACD main line.

Stochastic Oscillator: Stochastic(FastK, SlowK, SlowD)


Stochastic main line, Stocht = Stocht-1 + (Fast Stocht-1 / SlowK)
Stochastic signal line, Sigt = Sigt-1 + (Stocht Sigt-1) / SlowD
Where, Stocht is todays stochastic main line; Stocht-1 is the stochastic main line of
yesterday; Fast = 100 ((Closet L/(HL)); Closet is todays closing price; L is the
lowest low price over the last FastK days; and H is the highest high price over the last
FastK days.
On the Homogenization of Data from Two
Laboratories Using Genetic Programming

Jose G. Moreno-Torres1, Xavier Llor`


a2, David E. Goldberg3 ,
and Rohit Bhargava4
1
Department of Computer Science and Articial Intelligence,
Universidad de Granada, 18071 Granada, Spain
jose.garcia.mt@decsai.ugr.es
2
National Center for Supercomputing Applications (NCSA)
University of Illinois at Urbana-Champaign
1205 W. Clark Street, Urbana, Illinois, USA
xllora@illinois.edu
3
Illinois Genetic Algorithms Laboratory (IlliGAL)
University of Illinois at Urbana-Champaign
104 S. Mathews Ave, Urbana, Illinois, USA
deg@illinois.edu
4
Department of Bioengineering
University of Illinois at Urbana-Champaign
405 N. Mathews Ave, Urbana, Illinois, USA
rbx@uiuc.edu

Abstract. In experimental sciences, diversity tends to dicult predic-


tive models proper generalization across data provided by dierent lab-
oratories. Thus, training on a data set produced by one lab and testing
on data provided by another lab usually results in low classication ac-
curacy. Despite the fact that the same protocols were followed, variabil-
ity on measurements can introduce unforeseen variations that aect the
quality of the model. This paper proposes a Genetic Programming based
approach, where a transformation of the data from the second lab is
evolved driven by classier performance. A real-world problem, prostate
cancer diagnosis, is presented as an example where the proposed ap-
proach was capable of repairing the fracture between the data of two
dierent laboratories.

1 Introduction
The assumption that a properly trained classier will be able to predict the
behavior of unseen data from the same problem is at the core of any automatic
classication process. However, this hypothesis tends to prove unreliable when
dealing with biological data (or other experimental sciences), especially when
such data is provided by more than one laboratory, even if they are following
the same protocols to obtain it.
This paper presents an example of such a case, a prostate cancer diagnosis
problem where a classier built using the data of the rst laboratory performs

J. Bacardit et al. (Eds.): IWLCS 2008/2009, LNAI 6471, pp. 185197, 2010.
c Springer-Verlag Berlin Heidelberg 2010
186 J.G. Moreno-Torres et al.

very accurately on the test data from that same laboratory, but comparatively
poorly on the data from the second one. It is assumed that this behavior is due to
a fracture between the data of the two laboratories, and a Genetic Programming
(GP) method is developed to homogenize the data in subsequent subsets. We
consider this method a form of feature extraction because the new dataset is
constructed with new features which are functional mappings of the old ones.
The method presented in this paper attempts to optimize a transformation
over the data from the second laboratory, in terms of classier performance.
That is, the data from the second lab is transformed into a new dataset where
the classier, trained on the data from the rst lab, performs as accurately as
possible. If the performance achieved by the classier in this new, transformed,
dataset, is equivalent to the one obtained in the data from the rst lab, we
understand the data has been homogenized.
More formally, the classier f is trained on data from one laboratory (dataset
A), such that y = f (xA) is the class prediction for one instance xA of dataset
A. For the data from the other lab (dataset B), it is assumed that there exists
a transformation T such that f (T (xB)) is a good classier for instances xB
of dataset B. The goodness of the classier is measured by the loss function
l(f (T (xB)), y), where y is the class associated with xB, and l(., .) is a measure
of distance between f (T (xB)) and y. The aim is to nd a transformation T such
that the average loss over all instances in B is minimized.
The remainder of this paper is organized as follows: In Section 2, some prelimi-
naries about the techniques used and some approaches to similar problems in the
literature are presented. Section 3 has a description of the proposed algorithm.
Section 4 details the real-world biological dataset that motivates this paper. Sec-
tion 5 includes the experimental setup, along with the results obtained, and an
analysis. Finally, some concluding remarks are made in Section 6.

2 Preliminaries
This section is divided in the following way: In Section 2.1 we introduce the
notation that has been used in this paper. Then we include a brief summary of
what has been done in feature extraction in Section 2.2, and a short review of
the dierent approaches we found in the specialized literature on the use of GP
for feature extraction in Section 2.3.

2.1 Notation
When describing the problem, datasets A, B and S correspond to:
A: The original dataset, provided by the rst lab, that was used to build the
classier.
B: The problem dataset, from the second lab. The classier is not accurate
on this dataset, and that is what the proposed algorithm attempts to solve.
S: The solution dataset, result of applying the evolved transformation to the
samples in dataset B. The goal is to have the classier performance be as
high as possible on this dataset.
On the Homogenization of Data from Two Laboratories 187

2.2 Feature Extraction


Feature extraction is one form of pre-processing, which creates new features as
functional mappings of the old ones. An early proposer of such a term was proba-
bly Wyse in 1980 [1], in a paper about intrinsic dimensionality estimation. There
are multiple techniques that have been applied to feature extraction throughout
the years, ranging from principal component analysis (PCA) to support vector
machines (SVMs) to GAs (see [2,3,4], respectively, for some examples).
Among the foundations papers in the literature, Lius book in 1998 [5] is one
of the earlier compilations of the eld. A workshop held in 2003 [6], led Guyon
& Elissee to publish a book with an important treatment of the foundations of
feature extraction[7].

2.3 Genetic Programming-Based Feature Extraction


Genetic Programming (GP) has been used extensively to optimize feature ex-
traction and selection tasks. One of the rst contributions in this line was the
work published by Tackett in 1993 [8], who applied GP to feature discovery and
image discrimination tasks.
We can consider two main branches in the philosophy of GP-based feature
extraction:
1 On one hand, we have the proposals that focus only on the feature extraction
procedure, of which there are multiple examples: Sherrah et al. [9] presented
in 1997 the evolutionary pre-processor (EPrep), which searches for an op-
timal feature extractor by minimizing the misclassication error over three
randomly selected classiers. Kotani et al.s work from 1999 [10] determined
the optimal polynomial combinations of raw features to pass to a k-nearest
neighbor classier. In 2001, Bot [11] evolved transformed features, one-at-a-
time, again for a k-NN classier, utilizing each new feature only if it improved
the overall classication performance. Zhang & Rockett, in 2006, [12] used
multiobjective GP to learn optimal feature extraction in order to fold the
high-dimensional pattern vector to a one-dimensional decision space where
the classication would be trivial. Lastly, also in 2006, Guo & Nandi [13] op-
timized a modied Fisher discriminant using GP, and then Zhang & Rockett
[14] extended their work by using a multiobjective approach to prevent tree
bloat.
2 On the other hand, some authors have chosen to evolve a full classier with
an embedded feature extraction step. As an example, Harris [15] proposed in
1997 a co-evolutionary strategy involving the simultaneous evolution of the
feature extraction procedure along with a classier. More recently, Smith &
Bull [16] developed a hybrid feature construction and selection method using
GP together with a GA.

2.4 Finding and Repairing Fractures between Data


Among the proposals to quantify the fracture in the data, we would like to
mention the one by Wang et al. [17], where the authors present the idea of
188 J.G. Moreno-Torres et al.

correspondence tracing. They propose an algorithm for the discovering of changes


of classication characteristics, which is based on the comparison between two
rule-based classiers, one built from each dataset. Yang et al. [18] presented in
2008 the idea of conceptual equivalence as a method for contrast mining, which
consists of the discovery of discrepancies between datasets. Lately, it is important
to mention the work by Cieslak and Chawla [19], which presents a statistical
framework to analyze changes in data distribution resulting in fractures between
the data.
The fundamental dierence between the mentioned works and this one is we
focus on repairing the fracture by modifying the data, using a general method
that works with any kind of data fracture, while they propose methods to quan-
tify said fracture that work provided some conditions.

3 A Proposal for GP-Based Feature Extraction to


Homogenize Data from Two Laboratories

The problem we are attempting to solve is the design of a method that can create
a transformation from a dataset (dataset B) where a classication model built
using the data from a dierent dataset (dataset A) is not accurate; into a new
dataset (dataset S) where the classier is more accurate. Said classier is kept
unchanged throughout the process.
We decided to use GP to solve the problem for a number of reasons:
1 It is well suited to evolve arbitrary expressions because its chromosomes are
trees. This is useful in our case because we want to have the maximum possi-
ble exibility in terms of the functional expressions of this transformations.
2 GP provides highly-interpretable solutions. This is an advantage because our
goal is not only to have a new dataset where the classier works, but also to
analyze what was the problem in the rst dataset.
Once GP was chosen, we needed to decide what terminals and operators to use,
how to calculate the tness of an individual and which evolutionary parameters
(population size, number of generations, selection and mutation rates, etc) are
appropriate for the problem at hand.

3.1 Solutions Representation: Context-Free Grammar


The representation of the solutions was achieved by extending GP to evolve
more than one tree per solution. Each individual is composed by n trees, where
n is the number of attributes present in the dataset. We are trying to develop a
new dataset with the same number of attributes as the old one, since this new
dataset needs to be fed to the existing model. In the tree structure, the leaves
are either constants (we use the Ephemeral Random Constant approach [20]) or
attributes from the original dataset. The intermediate nodes are functions from
the function set, which is specic to each problem.
On the Homogenization of Data from Two Laboratories 189

The attributes on the transformed dataset are represented by algebraic expres-


sions. These expressions are generated according to the rules of a context-free
grammar which allows the absence of some of the functions or terminals. The
grammar corresponding to the example problem would look like this:

Start T ree T ree


T ree N ode
N ode N ode Operator N ode
N ode T erminal
Operator + | | |
T erminal x0 | x1 | E
E realN umber(represented by e)

3.2 Fitness Evaluation

The tness evaluation procedure is probably the most treated aspect of design
in the literature when dealing with GP-based feature extraction. As has been
stated before, the idea is to have the provided classiers performance drive
the evolution. To achieve that, our method calculates tness as the classiers
accuracy over the dataset obtained by applying the transformations encoded in
the individual (training-set accuracy).

3.3 Genetic Operators

This section details the choices made for selection, crossover and mutation op-
erators. Since the objective of this work is not to squeeze the maximum possible
performance from GP, but rather to show that it is an appropriate technique for
the problem and that it can indeed solve it, we did not pay special attention to
these choices, and picked the most common ones in the specialized literature.

Tournament selection without replacement. To perform this selection, s in-


dividuals are rst randomly picked from the population (where s is the tour-
nament size), while avoiding using any member of the population more than
once. The selected individual is then chosen as the one with the best tness
among those picked in the rst stage.
One-point crossover: A subtree from one of the parents is substituted by one
from the other parent. This procedure is carried over in the following way:

1 Randomly select a non-root non-leave node on each of the two parents.


2 The rst child is the result of swapping the subtree below the selected
node in the father for that of the mother.
3 The second child is the result of swapping the subtree below the selected
node in the mother for that of the father.
190 J.G. Moreno-Torres et al.

Swap mutation: This is a conservative mutation operator, that helps diversify


the search within a close neighborhood of a given solution. It consists of
exchanging the primitive associated to a node by one that has the same
number of arguments.
Replacement mutation: This is a more aggressive mutation operator that
leads to diversication in a larger neighborhood. The procedure to perform
this mutation is the following:
1 Randomly select a non-root non-leave node on the tree to mutate.
2 Create a random tree of depth no more than a xed maximum depth.
In this work, the maximum depth allowed was 5.
3 Swap the subtree below the selected node for the randomly generated
one.

3.4 Function Set


Which functions to include in the function set are usually dependent on the
problem. Since one of our goals is to have an algorithm as universal and ro-
bust as possible, where the user does not need to ne-tune any parameters to
achieve good performance; we decided not to study the eect of dierent function
set choices. We chose the default functions most authors use in the literature:
{+, , , , exp, cos}.

3.5 Parameters
Table 1 summarizes the parameters used for the experiments.

Table 1. Evolutionary parameters for a nv -dimensional problem

Parameter Value
Number of trees nv
Population size 400 nv
Duration of the run 100 generations
Selection operator Tournament without replacement
Tournament size log2 (nv ) + 1
Crossover operator One-point crossover
Crossover probability 0.9
Mutation operator Replacement & Swap mutations
Replacement mutation probability 0.001
Swap mutation probability 0.01
Maximum depth of the swapped in subtree 5
Function set {+, , , , cos, exp}
Terminal set {x0 ,x1 ,...,xnv 1, e}

3.6 Execution Flow


Algorithm 1 contains a summary of the execution ow of the GP procedure,
which follows a classical evolutionary scheme. It stops after a user-dened num-
ber of generations,
On the Homogenization of Data from Two Laboratories 191

Algorithm 1. Execution ow of the GP method


1 . Randomly c r e a t e t h e i n i t i a l p o p u l a t i o n by a p p l y i n g t h e
c o n t e x t f r e e grammar i n S e c t i o n 3 . 1 .
2 . Repeat Ng t i m e s ( where Ng i s t h e number o f g e n e r a t i o n s )
2.1 Evaluate the cu r r en t population , using the procedure
seen in Section 3 . 2 .
2 . 2 Apply s e l e c t i o n and c r o s s o v e r t o c r e a t e a new
p o p u l a t i o n t h a t w i l l r e p l a c e t h e o l d one .
2 . 3 Apply t h e mutation o p e r a t o r s t o t h e new p o p u l a t i o n .
3 . Return t h e b e s t i n d i v i d u a l e v e r s e e n .

4 Case Study: Prostate Cancer Diagnosis


Prostate cancer is the most common non-skin malignancy in the western world.
The American Cancer Society estimated 192,280 new cases and 27,360 deaths
related to prostate cancer in 2009 [21]. Recognizing the public health implications
of this disease, men are actively screened through digital rectal examinations
and/or serum prostate specic antigen (PSA) level testing. If these screening
tests are suspicious, prostate tissue is extracted, or biopsied, from the patient
and examined for structural alterations. Due to imperfect screening technologies
and repeated examinations, it is estimated that more than one million people
undergo biopsies in the US alone.

4.1 Diagnostic Procedure


Biopsy, followed by manual examination under a microscope is the primary
means to denitively diagnose prostate cancer as well as most internal cancers
in the human body. Pathologists are trained to recognize patterns of disease in
the architecture of tissue, local structural morphology and alterations in cell size
and shape. Specic patterns of specic cell types distinguish cancerous and non-
cancerous tissues. Hence, the primary task of the pathologist examining tissue
for cancer is to locate foci of the cell of interest and examine them for alterations
indicative of disease. A detailed explanation of the procedure is beyond the scope
of this paper and can be found elsewhere [22,23,24,25].
Operator fatigue is well-documented and guidelines limit the workload and
rate of examination of samples by a single operator (examination speed and
throughput). Importantly, inter- and intra-pathologist variation complicates de-
cision making. For this reason, it would be extremely interesting to have an
accurate automatic classier to help reduce the load on the pathologists. This
was partially achieved in [24], but some issues remain open.

4.2 The Generalization Problem


Llor`
a et al. [24] successfully applied a genetics-based approach to the develop-
ment of a classier that obtained human-competitive results based on FTIR
192 J.G. Moreno-Torres et al.

data. However, the classier built from the data obtained from one laboratory
proved remarkably inaccurate when applied to classify data from a dierent
hospital. Since all the experimental procedure was identical; using the same ma-
chine, measuring and post-processing; and having the exact same lab protocols,
both for tissue extraction and staining; there was no factor that could explain
this discrepancy.
What we attempt to do with this work is develop an algorithm that can
evolve a transformation over the data from the second laboratory, creating a new
dataset where the classier built from the rst lab is as accurate as possible.

4.3 Pre-processing of the Data


The biological data obtained from the laboratories has an enormous size (in the
range of 14GB of storage per sample); and parallel computing was needed to
achieve better-than-human results. For this reason, feature selection was per-
formed on the dataset obtained by FTIR. It was done by applying an evalu-
ation of pairwise error and incremental increase in classication accuracy for
every class, resulting in a subset of 93 attributes. This reduced dataset provided
enough information for classier performance to be rather satisfactory: a sim-
ple C4.5 classier achieved 95% accuracy on the data from the rst lab, but
only 80% on the second one. The dataset consists of 789 samples from one
laboratory and 665 from the other one. These samples represent 0.01% of the
total data available for each data set, which were selected applying stratied
sampling without replacement. A detailed description of the data pre-processing
procedure can be found in [22].
The experiments reported in this paper were performed utilizing the reduced
dataset, since the associated computational costs make it unfeasible to work
with the complete one. The reduced dataset is made of 93 real attributes, and
there are two classes (positive and negative diagnosis). The dataset consists of
789 samples from one laboratory and 665 from the other one, with a 60% 40%
class distribution.

5 Experimental Study
This section is organized in the following way: To begin with, a general de-
scription of the experimental procedure is presented in Section 5.1, and the
parameters used for the experiment. The results obtained are presented in Sec-
tion 5.2, a statistical analysis is shown in Section 5.3, and lastly some sample
transformations are shown in Section 5.4.

5.1 Experimental Framework


The experimental methodology can be summarized as follows:
1 Consider each of the provided datasets (one from each lab) to be datasets A
and B respectively.
On the Homogenization of Data from Two Laboratories 193

2 From dataset A, build a classier. We chose C4.5 [26], but any other classier
would work exactly the same; due to the fact that the proposed method uses
the learned classier as a black box.
3 Apply our method to dataset B in order to evolve a transformation that will
create a solution dataset S. Use 5-fold cross validation over dataset S, so
that training and test set accuracy results can be obtained.
4 Check the performance of the step 2 classier on dataset S. Ideally, it should
be close to the one on dataset A, meaning the proposed method has success-
fully discovered the hidden transformation and inverted it.

5.2 Performance Results

This section presents the results for the Prostate Cancer problem, in terms of
classier accuracy. The results obtained can be seen in table 2.

Table 2. Classier performance results

Classier performance in dataset ...


A-training A-test B S-training S-test
0.95435 0.92015 0.83570 0.95191 0.92866

The performance results are promising. First and foremost, the proposed
method was able to nd a transformation over the data from the second labora-
tory that made the classier work just as well as it did on the data from the rst
lab, eectively nding the fracture in the data (that is, the dierence in data
distribution between the data sets provided by the two labs) that prevented the
classier from working accurately.

5.3 Statistical Analysis

To complete the experimental study, we performed a statistical comparison


between the classier performance over datasets A, B and S.
In [27,28,29,30] a set of simple, safe and robust non-parametric tests for statis-
tical comparisons of classiers are recommended. One of them is the Wilcoxon
Signed-Ranks Test [31,32], which is the test that we have selected to do the
comparison.
In order to perform the Wilcoxon test, we used the results from each parti-
tion in the 5-fold cross validation procedure. We ran the experiment four times,
resulting in 4 5 = 20 performance samples to carry out the statistical test. R+
corresponds to the rst algorithm in the comparison winning, R to the second
one.
We can conclude our method has proved to be capable of fully homogenizing
the data from both laboratories regarding classier performance, both in terms
of training and test set.
194 J.G. Moreno-Torres et al.

Table 3. Wilcoxon signed-ranks test results

Comparison R+ R p-value null hypothesis of equality


A-test vs B 210 0 1.91E 007 rejected (A-test outperforms B)
B vs S-test 0 210 1.91E 007 rejected (S-test outperforms B)
A-training vs S-training 126 84 accepted
A-test vs S-test 84 126 accepted

5.4 Obtained Transformations


Figure 1 contains a sample of some of the evolved expressions for the best indi-
vidual found by our method. Since the dataset has 93 attributes, the individual
was composed of 93 trees, but for space concerns only the attributes relevant to
the C4.5 classier were included here.

 

Fig. 1. Tree representation of the expressions contained in a solution to the Prostate


Cancer problem

6 Concluding Remarks

We have presented a new algorithm that approaches a common problem in real


life for which not many solutions have been proposed in evolutionary computing.
The problem in question is the repairing of fractures between data by adjusting
the data itself, not the classiers built from it.
On the Homogenization of Data from Two Laboratories 195

We have developed a solution to the problem by means of a GP-based al-


gorithm that performs feature extraction on the problem dataset driven by the
accuracy of the previously built classier.
We have applied our method to a real-world problem where data from two dif-
ferent laboratories regarding prostate cancer diagnosis was provided, and where
the classier learned from one did not perform well enough on the other. Our
algorithm was capable of learning a transformation over the second dataset that
made the classier t just as well as it did on the rst one. The validation results
with 5-fold cross validation also support the idea that the algorithm is obtaining
good results; and has a strong generalization power.
We have applied a statistical analysis methodology that supports the claim
that the classier performance obtained on the solution dataset signicantly
outperforms the one obtained on the problem dataset.
Lastly, we have shown the learned transformations. Unfortunately, we have
not been able to extract any useful information from them yet.

Acknowledgments
Jose Garca Moreno-Torres was supported by a scholarship from Obra Social
la Caixa and is currently supported by a FPU grant from the Ministerio de
Educacion y Ciencia of the Spanish Government and the KEEL project. Rohit
Bhargava would like to acknowledge collaborators over the years, especially Dr.
Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for
numerous useful discussions and guidance. Funding for this work was provided in
part by University of Illinois Research Board and by the Department of Defense
Prostate Cancer Research Program. This work was also funded in part by the
National Center for Supercomputing Applications and the University of Illinois,
under the auspices of the NCSA/UIUC faculty fellows program.

References
1. Wyse, N., Dubes, R., Jain, A.: A critical evaluation of intrinsic dimensionality
algorithmsa critical evaluation of intrinsic dimensionality algorithms. In: Gelsema,
E.S., Kanal, L.N. (eds.) Pattern recognition in practice, Amsterdam, pp. 415425.
Morgan Kauman Publishers, Inc., San Francisco (1980)
2. Kim, K.A., Oh, S.Y., Choi, H.C.: Facial feature extraction using pca and wavelet
multi-resolution images. In: Sixth IEEE International Conference on Automatic
Face and Gesture Recognition, p. 439. IEEE Computer Society, Los Alamitos
(2004)
3. Podolak, I.T.: Facial component extraction and face recognition with support vec-
tor machines. In: FGR 2002: Proceedings of the Fifth IEEE International Confer-
ence on Automatic Face and Gesture Recognition, Washington, DC, USA, p. 83.
IEEE Computer Society, Los Alamitos (2002)
4. Pei, M., Goodman, E.D., Punch, W.F.: Pattern discovery from data using genetic
algorithms. In: Proceeding of 1st Pacic-Asia Conference Knowledge Discovery &
Data Mining, PAKDD 1997 (1997)
196 J.G. Moreno-Torres et al.

5. Liu, H., Motoda, H.: Feature extraction, construction and selection: a data mining
perspective. SECS, vol. 453. Kluwer Academic, Boston (1998)
6. Guyon, I., Elissee, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 11571182 (2003)
7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.): Feature Extraction, Founda-
tions and Applications. Springer, Heidelberg (2006)
8. Tackett, W.A.: Genetic programming for feature discovery and image discrimina-
tion. In: Proceedings of the 5th International Conference on Genetic Algorithms,
pp. 303311. Morgan Kaufmann Publishers Inc., San Francisco (1993)
9. Sherrah, J.R., Bogner, R.E., Bouzerdoum, A.: The evolutionary pre-processor: Au-
tomatic feature extraction for supervised classication using genetic programming.
In: Proc. 2nd International Conference on Genetic Programming (GP 1997), pp.
304312. Morgan Kaufmann, San Francisco (1997)
10. Kotani, M., Ozawa, S., Nakai, M., Akazawa, K.: Emergence of feature extraction
function using genetic programming. In: KES, pp. 149152 (1999)
11. Bot, M.C.J.: Feature extraction for the k-nearest neighbour classier with ge-
netic programming. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Teta-
manzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 256267.
Springer, Heidelberg (2001)
12. Zhang, Y., Rockett, P.I.: A generic optimal feature extraction method using mul-
tiobjective genetic programming. Technical Report VIE 2006/001, Department of
Electronic and Electrical Engineering, University of Sheeld, UK (2006)
13. Guo, H., Nandi, A.K.: Breast cancer diagnosis using genetic programming gener-
ated feature. Pattern Recognition 39(5), 980987 (2006)
14. Zhang, Y., Rockett, P.I.: A generic multi-dimensional feature extraction method
using multiobjective genetic programming. Evolutionary Computation 17(1), 89
115 (2009)
15. Harris, C.: An investigation into the Application of Genetic Programming tech-
niques to Signal Analysis and Feature Detection,September. University College,
London (September 26, 1997)
16. Smith, M.G., Bull, L.: Genetic programming with a genetic algorithm for feature
construction and selection. Genetic Programming and Evolvable Machines 6(3),
265281 (2005)
17. Wang, K., Zhou, S., Fu, C.A., Yu, J.X., Jerey, F., Yu, X.: Mining changes of classi-
cation by correspondence tracing. In: Proceedings of the 2003 SIAM International
Conference on Data Mining, SDM 2003 (2003)
18. Yang, Y., Wu, X., Zhu, X.: Conceptual equivalence for contrast mining in classi-
cation learning. Data & Knowledge Engineering 67(3), 413429 (2008)
19. Cieslak, D.A., Chawla, N.V.: A framework for monitoring classiers performance:
when and why failure occurs? Knowledge and Information Systems 18(1), 83108
(2009)
20. Koza, J.: Genetic Programming: On the Programming of Computers by Means of
Natural Selection. The MIT Press, Cambridge (1992)
21. AmericanCancerSociety: How many men get prostate cancer?
http://www.cancer.org/docroot/CRI/content/
CRI 2 2 1X How many men get prostate cancer 36.asp
22. Fernandez, D.C., Bhargava, R., Hewitt, S.M., Levin, I.W.: Infrared spectroscopic
imaging for histopathologic recognition. Nature Biotechnology 23(4), 469474
(2005)
On the Homogenization of Data from Two Laboratories 197

23. Levin, I.W., Bhargava, R.: Fourier transform infrared vibrational spectroscopic
imaging: integrating microscopy and molecular recognition. Annual Review of
Physical Chemistry 56, 429474 (2005)
24. Llor`a, X., Reddy, R., Matesic, B., Bhargava, R.: Towards better than human ca-
pability in diagnosing prostate cancer using infrared spectroscopic imaging. In:
Proceedings of the 9th Annual Conference on Genetic and Evolutionary Compu-
tation GECCO 2007, pp. 20982105. ACM, New York (2007)
25. Llor`a, X., Priya, A., Bhargava, R.: Observer-invariant histopathology using
genetics-based machine learning. Natural Computing: An International Jour-
nal 8(1), 101120 (2009)
26. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
Inc., San Francisco (1993)
27. Demsar, J.: Statistical comparisons of classiers over multiple data sets. Journal
of Machine Learning Research 7, 130 (2006)
28. Garca, S., Herrera, F.: An extension on statistical comparisons of classiers over
multiple data sets for all pairwise comparisons. Journal of Machine Learning Re-
search 9, 26772694 (2008)
29. Garca, S., Fern
andez, A., Luengo, J., Herrera, F.: A study of statistical techniques
and performance measures for genetics-based machine learning: Accuracy and in-
terpretability. Soft Computing 13(10), 959977 (2009)
30. Garca, S., Fern
andez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests
for multiple comparisons in the design of experiments in computational intelligence
and data mining: Experimental analysis of power. Information Sciences 180(10),
20442064 (2010)
31. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bul-
letin 1(6), 8083 (1945)
32. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures,
4th edn. Chapman & Hall/CRC (2007)
Author Index

Bhargava, Rohit 185 Lanzi, Pier-Luca 1, 70, 87


Bull, Larry 87 Llor`
a, Xavier 185
Butz, Martin V. 47, 57 Loiacono, Daniele 1, 70

Martnez, Ivette 145


Casillas, Jorge 21
Moreno-Torres, Jose G. 185

ee, Gilles
En 107 Orriols-Puig, Albert 21

Peroumalnak, Mathias 107


Farooq, Muddassar 127
Franco, Mara 145 Preen, Richard 166

Stalph, Patrick O. 47, 57


Goldberg, David E. 185
Gorrin, Celso 145
Tanwani, Ajay Kumar 127

Howard, Gerard David 87 Wilson, Stewart W. 38

You might also like