Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

1

Search-Intensive Concept Induction

Attilio Giordana and Filippo Neri

Universitˆ di Torino
Dipartimento di Informatica
Corso Svizzera 185
10149 TORINO (Italy)

{attilio, neri}@di.unito.it

Abstract
This paper describes REGAL, a distributed genetic algorithm-based system, designed
for learning First Order Logic concept descriptions from examples. The system is a
hybrid between the Pittsburgh and the Michigan approaches, as the population
constitutes a redundant set of partial concept descriptions, each evolved separately. In
order to increase effectiveness, REGAL is specifically tailored to the concept learning
task; hence, REGAL is task-dependent, but, on the other hand, domain-independent.
The system proved to be particularly robust with respect to parameter setting across a
variety of different application domains.

REGAL is based on a selection operator, called Universal Suffrage operator, provably


allowing the population to asymptotically converge, in average, to an equilibrium
state, in which several species coexist. The system is presented both in a serial and in
a parallel version, and a new distributed computational model is proposed and
discussed.

The system has been tested on a simple artificial domain, for the sake of illustration,
and on several complex real-world and artificial domains, in order to show its power,
and to analyze its behaviour under various conditions. The results obtained so far
suggest that genetic search may be a valuable alternative to logic-based approaches to
learning concepts, when no (or little) a priori knowledge is available and a very large
hypothesis space has to be explored.
2
1. Introduction
Learning concepts from examples has been widely investigated in Machine Learning
both for its theoretical relevance and for its potential impact on many application
domains. Nevertheless, in spite of the great number of algorithms proposed so far, the
problem is still far from a satisfactory general solution and, hence, this learning task
is still in the focus of many ongoing and new researches.

One of the major obstacles in the path to effectively learn realÐworld concepts is
computational complexity. As Mitchell [1982] noticed, the problem can be
formulated as a search in a hypothesis space corresponding to candidate descriptions
of the concept in a specified language. In most useful applications this space is so
large that finding ÒgoodÓ descriptions seems almost hopeless. Nevertheless, we can
think of at least two ways of surmounting this obstacle: either to use as much a priori
knowledge as possible, in order to focus the search towards smaller but interesting
subspaces (see, for instance, [Mitchell, Keller & Kedar-Cabelli, 1986; Tecuci, 1991;
Saitta, Botta & Neri, 1993]), or to increase the algorithm exploration power and
ability to exploit computational resources, such as parallelism. Genetic Algorithms
[Holland, 1975; De Jong, 1975] are rather along this line.

Genetic Algorithms (GAs) are not learning algorithms per se, but they may offer to a
variety of learning tasks a powerful and domain-independent search method. In fact,
even though GAs have been successfully applied mainly to optimization problems,
machine learning has also benefitted from them. The most wellÐknow approach in
this sense is represented by the Genetic Classifiers paradigm, proposed by Holland
[1986] and then investigated by many others. However, this paradigm binds the GAs
to a very specific model of learning agents, which not always fits the learning task
one has to face. Therefore, other authors investigated how to introduce GAs in
learning frameworks already existing. For example, proposals for using GAs in
supervised concept learning have been put forward both for propositional concept
representation languages [Wilson, 1987; McCallum & Spackman, 1990; De Jong &
Spears, 1991; Greene & Smith, 1993; De Jong, Spears & Gordon, 1993; Janikow,
1993; Venturini, 1993] and for First Order Logic languages [Giordana & Sale, 1992;
Giordana & Saitta, 1993]. Genetic algorithms have also been used to learn rules for
sequential decision making [Grefenstette, Ramsey & Schultz, 1990] and to refine
numeric constants in a knowledge base [Bala, De Jong & Pachowicz, 1991; Botta,
Giordana & Saitta, 1993].
3
From the experience gathered so far, genetic algorithms seem to emerge, also in
Machine Learning, as an appealing alternative to classical search algorithms for
exploring large spaces of inductive hypotheses. In particular, they exhibit two major
characteristics which are particularly attractive: as already mentioned, GAs are highly
parallel in nature and can exploit parallel machines, whereas classical search
algorithms do not easily lend themselves to parallel processing; moreover, the type of
search performed by a GA may, in some cases, give it the capability of escaping from
local minima, whereas greedy algorithms may not. For instance, Giordana and Sale
[1992] present a typical induction problem in which a genetic algorithm can easily
find a ÒperfectÓ solution, whereas any induction algorithm, guided by information
gain heuristics, is bound to be misled towards suboptimal solutions. The opposite is
also true: some problems, easy to solve with information gain based heuristics, can be
hard for a genetic algorithm. For this reason, GAs assume a complementary role with
respect to classical search algorithms that are aimed more at a strong exploitation of
the information gathered from the data than to a wide exploration of the hypothesis
space.

This paper systematizes and extends in many respects the framework for learning
concept descriptions in First Order Logic previously proposed in [Giordana & Sale,
1992; Giordana & Saitta, 1993, 1994; Giordana, Neri & Saitta, 1994; Giordana, Saitta
& Zini, 1994; Neri & Giordana, 1995; Neri & Saitta, 1995a,b]. More specifically, it
describes in all relevant details a substantially new version of the system REGAL,
which has been already summarized in [Neri & Giordana, 1995], and focuses on the
problem of learning multimodal concepts, i.e. concepts requiring disjunctive
descriptions. The starting point of REGAL is the selection operator, especially
tailored to tackle the problem at hand, described in [Giordana, Saitta & Zini, 1994],
where the properties of this new selection operator have been theoretically
investigated and experimentally confirmed. However, the important novelty of the
current REGAL's version, with respect to the one reported in [Giordana, Saitta &
Zini, 1994] consists in a new distributed architecture for the genetic algorithm, which
extends and modifies the classical network model [Goldberg, 1989a; Pettey & Leuze,
1989] in order to distribute different niches on different computational nodes.

The paper is organized as follows: Section 2 discusses the problem of learning


disjunctive concepts with GAs, and introduces the main ideas of the approach.
Sections 3 and 4 describe the concept representation language, the bit string encoding
and the genetic operators. The content of these sections is mostly a more coherent and
organic presentation of what can be found, in many chunks in some previous papers.
Section 5 describes the novel distributed model of REGAL and is a more
4
comprehensive presentation of the work presented in [Neri & Giordana, 1995].
Section 6 contains an extended set of experimental results on some well known
datasets. Finally, Sections 7 provides a comparison with related work and Sections 8
discusses some open problems.

2. Learning Multimodal Concepts


Given a concept description language , multimodal concepts cannot be described by
a single conjunctive formula belonging to , but require a disjunctive description
involving two or more conjunctive formulas. Therefore, multimodality depends upon
the concept description language. However, as it is difficult to a priori know what
language is most suited to a target concept, the ability to deal with disjunctive
descriptions would greatly enhance the usefulness of an inductive algorithm.

Learning multimodal concepts bears similarities with the problem of learning


multiple concepts from the same learning set. In some sense, a multimodal concept
can be seen as a collection of subconcepts to be learned separately. However, the
difference is that in a multiple concepts learning task the clustering of the instances
into the subsets corresponding to different concepts is given by a teacher, whereas in
a multimodal concept learning problem the subconcepts have to be selected by the
algorithm itself. In this sense, there is a stronger similarity with conceptual clustering.

Disjunctive concepts may be difficult to learn, especially in presence of noise,


because small disjuncts, covering few examples, may be confused with stochastic
fluctuations. Concerning this point, Holte, Acker and Porter [1989] and Quinlan
[1991] noticed that small disjuncts are more error prone than larger ones and
suggested approaches to increase their reliability. On the other hand, small disjuncts
may be necessary to improve predictive accuracy [Danyluk & Provost, 1992].

In classical (non GA-based) rule learning algorithms, the most common strategy used
for dealing with multimodality consists in learning one disjunct at a time [Michalski,
1980, 1983; Michalski et al., 1986]. The learning algorithm searches for some
conjunctive concept description; then, the positive instances covered by the
description are removed from the learning set and the algorithm starts again looking
for another description covering some (or all) of the remaining positive examples.
The procedure is repeated until all the positive instances are covered. In the case of
multiple concepts learning, the same policy is iterated for each concept. On the other
hand, there also exist rule learning algorithms which work on an entire classification
theory, dealing with both multiple and multimodal concepts at the same time (see, for
5
instance, [Botta & Giordana, 1993]). Decision trees construction is also typically
done by generating all the disjuncts at the same time.

2.1. Genetic Algorithms for Concept Induction


Machine learning oriented genetic algorithms fall into two broad classes,
corresponding to substantially different approaches, named the Pittsburgh's and the
Michigan's one, early exemplified by the LS-1 system [Smith, 1983] and by the
Classifier systems [Holland, 1986], respectively. In the Pittsburgh's approach each
individual in the population encodes a whole solution, whereas in the Michigan's
approach each individual represents only a part of the solution, and the global one is
obtained by promoting co-operation and competition inside the population.

Both paradigms present advantages and drawbacks. Encoding a whole knowledge


base in each individual allows an easier control of the genetic search but introduces a
large redundancy, that can lead to populations hard to manage and to chromosomes
excessively long. On the other hand, a simple GA can be used, because convergence
to a single ÒbestÓ individual is appropriate. In the Michigan's approach co-operation
between different individuals reduces redundancy and more complex problems can be
handled, but also more sophisticated strategies may have to be designed; HollandÕs
bucket brigade is one example of it [Holland, 1986].

The task of learning concept descriptions can be mapped to both PittsburghÕs and
MichiganÕs approach. In the former, each chromosomes encodes a whole
classification theory, while the component rules correspond to different concept
modalities. As mentioned before, a simple GA can be used, as a complete concept
description corresponds to a ÒbestÓ individual, according to a taskÐdependent fitness
function. Recent instances of this approach are the systems GABIL [De Jong, Spears
& Gordon, 1993] and GIL [Janikow, 1993], which learn concept descriptions in the
VL1 language [Michalski et al., 1986], and the system described in [Koza, 1991],
which learns Lisp programs.

On the contrary, individuals, in the MichiganÕs approach, naturally correspond to


partial concept descriptions, which are evolved as a whole. In this approach specific
strategies have to be designed in order to extract a non redundant concept description
from the population. The whole power of the Michigan's approach, allowing also
chains of rules to be learned, seems excessive for the considered task, and
simplification have been proposed [Wilson, 1987; Bonelli et al, 1990].

Another way of modifying the MichiganÕs approach to learn disjunctive concepts is to


allow ÒnichesÓ and ÒspeciesÓ formation, as it is done in REGAL. This alternative
6
approach requires a more complex population dynamics, as each disjunct, associated
to a species, usually corresponds to a local maximum of the fitness function.

A way out of the dilemma is proposed by Venturini [1993], who suggested to learn
one disjunct at a time. However, this strategy offers an over simplistic solution to the
problem and tends to find less general descriptions than the ones formed by the above
mentioned approaches.

2.2. Theory of Niches and Species Formation


Species formation seems particularly appealing for concept learning. First of all it
allows several disjuncts, and also several concepts, to be learned simultaneously.
Furthermore, computational resources are exploited more effectively by avoiding
useless replications and redundancies, and by naturally exploiting distributed
facilities. Early proposed methods for niches and species formation include
ÒCrowdingÓ [De Jong, 1975] and ÒSharing FunctionsÓ [Goldberg & Richardson,
1987] .

Crowding is a variant of a simple GA with respect to replacement of older


individuals: new individuals, generated by crossover and mutation, replace the older
ones that are most similar to them, according to a given similarity measure. In this
way, subpopulations are likely to grow up, because genetic pressure tends to manifest
itself primarily among similar individuals.

The method based on sharing functions modifies the selection probability, with the
aim of inhibiting the excessive growth of the genetic pressure of a subpopulation.
This is achieved by reducing the fitness of an individual depending on the number of
existing individuals similar to it. In the initial formulation, genotypic sharing was
considered [Goldberg & Richardson, 1987]. The fitness value f(j), associated to an
individual j, was considered as a reward from the environment to be shared with
other individuals. Similarity between two individuals, j and j', was defined as the
Hamming distance between their associated bit strings. Phenotypic sharing can also
be used, by considering the distance between two individuals in their semantic
domain [Deb & Goldberg, 1989]. By applying both crowding and shared functions to
a set of benchmark optimization problems, Deb and Goldberg [1989] observed that
crowding was less effective, because it sometimes could not find some of the existing
fitness maxima. In an earlier version of REGAL, we also tried to apply crowding; the
method proved to work well to learn many unimodal concepts at one time, but was
unable to allow a stable formation of subpopulations, representative of disjunctive
definitions of the same concept. In all the experiments performed, in the long term a
single disjunct overcame the other ones.
7
Phenotypic or genotypic sharing works well in vectorial spaces, where the notion of a
distance has a clear semantics. Deb & Goldberg [1989] found a performance
difference between phenotypic and genotypic distances, in favour of the phenotypic
one: by using as genotypic distance Hamming's definition, the GA showed a
behaviour intermediate between crowding and phenotypic sharing with respect to the
ability of finding all the existing maxima. Another method for generating niches,
which is based on tags, and potentially does not suffers from the previous problems,
has been recently proposed by Spears [1994]; in principle, it could be a viable
alternative to the solution adopted for REGAL that is described in the following.

2.3. REGAL's Species Formation Approach


REGAL's approach to concept learning is a hybrid one between the Michigan and the
Pittsburgh approaches because neither a single individual encodes a complete
description nor the whole population is a potential solution. In REGAL each
individual encodes a partial solution, i.e. one disjunct consisting of a conjunctive
formula, and the whole population is a redundant set of these partial solutions.

The difference with respect to the MichiganÕs approach is that each individual evolves
separately and only at the end of the run a complete solution is formed. Despite its
apparent similarity to a classifier system, REGAL deeply differs from it, because it
uses a totally different mechanism to promote the formation of species of conjunctive
formulas, competing for covering the instances in the learning set.

Species formation is achieved, in REGAL, by acting on the selection mechanism, as


the sharing function method does. The way in which this last works is to maintain the
basic selection scheme of a simple GA and to achieve species formation through the
definition of a more complex Òshared fitnessÓ, which reduces the genetic pressure of
individuals similar to each other. The approach taken in REGAL is the opposite: a
very simple fitness function is used, whereas the very selection mechanism is
modified, in order to take advantage of the peculiar characteristics of the addressed
task. In other words, the proposed selection method is task-dependent, but domain-
independent.

There are several reasons why we did not simply use sharing functions. The first, and
most fundamental one, is the difficulty of defining a meaningful distance measure.
REGAL is designed to learn concepts expressed in a First Order Logic (FOL)
language and no satisfactory general definition of a distance among FOL formulas
exist. Even if such a measure could be defined, it would be strictly domain dependent;
this fact would result in a limiting factor for the applicability of the system to a whole
spectrum of domains. The situation is made even worse by the problem-dependency
8
of the parameters appearing in the sharing function method. A second reason is
computational complexity: the shared fitness evaluation requires, at each generation, a
number of steps proportional to M2, being M the cardinality of the population. A third
reason derives from the fact that sharing functions (based on either genotypic or
phenotypic distance) can be proved not to be able to separate two species under
certain not a priori testable conditions [Giordana, Neri & Saitta, 1994; Neri & Saitta,
1995b].

The selection mechanism used in REGAL is obtained by introducing an operator,


called Universal Suffrage (US), whose basic properties are investigated in [Giordana,
Saitta & Zini, 1994; Neri & Saitta, 1995b]. The US operator does not uses any
distance measure, and does not require any additional parameter setting when a new
problem is addressed. Furthermore, the fitness evaluation at each generation is
proportional to M. Finally, if any disjunct, however small, is necessary for covering a
part of the concept extension, the selection mechanism is ÒguaranteedÓ to maintain it
in the population.

When growing different species, an important issue is how to avoid Òlethal matingsÓ,
i.e., matings that are bound to produce bad offsprings. If two conjunctive sub-
concepts, j and y, cannot be generalized further to a single conjunctive description,
crossing j with y will inexorably generate less fit individuals. We will discuss later
on how REGAL faces this problem, by imposing a long-term control strategy to the
basic genetic evolution, which exploits the features of the distributed implementation.

Learning one disjunct at a time may be an effective solution to the problem of lethal
matings [Venturini, 1993]. However, one can expect more difficulties in generalizing
the latest and smaller disjuncts, because of the absence of individuals covering the
previously removed examples. A possible remedy to this drawback may be to use
taskÐoriented mutation operators, such as ÒdroppingÓ or Òadding conditionÓ operators,
in order to obtain from them what cannot be obtained by pure crossover [De Jong,
Spears & Gordon, 1993; Greene & Smith, 1993; Janikow, 1993].
9

3. Knowledge Representation and


Encoding
REGAL is able to learn from structured concept instances. In particular, a concept
instance is represented as a collections of objects, each one described by an attribute
vector. As discussed by Quinlan [1991], a First Order Logic concept description
language, i.e. a language with variables, is more appropriate for dealing with this
kind of representation.

The concept description language , used by REGAL, is a First Order Logic


language, intermediate between VL2 and VL21 [Michalski, 1983; Michalski et al.,
1986]. More precisely, is a Horn clause language, in which terms can be variables
or disjunctions of constants, and negation occurs in a restricted form. Internal
disjunction has been widely used in concept learning (see, for instance, the systems
AQ [Michalski et al., 1986], Induce [Michalski, 1980], GABIL [De Jong, Spears &
Gordon, 1993], GIL [Janikow, 1993]), because it allows compact inductive
hypotheses to be expressed. An example of an atomic expression containing a
disjunctive term is Òcolor(x, yellow Ú green)Ó, which is semantically equivalent to
Òcolor(x, yellow) Ú color(x, green)Ó, but is more natural.

In REGAL, an atomic formula of arity m has the syntactic form P(x 1 , x2, ..., xm, K),
where x1, x2 , ..., xm are variables and the term K is a disjunction of constant terms,
denoted by [v1 , v2 , ...., vn], or the negation of such a disjunction, denoted by Ø [v1, v2 ,
...., vn]. At most one term can be a (positive or negative) disjunctive expression, while
all the others must be variables. Examples of well formed atomic expressions are:
color(x 1 , [yellow, green]), color(x1 , Ø [blue, black])
far(x1, x2, [2,3,4]), greater(x1, x2), tall(x1)
Notice that the expression color(x1 ,Ø [blue,black]) is semantically equivalent to
Øc olor(x1,blue)ÙØ color(x 1 ,black). Therefore, the negation of an internal disjunction
is equivalent to a conjunction of negated atoms in clausal form, as, for instance, in
FOIL [Quinlan, 1990]. When a formula j is evaluated on a concept instance x, each
variable in j has to be bound to some object occurring in the description of x. Then,
the predicates occurring in j are evaluated on the basis of the attributes of the object
bound to their variables. As the binding between the variables in j and the objects in
x can be chosen in many ways, j is said true of x iff there exists at least one choice
such that all the predicates occurring in j are true. How the predicate semantics is
evaluated in terms of the object attributes must be explicitely defined by the user
10
before beginning to run REGAL on a specific application. In order to clarify this
point, suppose a concept instance is composed of three objects:
o1:<color = black, position = 1, size = 4, height = 3>
o2:<color = green, position = 2, size = 1, height = 1>
o3:<color = blue, position = 3, size = 10, height = 5>
Each object is described by four attributes: COLOR, POSITION, SIZE and HEIGHT.
Suppose, moreover, that the semantics for the predicates color(x 1 , [yellow, green]),
(far(x1, x2, [2,3,4]), greater(x1,x2) and tall(x 1 ) is defined by the expressions:
member(COLOR(x1 ), K); member(|POSITION(x1 ) - POSITION(x 1 )| ,K);
SIZE(x1) > SIZE(x2); HEIGHT(x1 ) > 6
respectively. Then, the predicate color(x1 , [yellow, green]) will be true when the
variable x1 is bound to object o 2 , whereas it is false for any other binding (the term
K is bound to [yellow, green]). In a similar way, far(x1 , x2, [2,3,4]) is true only when
x1 is bound to o3 and x2 is bound to o1 (the term K is bound to [2,3,4]), greater(x1 ,x2)
is true for the three bindings <x 1 =o1, x2 =o2>, <x1 =o3, x2 =o1>, <x1 =o3, x2 =o2> and
tall(x1) is always false.

In the following we will show how concept descriptions in the language can be
represented on a bit string of fixed length.

3.1. Biasing the Induction Process by means of a


Language Template
All learning algorithms use some kind or other of bias (i.e a set of implicit or explicit
constraints) to restrict the inductive hypothesis space. A method, widely used in
systems integrating induction and deduction in FOL frameworks, consists in using
meta-syntactic constructs to define the set of admissible formulas to be explored.
Examples of this approach include the predicate schemes [Bergadano, Giordana &
Ponsero, 1989; Bergadano & Gunetti, 1994; Baroglio, Botta & Saitta, 1994], the
schemata used in MOBAL [Morik, 1991], and the relational clichŽs used in FOCL
[Silverstein & Pazzani, 1991]. REGAL adopts a similar approach and limits the
hypothesis space by means of a Language Template. Informally, the language
template is a formula L belonging to , such that every admissible conjunctive
concept description can be obtained from L by deleting some constants from the
internal disjunctions occurring in it.

Before giving a formal definition of L, it may be useful to revisit the logical notions,
informally introduced above, in order to define the concept of completed form for a
disjunctive term and for a predicate. Given a FOL formula j (x1 , x2 , ..., xn ), to
evaluate the truth of j in a universe U means to search for a binding between the
11
variables x1 , x 2 , ..., xn and some constants a1 , a2 , ..., an, defined in U, such that j(a1,
a2, ..., an ,) is satisfied. Notice that in supervised concept learning each learning
instance is a specific universe. We can introduce the following:

Definition 1: Given a predicate P(x 1 , x2 ,...,xm, K), where K is a positive disjunction of


constants [v1 , v2 , ...., vn], P will be said in completed form if the set [v1 , v2 , ...., vn] is
such that P can be satisfied for any binding of the variables x 1 , x2 ,...,xn.

Definition 1 states that a predicate P containing a disjunctive term in completed form


is true of every instance in the learning set. Considering again the predicate color(x,
K), this will be in completed form if K is the set of all possible colors. In order to
distinguish predicates in completed form from the others, the former one will be
denoted in italic throughout the paper.

Definition 2: A language template L is a formula of containing at least one


predicate in completed form.

The purpose of a template L is to define a learning problem for REGAL. The


predicates not in completed form occurring in L have the role of constraints and must
be satisfied by the specific binding chosen for the variables in L. On the contrary, the
predicates in completed form (true by definition) are used to define the search space
characterizing the learning problem. Deleting a constant from a positive completed
term, occurring in a predicate P, makes the atomic assertion P more specific. More in
general, any formula, obtained by dropping some constants from L, is more specific
than L itself. Given a template L, the search space explored by REGAL is restricted
to the set H(L) of formulas that can be obtained by deleting some constant from the
completed terms occurring in L.

It is easy to see that, being L a conjunctive formula, it can be written as the


conjunction of two sub-formulas: Lc, consisting of non-completed predicates, and Ls
consisting of completed predicates. The template must be declared at the beginning,
by separately specifying Lc and L s . As a particular case, the formula L c can be
empty, whereas L s must always contain at least one completed predicate.

As the set of constants necessary to complete a disjunctive expression may be a very


large one (or even infinite), a special constant symbol Ò *Ó has been introduced in L,
playing the role of a wildcard. Let K = [v1, v2 , ...., vn , *] be a completed term; the
meaning of the constant Ò*Ó is implicitly defined as * = Ø[ v1 , v2 , ...., vn]. In other
words it means Òany other value not explicitly mentioned in the disjunctive term it
occurs inÓ. The semantics associated to the symbol Ò * Ó is local to the specific
12
disjunctive term it occurs in, and is statically defined when the template is declared.
Afterwards, it maintains the initial semantics in all the disjunctive terms derived by
the same completed term. An example of a language template is reported in Figure 1,
together with two formulas belonging to its associated hypothesis space.

It is worth noting how the use of the wildcard Ò*Ó naturally leads to the introduction
of negated disjunctive terms. For instance, the predicate shape(x, Ø[ square, circle]),
occurring in j 1 , is just a rewriting of the predicate shape(x, [triangle,*]), where * =
Ø[ square, triangle, circle], according to the declaration in the template. More in
general, Ò*Ó can always be eliminated from an inductive hypothesis by introducing a
negated term. Finally, we notice that completed predicates occurring in an inductive
hypothesis can be dropped without modifying their semantics, being tautologically
true. As an example, in formula j2 the complete predicate shape(x, [square, triangle,
circle, *] ) has been dropped.

Template L
L = weight(x, [3, 4, 5] Ù color(x, [red, blue, *]) Ù
Ù shape(x, [square, triangle, circle, *]) Ù far(x, y, [1, 2, 3, 4, 5, *])
Lc º weight(x, [3, 4, 5])
Ls º color(x, [red, blue, *] ) Ù shape(x, [square, triangle, circle, *] ) Ù
Ù far(x, y, [1, 2, 3, 4, 5, *])
j1 = weight(x, [3, 4, 5] Ù color(x, [red]) Ù shape(x, Ø [square, circle]) Ù
Ù far(x, y, [1, 2])
j2 = weight(x, [3, 4, 5] Ù color(x, Ø [red]) Ù far(x, y, [1, 2])

Figure 1 Ð Example of a language template L with its sub-formulas Lc and Ls . Formulas


j 1 and j 2 belong to the space H(L ). All formulas are to be considered implicitly
existentially quantified.

3.2. Mapping FOL Formulas to Bit Strings


An inductive hypotheses j, belonging to the space H(L), corresponds to the body of
an implication rule j ® h, being h the name of a target concept. As we will show in
the following, formulas in H(L) can be easily represented on a fixed-length bit string.
We observe that only the completed predicates in Ls need to be processed, while the
sub-formula Lc must occur, by default, in any inductive hypothesis, so that it does not
need to be represented. It is immediate to map L s to a fixed-length bit string s(L s),
where each constant occurring in a term in completed form is associated to a specific
bit in s(L s). By keeping adjacent the bits corresponding to literals that are adjacent in
the template, each completed predicate will correspond to a specific substring and,
hence, decoding a bit string is straightforward.
13
The semantic interpretation of the alleles in the bit string has been assigned as
follows: if the bit corresponding to a given term v in a predicate P is set to 1, then v
belongs to the current internal disjunction of P, whereas, if it is set to 0, it does not
belong to it. As removing a term from the internal disjunction of a predicate makes it
more constrained, i.e more specific, the operations turning a 1 into a 0 in a bit string
will be said specialization. For the opposite reason, the operations turning a 0 into a 1
will be said generalization. Examples of bit strings encoding the formulas of Figure 1
are given in Figure 2.

The bit string univocally characterizes an inductive hypothesis j , and it is the


information processed by the genetic algorithm. As a matter of fact, in REGAL an
individual is a more complex structure, containing also other information processed
by the evaluation procedure. In particular, it contains a bitmap F, specifying the set of
examples satisfying j, the class h assigned to j and the fitness f. In the evaluation
phase, each newly generated individual is matched against all the learning examples
and the bitmap F is filled. Afterwards, the class h is assigned: if there is a single target
concept, the class is univocally determined, otherwise the system assigns the concept
which has more representatives in the learning instances satisfying j. Finally, the
fitness f is evaluated with respect to the assigned class h.

color(x, [red, blue, *]) Ù shape(x, [square, triangle, circle, *]) Ù far(x, y, [1, 2, 3, 4, 5, *])

s(Ls) = 11 1 111 1 1 1 1 1 1 1
j1 Þ 1 0 0 0 10 1 1 1 0 0 0 0

j2 Þ 0 1 1 1 1 1 1 1 1 0 0 0 0

Figure 2 Ð Bit strings corresponding to the template L s and to the formulas j 1 and j2
reported in Figure 1. Every term, including Ô*Õ, occurring in an internal disjunction has a
corresponding bit on the bit string.

3.3. Comparing REGAL's and Other Systems'


Description Language
The concept description language used by REGAL shows two features that need to be
discussed further, namely the fixed size of the template, which imposes an upper
bound on the complexity of the inductive hypotheses, and the restricted use of
negation.
14
The existence of an upper bound on the maximum complexity of inductive
hypotheses, be it explicit or implicit, is a true fact for most learning algorithms. Apart
from the systems working in propositional logics, where such a bound is naturally
determined by the number of attributes used to describe the learning events, many
systems working in FOL also have such a bound. For instance, the system FOIL
[Quinlan, 1990], which learns concept definitions represented in Horn clauses using a
general-to-specific strategy, adopts a heuristic rule for halting the specialization of an
inductive hypothesis when the complexity of the formula becomes greater than the
complexity of the instances it covers. A similar criterion is also used by other
systems, such as SMART+ [Botta & Giordana, 1993]. In other cases, the complexity
limit is determined by the amount of resources the system can use.

From these considerations we can conclude that the apparent limitation, due to the
predefined complexity of the template is not a real one, provided that sufficiently long
templates could be handled, when necessary. As a matter of fact, REGAL proved to
be able to work well with strings several hundreds of bits long. Then, we may assume
to be quite free in designing the template according to the real need of the problem,
without encountering excessive difficulties.

In a FOL language the choice of an adequate template may be a non trivial one. In
order to estimate the reasonable complexity for a template, we can observe that a
nonÐconstructive, bottom-up generalization process, using only Òdropping conditionÓ
and Òturning constants into variablesÓ rules, will never generate a conjunctive
description more complex than the one of the most complex example 1 .

Therefore, a method which can be used to design the template is to examine the
learning instances and to insert in the template only the literals that occur in ground
form in at least one of them. In this way, the hypothesis space will not be larger than
the one explored by an exhaustive bottomÐup generalization algorithm.

A second, more ambitious alternative can be that of using a deduction system to


derive the template from an existing domain theory, and possibly, from the learning
set. This is the approach we are currently pursuing.

The limitation about the use of negation is more substantial. As explained before, the
negation used in REGAL is basically the negation of atoms. In a previous version of
REGAL we considered also another form of negation, greatly increasing the power of
the language , namely the negation of existentially quantified formulas. This form of
negation, widely used in Logic Programming, can be learned by systems such as

1 Of course, this is not true if the concept description language is not function free.
15
SMART+ [Botta & Giordana, 1993], FOIL [Quinlan, 1990] and FOCL [Pazzani &
Kibler, 1992]. A negated existentially quantified formula has the following syntax:
Ø $ < y 1, ... , ym> [y(x1, .... , xn , y1, ... , ym)]
where y is a conjunction of (possibly internally disjunct) predicates, each one
containing at least one variable in the set y1, ... , y m. In theory, it is easy to extend the
method we propose for mapping Ls into a bit string in such a way as to also account
for this form of negation. One such method has been described in [Giordana & Saitta,
1993]. Currently, this form of negation is not in use, because the genetic search found
difficulties to deal with it, owing to the highly epistatic interaction between the
positive and the negative parts of a formula during reproduction by crossover. In
order to have an intuition of the problem, let us consider the formula:
j(x1 , .... , xn) Ù Ø $ < y 1, ... , ym> [y(x1, .... , xn , y1, ... , ym)]
The positive part j requires at least one binding b1 for the variables x1 , .... , xn such
that j is satisfied. The negated part requires that, given b1 , there must not be any
binding b2 for y 1 , ... , y m such that y is satisfied. Modifying the formula j by
crossover or mutation can easily modify the set of possible bindings b 1 , so that the
negated part becomes unsatisfied and the whole formula becomes false. Therefore,
the experimentation showed, in general, a very unstable behavior under crossover and
mutation.

4. The Genetic Operators


In this section, the genetic operators and the fitness function used by REGAL are
described. The system uses selection, crossover, mutation, and seeding operators. The
most relevant differences with respect to a classical Genetic Algorithm concern
selection and seeding; then, we start describing crossover and mutation, which are
closer to the usual ones.

4.1. Crossover and Mutation Operators


The crossover operators have been already used in GA-SMART [Giordana & Sale,
1992] and consist of the two-point [De Jong, 1975] and uniform [Syswerda, 1989]
crossovers, and the generalizing and specializing crossovers, specifically designed
for the task at hand. These crossovers have been chosen empirically after
experimenting with several datasets.

The generalizing and specializing crossovers need additional explanation. As


described in Figure 2, the string s(L s ) is divided into substrings, each one
corresponding to a specific predicate P. In both crossover types, a set D of predicates
16
is randomly selected in L s. The specializing (generalizing) crossover works as
follows:
(a) The substrings in the parent strings s1 and s2, corresponding to the predicates not
selected in D, are copied unchanged into the corresponding offsprings s 1 ' and s2' .
(b) For each predicate Pi Î D, a new substring si' is generated by AND-ing (OR-ing)
the bits of the corresponding substring s1 (Pi) and s2 (Pi). The substring si' is then
copied in both s1' and s2'.
The set D is created by assigning an equal probability pd of being selected to each
predicate P i. As explained in Section 3.2, turning a 0 into a 1 in a bit string adds a
term to an internal disjunction occurring in the corresponding inductive hypothesis
making it more general, i.e. less constrained. Then, the generalizing crossover has
been named in this way, because the offspring it generates are more general than the
parents. The simmetric consideration holds for the specializing crossover.

In the remaining of the paper, let and be the sets of positive and negative
examples of a target concept, respectively, and let E and C be their cardinalities.
Given a pair of strings (s 1 , s2), generated by the mating procedure, crossover will be
applied with an assigned probability pc (currently, pc = 0.6). Then, the specific
crossover type is selected stochastically by taking into account the features of s1 , and
s2 . The conditional probabilities pu of uniform crossover, p2 pt of two-point crossover,
ps of specializing crossover and pg of generalizing crossover, given that crossover
application has been selected, are assigned as follows:
pu = (1 - a × fn) × b
p2 pt = (1 - a × fn) × (1 - b) (4.1)
ps = a × fn × r
pg = a × fn × (1 - r)
In expressions (4.1), a and b (a, b Î [0,1]) are tunable parameters, fn is the normalized
mean value of the fitness of the two formulas j1 and j 2 , defined by string s1 and s2 ,
respectively:
f ( j1 ) + f ( j 2 )
fn = £1
2 f Max

and r is the ratio:

r =
[ n + (j1 )+ n - (j1 )+ n + (j 2 )+ n - (j 2 )]

( E +C)2
being n+ (j) and n-(j) the numbers of positive and negative learning instances
covered by j, respectively.

Expressions (4.1) have been empirically determined, after a long experimentation


with GA-SMART [Giordana & Sale, 1992] with the goal of realizing a form of
17
adaptive choice of the operators. Adaptive selection of the genetic operators is not a
novelty: the system GABIL [De Jong, Spears & Gordon, 1993], for instance, also
uses a method for dynamically selecting the most suitable operators to apply.

In the specific case, the normalized mean value f n determines the probabilities
p u + p 2pt = 1 - afn and ps + p g = afn of selecting the crossover between the uniform
and the twoÐpoint crossover, on the one side, or between the specializing and
generalizing crossover, on the other. When s1 and s2 have a low fitness, i.e. are not
yet good inductive hypotheses, the chance of applying the first two crossovers is high
in order to exploit their greater exploration power. On the opposite, high values of the
fitness privilege the use of the specializing and generalizing crossover in order to
refine the inductive hypotheses. The choice between the uniform and the twoÐpoint
crossover is statically controller by the parameter b, whereas the one between the
specializing and generalizing crossover is controller by the parameter r which again is
evaluated on-line on the basis of the examples covered by s1 and s2: when they cover
many examples, the specializing crossover is privileged in order to increase the
chances of creating a more consistent offspring, otherwise the generalizing crossover
gets a higher probability.

We notice that the parameters occurring in (4.1) are not critical; in fact, after an initial
tuning, they have never been changed across the range of tried applications.

As far as the mutation operator is concerned, it is identical to the classical one. It is


applied to offsprings with probability pm = 0.0001 and can affect any bit of s(Ls).

4.2. Seeding Operator


As it will be explained later, the selection operator used by REGAL favors
hypotheses covering more positive instances in the learning set. If some instance is
not yet covered, it would be sensible to dynamically generate new individuals
covering it. This facility is offered by the seeding operator, which is reminiscent of
the use of a ÒseedÓ in Induce [Michalski, 1980, 1983] and Rigel [Gemello, Mana &
Saitta, 1991], and can be seen as a more sophisticated version of the new event
operator used in GIL [Janikow, 1993] and of the creation operator used in SIA
[Venturini, 1993]. More precisely, the seeding operator acts as a function, which,
given a positive instance, returns a formula covering it. The abstract description of the
seeding algorithm is the following one:
Seeding (x)
Let x Î be a positive learning instance
Generate a random bit string s defining a formula j Î H(L)
18
Randomly select a binding b on x compatible with Lc
Turn to 1 the smallest set of bits in s necessary to satisfy j on x
Return(j)

As an example, consider again the language template reported in Figure 1 and


suppose that the example x to be covered consists of the following sequence of three
objects:
o1:<color = blue, position = 1, weight = 4, shape= square>
o2:<color = green, position = 2, weight = 1, shape=square>
o3:<color = blue, position = 5, weight = 10, shape=circle>

Suppose, moreover, that the binding <x=o 1; y =o3>, satisfies the constraint weight(x,
[3, 4, 5]). Let the string s=Ò1000110101001Ó, corresponding to the formula j =
color(x, [red]) Ù shape(x, [triangle, circle]) Ù far(x, y, [1, 2, *]), be selected: j is not
satisfied by x. Then, the string s will be modified into sÕ=Ò1101110101001Ó,
corresponding to jÕ = color(x, [red,blue]) Ù shape(x, [square,triangle, circle]) Ù
far(x, y, [1, 2, *]), which is now satisfied by x.

Provided that the compatibility with the constraints imposed by the template L c is
assured, the seeding algorithm tries to introduce as much randomness as possible, in
order to increase genetic diversity in the population. It is used at the beginning of the
search to initialize the population A(0) and, then, it is called either by the selection
operator or as an alternative to the classical mutation operator.

4.3. The Universal Suffrage Selection Operator


In this section, we will describe the new Universal Suffrage (US) selection operator,
especially tailored to the learning concept problem. The basic idea can be explained
through a metaphor. Conjunctive formulas are ÒcandidatesÓ to be elected in a
parliament (the population), whereas the positive training examples are the voters; an
example can vote for one of the formulas that cover it. The main departure of the US
operator from the others proposed so far resides in the fact that the individuals to be
mated are not chosen directly from the current population, but, instead, indirectly
through the selection of an equal number of positive examples, as we will now
describe.
x (t) x (t)
Let the population A(t) = ìîíj1 1 , ... , j mm üþý at generation t be a multiset of
cardinality M; A(t) contains m different individuals (i.e. conjunctive formulas of the
description language ), each one occurring with multiplicity x j(t) (1 £ j £ m). Let
COV(j j ) be the subset of covered by jj. Informally, the selection procedure works
as follows. At each generation t, a number g * M £ M of examples is randomly
19
selected with replacement2 from the set of positive examples . The parameter g is
the generation gap and represents the proportion of individuals selected for mating.
To each selected example x k a set R(xk), containing the formulas belonging to A(t)
and covering x k, is associated. The set R(xk) corresponds to a Òroulette wheelÓ rk,
divided into sectors, each one associated to a jj Î R(xk). The extension of the sector
associated to jj is proportional to the ratio between j j 's total fitness (i. e. the fitness
of j j multiplied by jj's multiplicity xj (t) in A(t)) and the sum of the total fitness
values of all the formulas occurring in R(xk ). For each spin of the wheel rk the
winning formula is chosen. More formally:

Universal Suffrage Selection


Let B(t) = Æ
Randomly select with replacement g * M examples from
for each selected examples xk do
if R(xk) ¹ Æ then Spin rk and add the winning formula to B(t)
else Create a new formula y covering xk by applying the
seeding operator and add y to B(t)
end

Figure 3 shows a graphical representation of the basic cycle of REGAL. According to


the ÒparliamentÓ metaphor, at each generation g * M positive examples are requested
to express their preference, which they do by ÒvotingÓ for one of their
ÒrepresentativesÓ by spinning their own roulette wheel. It is important to notice that
only formulas covering the same examples compete among each other, and the
examples (stochastically) ÒvoteÓ for the best of them. The selection process favors
formulas with larger coverage (i.e. occurring in more roulette wheels) and formulas
with higher total fitness (i.e. with greater probability of winning in the roulette wheel
in which they appear).

2 Extraction is performed with replacement for dealing with the case E < M.
20

| | =E

Extraction with
replacement

ext = {xk }
A(t) |A(t)| = M | ext | = g * M

Seeding +
Extraction Roulette wheels rk
without
replacement

Extraction by
roulette wheels
spinning

B'(t) |B'(t)| = g * M B(t) B


| (t)| = g * M

Crossovers + Mutation

Bnew(t) |B new(t)| = g* M

A(t+1) |A(t+1)| = [A(t) - B'(t)] È Bnew(t)


|A(t+1)| = M

Figure 3 Ð Graphical representation of REGALÕs basic cycle.

The name Universal Suffrage derives from the fact that each example has the same
probability of being extracted. More precisely, the process of extracting the examples
xk from , at each generation, follows a Binomial distribution with probability of
success 1/E and number of trials g * M. As the selection process is the same at each
generation, the probability distribution of the number of time x k is selected in t
generations is still a Binomial one, with g *M* t trials and the same probability of
success 1/E.

By considering the inverse Bernoulli sampling problem, it is easy to prove that for
21
1
lg( )
1 h
M ³ (4.2)
gt E
lg( )
E -1

a generic example xk ha a probability greater than h of being extracted within the first
t generations. From (4.2) we deduce that the most favorable case is g = 1 . In fact,
when the whole population is renewed at each generation, a smaller M is sufficient to
let xk appear, given t.

The average number of generations t all necessary to extract all the examples is at
most:
E2
t all = (4.3)
gM

Again the most convenient value is g = 1. Clearly, it is usually not necessary to


perform E 2 trials, because any formula, not coinciding with the description of a
single example, will cover several concept examples at the same time. A detailed
derivation of (4.2) and (4.3) may be found in [Neri & Saitta, 1995b].

The USÐbased selection process can be formally described by an (E ´ m) stochastic


matrix P(t), whose rows correspond to the examples xk in , and whose columns
correspond to the formulas jj in A(t). Each entry p kj (t) is equal to the probability of
jj winning in the corresponding roulette wheel rk , given that xk has been extracted:
n kj f j x j ( t ) (1£ k£ E, 1£ j£ m) (4.4)
{Pkj (t)} = p kj ( t ) =
m
å n ki f i x i ( t )
i=1

In (4.4) fj is the fitness of jj and n kj is the characteristic function of the set COV(jj):

1 if x k Î COV (jj )
ì (1£ k£ E, 1£ j£ m)
nkj = íî 0 otherwise

The p kj 's are normalized over the k-th row of P.

4.4. Fitness Function


The definition of the fitness to be used in (4.4) is worth some discussion. In concept
learning the criteria employed to evaluate solutions have traditionally been
completeness, consistency and simplicity. However, if the concepts of completeness
and consistency find a precise definition in Logics, the one of simplicity is not
univocally accepted and finds different interpretations in different contexts. The most
frequently used meaning is that of syntactic simplicity which reduces to counting the
22
number of literals necessary to represent a formula. The drawback is that this
definition strictly depends on the representation language and moreover, a same
formula can be rewritten in equivalent forms having different simplicity values. In the
case of REGAL, a computational definition of simplicity has been adopted,
corresponding to the number of tests which has to be done by the matcher in order to
verify whether a formula is satisfied by an example. REGALSÕs matcher simply
checks that, for each predicate in the language template, the associated feature does
not assume a value outside the set of terms in the corresponding internal disjunction,
i.e. it checks that any negated condition is violated. Then, a complexity measure c(j)
for a formula j can be immediately obtained by counting the number of terms in the
template not occurring in j , or, equivalently, by counting the number of Ò0Ós
occurring in the bit string s(j). Starting from this measure of complexity, a measure
of simplicity z(j) has been defined as the ratio

1ength(s(j )) - c(j )
z(j ) = .
length(s(j ))

As the US operator handles completeness through the roulette wheel spinning


mechanism, only consistency and simplicity are considered in the fitness function.
More precisely, we use the function
-w
f(j) = f(z, w) = (1 + A z) e (4.5)

In (4.5), the parameter A is user-tunable and its current value is A = 0.1. We observe
that fMax= 1.1 and lim f (z, w) = 0 . The function f is reported in Figure 4.
w®¥

Expression (4.5) derives from an extensive experimentation with alternative fitness


functions. It proved to give reasonable evaluations, but it is in no way meant to be a
unique nor the ÒbestÓ choice.

With regard to the complexity of the selection operator, in order to compute the
probability values in (4.4) we need to evaluate, for each jj, a number of sums at most
equal to the number of roulette wheels available: as there is at most one roulette for
each example in , then the number of steps performed is at most E m £ E M. Hence,
23

Figure 4 Ð Fitness function f(j) vs. simplicity (z) and consistency (w) of a formula j.
Increasing values of z (w) correspond to decreasing simplicity (consistency).

the global complexity is of the order O(EM). This complexity is to be contrasted with
that of using the sharing functions, i.e. O(M2E) [Deb & Goldberg, 1989]. This last
value is to be intended for a phenotypic distance taking into account the coverage of
the examples, in order to make a fair comparison with the US operator.

Finally we want to stress an important point regarding fitness evaluation and


probability comparison: in REGAL there is no notion of a global mean value of the
fitness; in fact, the probabilities in (4.4) are normalized only with respect to the Òmean
value of the fitness of individuals belonging to the same rouletteÓ. Individuals not
occurring in the same roulette are not comparable. We argue that it is exactly this
"locality" property of the fitness that determines the ability of REGAL to maintain
different species in equilibrium.

In [Giordana, Neri & Saitta, 1994] a general method for investigating the macroscopic
properties of populations evolving according to a transition probability matrix has
been proposed. This method, based on the definition of a Virtual Average Population,
allows some of the technical difficulties, involved in such kind of analysis, to be
overcome, still providing, at the same time, accurate estimates of some of the
parameters controlling the evolution. The method is well suited for a variety of
approaches and its application to several selection methods can be found in [Neri &
Saitta, 1995a, b].

5. The Distributed Model


24
As the number of individuals necessary to find a solution may be quite large, even in
simple applications, the idea of trying to exploit the explicit parallelism of genetic
algorithms, in order to reduce both this number and the elapsed time for learning a
satisfying concept description, is a natural one.

Distributing the population on a network of homogeneous nodes, according to the


classical network model [Goldberg, 1989a; Pettey & Leuze, 1989] does not work
well, because it does not comply with the US selection policy. In fact, it can be
theoretically proved and experimentally verified that each node still need a large
population in order to let small disjuncts not to disappear.

The alternative architecture we ended up with relies on a distributed implementation


of the US algorithm, which is based on the idea of co-evolution [Husbands & Mills,
1991; Potter, DeJong & Grefenstette, 1995]. The computational environment still
consists of a network of subpopulations fully interconnected and exchanging
individuals at each generation, but each node performs the US selection only on a
subset of the learning set that differs from one to another; the individuals arriving at
one node from the network are allowed to participate into the US selection on the
local wheels.

The basic architecture underlying the distributed model is represented in Figure 5: a


SUPERVISOR processor coordinates a set of Nodal Genetic Algorithms (NGAs),
each one searching a subset of the hypothesis space, extensionally biased, defined by
the subset of the learning set they have currently to deal with.

The SUPERVISOR realizes an explicit longÐterm strategy aimed at forming a


classification theory (i.e., a complete concept description), with the help of parallel
evolving populations, and at shifting the focus of the genetic search toward specific
regions of the hypothesis space, according to the partial concept descriptions emerged
so far.

In this section, we will discuss the two fundamental aspects of the process of
cooperation in REGAL: the selection and migration model, concerned with the
interaction among NGAs, and the interactions between the SUPERVISOR and the
NGA processors. In the following, NGA processors will be also called ÒnodesÓ, for
the sake of simplicity.

The search activity of each node is biased by the subset of training examples assigned
to it, exploited by the US selection operator. Any change in this set will result in a
shift of the region of the hypothesis space that will be explored. Consequently, the
SUPERVISOR processor, which constantly monitors the state of all the NGAs, can
25
easily focus the hypothesis space exploration by modifying the set of examples
assigned to any NGA.

 

 

 

SUPERVISOR

Figure 5 Ð Abstract view of REGAL. The SUPERVISOR processor imposes a long-term


searching policy by coordinating the short-term searching strategies of the Nodal Genetic
Algorithms (NGA).

Nodal Genetic Algorithm (Node n)

Initialize An(0) and Evaluate An(0)


while not done do
Receive m.|An(t)| individuals from the network
and store them in Anet(t)
Select Bn(t) from An(t) ÈAnet(t) according to
the US operator
Select r|Anet(t)| individuals from Anet(t) and
add them to Bn(t)
Recombine Bn(t) using crossover and mutation
Update An(t) and Anet(t) with the new
individuals in Bn(t)
Send Anet(t) on the network
Send the status to the SUPERVISOR
Check for messages from the SUPERVISOR
and update the niche status
end

Figure 6 Ð Abstract description of Nodal Genetic Algorithm. The parameter mÎ


[0, 1] is the migration rate. The parameter rÎ [0, 1] is the foreign mating rate.
An(t) represents the population in a single node n at time t. A net(t) represents the
communication buffer of node n.

From another perspective, each NGA can be seen as a niche needed for the survival of
a species or for a group of species (formulas representing one of the target concept
modalities). The SUPERVISOR distributes the examples of the learning set according
26
to the emerging species, thus identifying NGAs with niches, where specific species
can grow up. The aim of this association is that there will be, in the limit, an
identification between niches and concept modalities. Therefore the mating between
individuals of different niches can be controlled by tuning the migration parameter m.
Furthermore, the SUPERVISOR performs two other fundamental tasks: it
periodically extracts a classification theory from the global population and, when it
finds a satisfactory one, halts the whole system, according to a chosen criterion. In
Figures 6 and 7 an abstract description of the nodal GA and of the SUPERVISOR,
respectively, is provided.

SUPERVISOR

Start N Nodal GAs


while not solved do
Wait for messages from NGAs
Update the current classification theory
if the current classification theory is satisfactory
then broadcast done to all Nodal GAs
solved = true
else reassign the niches to the NGAs
end

Figure 7 Ð Abstract description of the SUPERVISOR processorÕs algorithm. N is


the number of nodes in the network.

5.1. Selection and Migration Model


The selection and migration model describes the interactions among the NGA nodes,
involving the exchange of individuals (formulas) representative of niches, and the
exploitation of the migrating individuals to generate others, which may represent
larger niches than their parents did (i.e., more general formulas).

Migration and selection are strictly interleaved activities. In the following, a complete
cycle of selection-migration is described and schematically represented in Figure 8.
For the sake of clarity, let us focus on a NGA n (1 £ n £ N). Let n be the subset of
training examples currently assigned to n. At each generation t, NGA n receives from
the other NGAs a set of foreign individuals A net (t), of cardinality m|An(t)| (0£m£1),
where m is the migration rate.

The individuals sent to n by other NGAs and covering some example belonging to
n , will compete, together with the individuals in the nodal population An (t), for
being selected for reproduction through the US selection procedure. Moreover, a set
of incoming individuals of cardinality rm|An(t)|, where r is called foreign mating
rate, is randomly selected for reproduction. After the reproduction phase, g ¥ |An(t)| of
27
the newly generated individuals will replace some elements of n's population, while
the other ones will be sent to the other NGAs.

The twoÐstep policy for selecting the incoming information is based on the following
motivations. The first step aims at the formation of more representative individuals
(more general formulas) by exploiting ÒforeignÓ individuals which are also
representatives of the current nÕs species. The second step allows a possible recreation
of missing schemata in nÕs population. The combination of these two policies is also
useful in easing the lethal mating problem, by reducing the number of matings
between individuals of different species, still allowing for the possibility of exploiting
the information contained in individuals representative of them.

Finally, the routing of individuals towards the other NGAs is realized by partitioning
the set of outgoing individuals according to the number of nodes directly connected to
n and a different partition is sent to each neighbour node.

5.2. The Exploration Policy


The SUPERVISOR's task consists in monitoring the NGAs activity, periodically
extracting a classification theory from the nodal populations, and shifting the focus of
the hypothesis space exploration any time a significant change in the classification
theory is detected. As said before, the SUPERVISOR focuses the search of the NGAs
by assigning a specific subset n of learning examples to each node n. In the
following we will describe the formation of niches, corresponding to partial
descriptions of the target concept, in the nodes of the network.

The SUPERVISORÕs monitoring activity consists in collecting the status reports


periodically sent to it by the NGAs processes. A status report from a node n contains
a copy of the best solution jbest(n) which it has currently found in order to cover the
assigned set n. The choice of jbest(n) is made according to a local measure p(j)
depending on jÕs completeness and consistency on n; the measure currently used is
simply the product p(j) = P × f(j), being P the number of positive examples in n
covered by j. Any formula covering at least one positive example in , with fitness
higher than any other formula previously received, is recorded by the Supervisor.

From all recorded formulas, a disjunctive concept description, which becomes the
current best solution, is assembled. The algorithm used to this aim is quite simple, by
the moment. First, the set COVER(t) is constructed as the union of the positive
examples covered by all the formulas received up to generation t. Then, the formulas
are sorted in decreasing order according to p(j) (re-evaluated on the whole set ) and
28
the first K-best ones, able to cover the entire set COVER(t), are selected as current
concept description BEST(t).

Finally, a comparison between the previous classification theory and the new one is
accomplished. If a significant improvement is detected, the sets n are modified in
order to distribute the task of improving the current concept description BEST(t). In
practice, this is realized by distributing the formulas in BEST(t) to the nodes and
defining the generic set n as the union of the positive examples covered by the
formulas assigned to n and the examples not covered by any other formula in
BEST(t).

The SUPERVISOR stops the genetic search when the current solution BEST(t) does
not improve for an assigned number of consecutive generations, or when a maximum
generation limit is reached.

5.3. The Freezing Mechanism


In the hypothetical case of availability of unbounded computational resources and no
time limit, the exploration strategy may devise an assignment of the example sets n
to each node n such that as many niches can simultaneously be formed as there are
modalities in the concept. In practice, as we usually deal with bounded resources and
time limit, a ÒfreezingÓ mechanism has been introduced, in order to let the system
learn at least a reasonable classification theory, under the given constraints. The
freezing mechanism is a restating of the Òlearn a disjunct at a timeÓ policy in terms
suited to our framework. The basic idea is the following: if a NGA n is not able to
improve its assigned best description jbest within an a priori fixed amount of time, it
is assumed that the description jbest represents a target concept modality, and that it is
no further improvable. Consequently, it can be saved and its covered positive
examples can be removed from the learning set, in order to move the focus of the
hypothesis space exploration.

This mechanism is especially useful in those situations where the number of target
concept modalities is higher than the number of available searching processors. Its
benefits are sensible also in the case of searching for small disjuncts; in fact, freezing
a description and adding again a node to the set of available resources increases the
computational power that can be directed to explore regions of the hypothesis space
where small disjuncts are more likely to exist.

When the freezing mechanism is used, the whole process halts, and a final
classification theory is returned, when all the examples are removed from the learning
set.
29
N ETWO R K
N GA n
SE LE CTI O N WIT H
R=mm m
U NI VER SAL SU F FR AGE

CO M M B UF FE R PO PU LA TIO N

i ndi v i dua l s
se l ect ed w i th
re pl a cem ent

i ndi v i dua l s SE LE CTE D i ndi v i dua l s


se l ect ed F OR M A T IN G se l ect ed
w i tho ut w i tho ut
re pl a cem ent re pl a cem ent

M ATI NG ,
CR OSSOVE R,
M UT ATI O N

N EW
I ND IV ID U ALS

Figure 8 Ð Scheme of the selection and migration model. COMMBUFFER is the


communication buffer, where individuals arriving to n from other nodes or leaving n for
other nodes are temporarily stored. The value m is cardinality of the population in each
node, i.e., m = M/N.

5.4. Experimental Analysis


In order to investigate the behavior of the distributed model in a controlled setting a
simple learning task has been constructed. It consists in learning the concept of
ÒBicycleÓ using a set of 500 positive and 600 negative examples of the concept, each
one described by 6 attributes, encoded into strings of 18 bits.

The target concept, has four disjuncts: the first one, j1, covers 312 positive instances,
the second one, j2, covers 167, the third one, j3, covers 56, and the fourth one, j4,
covers 35. Moreover, j 1 and j 2 have 51 positive instances in common, whereas j1
and j 3 have 19. The fitness values of the individuals corresponding to the four
disjuncts are the following ones:
f1 = f(j1) = 1.667 f2 = f(j2) = 1.556
f3 = f(j3) = 1.500 f4 = f(j4) = 1.500
30
The relations between the extensions of the four disjuncts on the learning set is
represented in Figure 9.

j1

242 j
2
51
116
19
j 37
3 35 j
4

Figure 9 Ð Training set coverage of the four disjuncts of the ÒBicyclesÓ concept.

Several runs have been performed by changing the values of the global population
size M, of the node number N, the migration rate m and the halt criterion. For each
setting of the four parameters the run has been repeated four times.

A first set of experiments aimed at detecting the size of the population necessary in
REGAL to solve the problem using a single node and relying on the US operator only
for promoting niche formation. Typical results are reported in Table I.

Table I
Experimental results on the ÒBicyclesÓ dataset

Halt Criterion h0 # Gen h1 # Gen


Population
(312 0)
M = 560 (167 0) 70
( 68 1)
(105 0)
Total Cover (500 1)
(312 0) (312 0)
(167 0) (167 0)
( 47 0) 15 ( 56 0) 70
M = 640 ( 41 0) (105 0)
( 25 0)
Total Cover (500 0) (500 0)

In the experimentation, the main focus was on the population size M. The system had
two halting criteria:
h0 Þ A first complete and consistent solution has been found.
h1 Þ A maximum of 70 generations had been reached.

From Table I it appears that there is a quite sharp critical value M cr (between 560 and
640, in this example), such that for M < Mcr no complete and consistent solution
31
could be found within the time bound given. When M > Mcr, a complete and
consistent solution was always found within 15 generations. By letting the system run
even after a complete and consistent solution has been found, a more general solution
is usually reached (less and larger disjuncts).

Table II
Global population size M = 240. The nodal population size m = M/N ranges from m = 60 to m =
15. Values in parentheses, such as (x,y), denote the number of positive and negative learning
examples covered by a corresponding formula, respectively. When the halt criterion h 0 is met, a
complete and consistent solution is found within 9 generations, except for (N = 16, m = 0), where
12 generations where needed, and for (N = 4, m = 0.1), where no complete and consistent solution
has been found.

Migr. Rate (m) 0.0 0.1 0.2 0.4


Halt Criterion h0 h1 h0 h1 h0 h1 h0 h1
#Nodes (N)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N=4 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 60 (56 0) (56 0) (56 0) (56 0) (56 0) (41 0) (56 0)
(105 0) (105 0) (53 0) (58 0) (58 0) (40 0) (35 0)

Total Cover (500 0) (500 0) (494 0) (500 0) (500 0) (494 0) (500 0)


(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N=8 (167 0) (167 0) (167 0) (167 0) (85 0) (167 0) (167 0) (167 0)
m = 30 (32 0) (56 0) (56 0) (56 0) (56 0) (56 0) (34 0) (56 0)
(25 0) (32 0) (47 0) (47 0) (73 0) (62 0) (45 0) (32 0)
(31 0) (16 0) (36 0)
(36 0) (32 0)

Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N = 16 (167 0) (167 0) (167 0) (167 0) (85 0) (167 0) (167 0) (167 0)
m = 15 (25 0) (56 0) (34 0) (56 0) (40 0) (56 0) (41 0) (41 0)
(17 0) (105 0) (35 0) (58 0) (17 0) (32 0) (41 0) (62 0)
(15 0) (27 0) (16 0) (62 0)
(16 0) (31 0) (16 0)
(35 0)

Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)

The goal of a second group of experiments was that of testing whether distributing a
fixed number M of individuals among communicating nodes in a network could bring
improvements from the computational point of view, being the quality of the solution
equivalent. Two sets of experiments, one with M = 240 and one with M = 280, are
described. Typical results are reported in Table II and Table III.

Table III
Global population size M = 280. All the complete and consistent solutions, found with the halt
criteria h0, required at most 9 generations.

Migr. Rate (m) 0.0 0.1 0.2 0.4


Halt Criterion h0 h1 h0 h1 h0 h1 h0 h1
32
#Nodes (N)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N=4 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 70 (56 0) (56 0) (41 0) (56 0) (40 0) (56 0) (56 0) (56 0)
(58 0) (105 0) (41 0) (58 0) (32 0) (32 0) (47 0) (47 0)
(32 0)

Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N=8 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 35 (41 0) (56 0) (41 0) (41 0) (41 0) (56 0) (56 0) (56 0)
(25 0) (105 0) (62 0) (41 0) (62 0) (105 0) (16 0) (105 0)
(105 0) (31 0) (35 0) (31 0)

Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N = 16 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 17 (41 0) (56 0) (47 0) (40 0) (41 0) (56 0) (56 0) (56 0)
(58 0) (105 0) (25 0) (47 0) (41 0) (105 0) (105 0) (105 0)
(34 0) (31 0) (105 0)
(13 0)

Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)

The first feature of the distributed version, emerging from Tables II and III, is the
strong reduction (from ~ 600 to ~ 280) in the size of the global population necessary
to find a complete and consistent solution. Moreover, the following considerations
can be done:
¥ For values M £ 240, a complete and consistent solution may not always be found,
as it happens, on the contrary, with M ³ 280. The same critical change in
behavior, noticed in the concentrated GA, is observed.
¥ Increasing N, with constant M, leads to a fragmentation of the first proposed
complete and consistent solution: smaller disjuncts are generated, corresponding
to more specific concept descriptions.
¥ Giving the system more time (using h1 instead of the greedy criterion h 0 ) means to
increase its probability of generalizing.
¥ The migration rate value does not seem to be very critical, in this testcase, as the
effects of its variation are barely noticeable.
In any case, the results of this experimentation are to be considered only suggestive,
given the simplicity of the task.

6. Experimental Evaluation
In this section we will present some results obtained by REGAL on three datasets
taken from Irvine's database: the ÒMushroomsÓ dataset, provided by Schlimmer, the
ÒSplice JunctionsÓ dataset, provided by Towell, Noordewier, and Shavlik, framed in
propositional calculus, and the ÒOffice Document ClassificationÓ dataset, provided by
Esposito, Malerba and Semeraro, framed in First Order Logic. The primary goal of
33
experimenting with these datasets is to evaluate REGAL in comparison with other
systems described in the literature. However, a second type of evaluation we wanted
to perform concerns REGAL's ability at discovering specific disjuncts hidden in a
dataset. For this task, real datasets are not appropriate because the real structure of
the concept hidden in the instances is not directly known but only guessed from the
experimental results. Therefore, a First Order Logic artificial learning problem, which
we call ÒThousand TrainsÓ problem, has been constructed in order to challenge
REGAL's ability to discover specific disjuncts hidden in the dataset. This learning
problem has been used for the first time in Giordana & Saitta [1993] to test an older
version of REGAL.

The ÒMushroomsÓ dataset, not too complex and not too easy, has been used to study
experimentally how REGAL's performances change depending on the number N of
subpopulations and on the migration rate m. To this aim, extensive runs have been

Table IV
REGAL's configuration used for the ÒSplice-JunctionsÓ, ÒOffice
DocumentÓ and ÒThousand TrainsÓ problems.

Parameter Value
Number of Nodes 4
Population per Node 200
Migration rate m 0.2
Freezing Yes
(except for Splice-Junctions)

performed using a CM5 connection machine3 . On the contrary, the experimentation


with the other three datasets, aimed only at checking the performances of REGAL as
a learning algorithms, has been done using a less expensive computational
environment, consisting of a Sparc-5 workstation. In this second group of
experiments, the same values for the system parameters (as reported in Table IV)
have been used, without trying to find any optimal setting. This is an implicit proof of
REGAL's robustness with respect to parameter variations.

6.1 "Mushrooms" Dataset


The problem, proposed by Schlimmer [1987], consists in discriminating between
edible and poisonous mushrooms. The dataset contains 8124 instances, 4208 of
edible mushrooms and 3916 of poisonous ones. Each instance is described by a vector
of 22 discrete, multiÐvalued attributes. The problem is encoded in propositional
calculus and can be formulated in REGAL's concept description language by using

3 The CM5 Connection Machine used in this experimentation consists of 64 Sparc-10 processors,
connected by a highÐspeed packet-switching network. In this way every processor is directly
connected with all the others.
34
only unary predicates, and by allowing just one variable in a formula, as it has been
done for the ÒBicyclesÓ test case.

By defining a predicate for each attribute, with an internal disjunction corresponding


to the set of all possible values it can assume in the dataset, a template of 126 bit has
been obtained. Randomly selected sets of 4000 instances (2000 edible + 2000
poisonous) have been used as learning set, while the remaining 4124 instances have
been used for testing. With such a large learning set, REGAL can always found a
perfect definition for both classes, covering all the examples and no counterexamples
on the test set, as also other induction algorithms can do. In table V some results
comparing REGAL with other well known systems are reported.

Table V
Comparative results among different approachs to the ÒMushroomsÓ classification problem.
LS = Cardinality of the learning set. e = Error rate on test set

e [%]
User System (Method) LS
[Neri & Giordana] REGAL (Genetic Algorithm) 4000 0.0
[Rachlin et al., 1994] PEBLS (Memory-Based) 7311 0.0
[Rachlin et al., 1994] (Nearest Neighbours) 7311 0.0
[Holte, 1993] C4.5 (Decision Tree) 5416 0.0
[Schlimmer, 1987] STAGGER (Rules) 5416 5.0
[Yeung, 1991] Neural Network 300 0.9

In the further experimentation, the aim was to evaluate other factors characterizing
the quality of the found classification theory, such as the global complexity and the
total number and size of the disjuncts under different distribution and cooperation
conditions of the same global population. According to the definition of complexity,
given in Section 4.4 for a single formula, the global complexity (C) of a classification
theory has been defined as the total number of 0's occurring in the bit strings
describing the disjuncts. The smaller the total number of 0's surviving in the final
solution, the better REGAL followed the simplicity criterion embedded in its fitness
function. The other measures monitored in the experiments are the Cpu time (T,
measured in minutes) spent for the global computation, the number of disjuncts (ND)
occurring in the final solution, and the maximum (MXD), average (AVG) and
minimum (SMD) sizes of their extension. Moreover, also the Cpu time (Tc in
minutes) spent to found the first complete and consistent concept description and the
global complexity (Cc) of the description itself have been measured.

Among a large set of experiments performed, a subset of eight, showing typical


behaviours, are reported. In the experiments of this set the global population size has
been kept constant (M = 800), whereas the numbers of nodes N was set to 16, 32 and
35
64, and the migration rate to 0.0, 0.2, 0.5, and 0.9, respectively. In every run REGAL
has been stopped after all nodes have evolved for 100 generations. The results are
summarized in Table VI: columns 3 to 8 report the measures concerning the solution
extracted at the 100th generation. Moreover, columns 9 and 10 report the time Tc

Table VI
Influence of the population size, the processor number and the migration rate on REGAL's
performances in learning the ÒPoisonous mushroomsÓ concept. The different fields have the
following meaning:
Mode = (processor number, population per processor).
m = migration rate (percentage of the population).
T = Cpu time (in minutes) for completing 100 generations.
C = complexity of the solution after 100 generations.
ND = Number of disjuncts in the final solution.
MXD, SMD, AVG = Number of examples covered by the largest, the smallest and the
Ê average disjunct, respectively.
Tc = Cpu time (in minutes) necessary for finding the first complete and consistent solution.
C c = Complexity of the first complete and consistent solution found in a run.

Mode m T [min] C ND MXD SMD AVG Tc Cc


0.0 505 40 4 1946 49 631 18 353
(16, 50) 0.2 441 28 4 1946 168 1135 14 217
0.5 497 25 4 1946 426 1107 18 194
0.9 575 34 5 1946 598 1086 30 214
0.0 177 47 4 1946 84 953 8 294
(32, 25) 0.2 192 30 4 1946 707 1282 7 243
0.5 184 16 4 1946 136 1027 11 191
0.9 208 23 4 1946 112 885 12 348
0.0 77 94 6 1930 225 1287 6 417
(64, 12) 0.2 76 37 4 1946 70 878 4 280
0.5 95 29 4 1946 546 1089 4 577
0.9 104 27 4 1946 832 1201 4 638
36

400 8
C with m = 50 and migr = 0.0 ND with m = 50 and migr = 0.0
350 C with m = 50 and migr = 0.2 ND with m = 50 and migr = 0.2
C with m = 50 and migr = 0.5 ND with m = 50 and migr = 0.5
C with m = 50 and migr = 0.9 7 ND with m = 50 and migr = 0.9
300

250
6

200

5
150

100
4
50

0 3
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(a) (a)
500 14
C with m = 25 and migr = 0.0 ND with m = 25 and migr = 0.0
450 C with m = 25 and migr = 0.2 ND with m = 25 and migr = 0.2
C with m = 25 and migr = 0.5 12 ND with m = 25 and migr = 0.5
400 C with m = 25 and migr = 0.9 ND with m = 25 and migr = 0.9

350
10
300

250 8

200
6
150

100
4
50

0 2
0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100
(b) (b)
1000
20 ND with m = 12 and migr = 0.0
C with m = 12 and migr = 0.0
900 C with m = 12 and migr = 0.2 ND with m = 12 and migr = 0.2
C with m = 12 and migr = 0.5 ND with m = 12 and migr = 0.5
800 C with m = 12 and migr = 0.9 ND with m = 12 and migr = 0.9

15
700

600

500 10

400

300
5
200

100

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(c) (c)

Figure 10 Ð Time evolution of the global Figure 11 Ð Time evolution on the


complexity C for migration rate m = 0.0, 0.2, 0.5 and number of disjuncts ND corresponding
0.9 for three different population distributions. (a) 16 to the three population distributions
nodes, 50 individuals per node; (b) 32 nodes, 25 described in Figure 10.
individuals per node, (c) 64 nodes, 12 individuals per
node.
37

necessary for obtaining a first complete and consistent solution, and the complexity of
the solution itself, respectively. The time elapsed between Tc and T (end of the
computation) has been spent by REGAL in trying to find a simpler and more general
solution. The time evolution of the complexity C and of the total number of disjuncts
ND versus the generation number is reported in Figures 10 and 11, respectively.

The findings emerging from the experiments confirm the data obtained from the
ÒBicyclesÓ test case: the Selection and Migration model together with the US operator
have been successful in allowing learning all the disjuncts at one time with a
relatively small global population size, in a reasonable time under different operative
conditions. Moreover, comparing the solutions found with different values of the
migration rate, it appears that values of migration rate different from 0 result in
significantly improved solutions, in terms of simplicity (being the error rate on the
test set equal to 0). Finally, we observe a super-linear speed up when increasing the
number of processors working in parallel. Let us now examine Fig. 12, which reports
the complexity evolution versus Cpu time for the best and the worst results obtained
in the three experimental settings (a), (b) and (c), described in Fig. 10. The
complexity shows a minimum in correspondence of the configuration with 32 nodes
and 25 individuals per node. Therefore, considering Fig. 12, it seems that the obtained
speedÐup, increasing the processors from 32 to 64, does not produce a real gain,
because a more simple solution can be obtained in less time (50 m) using only 32
processors.
400 400
C with m = 50 and migr = 0.0 C with m = 50 and migr = 0.5
350 C with m = 25 and migr = 0.0 350 C with m = 25 and migr = 0.5
C with m = 12 and migr = 0.0 C with m = 12 and migr = 0.5

300 300

250 250

200 200

150 150

100 100

50 50

0 0
0 100 200 300 400 500 600 0 50 100 150 200 250 300 350 400 450 500
(a) (b)

Figure 12 - Comparison of the time evolution of the complexity versus the Cpu time [min],
for the population distributions (a), (b) and (c) described in Figure 10. (a) Migration rate = 0.
(b) Migration rate = 0.5. As we can see, the population of m = 25 individuals per node, with 32
nodes and m = 0.5 has reached the best performance of all runs already at ~70 generations.
38

From the experiments performed it follows that:


¥ The finest the granularity of the distribution, the most complex the proposed
solution, when there is no cooperation between niches (i. e., m = 0);
¥ When cooperation between niches is possible (i. e., m ¹ 0), then a proper distribution
may result both in a reduction of the elapsed time for having a suitable solution and in
a reduction of the complexity of the solution itself.

6.2. ÒSplice JunctionsÓ Dataset


This set contains data from a Molecular Biology problem. Given a sequence of bases,
corresponding to a gene in a DNA molecule, the problem consists in recognizing
Donor and Acceptor sites, i.e. boundaries between Exons (parts of the DNA sequence
coding for protein synthesis and then retained after splicing) and Introns (parts of the
DNA sequence not coding for protein synthesis and hence spliced out). The task is
then a classification problem with three classes: Exon/Intron boundaries (referred to
as Donor sites), Intron/Exon boundaries (Acceptor sites) and no boundary. The
dataset consists of 3190 DNA sequences, each 60 nucleotides long. Each nucleotide
may take a value in {A, G, T, C}4.

REGAL has been evaluated in two different settings. In the first one, the template was
300 bit long and contained a different unary predicate for each one of the 60
positions. All the predicates had the same internal disjunction [A, G, T, C, *] where
Ò*Ó accounts for the possible occurrence of D and N. A learning set of 2000 instances
randomly selected from the dataset has been used. In the second setting, a learning set
of only 1000 examples has been used but the search space has been limited to 12
bases exploiting a subset of the domain theory described by Towell and Shavlik
[1994]. For each one of the two settings, REGAL has been run three times for 300
generations, for each one of the three classes. In each run REGAL learned a very
general set of rules consistent with the learning set. The average error rate on an
independent test set is reported in Table VII where it is compared with the best results
published by Towell and Shavlik [1994].

The results have been obtained after halting REGAL at the 300th generation; they
show that the system can easily handle quite large learning sets and templates. It may
be surprising that the generation number necessary to solve this difficult problem is
just three times more than the one required for the simpler application of the
ÒMushroomsÓ, whereas the complexity of the search space was much higher (2300
versus 2126).

4 There is actually 1% or occurrences of the symbols N and D, corresponding to the nucleotides not
recognized or recognized ambiguously.
39

Table VII
Comparative results among different approachs to the ÒSplice JunctionsÓ classification problem.
LS = Cardinality of the learning set. TS = Cardinality of the test set. e = Error rate on test set.

System LS e e e
User (Method) (E/I) (I/E) (Neither)
[%] [%] [%]
[Towell & Shavlik, 1994] KBANN 1000 7.56 8.47 4.62
(NN + Domain theory)
[Towell & Shavlik, 1994] Neural Network 1000 5.74 10.75 5.29
(Backpropagation)
[Towell & Shavlik, 1994] ID3 1000 10.58 13.99 8.84
(Decision tree)
[Neri & Giordana, 1995] REGAL 2000 4.40 4.20 5.20
(GA, 60 bases)
[Neri & Saitta] REGAL 1000 3.42 6.85 4.29
(GA, 12 bases)

6.3. ÒOffice DocumentÓ Classification


The dataset contains 230 examples, subdivided at the source into a learning set of 120
and a test set of 110 instances, corresponding to the front page of official documents
from six different companies. The proposed problem consists in recognizing the
company from the typical pattern appearing in the document.

Each example in the dataset is described as a sequence of 9 to 15 items, each


corresponding to a specific rectangular area on the page layout. Moreover, each item
is characterized by a type attribute, and by 8 numeric attributes individuating the
coordinates of the four corners of the corresponding rectangle.

REGAL has been supplied with a concept description language containing 5 unary
predicates, one identifying the type and the other ones the position and the size of an
item, and two binary predicates defining the relative position of two items. Each
(absolute or relative) predicate was provided with a large internal disjunction
containing a set of possible positions an item can assume in the page with a
granularity of 35 pixels. Using only three variables in the concept description
language, a template of 380 bit has been obtained.

REGAL easily discovered a classification theory consisting of 7 rules correctly


classifying all the learning examples. On the test set, the same rules exhibited a 5%
misclassification error (1 omission error and 5 commission errors). An evaluation of
INDUBI [Esposito et al., 1992], an InduceÐlike learner, on a subset of this dataset
shows an 8.5% prediction error. However, a direct comparison is not possible because
40

these results have been obtained using a smaller learning set. This experiment shows
again that REGAL can easily handle large templates.

6.4. ÒThousand TrainsÓ Dataset


This dataset has been designed by extending the well known train set used by
Michalski [1980]. There are two concepts to discriminate: "Trains going East" and
"Trains going West". Therefore, each learning event is represented as a sequence of
items (coaches), each one described by a vector of attributes referring to shape, color,
position, length, number of wheels and number of loads. Thousands of trains have
been generated by a program which selects at random the values of the attributes.
Then, each train has been classified using a set of disjunctive rules. The challenge for
REGAL was to discover the original rules or a set of equivalent ones.

The rules for classifying trains going East (Class 1) are reported in the following:

Class 1 - "Trains going East"


Rule 1: In second position there is an open-top small coach, carrying one
load, followed by an open-top small coach.
Rule 2: In third position there is a closed-top small coach carrying one load
and an open-top small coach in fifth position.
Rule 3: In position two, three or four there is a small coach, with two wheels
and carrying one load, immediately followed by a long white coach
carrying one load.

The learning set used for the experiments contained 500 instances of Class 1 (East)
and 500 instances of Class 2 (West). Rule 1 covered 98 instances of Class 1, Rule 2
covered 206 and Rule 3 covered 209. The three subsets were slightly overlapped,
because 13 instances verified more than one rule. An independent test set was also
generated, where the three disjuncts where covering 103, 212 and 189 positive
instances, respectively. The concept description language used by REGAL is very
similar to the one described in [Michalski, 1980] and is reported in Table VIII.

The experiment has been repeated six times using both the current version of
REGAL, with the setting described in Table IV and without distributing the
population, i.e. using only one node, and an older one based on classical selection. In
the first case, a perfect solution has been found in all the runs (see the second row in
Table IX), even if not always the same. Moreover, we notice that 5 disjuncts where
found in place of the three used for classifying the learning set. In the best solution
41

Table VIII
Predicates and template characterizing the concept description language used by
REGAL for learning the concept of "Trains going East".

Predicates mapped into the bit string Comment


Position(x, [0, 1, 2, 3, 4]) Ordinal position of the coach,
starting from the engine (position 0)
Length(x, [1, 2]) Length of the coach (Short or Long)
Wheels(x, [2, 3]) Number of wheels
Nloads(x, [0, 1, 2, *]) Number of loads
Color(x, [yl, wh, rd, gr, gy, bk]) yl = yellow, wh =white, etc.
Shape(x, [ot, en, us, or, cr, el, jt, h, sl]) ot = open-top, en = engine, etc.
Distant(x,y, [0, 1, *]) Number of coaches between x and y
Constraint predicates Comment
Follow(x,y) Item y follows item x
Coach(x) Item x is a coach
Template

Coach(x) Position(x, [0, 1, 2, 3, 4] ) Length(x, [1, 2] ) Wheels(x, [2, 3] )


Nloads(x, [0, 1, 2, *] ) Color(x, [yl, wh, rd, gr, gy, bk] )
Shape(x, [ot, en, us, or, cr, el, jt, he, sl] )
Coach(y) Position(x, [0, 1, 2, 3, 4] ) Length(y, [1, 2] ) Wheels(y, [2, 3] )
Nloads(y, [0, 1, 2, *] ) Color(y, [yl, wh, rd, gr, gy, bk] )
Shape(y, [ot, en, us, or, cr, el, jt, he, sl] )
Position(z, [0, 1, 2, 3, 4] ) Length(z, [1, 2] ) Wheels(z, [2, 3] )
Nloads(z, [0, 1, 2, *] ) Color(z, [yl, wh, rd, gr, gy, bk] )
Shape(z, [ot, en, us, or, cr, el, jt, he, sl] )
Follow(x,y) Distant(x,y, [0, 1, *] ) Distant(y,z, [0, 1, *] ) Distant(x, z, [0, 1, *] )

we had five disjuncts covering 223, 216, 108, 83 and 74 examples. They represent a
consistent and complete solution to the problem. The first two disjuncts cover the
examples (plus some others) of Rule 3 and 2, respectively, and then it can be seen as a
further generalization of them. On the contrary, the third disjunct has not been found,
because REGAL preferred an alternative solution consisting of three larger disjuncts
which proved to be a correct solution too.

These results, as well as the ones obtained on the ÒBicyclesÓ domain, show that
REGAL is biased towards generalization and tries to find out disjuncts as large and
simple as possible. This bias is mostly due to the Universal Suffrage which helps the
formation of large disjuncts.

As a comparison, we see that learning one disjunct at a time led to solutions


containing a greater number of disjuncts much smaller that the ones found by the
Universal Suffrage. In particular, the disjuncts found in the last part of the run tended
to be smaller and smaller as the set of examples still to be covered decreased. In the
present case, the examples of Rule 1 have been covered by a set of 3 to 4 small
disjuncts which resulted slightly inconsistent on the test set, as described in Table IX.
42

Table IX
Results for the 1000 train problem obtained using a learning set of 1000 learning events and a test set
of 1000 events independently generated. V avg is the average completeness and Wavg is the average
consistency over 6 runs. Runs of the current version of REGAL exploiting only one node and the
freezing mechanism, give results comparable to the first row of the table.
Method # Disjuncts Learning Set Test Set Generations
Vavg Wavg Vavg Wavg
Univ.Suff. , Freezing 5 1.0 1.0 1.0 1.0 200
Class. Sel., One at a time 7.6 1.0 1.0 0.98 0.99 800

6.5. Discussion
On the basis of the experimentation described so far, REGAL's capabilities emerge
quite clearly. A first positive feature is its ability to successfully deal with difficult
applications, with relatively small global populations and exploiting the GA's explicit
parallelism.

A second feature is the robustness with respect to the control parameters. Many runs
have been executed in very different conditions and with different population size, but
REGAL was always able to find a reasonable solution. Moreover, we found that
REGAL is very easy to use, i.e., it is easy to find a reasonable template and to define
the predicates in the concept description language.

A third feature characterizing REGAL is its bias toward generalization; the fitness
function it currently uses, combined with the Universal Suffrage operator, allows very
general solutions to be found. This can be an advantage in presence of large datasets,
which, luckily enough, REGAL can handle pretty well, but can become a
disadvantage when datasets are small. As an example, we can see that the
classification error on the ÒOffice DocumentsÓ dataset (230 examples) is relatively
high with respect to the other applications. On the other hand, we may notice that 5
out of the 6 errors are commission errors, exactly due to overgeneralization. It could
be worth investigating the possibility of flexibly controlling this generalization bias.

Finally, REGAL seems better suited to coarse multi-processors or LAN architectures


than to fine-grained parallel processors. Even if this point still needs more
investigation, it seems that fragmenting the population on too many nodes prevents
finding good generalizations. The reason is that when a specie is bound inside a
unique node with a small population, its chances of interacting with other species
becomes very low and so the generalization process slow down. A possible way for
exploiting fine-grained parallelism, we are currently exploring, is to allow a same
niche to extend on more than one node in order to have species with a greater number
43

of representatives in the network, when the population is distributed on a larger


number of processors.

7. Related Work
Even though the preferential area of application for GAs has been, so far, function
optimization, a few systems, oriented to concept learning, already exist, all of them
searching propositional hypothesis spaces. REGAL, with its ability to deal with First
Order Logic hypotheses, constitutes thus an exception. On the other hand, it shares
with other systems, such as GABIL [De Jong, Spears & Gordon, 1993] and COGIN
[Greene & Smith, 1993], the fixed-length bit string encoding approach, aimed at
exploiting standard genetic operators.

GABIL learns single concepts described by an unordered set of propositional


classification rules in DNF, in which each conjunctive term may have internal
disjunction as well as REGAL5 . The system follows Pittsburgh's approach: each
individual, corresponding to a whole classification theory, is a concatenation of a
variable number of rules. Classical crossover and mutation are used, provided that the
crossover's cut points preserve the rules semantic. Individuals are evaluated according
to a fitness function equal to the square of the proportion of correctly classified
examples. The main novelties, in GABIL, are the possibility of running the system in
a Òbatch-incrementalÓ mode, i.e., restarting the system every time a new example is
misclassified, and the addition of task-specific operators, closely matching
MichalskiÕs Adding alternatives and Dropping conditions generalization rules
[Michalski, 1983]. Finally, the system is adaptive, as each individual is provided with
control bits corresponding to the operators; if the bit corresponding to an operator is
set to 1, the operator can be applied to the individual, otherwise it cannot.

The system COGIN learns multiple, disjunctive concepts, expressed in a


propositional language with internal disjunction. Negation is handled by means of a
special symbol, introduced to complement the set of values of a given attribute, as it
is done in REGAL. The chromosome is a fixed-length bit string and corresponds to
one classification rule. The approach in a hybrid one between Pittsburgh and
Michigan. The population A(t), of variable cardinality, is the current solution and
contains individuals which are all different. At each cycle, A(t) is duplicated into
AÕ(t) and individuals from the two are paired. The 1-point crossover is applied and
the union of old and new individuals is ranked according to the fitness. A competitive

5 REGALÕs concept description language differs from GABILÕs one because of the possibility of
having variables in the predicates which set it in the First Order Logics framework.
44

replacement operator generates A(t+1), which contains as many individuals as it is


necessary to cover the whole training set, starting from the fittest ones. The system
takes into account the specific characteristics of the discrimination task, in that it is
Òcoverage-basedÓ. The system halts when a maximum number of generation is
reached and the solution is the best one found during the whole evolution process. A
lexicographically-screened fitness function evaluates rules, taking into account the
ruleÕs information gain and the number of misclassified examples. It is interesting to
note that the role of selection has been transferred, in COGIN, to the competitive
replacement step.

In contrast to REGAL, COGIN and GABIL, the two systems GIL [Janikow, 1993]
and SIA [Venturini, 1993] adopt a logical representation for the chromosome and use
special-purpose, task-dependent operators to generate offsprings.

GIL learns a single concept, expressed in the VL 1 [Michalski et al., 1986]. The
system follows Pittsburgh's approach: the population has fixed cardinality and each
individual represents a whole set of rules. The chromosome has an external
representation as a logical formula and an internal one as a fixed-length bit vector.
The system allows the initial population to contain either random individuals or
randomly chosen single positive examples, or even a priori given hypotheses.
Individuals are evaluated according to a fitness function which takes into account
completeness, consistency and simplicity of the rules. GIL utilizes 14 genetic
operators, working at the levels of either whole classification theories, or rules, or
conditions; some are specialization operators, some others are generalization
operators, while yet others have mixed effects. Among the others, we may notice that
the New-Event operator, which adds to a set of rules the description of a single, not
yet covered, positive example, acts as the ÒseedÓ in AQ [Michalski et al., 1986], and
has a role similar to the one the seeding operator has in REGAL. A drawback of GIL
is the excessive number of parameters, which make the system unstable across
different domains and different runs in the same domain.

The system SIA learns classification rules with weights in a propositional language.
The system, which allows for multiple classifications and missing attribute values, is
basically an AQ system, where the Star search method has been replaced by a GA.
Chromosomes represent conjunctive rules and disjuncts are learned one at a time.
Also this system has an operator, called Creation, generating a formula covering a
given example. Moreover, there are generalization operators and crossover. Generated
offsprings possibly compete with the parents to enter the new generation. The fitness
function takes into account completeness, consistency and simplicity of the rules.
SIAÕs approach is basically MichiganÕs, but with a search space reduced to that part
45

containing only generalizations of a given example. As SIA may generate a redundant


set of rules, a pruning phase may follow, in which rules are evaluated in the context
of the others and the worst ones are dropped. The system worked reasonably on
artificial domains, but had difficulties when applied to a real problem of French
justice.

The idea of avoiding using a distance measure and of Òsharing examplesÓ among
formulas covering them has been already used by McCallum and Spackman [1990].
The method proposed is more similar to the one implemented in a previous version of
REGAL [Giordana & Sale, 1992]: species formation is controlled through a
modification of the fitness function. This method proved to be often unable to let
stable subpopulations be formed. A theoretical investigation of McCallum and
SpackmanÕs method shows that the population asymptotic behaviour, under pure
selection, will converge, in the average, to a homogeneous population, as crowding
does [Neri & Saitta, 1995a]. Methods avoiding the use of a distance measure have
been also proposed by Spears [1994] and Hekanaho [1995].

Finally, the systems BOOLE [Wilson, 1987] and NEWBOOLE [Bonelli et al., 1990],
even though they appear more like to Classifier systems, are actually concept learners
and have been used to learn Boolean functions.

Tasks related to concept learning, in which GAs have been used, include rules
refinement and feature selection. In rule refinement, rules previously acquired by
means of other algorithms are modified using a GA [Bala, De Jong & Pachowicz,
1991], or parameters occurring in rules are tuned so as to increase their classification
power [Botta, Giordana & Saitta, 1993]. The use of GAs for feature selection has
been experimented in [Siedlecki & Sklansky, 1989; Vafaie & De Jong, 1991], and
proposed for feature construction in [Giordana, Lo Bello & Saitta, 1994]. Rule
induction with GAs is also the basis of the system SAMUEL [Grefenstette, Ramsey
& Schultz, 1990], which learns from ÒepisodesÓ rules for sequential decision making.

Distributing GAs is a research topic that is receiving increasing attention. Several


models of distribution have been proposed and analysed in recent years (see, just as
examples, [Pettey & Leuze, 1989; Baluja, 1993; Grefenstette, 1981; Kommu &
Pomeranz, 1992; Maruyama et al., 1993; Tanese, 1987, 1989]). Most of them were
aimed at increasing speed in computation, and, in fact, linear [Tanese, 1987] or even
superlinear speed-ups [Shonkwiler, 1993] have been reported. One of the main issue,
in distributing GAs, is to determine a suitable level for the amount and frequency of
information communication among processors. In fact, too tight communication
favors premature convergence and increases computation time, while too loose
46

communication may reduce the quality of the solution, reducing the exploration
power of the single GAs, by reducing their population size. One way of resolving this
conflict is to try to forcibly differentiate the nodal GAs, as REGAL does by assigning
to each node a different subset of examples to cover. The idea of differentiating nodal
GAs has been used before, by a variety of methods. For instance, in the GAMAS
system [Potts, Giddens & Yadav, 1994] four GAs, namely four functionally
differentiated species, cooperate to avoid premature convergence is searching a
solution for NP-hard optimization problems. In particular, one species has a great
exploration power, given to it by a high mutation rate, another is more aimed at
exploitation, with a very low mutation rate, whereas a third one behaves in between.
A special SPECIES I, held in isolation, is generated by artificial selection, i.e., it
contains the most fit individuals found by the other three. Moreover, evolution goes
through a series of cycles, because the three evolving species, other than SPECIES I,
are restarted every 60 generations.

The formation of different species in different nodes of a network has been also the
goal of a paper by Cohoon et al. [1987], who applied to GAs the theory of Punctuated
Equilibria [Vose & Liepins, 1990]. Evolution proceeds through epochs, during which
populations evolve independently, reaching an equilibrium, disrupted eventually by a
ÒcatastropheÓ, i.e., by letting individuals migrate among nodes, changing thus the
species environment.

8. Conclusions
In this paper we described REGAL, a new GA-based system for learning concept
descriptions from examples, which exhibits several novelties with respect to other
systems devoted to the same task. In particular, it is able to deal with First Order
Logic concept description languages, uses a new kind of selection operator for
allowing the formation of species, is especially tailored to the classification tasks, and
uses a new strategy for running in parallel the genetic search.

Considering the experience gained so far with the use of REGAL, we think we have
been successful in showing that the system is able to work in non trivial domains and
has the capability to deal with complex and large datasets, favourably competing with
various alternative approaches. Nevertheless, REGALÕs behavior and potential may
raise more questions than they answer.

An important aspect concerns the parallel implementation. REGAL's distributed


architecture has been experimentally proven successful in dealing with hard
47

applications with relatively small global populations. The comparison reported in Fig.
12 shows that a too fine-grained distribution of the niches is not effective with respect
to the generalization capabilities. Then, REGAL's current architecture is not suitable
to exploit highly parallel machine. One possibility for overcoming this weakness is to
try to improve the current architecture as suggested in Section 6.4. Alternatively,
other methods for exploiting massive parallelism seem to be also convenient. First of
all, running in parallel independent experiments, to quickly gather statistical
significance, looks more promising than trying to speed up the execution of a single
experiment using a distributed populations network. Second, the most expensive part
of REGAL' s algorithm is the evaluation of the individuals. Also this task could be
done in parallel, using the well known master-slave model [Goldberg, 1989a].
Therefore, a sound parallel implementation for REGAL seems to be a two level
architecture, consisting of a small network of nodes each one implemented as a
master-slave parallel processor.

Finally, a third aspect concerns the role genetic algorithms can play in machine
learning approaches. We agree with the view that learning should exploit available
background and domain-specific knowledge as much as possible, trying to limit
search in the hypothesis space by means of a priori constraints. In this case, in fact,
ÒsymbolicÓ search algorithms may be more convenient, both because they are able to
exploit that very a priori knowledge, and because they require less computational
resources when working inside small search spaces. On the other hand, when little or
no knowledge is available, or when, notwithstanding the imposed constraints, the
search space still remains very large, we came to believe, on the basis of an extensive
experimental work, that genetic algorithms have the potential of becoming the
winning approach, because of the exploration power they offer, without requiring to
reduce the representation language of the hypotheses.
48

References
Bala J., De Jong K.A. and Pachowicz P. (1991). ÒLearning Noise Tolerant
Classification Procedures by Integrating Inductive Learning and Genetic
AlgorithmsÓ. Proc. First International Workshop on Multistrategy Learning
,Center for Artificial Intelligence George Mason University (Harpers Ferry,
WV), pp. 316-323.
Baluja S. (1993). ÒStructure and Performance of FineÐGrained Parallelism in Genetic
SearchÓ. Proc. 5th Int. Conf. on Genetic Algorithms, Morgan Kaufmann
(UrbanaÐChampaign, IL), pp. 155-163.
Baroglio C., Botta M. and Saitta L. (1994). ÒWHY: A System that Learns from a
Causal Model and a Set of ExamplesÓ. In R. Michalski & G. Tecuci (Eds.),
Machine Learning: A Multistrategy Approach, Vol. IV, Morgan Kaufmann (Los
Altos, CA), pp. 319-347.
Bergadano F., Giordana A. and Ponsero S. (1989). ÒDeduction in Top-Down
Inductive LearningÓ. Proc. 6th Int. Workshop on Machine Learning, Morgan
Kaufmann (Ithaca, NY), pp. 23-25.
Bergadano F. and Gunetti D. (1994). ÒLearning Clauses by Tracing DerivationsÓ.
Proc. 4th Int. Workshop on Inductive Logic Programming, GMD Technical
Report # 237, (Bad Honnef, Germany), pp. 11-29.
Bonelli P., Parodi A., Sen S., and Wilson S. (1990). ÒNEWBOOLE: A Fast GBML
SystemÓ. Proc. Int. Conf. on Machine Learning 1990 , Morgan Kaufmann
(Austin, TX), pp. 153-159.
Botta M. and Giordana A. (1993). ÒSmart+: A Multistrategy Learning ToolÓ. Proc.
13th Int. Joint Conference on Artificial Intelligence Morgan Kaufmann
(ChambŽry, France), pp. 937-943.
Botta M., Giordana A. and Saitta L. (1993). ÒLearning Fuzzy Concept DefinitionÓ.
Proc. 2nd IEEE Int. Conf. on Fuzzy Systems , IEEE Press (San Francisco, CA),
pp. 18-22.
Cohoon J. P., Hedge S. U., Martin W. N. and Richards D. (1990). ÒPunctuated
Equilibria: A Parallel Genetic AlgorithmÓ, Proc. Int. Conf. on Genetic
Algorithms 1997 , Morgan Kaufmann (Cambridge, MA), pp. 148-154.
Danyluk A. P. and Provost F. J. (1993): ÒSmall Disjuncts in Action: Learning to
Diagnose Errors in the Local Loop of the Telephone NetworkÓ. Proc. Int. Conf.
on Machine Learning , Morgan Kaufmann (Amherst, MA), pp. 81-88.
De Jong K. A. (1975). ÒAnalysis of the Behaviour of a Class of Genetic Adaptive
SystemsÓ. Doctoral Dissertation, Dept. of Computer and Communication
Sciences, University of Michigan, Ann Arbor, MI.
De Jong K. and Spears W. M. (1991). ÒLearning Concept Classification Rules Using
Genetic AlgorithmsÓ. Proc. 12th Int. Joint Conference on Artificial Intelligence,
Morgan Kaufmann (Sidney, Australia), pp. 768-774.
49

De Jong, K. A., Spears W. M. and Gordon F. D. (1993). ÒUsing Genetic Algorithms


for Concept LearningÓ, Machine Learning, 13 , pp. 161-188, Kluwer Academic
Publishers.
Deb K. and Goldberg D. (1989). ÒAn Investigation of Niches and Species Formation
in Genetic Function OptimizationÓ. Proc. 3rd Int. Conf. on Genetic Algorithms,
Morgan Kaufmann (Fairfax, VA), pp. 42-50.
Esposito F., Malerba D. and Semeraro G. (1992). "Classification in Noisy
Environments Using a Distance Measure Between Structural Symbolic
DescriptionsÓ. IEEE Trans. on Pattern Analysis and Machine Intelligence,
PAMI-14, 390-402 (IEEE Computer Society).
Gemello R., Mana F. and Saitta L. (1991). ÒRIGEL: An Inductive Learning SystemÓ.
Machine Learning, 6, 7-36, Kluwer Academic Publishers.
Giordana A. and Sale C. (1992). ÒGenetic Algorithms for Learning RelationsÓ. Proc.
9th Int. Conf. on Machine Learning , Morgan Kaufmann (Aberdeen, UK), pp.
169-178.
Giordana A. and Saitta L. (1993). ÒREGAL: An Integrated System for Learning
Relations Using Genetic AlgorithmsÓ. Proc. 2nd International Workshop on
Multistrategy Learning, Center for Artificial Intelligence George Mason
University (Harpers Ferry, VA), pp. 234-249.
Giordana A. and Saitta L. (1994). ÒLearning Disjunctive Concepts by Means of
Genetic AlgorithmsÓ. Proc. Int. Conf. on Machine Learning, Morgan
Kaufmann (New Brunswick, NJ), pp. 96-104.
Giordana A., Lo Bello G. and Saitta L. (1994). ÒAbstraction in Propositional
CalculusÓ. Proc. Workshop on Knowledge Compilation and SpeedÐUp Learning
(Amherst, MA), pp. 56-64.
Giordana A., Neri F. and Saitta L. (1994). ÒFormal Models of Selection in Genetic
AlgorithmsÓ. Proc. Int. Conf. on Methodologies for Intelligent Systems
(Charlotte, NC), pp. 124-133, SpringerÐVerlag.
Giordana A., Saitta L. and Zini F. (1994). ÒLearning Disjunctive Concepts with
Distributed Genetic AlgorithmsÓ. Proc. First IEEE Conf. on Evolutionary
Computation , IEEE Press, (Orlando, FL), pp. 115-119.
Goldberg D.E. and Richardson J. (1987). ÒGenetic Algorithms with Sharing for
Multimodal Function OptimizationÓ. Proc. 2nd Int. Conf. on Genetic Algorithms,
Morgan Kaufmann (Cambridge, MA), pp. 41-49.
Goldberg D.E. (1989). Genetic Algorithms, Addison-Wesley, (Reading, MA).
Greene D.P. and Smith S.F. (1993). ÒCompetition-Based Induction of Decision
Models from ExamplesÓ, Machine Learning, 13 , 229-258, Kluwer Academic
Publishers.
Grefenstette J. J. (1981). ÒParallel Adaptive Algorithms for Function OptimizationÓ.
Technical Report CS-81-19, Vanderbilt University (Nashville, TN).
Grefenstette J. J., Ramsey C. L. and Schultz A. C. (1990). ÒLearning Sequential
Decision Rules Using Simulation Models and CompetitionsÓ, Machine Learning,
5, 355-381, Kluwer Academic Publishers.
50

Hekanaho J. (1995). ÒSymbiosis in Multimodal Concept LearningÓ. Proc. 12th


International Conference on Machine Learning, Morgan Kaufmann (Lake
Tahoe, CA), pp. 278-285.
Holland J. H. (1975). ÒAdaptation in Natural and Artificial SystemsÓ. Ph.D. Thesis,
University of Michigan, Ann Arbor, MI.
Holland J. H. (1986). ÒEscaping Brittleness: The Possibilities of General Purpose
Learning Algorithms Applied to Parallel Rule-Based SystemsÓ. In R. Michalski,
J. Carbonell & T. Mitchell (Eds.), Machine Learning: An AI Approach, Vol. II.
Morgan Kaufmann, Los Altos, CA, pp. 593-623.
Holte R., Acker L. and Porter B. (1989). ÒConcept Learning and the Problem of Small
DisjunctsÓ. Proc. 11th Int. Joint Conference on Artificial Intelligence, Morgan
Kaufmann (Detroit, MI), pp. 813-818.
Holte R.C. (1993). ÒVery Simple Classification Rules Perform Well on Most
Commonly Used DatasetsÓ. Machine Learning, 11, 63-90, Kluwer Academic
Publishers.
Husbands P. and Mill F. (1991). ÒCo-evolving Parasites improve simulated evolution
as an optimization procedureÓ. Proc. 4h Int. Conf. on Genetic Algorithms,
Morgan Kaufmann (Faifax, VA), pp. 264-270.
Janikow C. Z. (1993). ÒA Knowledge Intensive Genetic Algorithm for Supervised
LearningÓ. Machine Learning, 13, 198-228, Kluwer Academic Publishers.
Kommu V. and Pomeranz I. (1992). ÒEffect of Communication in a Parallel Genetic
AlgorithmÓ, Proc. Int. Conf. on Parallel Processing, Morgan Kaufmann, pp. III-
310-317.
Koza J. R. (1991). Genetic Programming. On the Programming of Computers by
Means of Natural Selection. MIT Press, Cambridge, MA.
Maruyama T., Hirose T. and Konagaya A. (1993). ÒA Fine Grained Parallel Genetic
Algorithm for Distributed Parallel SystemsÓ. Proc. 5th Int. Conf. on Genetic
Algorithms , Morgan Kaufmann (UrbanaÐChampaign, IL), pp. 184-190.
McCallum R. A. and Spackman K. A. (1990). ÒUsing Genetic Algorithm to Learn
Disjunctive Rules from ExamplesÓ. Proc. Int. Conf. on Machine Learning,
Morgan Kaufmann (Austin, Texas), pp. 149-152.
Michalski R. (1983). ÒA Theory and Methodology of Inductive LearningÓ. In R.
Michalski, J. Carbonell & T. Mitchell (Eds.), Machine Learning: An AI
Approach, Vol. I., Morgan Kaufmann, Los Altos, CA, pp. 83-134.
Michalski R., Mozetic I., Hong J. and Lavrac N. (1986). ÒThe Multi-Purpose
Incremental Learning System AQ15 and its Testing Application to Three
Medical DomainsÓ. Proc. Fifth National Conference on Artificial Intelligence,
American Association for Artificial Intelligence (Philadelphia, PA), pp. 1041-
1045.
Michalski R.S. (1980). ÒPattern Recognition as a Rule-Guided Inductive InferenceÓ.
IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-2, 349-361,
IEEE Computer Society.
51

Mitchell T. M. (1982): ÒGeneralization as searchÓ. Artificial Intelligence, 18, 203-


226, Elsevier.
Mitchell T. M., Keller R. M. and Kedar-Cabelli S. T. (1986). ÒExplanation-Based
Generalization: A Unifying ViewÓ. Machine learning, 1, 47-80, Kluwer
Academic Publishers.
Morik K. (1991). ÒBalanced Cooperative ModelingÓ. Proc. 1st Multistrategy
Learning Workshop , Center for Artificial Intelligence George Mason University
(Harpers Ferry, WV), pp. 65-80.
Neri F. and Giordana A. (1995). ÒA Distributed Genetic Algorithm for Concept
LearningÓ, Proc. Int. Conf. on Genetic Algorithms, Morgan Kaufmann
(Pittsburgh, PA), pp. 436-443.
Neri F. and Saitta L. (1995a). ÒA Formal Analysis of Selection SchemesÓ. Proc. Int.
Conf. on Genetic Algorithms, Morgan Kaufmann (Pittsburgh,PA), pp. 32-39.
Neri F. and Saitta L. (1995b). ÒComparative Formal Analysis of Selection OperatorsÓ.
Technical Report TR 94-73, Dip. Informatica, University of Torino (Torino,
Italy).
Pazzani M. and Kibler D. (1992). ÒThe Utility of Knowledge in Inductive LearningÓ.
Machine Learning, 9, 57-94, Kluwer Academic Publishers.
Pettey C.C. and Leuze M.R. (1989). ÒA Theoretical Investigation of a Parallel Genetic
AlgorithmÓ. Proc. 3rd Int. Conf. on Genetic Algorithms, Morgan Kaufmann
(Fairfax, VA), pp. 398-405.
Potter M.A., DeJong K.A., Grefenstette J.J. (1995). ÒA Coevolutionary Approach to
Learning Sequential Decision RulesÓ. Proc. Int. Conf. on Genetic Algorithms ,
Morgan Kaufmann (Pittsburgh, PA), pp. 366-372.
Potts J. C., Giddens T. D. and Yadav S. B. (1994). ÒThe Development and Evaluation
of an Improved Genetic Algorithm Based on Migration and Artificial SelectionÓ.
IEEE Trans. on Systems, Man, and Cybernetics, SMC-24, 73-86, IEEE
Computer Society.
Quinlan, J. R. (1990). "Learning Logical Definitions from Relations". Machine
Learning, 5, 239-266, Kluwer Academic Publishers.
Quinlan J. R. (1991). ÒImproved Estimates for the Accuracy of Small DisjunctsÓ.
Machine Learning, 6 , 93-98, Kluwer Academic Publishers.
Rachlin J., Kasif S., Salzberg S. and Aha D.W. (1994). ÒTowards a Better
Understanding of Memory-Based Reasoning SystemsÓ. Proc. 11th Machine
Learning Conference , Morgan Kaufmann (New Brunswick, NJ), pp. 242-250.
Saitta L., Botta M. and Neri F. (1993). ÒMultistrategy Learning and Theory
RevisionÓ, Machine Learning, 11, 153-172, Kluwer Academic Publishers.
Schlimmer J.S. (1987). ÒConcept Acquisition through Representational AdjustmentÓ.
TR 87-19, Dpt. of Information and Computer Science, Univ. of California, Irvine,
CA.
Shonkwiler R. (1993). ÒParallel Genetic AlgorithmsÓ. Proc. 5th Int. Conf. on Genetic
Algorithms , Morgan Kaufmann (UrbanaÐChampaign, IL), pp. 199-205.
52

Siedlecki W. and Sklanski J. (1989). ÒA Note on Genetic Algorithms for Large Scale
Feature SelectionÓ. Pattern Recognition Letters, 10, 335Ð347, Elsevier.
Silverstein G. and Pazzani M. (1991). ÒRelational ClichŽs: Constraining Constructive
Induction during Relational LearningÓ. Proc. 8th Int. Workshop on Machine
Learning, Morgan Kaufmann, (Evanston, IL), pp. 203-207.
Smith S. (1983). ÒFlexible Learning of Problem Solving Heuristics through Adaptive
SearchÓ. Proc. 8th Int. Joint Conf. on Artificial Intelligence, Morgan Kaufmann
(Karlsruhe, Germany), pp. 422-425.
Spears W. (1994). ÒSimple Subpopulation SchemesÓ. Proc. of Workshop on
Evolutionary Programming.
Syswerda G. (1989). ÒUniform Crossover in Genetic AlgorithmsÓ. Proc. 3th Int.
Conf. on Genetic Algorithms, Morgan Kaufmann (Fairfax, VA), pp 2Ð9.
Tanese R. (1987). ÒParallel Genetic Algorithm for a HypercubeÓ. Proc. 2nd Int. Conf.
on Genetic Algorithms , Morgan Kaufmann (Cambridge, MA), pp. 177-183.
Tanese R. (1989). ÒDistributed Genetic AlgorithmsÓ. Proc. 3th Int. Conf. on Genetic
Algorithms , Morgan Kaufmann (Fairfax, VA), pp. 434-439.
Towell G.G. and Shavlik J.W. (1994). ÒKnowledge-Based Artificial Neural
NetworksÓ. Artificial Intelligence, 70, 119-165, Kluwer Academic Publishers.
Tecuci G. (1991). ÒLearning as Understanding the External WorldÓ. Proc. First
International Workshop on Multistrategy Learning, Center for Artificial
Intelligence George Mason University (Harpers Ferry, WV), pp. 49-64.
Vafaie H. and De Jong K.A. (1991). ÒImproving the Performance of Rule Induction
System Using Genetic AlgorithmsÓ. Proc. First International Workshop on
Multistrategy Learning, Center for Artificial Intelligence George Mason
University (Harpers Ferry, VA), pp. 305-315.
Venturini G. (1993). ÒSIA: A Supervised Inductive Algorithm with Genetic Search
for Learning Attribute Based ConceptsÓ. Proc. European Conference on
Machine Learning (Vienna, Austria), pp. 280-296, Springer Verlag.
Vose M.D. and Liepins G.E. (1991). ÒPunctuated Equilibria in Genetic SearchÓ.
Complex Systems, 5, 31-44.
Wilson S. (1987). ÒClassifier Systems and the Animat ProblemÓ. Machine Learning,
2, 199-228, Kluwer Academic Publishers.
Yeung D.Y. (1991). ÒA Neural Network Approach to Constructive InductionÓ. Proc.
8th Int. Conf. on Machine Learning , Morgan Kaufmann(Evanston, IL), pp. 228-
232.

Acknowledgements
The authors are very grateful to the Centre National de Calcul Parall•le en Sciences
de la Terre (Paris, France) and to the Thinking Machine Corporation for making the
CM5 Connection machine available to us. Particular thanks go to Andreas Pirklbauer,
to Patrick Stoclet and to Prof. Ottorino.
53

You might also like