Professional Documents
Culture Documents
Search-Intensive Concept Induction: Attilio Giordana and Filippo Neri
Search-Intensive Concept Induction: Attilio Giordana and Filippo Neri
Universitˆ di Torino
Dipartimento di Informatica
Corso Svizzera 185
10149 TORINO (Italy)
{attilio, neri}@di.unito.it
Abstract
This paper describes REGAL, a distributed genetic algorithm-based system, designed
for learning First Order Logic concept descriptions from examples. The system is a
hybrid between the Pittsburgh and the Michigan approaches, as the population
constitutes a redundant set of partial concept descriptions, each evolved separately. In
order to increase effectiveness, REGAL is specifically tailored to the concept learning
task; hence, REGAL is task-dependent, but, on the other hand, domain-independent.
The system proved to be particularly robust with respect to parameter setting across a
variety of different application domains.
The system has been tested on a simple artificial domain, for the sake of illustration,
and on several complex real-world and artificial domains, in order to show its power,
and to analyze its behaviour under various conditions. The results obtained so far
suggest that genetic search may be a valuable alternative to logic-based approaches to
learning concepts, when no (or little) a priori knowledge is available and a very large
hypothesis space has to be explored.
2
1. Introduction
Learning concepts from examples has been widely investigated in Machine Learning
both for its theoretical relevance and for its potential impact on many application
domains. Nevertheless, in spite of the great number of algorithms proposed so far, the
problem is still far from a satisfactory general solution and, hence, this learning task
is still in the focus of many ongoing and new researches.
One of the major obstacles in the path to effectively learn realÐworld concepts is
computational complexity. As Mitchell [1982] noticed, the problem can be
formulated as a search in a hypothesis space corresponding to candidate descriptions
of the concept in a specified language. In most useful applications this space is so
large that finding ÒgoodÓ descriptions seems almost hopeless. Nevertheless, we can
think of at least two ways of surmounting this obstacle: either to use as much a priori
knowledge as possible, in order to focus the search towards smaller but interesting
subspaces (see, for instance, [Mitchell, Keller & Kedar-Cabelli, 1986; Tecuci, 1991;
Saitta, Botta & Neri, 1993]), or to increase the algorithm exploration power and
ability to exploit computational resources, such as parallelism. Genetic Algorithms
[Holland, 1975; De Jong, 1975] are rather along this line.
Genetic Algorithms (GAs) are not learning algorithms per se, but they may offer to a
variety of learning tasks a powerful and domain-independent search method. In fact,
even though GAs have been successfully applied mainly to optimization problems,
machine learning has also benefitted from them. The most wellÐknow approach in
this sense is represented by the Genetic Classifiers paradigm, proposed by Holland
[1986] and then investigated by many others. However, this paradigm binds the GAs
to a very specific model of learning agents, which not always fits the learning task
one has to face. Therefore, other authors investigated how to introduce GAs in
learning frameworks already existing. For example, proposals for using GAs in
supervised concept learning have been put forward both for propositional concept
representation languages [Wilson, 1987; McCallum & Spackman, 1990; De Jong &
Spears, 1991; Greene & Smith, 1993; De Jong, Spears & Gordon, 1993; Janikow,
1993; Venturini, 1993] and for First Order Logic languages [Giordana & Sale, 1992;
Giordana & Saitta, 1993]. Genetic algorithms have also been used to learn rules for
sequential decision making [Grefenstette, Ramsey & Schultz, 1990] and to refine
numeric constants in a knowledge base [Bala, De Jong & Pachowicz, 1991; Botta,
Giordana & Saitta, 1993].
3
From the experience gathered so far, genetic algorithms seem to emerge, also in
Machine Learning, as an appealing alternative to classical search algorithms for
exploring large spaces of inductive hypotheses. In particular, they exhibit two major
characteristics which are particularly attractive: as already mentioned, GAs are highly
parallel in nature and can exploit parallel machines, whereas classical search
algorithms do not easily lend themselves to parallel processing; moreover, the type of
search performed by a GA may, in some cases, give it the capability of escaping from
local minima, whereas greedy algorithms may not. For instance, Giordana and Sale
[1992] present a typical induction problem in which a genetic algorithm can easily
find a ÒperfectÓ solution, whereas any induction algorithm, guided by information
gain heuristics, is bound to be misled towards suboptimal solutions. The opposite is
also true: some problems, easy to solve with information gain based heuristics, can be
hard for a genetic algorithm. For this reason, GAs assume a complementary role with
respect to classical search algorithms that are aimed more at a strong exploitation of
the information gathered from the data than to a wide exploration of the hypothesis
space.
This paper systematizes and extends in many respects the framework for learning
concept descriptions in First Order Logic previously proposed in [Giordana & Sale,
1992; Giordana & Saitta, 1993, 1994; Giordana, Neri & Saitta, 1994; Giordana, Saitta
& Zini, 1994; Neri & Giordana, 1995; Neri & Saitta, 1995a,b]. More specifically, it
describes in all relevant details a substantially new version of the system REGAL,
which has been already summarized in [Neri & Giordana, 1995], and focuses on the
problem of learning multimodal concepts, i.e. concepts requiring disjunctive
descriptions. The starting point of REGAL is the selection operator, especially
tailored to tackle the problem at hand, described in [Giordana, Saitta & Zini, 1994],
where the properties of this new selection operator have been theoretically
investigated and experimentally confirmed. However, the important novelty of the
current REGAL's version, with respect to the one reported in [Giordana, Saitta &
Zini, 1994] consists in a new distributed architecture for the genetic algorithm, which
extends and modifies the classical network model [Goldberg, 1989a; Pettey & Leuze,
1989] in order to distribute different niches on different computational nodes.
In classical (non GA-based) rule learning algorithms, the most common strategy used
for dealing with multimodality consists in learning one disjunct at a time [Michalski,
1980, 1983; Michalski et al., 1986]. The learning algorithm searches for some
conjunctive concept description; then, the positive instances covered by the
description are removed from the learning set and the algorithm starts again looking
for another description covering some (or all) of the remaining positive examples.
The procedure is repeated until all the positive instances are covered. In the case of
multiple concepts learning, the same policy is iterated for each concept. On the other
hand, there also exist rule learning algorithms which work on an entire classification
theory, dealing with both multiple and multimodal concepts at the same time (see, for
5
instance, [Botta & Giordana, 1993]). Decision trees construction is also typically
done by generating all the disjuncts at the same time.
The task of learning concept descriptions can be mapped to both PittsburghÕs and
MichiganÕs approach. In the former, each chromosomes encodes a whole
classification theory, while the component rules correspond to different concept
modalities. As mentioned before, a simple GA can be used, as a complete concept
description corresponds to a ÒbestÓ individual, according to a taskÐdependent fitness
function. Recent instances of this approach are the systems GABIL [De Jong, Spears
& Gordon, 1993] and GIL [Janikow, 1993], which learn concept descriptions in the
VL1 language [Michalski et al., 1986], and the system described in [Koza, 1991],
which learns Lisp programs.
A way out of the dilemma is proposed by Venturini [1993], who suggested to learn
one disjunct at a time. However, this strategy offers an over simplistic solution to the
problem and tends to find less general descriptions than the ones formed by the above
mentioned approaches.
The method based on sharing functions modifies the selection probability, with the
aim of inhibiting the excessive growth of the genetic pressure of a subpopulation.
This is achieved by reducing the fitness of an individual depending on the number of
existing individuals similar to it. In the initial formulation, genotypic sharing was
considered [Goldberg & Richardson, 1987]. The fitness value f(j), associated to an
individual j, was considered as a reward from the environment to be shared with
other individuals. Similarity between two individuals, j and j', was defined as the
Hamming distance between their associated bit strings. Phenotypic sharing can also
be used, by considering the distance between two individuals in their semantic
domain [Deb & Goldberg, 1989]. By applying both crowding and shared functions to
a set of benchmark optimization problems, Deb and Goldberg [1989] observed that
crowding was less effective, because it sometimes could not find some of the existing
fitness maxima. In an earlier version of REGAL, we also tried to apply crowding; the
method proved to work well to learn many unimodal concepts at one time, but was
unable to allow a stable formation of subpopulations, representative of disjunctive
definitions of the same concept. In all the experiments performed, in the long term a
single disjunct overcame the other ones.
7
Phenotypic or genotypic sharing works well in vectorial spaces, where the notion of a
distance has a clear semantics. Deb & Goldberg [1989] found a performance
difference between phenotypic and genotypic distances, in favour of the phenotypic
one: by using as genotypic distance Hamming's definition, the GA showed a
behaviour intermediate between crowding and phenotypic sharing with respect to the
ability of finding all the existing maxima. Another method for generating niches,
which is based on tags, and potentially does not suffers from the previous problems,
has been recently proposed by Spears [1994]; in principle, it could be a viable
alternative to the solution adopted for REGAL that is described in the following.
The difference with respect to the MichiganÕs approach is that each individual evolves
separately and only at the end of the run a complete solution is formed. Despite its
apparent similarity to a classifier system, REGAL deeply differs from it, because it
uses a totally different mechanism to promote the formation of species of conjunctive
formulas, competing for covering the instances in the learning set.
There are several reasons why we did not simply use sharing functions. The first, and
most fundamental one, is the difficulty of defining a meaningful distance measure.
REGAL is designed to learn concepts expressed in a First Order Logic (FOL)
language and no satisfactory general definition of a distance among FOL formulas
exist. Even if such a measure could be defined, it would be strictly domain dependent;
this fact would result in a limiting factor for the applicability of the system to a whole
spectrum of domains. The situation is made even worse by the problem-dependency
8
of the parameters appearing in the sharing function method. A second reason is
computational complexity: the shared fitness evaluation requires, at each generation, a
number of steps proportional to M2, being M the cardinality of the population. A third
reason derives from the fact that sharing functions (based on either genotypic or
phenotypic distance) can be proved not to be able to separate two species under
certain not a priori testable conditions [Giordana, Neri & Saitta, 1994; Neri & Saitta,
1995b].
When growing different species, an important issue is how to avoid Òlethal matingsÓ,
i.e., matings that are bound to produce bad offsprings. If two conjunctive sub-
concepts, j and y, cannot be generalized further to a single conjunctive description,
crossing j with y will inexorably generate less fit individuals. We will discuss later
on how REGAL faces this problem, by imposing a long-term control strategy to the
basic genetic evolution, which exploits the features of the distributed implementation.
Learning one disjunct at a time may be an effective solution to the problem of lethal
matings [Venturini, 1993]. However, one can expect more difficulties in generalizing
the latest and smaller disjuncts, because of the absence of individuals covering the
previously removed examples. A possible remedy to this drawback may be to use
taskÐoriented mutation operators, such as ÒdroppingÓ or Òadding conditionÓ operators,
in order to obtain from them what cannot be obtained by pure crossover [De Jong,
Spears & Gordon, 1993; Greene & Smith, 1993; Janikow, 1993].
9
In REGAL, an atomic formula of arity m has the syntactic form P(x 1 , x2, ..., xm, K),
where x1, x2 , ..., xm are variables and the term K is a disjunction of constant terms,
denoted by [v1 , v2 , ...., vn], or the negation of such a disjunction, denoted by Ø [v1, v2 ,
...., vn]. At most one term can be a (positive or negative) disjunctive expression, while
all the others must be variables. Examples of well formed atomic expressions are:
color(x 1 , [yellow, green]), color(x1 , Ø [blue, black])
far(x1, x2, [2,3,4]), greater(x1, x2), tall(x1)
Notice that the expression color(x1 ,Ø [blue,black]) is semantically equivalent to
Øc olor(x1,blue)ÙØ color(x 1 ,black). Therefore, the negation of an internal disjunction
is equivalent to a conjunction of negated atoms in clausal form, as, for instance, in
FOIL [Quinlan, 1990]. When a formula j is evaluated on a concept instance x, each
variable in j has to be bound to some object occurring in the description of x. Then,
the predicates occurring in j are evaluated on the basis of the attributes of the object
bound to their variables. As the binding between the variables in j and the objects in
x can be chosen in many ways, j is said true of x iff there exists at least one choice
such that all the predicates occurring in j are true. How the predicate semantics is
evaluated in terms of the object attributes must be explicitely defined by the user
10
before beginning to run REGAL on a specific application. In order to clarify this
point, suppose a concept instance is composed of three objects:
o1:<color = black, position = 1, size = 4, height = 3>
o2:<color = green, position = 2, size = 1, height = 1>
o3:<color = blue, position = 3, size = 10, height = 5>
Each object is described by four attributes: COLOR, POSITION, SIZE and HEIGHT.
Suppose, moreover, that the semantics for the predicates color(x 1 , [yellow, green]),
(far(x1, x2, [2,3,4]), greater(x1,x2) and tall(x 1 ) is defined by the expressions:
member(COLOR(x1 ), K); member(|POSITION(x1 ) - POSITION(x 1 )| ,K);
SIZE(x1) > SIZE(x2); HEIGHT(x1 ) > 6
respectively. Then, the predicate color(x1 , [yellow, green]) will be true when the
variable x1 is bound to object o 2 , whereas it is false for any other binding (the term
K is bound to [yellow, green]). In a similar way, far(x1 , x2, [2,3,4]) is true only when
x1 is bound to o3 and x2 is bound to o1 (the term K is bound to [2,3,4]), greater(x1 ,x2)
is true for the three bindings <x 1 =o1, x2 =o2>, <x1 =o3, x2 =o1>, <x1 =o3, x2 =o2> and
tall(x1) is always false.
In the following we will show how concept descriptions in the language can be
represented on a bit string of fixed length.
Before giving a formal definition of L, it may be useful to revisit the logical notions,
informally introduced above, in order to define the concept of completed form for a
disjunctive term and for a predicate. Given a FOL formula j (x1 , x2 , ..., xn ), to
evaluate the truth of j in a universe U means to search for a binding between the
11
variables x1 , x 2 , ..., xn and some constants a1 , a2 , ..., an, defined in U, such that j(a1,
a2, ..., an ,) is satisfied. Notice that in supervised concept learning each learning
instance is a specific universe. We can introduce the following:
It is worth noting how the use of the wildcard Ò*Ó naturally leads to the introduction
of negated disjunctive terms. For instance, the predicate shape(x, Ø[ square, circle]),
occurring in j 1 , is just a rewriting of the predicate shape(x, [triangle,*]), where * =
Ø[ square, triangle, circle], according to the declaration in the template. More in
general, Ò*Ó can always be eliminated from an inductive hypothesis by introducing a
negated term. Finally, we notice that completed predicates occurring in an inductive
hypothesis can be dropped without modifying their semantics, being tautologically
true. As an example, in formula j2 the complete predicate shape(x, [square, triangle,
circle, *] ) has been dropped.
Template L
L = weight(x, [3, 4, 5] Ù color(x, [red, blue, *]) Ù
Ù shape(x, [square, triangle, circle, *]) Ù far(x, y, [1, 2, 3, 4, 5, *])
Lc º weight(x, [3, 4, 5])
Ls º color(x, [red, blue, *] ) Ù shape(x, [square, triangle, circle, *] ) Ù
Ù far(x, y, [1, 2, 3, 4, 5, *])
j1 = weight(x, [3, 4, 5] Ù color(x, [red]) Ù shape(x, Ø [square, circle]) Ù
Ù far(x, y, [1, 2])
j2 = weight(x, [3, 4, 5] Ù color(x, Ø [red]) Ù far(x, y, [1, 2])
color(x, [red, blue, *]) Ù shape(x, [square, triangle, circle, *]) Ù far(x, y, [1, 2, 3, 4, 5, *])
s(Ls) = 11 1 111 1 1 1 1 1 1 1
j1 Þ 1 0 0 0 10 1 1 1 0 0 0 0
j2 Þ 0 1 1 1 1 1 1 1 1 0 0 0 0
Figure 2 Ð Bit strings corresponding to the template L s and to the formulas j 1 and j2
reported in Figure 1. Every term, including Ô*Õ, occurring in an internal disjunction has a
corresponding bit on the bit string.
From these considerations we can conclude that the apparent limitation, due to the
predefined complexity of the template is not a real one, provided that sufficiently long
templates could be handled, when necessary. As a matter of fact, REGAL proved to
be able to work well with strings several hundreds of bits long. Then, we may assume
to be quite free in designing the template according to the real need of the problem,
without encountering excessive difficulties.
In a FOL language the choice of an adequate template may be a non trivial one. In
order to estimate the reasonable complexity for a template, we can observe that a
nonÐconstructive, bottom-up generalization process, using only Òdropping conditionÓ
and Òturning constants into variablesÓ rules, will never generate a conjunctive
description more complex than the one of the most complex example 1 .
Therefore, a method which can be used to design the template is to examine the
learning instances and to insert in the template only the literals that occur in ground
form in at least one of them. In this way, the hypothesis space will not be larger than
the one explored by an exhaustive bottomÐup generalization algorithm.
The limitation about the use of negation is more substantial. As explained before, the
negation used in REGAL is basically the negation of atoms. In a previous version of
REGAL we considered also another form of negation, greatly increasing the power of
the language , namely the negation of existentially quantified formulas. This form of
negation, widely used in Logic Programming, can be learned by systems such as
1 Of course, this is not true if the concept description language is not function free.
15
SMART+ [Botta & Giordana, 1993], FOIL [Quinlan, 1990] and FOCL [Pazzani &
Kibler, 1992]. A negated existentially quantified formula has the following syntax:
Ø $ < y 1, ... , ym> [y(x1, .... , xn , y1, ... , ym)]
where y is a conjunction of (possibly internally disjunct) predicates, each one
containing at least one variable in the set y1, ... , y m. In theory, it is easy to extend the
method we propose for mapping Ls into a bit string in such a way as to also account
for this form of negation. One such method has been described in [Giordana & Saitta,
1993]. Currently, this form of negation is not in use, because the genetic search found
difficulties to deal with it, owing to the highly epistatic interaction between the
positive and the negative parts of a formula during reproduction by crossover. In
order to have an intuition of the problem, let us consider the formula:
j(x1 , .... , xn) Ù Ø $ < y 1, ... , ym> [y(x1, .... , xn , y1, ... , ym)]
The positive part j requires at least one binding b1 for the variables x1 , .... , xn such
that j is satisfied. The negated part requires that, given b1 , there must not be any
binding b2 for y 1 , ... , y m such that y is satisfied. Modifying the formula j by
crossover or mutation can easily modify the set of possible bindings b 1 , so that the
negated part becomes unsatisfied and the whole formula becomes false. Therefore,
the experimentation showed, in general, a very unstable behavior under crossover and
mutation.
In the remaining of the paper, let and be the sets of positive and negative
examples of a target concept, respectively, and let E and C be their cardinalities.
Given a pair of strings (s 1 , s2), generated by the mating procedure, crossover will be
applied with an assigned probability pc (currently, pc = 0.6). Then, the specific
crossover type is selected stochastically by taking into account the features of s1 , and
s2 . The conditional probabilities pu of uniform crossover, p2 pt of two-point crossover,
ps of specializing crossover and pg of generalizing crossover, given that crossover
application has been selected, are assigned as follows:
pu = (1 - a × fn) × b
p2 pt = (1 - a × fn) × (1 - b) (4.1)
ps = a × fn × r
pg = a × fn × (1 - r)
In expressions (4.1), a and b (a, b Î [0,1]) are tunable parameters, fn is the normalized
mean value of the fitness of the two formulas j1 and j 2 , defined by string s1 and s2 ,
respectively:
f ( j1 ) + f ( j 2 )
fn = £1
2 f Max
r =
[ n + (j1 )+ n - (j1 )+ n + (j 2 )+ n - (j 2 )]
( E +C)2
being n+ (j) and n-(j) the numbers of positive and negative learning instances
covered by j, respectively.
In the specific case, the normalized mean value f n determines the probabilities
p u + p 2pt = 1 - afn and ps + p g = afn of selecting the crossover between the uniform
and the twoÐpoint crossover, on the one side, or between the specializing and
generalizing crossover, on the other. When s1 and s2 have a low fitness, i.e. are not
yet good inductive hypotheses, the chance of applying the first two crossovers is high
in order to exploit their greater exploration power. On the opposite, high values of the
fitness privilege the use of the specializing and generalizing crossover in order to
refine the inductive hypotheses. The choice between the uniform and the twoÐpoint
crossover is statically controller by the parameter b, whereas the one between the
specializing and generalizing crossover is controller by the parameter r which again is
evaluated on-line on the basis of the examples covered by s1 and s2: when they cover
many examples, the specializing crossover is privileged in order to increase the
chances of creating a more consistent offspring, otherwise the generalizing crossover
gets a higher probability.
We notice that the parameters occurring in (4.1) are not critical; in fact, after an initial
tuning, they have never been changed across the range of tried applications.
Suppose, moreover, that the binding <x=o 1; y =o3>, satisfies the constraint weight(x,
[3, 4, 5]). Let the string s=Ò1000110101001Ó, corresponding to the formula j =
color(x, [red]) Ù shape(x, [triangle, circle]) Ù far(x, y, [1, 2, *]), be selected: j is not
satisfied by x. Then, the string s will be modified into sÕ=Ò1101110101001Ó,
corresponding to jÕ = color(x, [red,blue]) Ù shape(x, [square,triangle, circle]) Ù
far(x, y, [1, 2, *]), which is now satisfied by x.
Provided that the compatibility with the constraints imposed by the template L c is
assured, the seeding algorithm tries to introduce as much randomness as possible, in
order to increase genetic diversity in the population. It is used at the beginning of the
search to initialize the population A(0) and, then, it is called either by the selection
operator or as an alternative to the classical mutation operator.
2 Extraction is performed with replacement for dealing with the case E < M.
20
| | =E
Extraction with
replacement
ext = {xk }
A(t) |A(t)| = M | ext | = g * M
Seeding +
Extraction Roulette wheels rk
without
replacement
Extraction by
roulette wheels
spinning
Crossovers + Mutation
Bnew(t) |B new(t)| = g* M
The name Universal Suffrage derives from the fact that each example has the same
probability of being extracted. More precisely, the process of extracting the examples
xk from , at each generation, follows a Binomial distribution with probability of
success 1/E and number of trials g * M. As the selection process is the same at each
generation, the probability distribution of the number of time x k is selected in t
generations is still a Binomial one, with g *M* t trials and the same probability of
success 1/E.
By considering the inverse Bernoulli sampling problem, it is easy to prove that for
21
1
lg( )
1 h
M ³ (4.2)
gt E
lg( )
E -1
a generic example xk ha a probability greater than h of being extracted within the first
t generations. From (4.2) we deduce that the most favorable case is g = 1 . In fact,
when the whole population is renewed at each generation, a smaller M is sufficient to
let xk appear, given t.
The average number of generations t all necessary to extract all the examples is at
most:
E2
t all = (4.3)
gM
In (4.4) fj is the fitness of jj and n kj is the characteristic function of the set COV(jj):
1 if x k Î COV (jj )
ì (1£ k£ E, 1£ j£ m)
nkj = íî 0 otherwise
1ength(s(j )) - c(j )
z(j ) = .
length(s(j ))
In (4.5), the parameter A is user-tunable and its current value is A = 0.1. We observe
that fMax= 1.1 and lim f (z, w) = 0 . The function f is reported in Figure 4.
w®¥
With regard to the complexity of the selection operator, in order to compute the
probability values in (4.4) we need to evaluate, for each jj, a number of sums at most
equal to the number of roulette wheels available: as there is at most one roulette for
each example in , then the number of steps performed is at most E m £ E M. Hence,
23
Figure 4 Ð Fitness function f(j) vs. simplicity (z) and consistency (w) of a formula j.
Increasing values of z (w) correspond to decreasing simplicity (consistency).
the global complexity is of the order O(EM). This complexity is to be contrasted with
that of using the sharing functions, i.e. O(M2E) [Deb & Goldberg, 1989]. This last
value is to be intended for a phenotypic distance taking into account the coverage of
the examples, in order to make a fair comparison with the US operator.
In [Giordana, Neri & Saitta, 1994] a general method for investigating the macroscopic
properties of populations evolving according to a transition probability matrix has
been proposed. This method, based on the definition of a Virtual Average Population,
allows some of the technical difficulties, involved in such kind of analysis, to be
overcome, still providing, at the same time, accurate estimates of some of the
parameters controlling the evolution. The method is well suited for a variety of
approaches and its application to several selection methods can be found in [Neri &
Saitta, 1995a, b].
In this section, we will discuss the two fundamental aspects of the process of
cooperation in REGAL: the selection and migration model, concerned with the
interaction among NGAs, and the interactions between the SUPERVISOR and the
NGA processors. In the following, NGA processors will be also called ÒnodesÓ, for
the sake of simplicity.
The search activity of each node is biased by the subset of training examples assigned
to it, exploited by the US selection operator. Any change in this set will result in a
shift of the region of the hypothesis space that will be explored. Consequently, the
SUPERVISOR processor, which constantly monitors the state of all the NGAs, can
25
easily focus the hypothesis space exploration by modifying the set of examples
assigned to any NGA.
SUPERVISOR
From another perspective, each NGA can be seen as a niche needed for the survival of
a species or for a group of species (formulas representing one of the target concept
modalities). The SUPERVISOR distributes the examples of the learning set according
26
to the emerging species, thus identifying NGAs with niches, where specific species
can grow up. The aim of this association is that there will be, in the limit, an
identification between niches and concept modalities. Therefore the mating between
individuals of different niches can be controlled by tuning the migration parameter m.
Furthermore, the SUPERVISOR performs two other fundamental tasks: it
periodically extracts a classification theory from the global population and, when it
finds a satisfactory one, halts the whole system, according to a chosen criterion. In
Figures 6 and 7 an abstract description of the nodal GA and of the SUPERVISOR,
respectively, is provided.
SUPERVISOR
Migration and selection are strictly interleaved activities. In the following, a complete
cycle of selection-migration is described and schematically represented in Figure 8.
For the sake of clarity, let us focus on a NGA n (1 £ n £ N). Let n be the subset of
training examples currently assigned to n. At each generation t, NGA n receives from
the other NGAs a set of foreign individuals A net (t), of cardinality m|An(t)| (0£m£1),
where m is the migration rate.
The individuals sent to n by other NGAs and covering some example belonging to
n , will compete, together with the individuals in the nodal population An (t), for
being selected for reproduction through the US selection procedure. Moreover, a set
of incoming individuals of cardinality rm|An(t)|, where r is called foreign mating
rate, is randomly selected for reproduction. After the reproduction phase, g ¥ |An(t)| of
27
the newly generated individuals will replace some elements of n's population, while
the other ones will be sent to the other NGAs.
The twoÐstep policy for selecting the incoming information is based on the following
motivations. The first step aims at the formation of more representative individuals
(more general formulas) by exploiting ÒforeignÓ individuals which are also
representatives of the current nÕs species. The second step allows a possible recreation
of missing schemata in nÕs population. The combination of these two policies is also
useful in easing the lethal mating problem, by reducing the number of matings
between individuals of different species, still allowing for the possibility of exploiting
the information contained in individuals representative of them.
Finally, the routing of individuals towards the other NGAs is realized by partitioning
the set of outgoing individuals according to the number of nodes directly connected to
n and a different partition is sent to each neighbour node.
From all recorded formulas, a disjunctive concept description, which becomes the
current best solution, is assembled. The algorithm used to this aim is quite simple, by
the moment. First, the set COVER(t) is constructed as the union of the positive
examples covered by all the formulas received up to generation t. Then, the formulas
are sorted in decreasing order according to p(j) (re-evaluated on the whole set ) and
28
the first K-best ones, able to cover the entire set COVER(t), are selected as current
concept description BEST(t).
Finally, a comparison between the previous classification theory and the new one is
accomplished. If a significant improvement is detected, the sets n are modified in
order to distribute the task of improving the current concept description BEST(t). In
practice, this is realized by distributing the formulas in BEST(t) to the nodes and
defining the generic set n as the union of the positive examples covered by the
formulas assigned to n and the examples not covered by any other formula in
BEST(t).
The SUPERVISOR stops the genetic search when the current solution BEST(t) does
not improve for an assigned number of consecutive generations, or when a maximum
generation limit is reached.
This mechanism is especially useful in those situations where the number of target
concept modalities is higher than the number of available searching processors. Its
benefits are sensible also in the case of searching for small disjuncts; in fact, freezing
a description and adding again a node to the set of available resources increases the
computational power that can be directed to explore regions of the hypothesis space
where small disjuncts are more likely to exist.
When the freezing mechanism is used, the whole process halts, and a final
classification theory is returned, when all the examples are removed from the learning
set.
29
N ETWO R K
N GA n
SE LE CTI O N WIT H
R=mm m
U NI VER SAL SU F FR AGE
CO M M B UF FE R PO PU LA TIO N
i ndi v i dua l s
se l ect ed w i th
re pl a cem ent
M ATI NG ,
CR OSSOVE R,
M UT ATI O N
N EW
I ND IV ID U ALS
The target concept, has four disjuncts: the first one, j1, covers 312 positive instances,
the second one, j2, covers 167, the third one, j3, covers 56, and the fourth one, j4,
covers 35. Moreover, j 1 and j 2 have 51 positive instances in common, whereas j1
and j 3 have 19. The fitness values of the individuals corresponding to the four
disjuncts are the following ones:
f1 = f(j1) = 1.667 f2 = f(j2) = 1.556
f3 = f(j3) = 1.500 f4 = f(j4) = 1.500
30
The relations between the extensions of the four disjuncts on the learning set is
represented in Figure 9.
j1
242 j
2
51
116
19
j 37
3 35 j
4
Figure 9 Ð Training set coverage of the four disjuncts of the ÒBicyclesÓ concept.
Several runs have been performed by changing the values of the global population
size M, of the node number N, the migration rate m and the halt criterion. For each
setting of the four parameters the run has been repeated four times.
A first set of experiments aimed at detecting the size of the population necessary in
REGAL to solve the problem using a single node and relying on the US operator only
for promoting niche formation. Typical results are reported in Table I.
Table I
Experimental results on the ÒBicyclesÓ dataset
In the experimentation, the main focus was on the population size M. The system had
two halting criteria:
h0 Þ A first complete and consistent solution has been found.
h1 Þ A maximum of 70 generations had been reached.
From Table I it appears that there is a quite sharp critical value M cr (between 560 and
640, in this example), such that for M < Mcr no complete and consistent solution
31
could be found within the time bound given. When M > Mcr, a complete and
consistent solution was always found within 15 generations. By letting the system run
even after a complete and consistent solution has been found, a more general solution
is usually reached (less and larger disjuncts).
Table II
Global population size M = 240. The nodal population size m = M/N ranges from m = 60 to m =
15. Values in parentheses, such as (x,y), denote the number of positive and negative learning
examples covered by a corresponding formula, respectively. When the halt criterion h 0 is met, a
complete and consistent solution is found within 9 generations, except for (N = 16, m = 0), where
12 generations where needed, and for (N = 4, m = 0.1), where no complete and consistent solution
has been found.
Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N = 16 (167 0) (167 0) (167 0) (167 0) (85 0) (167 0) (167 0) (167 0)
m = 15 (25 0) (56 0) (34 0) (56 0) (40 0) (56 0) (41 0) (41 0)
(17 0) (105 0) (35 0) (58 0) (17 0) (32 0) (41 0) (62 0)
(15 0) (27 0) (16 0) (62 0)
(16 0) (31 0) (16 0)
(35 0)
Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
The goal of a second group of experiments was that of testing whether distributing a
fixed number M of individuals among communicating nodes in a network could bring
improvements from the computational point of view, being the quality of the solution
equivalent. Two sets of experiments, one with M = 240 and one with M = 280, are
described. Typical results are reported in Table II and Table III.
Table III
Global population size M = 280. All the complete and consistent solutions, found with the halt
criteria h0, required at most 9 generations.
Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N=8 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 35 (41 0) (56 0) (41 0) (41 0) (41 0) (56 0) (56 0) (56 0)
(25 0) (105 0) (62 0) (41 0) (62 0) (105 0) (16 0) (105 0)
(105 0) (31 0) (35 0) (31 0)
Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
(312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0) (312 0)
N = 16 (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0) (167 0)
m = 17 (41 0) (56 0) (47 0) (40 0) (41 0) (56 0) (56 0) (56 0)
(58 0) (105 0) (25 0) (47 0) (41 0) (105 0) (105 0) (105 0)
(34 0) (31 0) (105 0)
(13 0)
Total Cover (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0) (500 0)
The first feature of the distributed version, emerging from Tables II and III, is the
strong reduction (from ~ 600 to ~ 280) in the size of the global population necessary
to find a complete and consistent solution. Moreover, the following considerations
can be done:
¥ For values M £ 240, a complete and consistent solution may not always be found,
as it happens, on the contrary, with M ³ 280. The same critical change in
behavior, noticed in the concentrated GA, is observed.
¥ Increasing N, with constant M, leads to a fragmentation of the first proposed
complete and consistent solution: smaller disjuncts are generated, corresponding
to more specific concept descriptions.
¥ Giving the system more time (using h1 instead of the greedy criterion h 0 ) means to
increase its probability of generalizing.
¥ The migration rate value does not seem to be very critical, in this testcase, as the
effects of its variation are barely noticeable.
In any case, the results of this experimentation are to be considered only suggestive,
given the simplicity of the task.
6. Experimental Evaluation
In this section we will present some results obtained by REGAL on three datasets
taken from Irvine's database: the ÒMushroomsÓ dataset, provided by Schlimmer, the
ÒSplice JunctionsÓ dataset, provided by Towell, Noordewier, and Shavlik, framed in
propositional calculus, and the ÒOffice Document ClassificationÓ dataset, provided by
Esposito, Malerba and Semeraro, framed in First Order Logic. The primary goal of
33
experimenting with these datasets is to evaluate REGAL in comparison with other
systems described in the literature. However, a second type of evaluation we wanted
to perform concerns REGAL's ability at discovering specific disjuncts hidden in a
dataset. For this task, real datasets are not appropriate because the real structure of
the concept hidden in the instances is not directly known but only guessed from the
experimental results. Therefore, a First Order Logic artificial learning problem, which
we call ÒThousand TrainsÓ problem, has been constructed in order to challenge
REGAL's ability to discover specific disjuncts hidden in the dataset. This learning
problem has been used for the first time in Giordana & Saitta [1993] to test an older
version of REGAL.
The ÒMushroomsÓ dataset, not too complex and not too easy, has been used to study
experimentally how REGAL's performances change depending on the number N of
subpopulations and on the migration rate m. To this aim, extensive runs have been
Table IV
REGAL's configuration used for the ÒSplice-JunctionsÓ, ÒOffice
DocumentÓ and ÒThousand TrainsÓ problems.
Parameter Value
Number of Nodes 4
Population per Node 200
Migration rate m 0.2
Freezing Yes
(except for Splice-Junctions)
3 The CM5 Connection Machine used in this experimentation consists of 64 Sparc-10 processors,
connected by a highÐspeed packet-switching network. In this way every processor is directly
connected with all the others.
34
only unary predicates, and by allowing just one variable in a formula, as it has been
done for the ÒBicyclesÓ test case.
Table V
Comparative results among different approachs to the ÒMushroomsÓ classification problem.
LS = Cardinality of the learning set. e = Error rate on test set
e [%]
User System (Method) LS
[Neri & Giordana] REGAL (Genetic Algorithm) 4000 0.0
[Rachlin et al., 1994] PEBLS (Memory-Based) 7311 0.0
[Rachlin et al., 1994] (Nearest Neighbours) 7311 0.0
[Holte, 1993] C4.5 (Decision Tree) 5416 0.0
[Schlimmer, 1987] STAGGER (Rules) 5416 5.0
[Yeung, 1991] Neural Network 300 0.9
In the further experimentation, the aim was to evaluate other factors characterizing
the quality of the found classification theory, such as the global complexity and the
total number and size of the disjuncts under different distribution and cooperation
conditions of the same global population. According to the definition of complexity,
given in Section 4.4 for a single formula, the global complexity (C) of a classification
theory has been defined as the total number of 0's occurring in the bit strings
describing the disjuncts. The smaller the total number of 0's surviving in the final
solution, the better REGAL followed the simplicity criterion embedded in its fitness
function. The other measures monitored in the experiments are the Cpu time (T,
measured in minutes) spent for the global computation, the number of disjuncts (ND)
occurring in the final solution, and the maximum (MXD), average (AVG) and
minimum (SMD) sizes of their extension. Moreover, also the Cpu time (Tc in
minutes) spent to found the first complete and consistent concept description and the
global complexity (Cc) of the description itself have been measured.
Table VI
Influence of the population size, the processor number and the migration rate on REGAL's
performances in learning the ÒPoisonous mushroomsÓ concept. The different fields have the
following meaning:
Mode = (processor number, population per processor).
m = migration rate (percentage of the population).
T = Cpu time (in minutes) for completing 100 generations.
C = complexity of the solution after 100 generations.
ND = Number of disjuncts in the final solution.
MXD, SMD, AVG = Number of examples covered by the largest, the smallest and the
Ê average disjunct, respectively.
Tc = Cpu time (in minutes) necessary for finding the first complete and consistent solution.
C c = Complexity of the first complete and consistent solution found in a run.
400 8
C with m = 50 and migr = 0.0 ND with m = 50 and migr = 0.0
350 C with m = 50 and migr = 0.2 ND with m = 50 and migr = 0.2
C with m = 50 and migr = 0.5 ND with m = 50 and migr = 0.5
C with m = 50 and migr = 0.9 7 ND with m = 50 and migr = 0.9
300
250
6
200
5
150
100
4
50
0 3
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(a) (a)
500 14
C with m = 25 and migr = 0.0 ND with m = 25 and migr = 0.0
450 C with m = 25 and migr = 0.2 ND with m = 25 and migr = 0.2
C with m = 25 and migr = 0.5 12 ND with m = 25 and migr = 0.5
400 C with m = 25 and migr = 0.9 ND with m = 25 and migr = 0.9
350
10
300
250 8
200
6
150
100
4
50
0 2
0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100
(b) (b)
1000
20 ND with m = 12 and migr = 0.0
C with m = 12 and migr = 0.0
900 C with m = 12 and migr = 0.2 ND with m = 12 and migr = 0.2
C with m = 12 and migr = 0.5 ND with m = 12 and migr = 0.5
800 C with m = 12 and migr = 0.9 ND with m = 12 and migr = 0.9
15
700
600
500 10
400
300
5
200
100
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(c) (c)
necessary for obtaining a first complete and consistent solution, and the complexity of
the solution itself, respectively. The time elapsed between Tc and T (end of the
computation) has been spent by REGAL in trying to find a simpler and more general
solution. The time evolution of the complexity C and of the total number of disjuncts
ND versus the generation number is reported in Figures 10 and 11, respectively.
The findings emerging from the experiments confirm the data obtained from the
ÒBicyclesÓ test case: the Selection and Migration model together with the US operator
have been successful in allowing learning all the disjuncts at one time with a
relatively small global population size, in a reasonable time under different operative
conditions. Moreover, comparing the solutions found with different values of the
migration rate, it appears that values of migration rate different from 0 result in
significantly improved solutions, in terms of simplicity (being the error rate on the
test set equal to 0). Finally, we observe a super-linear speed up when increasing the
number of processors working in parallel. Let us now examine Fig. 12, which reports
the complexity evolution versus Cpu time for the best and the worst results obtained
in the three experimental settings (a), (b) and (c), described in Fig. 10. The
complexity shows a minimum in correspondence of the configuration with 32 nodes
and 25 individuals per node. Therefore, considering Fig. 12, it seems that the obtained
speedÐup, increasing the processors from 32 to 64, does not produce a real gain,
because a more simple solution can be obtained in less time (50 m) using only 32
processors.
400 400
C with m = 50 and migr = 0.0 C with m = 50 and migr = 0.5
350 C with m = 25 and migr = 0.0 350 C with m = 25 and migr = 0.5
C with m = 12 and migr = 0.0 C with m = 12 and migr = 0.5
300 300
250 250
200 200
150 150
100 100
50 50
0 0
0 100 200 300 400 500 600 0 50 100 150 200 250 300 350 400 450 500
(a) (b)
Figure 12 - Comparison of the time evolution of the complexity versus the Cpu time [min],
for the population distributions (a), (b) and (c) described in Figure 10. (a) Migration rate = 0.
(b) Migration rate = 0.5. As we can see, the population of m = 25 individuals per node, with 32
nodes and m = 0.5 has reached the best performance of all runs already at ~70 generations.
38
REGAL has been evaluated in two different settings. In the first one, the template was
300 bit long and contained a different unary predicate for each one of the 60
positions. All the predicates had the same internal disjunction [A, G, T, C, *] where
Ò*Ó accounts for the possible occurrence of D and N. A learning set of 2000 instances
randomly selected from the dataset has been used. In the second setting, a learning set
of only 1000 examples has been used but the search space has been limited to 12
bases exploiting a subset of the domain theory described by Towell and Shavlik
[1994]. For each one of the two settings, REGAL has been run three times for 300
generations, for each one of the three classes. In each run REGAL learned a very
general set of rules consistent with the learning set. The average error rate on an
independent test set is reported in Table VII where it is compared with the best results
published by Towell and Shavlik [1994].
The results have been obtained after halting REGAL at the 300th generation; they
show that the system can easily handle quite large learning sets and templates. It may
be surprising that the generation number necessary to solve this difficult problem is
just three times more than the one required for the simpler application of the
ÒMushroomsÓ, whereas the complexity of the search space was much higher (2300
versus 2126).
4 There is actually 1% or occurrences of the symbols N and D, corresponding to the nucleotides not
recognized or recognized ambiguously.
39
Table VII
Comparative results among different approachs to the ÒSplice JunctionsÓ classification problem.
LS = Cardinality of the learning set. TS = Cardinality of the test set. e = Error rate on test set.
System LS e e e
User (Method) (E/I) (I/E) (Neither)
[%] [%] [%]
[Towell & Shavlik, 1994] KBANN 1000 7.56 8.47 4.62
(NN + Domain theory)
[Towell & Shavlik, 1994] Neural Network 1000 5.74 10.75 5.29
(Backpropagation)
[Towell & Shavlik, 1994] ID3 1000 10.58 13.99 8.84
(Decision tree)
[Neri & Giordana, 1995] REGAL 2000 4.40 4.20 5.20
(GA, 60 bases)
[Neri & Saitta] REGAL 1000 3.42 6.85 4.29
(GA, 12 bases)
REGAL has been supplied with a concept description language containing 5 unary
predicates, one identifying the type and the other ones the position and the size of an
item, and two binary predicates defining the relative position of two items. Each
(absolute or relative) predicate was provided with a large internal disjunction
containing a set of possible positions an item can assume in the page with a
granularity of 35 pixels. Using only three variables in the concept description
language, a template of 380 bit has been obtained.
these results have been obtained using a smaller learning set. This experiment shows
again that REGAL can easily handle large templates.
The rules for classifying trains going East (Class 1) are reported in the following:
The learning set used for the experiments contained 500 instances of Class 1 (East)
and 500 instances of Class 2 (West). Rule 1 covered 98 instances of Class 1, Rule 2
covered 206 and Rule 3 covered 209. The three subsets were slightly overlapped,
because 13 instances verified more than one rule. An independent test set was also
generated, where the three disjuncts where covering 103, 212 and 189 positive
instances, respectively. The concept description language used by REGAL is very
similar to the one described in [Michalski, 1980] and is reported in Table VIII.
The experiment has been repeated six times using both the current version of
REGAL, with the setting described in Table IV and without distributing the
population, i.e. using only one node, and an older one based on classical selection. In
the first case, a perfect solution has been found in all the runs (see the second row in
Table IX), even if not always the same. Moreover, we notice that 5 disjuncts where
found in place of the three used for classifying the learning set. In the best solution
41
Table VIII
Predicates and template characterizing the concept description language used by
REGAL for learning the concept of "Trains going East".
we had five disjuncts covering 223, 216, 108, 83 and 74 examples. They represent a
consistent and complete solution to the problem. The first two disjuncts cover the
examples (plus some others) of Rule 3 and 2, respectively, and then it can be seen as a
further generalization of them. On the contrary, the third disjunct has not been found,
because REGAL preferred an alternative solution consisting of three larger disjuncts
which proved to be a correct solution too.
These results, as well as the ones obtained on the ÒBicyclesÓ domain, show that
REGAL is biased towards generalization and tries to find out disjuncts as large and
simple as possible. This bias is mostly due to the Universal Suffrage which helps the
formation of large disjuncts.
Table IX
Results for the 1000 train problem obtained using a learning set of 1000 learning events and a test set
of 1000 events independently generated. V avg is the average completeness and Wavg is the average
consistency over 6 runs. Runs of the current version of REGAL exploiting only one node and the
freezing mechanism, give results comparable to the first row of the table.
Method # Disjuncts Learning Set Test Set Generations
Vavg Wavg Vavg Wavg
Univ.Suff. , Freezing 5 1.0 1.0 1.0 1.0 200
Class. Sel., One at a time 7.6 1.0 1.0 0.98 0.99 800
6.5. Discussion
On the basis of the experimentation described so far, REGAL's capabilities emerge
quite clearly. A first positive feature is its ability to successfully deal with difficult
applications, with relatively small global populations and exploiting the GA's explicit
parallelism.
A second feature is the robustness with respect to the control parameters. Many runs
have been executed in very different conditions and with different population size, but
REGAL was always able to find a reasonable solution. Moreover, we found that
REGAL is very easy to use, i.e., it is easy to find a reasonable template and to define
the predicates in the concept description language.
A third feature characterizing REGAL is its bias toward generalization; the fitness
function it currently uses, combined with the Universal Suffrage operator, allows very
general solutions to be found. This can be an advantage in presence of large datasets,
which, luckily enough, REGAL can handle pretty well, but can become a
disadvantage when datasets are small. As an example, we can see that the
classification error on the ÒOffice DocumentsÓ dataset (230 examples) is relatively
high with respect to the other applications. On the other hand, we may notice that 5
out of the 6 errors are commission errors, exactly due to overgeneralization. It could
be worth investigating the possibility of flexibly controlling this generalization bias.
7. Related Work
Even though the preferential area of application for GAs has been, so far, function
optimization, a few systems, oriented to concept learning, already exist, all of them
searching propositional hypothesis spaces. REGAL, with its ability to deal with First
Order Logic hypotheses, constitutes thus an exception. On the other hand, it shares
with other systems, such as GABIL [De Jong, Spears & Gordon, 1993] and COGIN
[Greene & Smith, 1993], the fixed-length bit string encoding approach, aimed at
exploiting standard genetic operators.
5 REGALÕs concept description language differs from GABILÕs one because of the possibility of
having variables in the predicates which set it in the First Order Logics framework.
44
In contrast to REGAL, COGIN and GABIL, the two systems GIL [Janikow, 1993]
and SIA [Venturini, 1993] adopt a logical representation for the chromosome and use
special-purpose, task-dependent operators to generate offsprings.
GIL learns a single concept, expressed in the VL 1 [Michalski et al., 1986]. The
system follows Pittsburgh's approach: the population has fixed cardinality and each
individual represents a whole set of rules. The chromosome has an external
representation as a logical formula and an internal one as a fixed-length bit vector.
The system allows the initial population to contain either random individuals or
randomly chosen single positive examples, or even a priori given hypotheses.
Individuals are evaluated according to a fitness function which takes into account
completeness, consistency and simplicity of the rules. GIL utilizes 14 genetic
operators, working at the levels of either whole classification theories, or rules, or
conditions; some are specialization operators, some others are generalization
operators, while yet others have mixed effects. Among the others, we may notice that
the New-Event operator, which adds to a set of rules the description of a single, not
yet covered, positive example, acts as the ÒseedÓ in AQ [Michalski et al., 1986], and
has a role similar to the one the seeding operator has in REGAL. A drawback of GIL
is the excessive number of parameters, which make the system unstable across
different domains and different runs in the same domain.
The system SIA learns classification rules with weights in a propositional language.
The system, which allows for multiple classifications and missing attribute values, is
basically an AQ system, where the Star search method has been replaced by a GA.
Chromosomes represent conjunctive rules and disjuncts are learned one at a time.
Also this system has an operator, called Creation, generating a formula covering a
given example. Moreover, there are generalization operators and crossover. Generated
offsprings possibly compete with the parents to enter the new generation. The fitness
function takes into account completeness, consistency and simplicity of the rules.
SIAÕs approach is basically MichiganÕs, but with a search space reduced to that part
45
The idea of avoiding using a distance measure and of Òsharing examplesÓ among
formulas covering them has been already used by McCallum and Spackman [1990].
The method proposed is more similar to the one implemented in a previous version of
REGAL [Giordana & Sale, 1992]: species formation is controlled through a
modification of the fitness function. This method proved to be often unable to let
stable subpopulations be formed. A theoretical investigation of McCallum and
SpackmanÕs method shows that the population asymptotic behaviour, under pure
selection, will converge, in the average, to a homogeneous population, as crowding
does [Neri & Saitta, 1995a]. Methods avoiding the use of a distance measure have
been also proposed by Spears [1994] and Hekanaho [1995].
Finally, the systems BOOLE [Wilson, 1987] and NEWBOOLE [Bonelli et al., 1990],
even though they appear more like to Classifier systems, are actually concept learners
and have been used to learn Boolean functions.
Tasks related to concept learning, in which GAs have been used, include rules
refinement and feature selection. In rule refinement, rules previously acquired by
means of other algorithms are modified using a GA [Bala, De Jong & Pachowicz,
1991], or parameters occurring in rules are tuned so as to increase their classification
power [Botta, Giordana & Saitta, 1993]. The use of GAs for feature selection has
been experimented in [Siedlecki & Sklansky, 1989; Vafaie & De Jong, 1991], and
proposed for feature construction in [Giordana, Lo Bello & Saitta, 1994]. Rule
induction with GAs is also the basis of the system SAMUEL [Grefenstette, Ramsey
& Schultz, 1990], which learns from ÒepisodesÓ rules for sequential decision making.
communication may reduce the quality of the solution, reducing the exploration
power of the single GAs, by reducing their population size. One way of resolving this
conflict is to try to forcibly differentiate the nodal GAs, as REGAL does by assigning
to each node a different subset of examples to cover. The idea of differentiating nodal
GAs has been used before, by a variety of methods. For instance, in the GAMAS
system [Potts, Giddens & Yadav, 1994] four GAs, namely four functionally
differentiated species, cooperate to avoid premature convergence is searching a
solution for NP-hard optimization problems. In particular, one species has a great
exploration power, given to it by a high mutation rate, another is more aimed at
exploitation, with a very low mutation rate, whereas a third one behaves in between.
A special SPECIES I, held in isolation, is generated by artificial selection, i.e., it
contains the most fit individuals found by the other three. Moreover, evolution goes
through a series of cycles, because the three evolving species, other than SPECIES I,
are restarted every 60 generations.
The formation of different species in different nodes of a network has been also the
goal of a paper by Cohoon et al. [1987], who applied to GAs the theory of Punctuated
Equilibria [Vose & Liepins, 1990]. Evolution proceeds through epochs, during which
populations evolve independently, reaching an equilibrium, disrupted eventually by a
ÒcatastropheÓ, i.e., by letting individuals migrate among nodes, changing thus the
species environment.
8. Conclusions
In this paper we described REGAL, a new GA-based system for learning concept
descriptions from examples, which exhibits several novelties with respect to other
systems devoted to the same task. In particular, it is able to deal with First Order
Logic concept description languages, uses a new kind of selection operator for
allowing the formation of species, is especially tailored to the classification tasks, and
uses a new strategy for running in parallel the genetic search.
Considering the experience gained so far with the use of REGAL, we think we have
been successful in showing that the system is able to work in non trivial domains and
has the capability to deal with complex and large datasets, favourably competing with
various alternative approaches. Nevertheless, REGALÕs behavior and potential may
raise more questions than they answer.
applications with relatively small global populations. The comparison reported in Fig.
12 shows that a too fine-grained distribution of the niches is not effective with respect
to the generalization capabilities. Then, REGAL's current architecture is not suitable
to exploit highly parallel machine. One possibility for overcoming this weakness is to
try to improve the current architecture as suggested in Section 6.4. Alternatively,
other methods for exploiting massive parallelism seem to be also convenient. First of
all, running in parallel independent experiments, to quickly gather statistical
significance, looks more promising than trying to speed up the execution of a single
experiment using a distributed populations network. Second, the most expensive part
of REGAL' s algorithm is the evaluation of the individuals. Also this task could be
done in parallel, using the well known master-slave model [Goldberg, 1989a].
Therefore, a sound parallel implementation for REGAL seems to be a two level
architecture, consisting of a small network of nodes each one implemented as a
master-slave parallel processor.
Finally, a third aspect concerns the role genetic algorithms can play in machine
learning approaches. We agree with the view that learning should exploit available
background and domain-specific knowledge as much as possible, trying to limit
search in the hypothesis space by means of a priori constraints. In this case, in fact,
ÒsymbolicÓ search algorithms may be more convenient, both because they are able to
exploit that very a priori knowledge, and because they require less computational
resources when working inside small search spaces. On the other hand, when little or
no knowledge is available, or when, notwithstanding the imposed constraints, the
search space still remains very large, we came to believe, on the basis of an extensive
experimental work, that genetic algorithms have the potential of becoming the
winning approach, because of the exploration power they offer, without requiring to
reduce the representation language of the hypotheses.
48
References
Bala J., De Jong K.A. and Pachowicz P. (1991). ÒLearning Noise Tolerant
Classification Procedures by Integrating Inductive Learning and Genetic
AlgorithmsÓ. Proc. First International Workshop on Multistrategy Learning
,Center for Artificial Intelligence George Mason University (Harpers Ferry,
WV), pp. 316-323.
Baluja S. (1993). ÒStructure and Performance of FineÐGrained Parallelism in Genetic
SearchÓ. Proc. 5th Int. Conf. on Genetic Algorithms, Morgan Kaufmann
(UrbanaÐChampaign, IL), pp. 155-163.
Baroglio C., Botta M. and Saitta L. (1994). ÒWHY: A System that Learns from a
Causal Model and a Set of ExamplesÓ. In R. Michalski & G. Tecuci (Eds.),
Machine Learning: A Multistrategy Approach, Vol. IV, Morgan Kaufmann (Los
Altos, CA), pp. 319-347.
Bergadano F., Giordana A. and Ponsero S. (1989). ÒDeduction in Top-Down
Inductive LearningÓ. Proc. 6th Int. Workshop on Machine Learning, Morgan
Kaufmann (Ithaca, NY), pp. 23-25.
Bergadano F. and Gunetti D. (1994). ÒLearning Clauses by Tracing DerivationsÓ.
Proc. 4th Int. Workshop on Inductive Logic Programming, GMD Technical
Report # 237, (Bad Honnef, Germany), pp. 11-29.
Bonelli P., Parodi A., Sen S., and Wilson S. (1990). ÒNEWBOOLE: A Fast GBML
SystemÓ. Proc. Int. Conf. on Machine Learning 1990 , Morgan Kaufmann
(Austin, TX), pp. 153-159.
Botta M. and Giordana A. (1993). ÒSmart+: A Multistrategy Learning ToolÓ. Proc.
13th Int. Joint Conference on Artificial Intelligence Morgan Kaufmann
(ChambŽry, France), pp. 937-943.
Botta M., Giordana A. and Saitta L. (1993). ÒLearning Fuzzy Concept DefinitionÓ.
Proc. 2nd IEEE Int. Conf. on Fuzzy Systems , IEEE Press (San Francisco, CA),
pp. 18-22.
Cohoon J. P., Hedge S. U., Martin W. N. and Richards D. (1990). ÒPunctuated
Equilibria: A Parallel Genetic AlgorithmÓ, Proc. Int. Conf. on Genetic
Algorithms 1997 , Morgan Kaufmann (Cambridge, MA), pp. 148-154.
Danyluk A. P. and Provost F. J. (1993): ÒSmall Disjuncts in Action: Learning to
Diagnose Errors in the Local Loop of the Telephone NetworkÓ. Proc. Int. Conf.
on Machine Learning , Morgan Kaufmann (Amherst, MA), pp. 81-88.
De Jong K. A. (1975). ÒAnalysis of the Behaviour of a Class of Genetic Adaptive
SystemsÓ. Doctoral Dissertation, Dept. of Computer and Communication
Sciences, University of Michigan, Ann Arbor, MI.
De Jong K. and Spears W. M. (1991). ÒLearning Concept Classification Rules Using
Genetic AlgorithmsÓ. Proc. 12th Int. Joint Conference on Artificial Intelligence,
Morgan Kaufmann (Sidney, Australia), pp. 768-774.
49
Siedlecki W. and Sklanski J. (1989). ÒA Note on Genetic Algorithms for Large Scale
Feature SelectionÓ. Pattern Recognition Letters, 10, 335Ð347, Elsevier.
Silverstein G. and Pazzani M. (1991). ÒRelational ClichŽs: Constraining Constructive
Induction during Relational LearningÓ. Proc. 8th Int. Workshop on Machine
Learning, Morgan Kaufmann, (Evanston, IL), pp. 203-207.
Smith S. (1983). ÒFlexible Learning of Problem Solving Heuristics through Adaptive
SearchÓ. Proc. 8th Int. Joint Conf. on Artificial Intelligence, Morgan Kaufmann
(Karlsruhe, Germany), pp. 422-425.
Spears W. (1994). ÒSimple Subpopulation SchemesÓ. Proc. of Workshop on
Evolutionary Programming.
Syswerda G. (1989). ÒUniform Crossover in Genetic AlgorithmsÓ. Proc. 3th Int.
Conf. on Genetic Algorithms, Morgan Kaufmann (Fairfax, VA), pp 2Ð9.
Tanese R. (1987). ÒParallel Genetic Algorithm for a HypercubeÓ. Proc. 2nd Int. Conf.
on Genetic Algorithms , Morgan Kaufmann (Cambridge, MA), pp. 177-183.
Tanese R. (1989). ÒDistributed Genetic AlgorithmsÓ. Proc. 3th Int. Conf. on Genetic
Algorithms , Morgan Kaufmann (Fairfax, VA), pp. 434-439.
Towell G.G. and Shavlik J.W. (1994). ÒKnowledge-Based Artificial Neural
NetworksÓ. Artificial Intelligence, 70, 119-165, Kluwer Academic Publishers.
Tecuci G. (1991). ÒLearning as Understanding the External WorldÓ. Proc. First
International Workshop on Multistrategy Learning, Center for Artificial
Intelligence George Mason University (Harpers Ferry, WV), pp. 49-64.
Vafaie H. and De Jong K.A. (1991). ÒImproving the Performance of Rule Induction
System Using Genetic AlgorithmsÓ. Proc. First International Workshop on
Multistrategy Learning, Center for Artificial Intelligence George Mason
University (Harpers Ferry, VA), pp. 305-315.
Venturini G. (1993). ÒSIA: A Supervised Inductive Algorithm with Genetic Search
for Learning Attribute Based ConceptsÓ. Proc. European Conference on
Machine Learning (Vienna, Austria), pp. 280-296, Springer Verlag.
Vose M.D. and Liepins G.E. (1991). ÒPunctuated Equilibria in Genetic SearchÓ.
Complex Systems, 5, 31-44.
Wilson S. (1987). ÒClassifier Systems and the Animat ProblemÓ. Machine Learning,
2, 199-228, Kluwer Academic Publishers.
Yeung D.Y. (1991). ÒA Neural Network Approach to Constructive InductionÓ. Proc.
8th Int. Conf. on Machine Learning , Morgan Kaufmann(Evanston, IL), pp. 228-
232.
Acknowledgements
The authors are very grateful to the Centre National de Calcul Parall•le en Sciences
de la Terre (Paris, France) and to the Thinking Machine Corporation for making the
CM5 Connection machine available to us. Particular thanks go to Andreas Pirklbauer,
to Patrick Stoclet and to Prof. Ottorino.
53