Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

burning C o n f d Shwiegies for Chemical Processes

A Distributed Approach
Riyaz Sikora, University of Illinois

THE DISTRIBUTED LEARNING SYSTEM COMBINES


lem of inducing rules from examples has received much attention, and such algorithms as version spaces,' AQ,* ID3,3 and PLS l 4 have been successful (PLS 1 is described in a sidebar on p. 36). Similarity-based learning methods fall into two categories.5.6Some methods, i n cluding PLSl and ID3, use data to instantiate a parameterized model that is defined over an instance-space. An instance is a data point or unit of a sample and is represented as a list of attribute values. The attributes define an instance-space in which each attribute is a distinct dimension; for example, (Age, Height). The problem here is to fit the data to a model of the function. Other methods such as genetic algorithms use data to select a candidate concept defined over a hypothesis-space, that is, the space of all possible concepts with each point representing a whole concept. The problem here is to optimize some measure of hypothesis quality. Instance-space algorithms are generally fast; hypothesisspace algorithms are slow but stable. All these algorithms operate on complete data sets to find the concept or rule that explains that data. Taking a different approach, Michael Shaw and I designed the Distributed Learning which
JUNE 1992

SPLIT-BASED AND GEhETlC ALGORITHMS. SUCH HYBRID MACHINE-LEARNING METHODS SHOW GREAT PROMISE WHEN AN RVDUCTIVE LEARNER CANNOT REPRESENT A CONCEPT COVERING ALL POSITIVE EXAMPLES AND EXCLUDING ALL NEGATIVE EXAMPLES.
combines the features of instance-space and hypothesis-space methods. This algorithm decomposes a data set of training examples into subsets. After applying an inductive learning program on each subset, it synthesizes the results using a genetic algorithm (see the sidebar on p. 37). This parallel distributed approach is more efficient, since each inductive learning program works on only a subset of data. Also, since the genetic algorithm searches globally in the hypothesis space, this approach gives a more accurate concept description. We implemented DLS in Common Lisp on a Texas Instruments Explorer machine. control problem to which I applied this approach. A process in a particular chemical plant produces an undesirable by-product that is not in the data and is not directly measureable. An expensive chemical, which we'll call Fix, can be added in just sufficient quantities to remove the by-product chemically. The goal is to change the controllable process variables V,,V,,. ..,V, to reduce the amount of needed chemical Fix. Since it is not possible to derive mathematical equations linking the amount of Fix used to the process variables VI to V,, we used inductive learning to generate decision rules explaining when the amount of Fix is minimal. Each example in the data set consists of a set of plant readings showing the values of all the process variables and the amount of Fix being used. To dichotomize the decisions, the factory experts used a threshold value of 0.35 (all values were scaled between 0 and 1 to
35

The problem
Since it will be easier to discuss how DLS works from the perspective of a specific application, consider the process
~~

0885/9000/92/0400-0035 $3.00 0 1992 IEEE

The PLSl algorithm


The PLS I algorithm follows an inductive process starting with the entire space of possible events (called the feature space). It splits the space into two regions: those more likely to he in a specific class (positive events), and those more likely to be in other cla5ses (negative events). The splitting continues until a stopping criterion is satisfied. Each split uses only one attribute chosen according to an inforniationtheoretic approach. In each iteration, the region R in the feature space can be defined by a tuple (r.,u,e): rithms purpose i s to maximize the dissimilarity between disjuncts, each split is made to maximize the difference in the utilities of the two disjuncts (known as the distant function). e is the error rate associated with a disjunct. It is based on the number of positive eveiits covered by the disjunct as compared to the total number of positive elents (always less than 1 ). The distant function cfisr is defined as dist = l ~ o g U , - log u21 f x log(el x e > ) where u I and U ? are utilities for a tentative region dichotomy. e , and c1 are their respective error factors, and f is a constant representing the degree of confidence. Larger values of di.vt correspond to higher dissimilarity. Let S be the set of positive and negative training e\ents, and let R he the region that contains all events in E . The PLSl algorithm follows these steps:
~

r is the region or disjunct represented as a conjunction of conditions. This representation resembles DLS, in which a disjunction of regions constitutes a concept or hypothesis. U is the utility function, that is, the ratio of positive events to the total events cobered by the disjunct. Since the algo-

While any trial hyperplane remains untested, do Begin Choose a hyperplane not previously selected to become a tentative boundary for two subregions of R, rl and r2 Using the events from S, determine the utilities U, and u2 of r, and r2, and their error factors el and e2. If this tentative dichotomy produces a dissimilarity dlarger than any previous value for d Then create two permanent regions RI = (rl, u l , e,) and R2 = ( r 2 ,U Z , e2) having the (previously recorded) common boundary that gives the most dissimilar probabilities; Else place R in the defined region set to be output, and quit. End.

protect the process information). We converted the problem to a single-concept learning problem by treating the amount of Fix a s a class membership variable: positive when it is below the 0.35 threshold, else negative. The data set has 574 sets of examples, each taken at a different time. The most common criteria for evaluating a learning algorithm is its prediction accuracy. However, since the objective of

this study was to learn strategies for controlling the use of a chemical, it was more important that the rules or strategies learned were concise and easy to implement.

Consider a general learning algorithm that takes the training data set ( X , u ) as its input and finds a concept u(X) that

Data set

Inductive learner
0

Genetic algorithm

explains it. For each instance of ( X , u ) ,X is the vector of attribute values and U is the corresponding classification variable (usually a binary variable with 1 for positive and 0 for negative examples). The output concept, u(X). is the classification variable asafunction oftheattribute vector. Wecan thus address the problem of learning the concept as the problem of finding the function u ( X ) . DLS uses inductive learning programs to generate hypotheses that can be input into the genetic algorithm. By providing a useful initial population, these inputs decrease the time needed to reach the correct concept. The genetic algorithm searches globally and with implicit parallelism, which improves the quality of the learning results. Figure 1 shows a functional diagram of DLS, where ( X . r i ) is represented as P and ( X J ~ )the ~ . ith subsample, is represented by P,.

Inductive learner

___

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ~

Figure 1. The Distributed learning System.


~

Decomposing the data set. We decompose the data set into subsamples, since averaging a statistic over subsamples gives a better estimate than does calculating the statistic from the whole data set P. We use thejackknife technique of drawing random samples,xi n which one or more data points are removed from the data set P to get a subsample P , . This is repeated to get more 11). subsamples P , (;=I, 2 ..._. Specifically, we used the leave out r
~ ~~~

36

IEEE EXPERT

Genetic algorithms
Genetic algorithms are adaptive parallelsearch algorithms that can locate global maxima without getting trapped in local maxima. Goldberg describes genetic algorithms as search algorithms based on the mechanics of natural selection and natural genetics. A genetic algorithm includes
( I ) a chromosomal representation of a solution to the problem; (2) a way to create an initial population of

solutions;
(3) an evaluation function that rates solu-

tions in terms of fitness; and


(4) genetic operators that alter the compo-

sition of solutions during reproduction. Starting from an initial population of solutions, the genetic algorithm works with one population of solutions at a time. The algorithm evaluates each solution and assigns it a fitness score. By applying recombination and genetic operators to the old population, the algorithm generates a new population of solutions, which it then explores. Three genetic operators are commonly used. The reproduction operator duplicates the members of the population (solutions) that will be used to derive new members. The number of copies of each member is

proportional to its fitness score. After reproduction, new individuals are generated by selecting two individuals at a time from the resulting population and applying the crossover operator. This exchanges genes between the two selected individuals (parents) to form two different individuals, and it is usually applied with a constant probability P,.The mutation operator randomly changes some genes in a selected individual. It is applied much less often, at a rate of P , (where P , << 1). When applying a genetic algorithm, the user assigns values for such parameters as the population size, the number of generations, and the probability of applying crossover and mutation operators. Since the genetic algorithm works with string structures (analogous to chromosomes in biological systems), the solutions should be encoded and represented in string form. This low-level representation is called a genotype, and the corresponding set of apparent characteristics is called a phenotype. The individual elements of the genotype are called genes, and their possible values are alleles. The basic genetic algorithm is as follows:
Procedure GA (population size n, maximum number of generations Ng) begin; select an initial population of n genotypes (9);

no-of-generations = 0; repeat; for each member b of the population; compute Yg), the fitness measure for each member; /*evaluation*/ repeat; stochastically select a pair of genotypes gl, g2with probability increasing with their fitness f; /reproduction/ using the genotype representation of gi and g2,mutate a random bit with probability pu; /mutation*/ randomly select a crossover point and perform crossover on g1 and g2to give new genotypes gi and g2; /*crossover*/ until the new population is filled with n individuals g ; no-of-generations = no-of-generations t 1; until all the members converge; /termination*/ end;

Reference
I . D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, Mass., 1989.

technique (0 5 r < l ) , which obtains a random subsample where each example in the original data set has a probability of (1-7) of being included in the subsample. We also use a decomposability index d to calculate the amount of overlap among subsamples. It is defined (for a particular decomposition of the data set) as the ratio of the average number of examples in each subsample to the total number of examples in the data set: d = Num x (I-r)/Num = (I-r), where Num is the total number of examples. In general, there is no overlap when d = l/n (where n is the number of subsamples). PLSl programs within DLS use each subset to generate a concept to be passed to the genetic algorithm. To compare the concepts based on their scope, we defined an index s as the ratio of the number of examples covered by a given concept or rule to the number of all possible examples. In other words, s is the fraction of instancespace covered. Thus, if sl > s*, then a
JUNE 1992

concept C, has more scope and hence more predictive power than C2.

Representingcontrolstrategies.A control strategy will be represented as a categorical classification rule defined by a set of attributes. This classification rule also corresponds to a binary class membership function that holds over the instance space. Given the range of attribute values and function values, a correct concept can be expressed as one of many candidates or hypo these^.^,^ To learn concepts, a program must be able to form the hypothesis of the correct concept. I use the terms rule and control strategy interchangeably. In general, rules predict the value of a dependent variable for a given combination of independent variables, and strategies find the combination of independent variables that give the desired value of a dependent variable. All the rules and strategies together constitute a concept. Concepts are represented in disjunctive

normal form. In other words, a concept C of size k (also called k-DNF) is represented as C=CIvC2v.. .v Ck,whereeachdisjunct Ci is a conjunction of conditions. For example, in a problem involving rn attributes, each disjunct Ci will be of the form Ci = cl,;A 52,i A...&,,i,whereeachCj,i,j=1,2 ,...,rn represents an interval range for the attribute j . Consider the concept: If C Then Fix < 0.35, where

c = [(v, 2 0.262) And (v6 2 0.071) Or


[VI 5 0.5) And (V, 2 0.65)]
= CI v

c*

In this case, k = 2 and the disjuncts are in the form


cl

= 53.1 A

66.1

c 2 = 512 A 5 x 2 , where
53,1 =

56,] =

(V, (v, Cl,* = (VI 57.2 = (V,

2 0.262) 2 0.071) 5 0.5) 2 0.65)


37

If a particular is not present in C j ,it implies that the attribute j can take any value in Ci. In the above case, for example, there is no in CI, implying that the attribute VI can take any value in CI. CIand Czare control strategies for keeping the value of chemical Fix below 0.35. Stated differently, the control strategy C1 keeps the control variable V , above 0.262 and V , above 0.07 1 to keep the value of Fix below the 0.35 threshold. In this application, one of the main criteria in judging a concept is ease of use: We want to have a minimum number of disjuncts k (that is, a minimum k-DNF concept) and a minimum number of terms, in each disjunct Ci. PLSl uses this type of representation. The genetic algorithm in DLS must represent control strategies in a way compatible with the PLSl representation. I obtained the genetic algorithm's representation (called its genotype) by mapping the characteristics (that is, the phenotype) of C , ,,) into binary form (see the sidebar on p. 37). For example, if the problem has two attributes A, and A2, and if we use a 3-bit binary representation to convert the integers into binary form, then a disjunct C, = A where (1 I A I 5 3) and = (4 5 A, I 2) can be represented as

ki

cl,l

cj,',

The algorithm tries to find the best possible hypothesis (or disjunct) by applying genetic operators to the initial population. It then retains the best hypothesis, removes the positive examples that are covered, and repeats the process to find a new hypothesis that covers as many of the remaining positive instances as possible. The process terminates either when all the positive instances are covered, or when the algorithm cannot find a disjunct covering a certain threshold of positive examples (I kept the threshold at 5 for this application). The final concept is the disjunct of all the hypotheses found. This method of searching the instance-space is called explanationbased filtering.5

examples, Neg is the total number of negative examples, pos is the number of positive examples covered by the concept, and neg is the number of negative examples covered. The multiobjective function is reduced to a single-objective function by taking a convex combination of F 1and F,:
F = a.F, + ( 1 - a).FZ

where a is any number (0 5 a 5 1). Let a =Posl(Pos+ Neg), so that the two accuracy terms are weighted proportionally to their class representativeness. The objective function then becomes

1 .[pos + Neg - neg] Pos + Neg Multiplying by the constant (Pos + N e g ) , we get the fitness function F' = (pos + Neg - neg), where 0 < F' 5 Pos + Neg.

F=-

THE WRONG BIAS CAN M K E


LEARNlNG INEFFlClEiW OR A CONCEPT HARD TO LEARN. COMBZNZNG ALGORITHMS CAN ENSURE THAT WE AVOlD THESE PROBLEMS.

A comparative study
While doing a project for a chemical company, I compared DLS's distributed approach to C4.5 (the latest version of ID3) and PLSl. I divided the data set of 572 examples into a training set of 458 examples (290 positive and 168 negative) and a testing set of 114 examples (65 positive and 49 negative). As in the earlier example, the positive examples correspond to the value of the expensive chemical Fix being less than 0.35.

=(cl,i~...~<,

c2,;

c2,i,
-------

cl,i=

(( 1

3) ( 4

2 ))

(0 0 1 0 1 1 1 0 0 0 1 0)

phenotype genotype

Deriving rules. Genetic algorithms can

The genetic algorithm's fitness function. The goal of the ideal learning system

be used to derive rules in two ways: by is to obtain a concept that covers all the letting each member in the population rep- positive examples without covering any resent a complete concept (C1/C2/. ./ C p ) . negative examples. Due to noise in realor by letting each member be a single world data, we have to relax this requirement and allow a few negative examples to 5, disjunct C, = ( 5 , , , ~ . . . ~ .i).DLSusesthe second method because: ' unlike the first be covered as well. On the other hand, the method, it does not have to know the num- positive and negative examples in the appliber of disjuncts beforehand. If the data set cation described here constitute two classes is decomposed into n subsets and each that are exclusive of each other. When the subset is used by a PLS 1 program, then the learning program gives a concept C for the PLS 1 programs will generate n concepts, positive examples, we also want C* (the C1, C2,... C", where each Ci=C[~...~C& . complement of C) to represent the concept The genetic algorithm takes each C ; (f& for the negative examples. Thus a concept j = 1 to m,) from C' (for i=l to n ) as one C actually represents two mutually exclumember of the population. The initial pop- sive classes and hence is associated with ulation input to the genetic algorithm there- two accuracy terms. Consequently, the evalfore has a total of ml+m2+ ...+m, members, uation function should maximize both acrepresented as genotypes. Thus, translat- curacy terms. ing the output of the inductive-learning The two accuracy terms are defined as component into input for the genetic algorithm merely consists of breaking each concept C'into its members and converting each Cj into its binary genotype. where Pos is the total number of positive
38

C4.5. Since C4.5 explains its examples using only a decision tree, I had to extract the rules and strategies from it. The best tree generated by C4.5 had 70 leaves (after pruning). Since I was interested in knowing only when the value of Fix is less than 0.35, I needed to consider only those parts of the tree that corresponded to positive class membership-in this case, 20 leaves. Thus the size of the concept in disjunctive normal form was 20. Table 1 presents the six best control strategies learned by C4.5, along with the number of positive and negative examples explained by each (in the classification column) and theirs indices (reflecting the fraction of instance-space covered). For example, to keep the value of the chemical Fix below 0.35, the first strategy, C,, calls for maintaining the process variable VI above 0.028, V , above 0.256, V , above 0.043, V , below 0.405, V , below 0.815,
IEEE EXPERT

Table 1. The six best control strategies learned by C4.5.


Vl

v2

v3

v4

v 5

v6

v 7

V8

V,
>0.457

CLASSIFICATION
(66, 7) (47, 7) (38, 7) (20, 7) (20, 20) (14, 6)

S INDEX
0.126

C,

>0.028 50.028 >0.028 d.028 >0.028

>0.256 >0.256 >0.256 >0.633 >0.256 >0.256

>0.043 >0.043 >0.043 >0.043 >0.043 >0.043

50.405 50.405 50.366 50.405 50.405 >0.601

50.815 50.889 50.815 50.519 [0.815, 0.8891 >0.444

C z
c 3
Cd

50.457 50.439 >0.439

0.007
0.037 0.026 0.009 0.160

>0.616 50.818 50.768

Cs

c6

Table 2. The six best control strategies learned by PLSl.


Vl

v 2

v3

v4

v5

v6

v 7

V8

V9 CLASSIFICATIONPREDICTION S INDEX
(183,14) (32, 3) (41, 6) 0.134 0.030 0.019 0.009 0.002 0.007

cy

20.373 20.373 20.373 20.373

20.5

20.437 [0.437, 0.8021 20.437 20.437

20.071 20.071 20.071 20.071 20.071 20.071

50.564

s0.722 20.722 [0.516, 0.7221 [0.484, 0.7221 [0.484, 0.7221 20.722

c ;

[0.071, 0.5791
20.722 [0.563, 0.7861 [0.564, 0.7861 [0.071, 0.5791

(IO,4)
(3, 2) (3, 2) (1, 2) (2, 3)

(25, 0) (18, 6)

C{

50.183

20.373 [0.373, 0.6271

50.5 20.437 20.802

(7, 1) (6, 2)

and V , above 0.457. In other words, the strategy imposes constraints on six of the nine process variables. The s index is one measure of this restrictiveness. If the strategy is very restrictive, that is, if it calls for severe control of all variables, the s index will be low. In this case, the best strategy (C,) has an s index of 0.126 and explains only 66 of the 290 cases where the value of Fix is below 0.35 (positive examples), but wrongly explains seven of the 168 cases where the value of Fix is above 0.35 (negative examples). While the trees prediction accuracy was 78.3 percent, Table 1 does not include a column for prediction, since C4.5 does not predict results for individual leaves of the tree.

PLS1. PLSl describes its concepts in the required disjunctive normal form. The variable sig is the significance level
1

(between 0 and 1) corresponding to the approximate noise level in the data set. The concept given by the PLSl program for sig = 1 had a size of 18, and its prediction accuracy was 77.2 percent. Table 2 presents the six best control strategies of the 18 PLS 1 developed. The best strategy, C;, constrains five of the nine process variables and has an s index (scope) of 0.1335; about the same as that of the best strategy from C4.5. It explains 183 of the 290 positive cases but also covers 14 of the 168 negative cases-a much better rule than the best one from C4.5. In terms of prediction, C, correctly predicts 41 of the testing sets 65 positive cases but wrongly covers six of the 49 negative cases. Interestingly, the s indices generally decrease as the rules become weaker.

DLS. For this comparison, I used the

parameter values n = 5 and d = 0.2. For the genetic algorithm, I used the Stochastic Universal Sampling (SUS) algorithm9 for reproduction, a uniform crossover operatorlo with a0.7 probability, a 0.05 probability of mutation, and 100 generations. The population size of the genetic algorithm was determined by the PLSl programs output and was found to be proportional to the value of d for each n. I scaled the realvalued data set (between 0 and 1) to integer values from 0 to 63 so that the genetic algorithm could use a 6-bit binary representation for the genotypes. Later, I transformed the resulting concept given by DLS back to the original real-valued range. For ease of understanding, DLS outputs are reported as infinite intervals (even though they are actually finite closed intervals) if either end point of an interval corresponds to the domains lower or upper limit.
I

JUNE 1992

39

Table 3. The control strategies learned by DLS.


Vl
v 2
v3

v4

v5

v6

v7

Vll

V,

CLASSIFICATION PREOICTIONS INDEX


(235,39) (55,14) (6, 6) 0.397

C?
C ,

20.262 20.373

20.437

20.071 [0.071, 0.7541

50.579 20.595

l0.42, 0.7221 [0.516, 0.7061

(42, 11)

0.006

Table 4. Overall comparison of (4.5, PLSl, and DLS results.

LEAFINING PREOICTION NUMBER OF RULES/


MEMOD
ACCURACY (PERCENT) CONTROL STRATEGIES

C4.5 PLSl (sig=1)

78.3 77.2 79

20 18 2

DLS (n=5,d=0.2)

17-

n=2 e 5 + n=7 + PLSl(sig=O) I t PLSl(sig=l)

2.71e-20

2.00e-1

4.00e-1

6.00e-1

8.00e-1

1.OOetO

d gure 2. Plot of accuracy of PLSl and DLS for various values of

n and d.

Taking a closer look at Tables I and 2, we can see patterns across the control strategies generated by C4.5 and PLS 1. For example, the strategies have no constraints on the process variable V,, all place a lower bound on the value of V , and V,, and so on. It is exactly these generalizations that DLS captures in its strategies. It has a lower bound on V 3that falls between the lower bounds produced by C4.5 and PLS1. Although it is not obvious, the strategies generated by DLS combine and refine those generated by PLS 1 . For example, taking the union of C, and C; from PLSls output would eliminate the variable V, and place an upper bound of 0.579 on V7,which is exactly what DLS does. DLS has combined C[ and C i from the output of PLS 1, modified it to remove the restriction on the variable V,,
40

and relaxed the lower bound on V3to 0.262 to produce a more powerful rule Cr. This demonstrates DLSs ability to refine rules, that is, to combine and modify one or more rules from PLSls output to form a more concise and powerful rule. The concept learned by DLS, shown in Table 3, has only two control strategies, as compared to 18 from PLS 1 and 20 from C4.5. The best DLS strategy, Cy, calls for maintaining the process variable V3 above 0.262, V, above 0.07 1, and V7below 0.579. Because it restricts only three of the nine process variables, C;is easy to implement, and it has an s index of about 0.397 compared with 0.134 for PLSls output and 0.126 for C 4 . 5 ~ output. In terms of classification, c; explains 235 of the 290 positive cases-much more than any strategy from the other two tests-but at the cost

of also covering 39 of the 168 negative examples. In terms of prediction, C;predicts 55 of the 65 positive test cases, again much better than the previous results, but also wronglypredicts 14 of the45 negative test cases. Overall, the prediction accuracy of the DLS results is 79 percent. DLS stops learning after covering 277 of the 290 positive training-set examples, because the genetic algorithm cannot find a new strategy explaining more than the threshold of five positive examples. The results of the experiment show that DLSs control strategy is about 90 percent more concise than those produced by PLS 1 and C4.5, and its predictions are more accurate. Table 4 compares the overall concepts generated by each method.

An empirical study
Two DLS parameters, the number of subsamples IZ and the decomposability index d, must be tuned to achieve good performance. Using the same data set from the chemical process control problem, Michael Shaw and I experimented further to gauge the effect of changing these parameters on system p e r f ~ r m a n c eWe . ~ randomly divided the data set into a training set of 458 and a testing set of I14 examples. We used the same parameter values for the genetic

algorithm as in the first set of experiments, and a different training and testing set in each of five runs, averaging the results. Figures 2 and 3 show the prediction accuracy and CPU time, respectively, of PLS 1 and DLS for different values of n as a function of d. For each value of n, DLS is generally the most accurate when the average amount of overlap between subsamples is minimal, that is, around d = I/n. Also, the peak performances when n = 2 and n = 5 are better than PLS 1 in terms of prediction accuracy, and the CPU times are comparable. The average size of the concept for DLS was approximately three, and forPLSl itwas49.8forsig=Oand 17.4for sig = 1 . Thus the accuracy for n = 5 and d = 0.2 is comparable to that of PLSl (about 2 percent better), but rule size improves by about 82 percent. However, the prediction accuracy of DLS starts to decrease as the value of n increases, due to the loss of data representativeness as the data set is decomposed into too many insignificant subsamples. The performance of DLS thus depends on both the parameters nand d.

--t

n=2 n=5 I t n=7 + PLSl (s/g=O) f PLSl(s/g=l)

5 ._ E 3

3
0 a

0
2.71 e-20

I
2.00e-1

I
4.00e-1

I
6.00e-1

I
8.00e-1

I
1.OOetO

d gure 3. Plot of CPU time of PLSl and DLS for various values of n and d.

PtSl RULE

INSTANCES
CORRECTLY PREDICTED

Other applications
C ; :
We also applied this approach to financial problems5 using real-world data sets for bankruptcy analysis, loan default evaluation, and loan risk classification. An example on loan default evaluation again demonstrates DLSs ability to refine rules. We used this data set to classify firms into those that would default on loan payments and those that would not. Our testing set included I6 different positive instances and a data set of 32 examples, of which 16 belonged to the default class and the other 16 to the nondefault class. We set the parameter values at n = 20, d = 0.6, the probability of crossover = 0.7, and theprobability of mutation = 0.01. Since the testing set had only positive examples, one could argue that the best rule would be all firms default. However, this involves using information about the testing set prior to learning. Figure 4 lists the rules obtained by PLS 1 and DLS, along with the number of instances they correctly predicted. The PLS 1 attribute (total-debt to total-assets ratio) has the value of 50.763 in C, and >0.763 in C;, with the Salne of (current-asset to current-liability ratio) remaining the same
JUNE 1992
[(total-debt to total-assets ratio 5 0.763) And (current-asset to current-liability ratio 2 1.967)] Then the firm would default on loan payment [(net-income to total-assets ratio 2 -0.015) And ((total-debt to total-assets ratio > 0.763) And (current-asset to current-liability ratio 5 1.967)] Then the firm would default on loan payment If

If

c,:
C . , :

[(1.967 < current-asset to current-liability ratio 5 2.574) And (working-capital to sales ratio 2 0.226)] Then the firm would default on loan payment If If [(current-asset to current-liability ratio 2 4 578)] Then the firm would default on loan payment

INSTANCES
CORRECTLY PREDICTED

C,.

If [(current-asset to current-liability ratio 5 1 967)] Then the firm would default on loan payment
[(l 967 <currentpasset to current-liability ratio (working-capital to sales ratio c 0 2 2 6 ) ] Then the firm would default on loan payment

14

C2 If

s 2 574)And

C3 If
-

[(current-asset to current-liability ratio 2 4 578)] Then the firm would default on loan payment

Figure 4. Rules obtainedby PLSl and DLS, and the number of i & i K i T T e y correctly predicted from the total of 16.
41

in both rules. DLS combined these PLSl rules to get the simpler rule C y , which eliminates the variable (total-debt to total-assets ratio). Thus DLS reduces the rule size from four to three and improves the prediction accuracy from 68.8 percent (for PLS 1 ) to 93.8 percent. The predictions of DLSs rule C{, 14 out of 16 default loans, outnumber the combined predictions of PLSlS output.

Handling complex representationlanguages. Although the task is nontrivial, we


should also be able to extend DLS to handle complex representation languages based on first-order predicates. This would require developing new genetic operators (crossover and mutation) that respect language syntax. We could use John Kozas genetic-programming paradigm, in which populations of computer programs are genetically bred using a crossover (recombination) operator appropriate for genetically mating computer programs. Many seemingly different problems in artificial intelligence, symbolic processing, and machine learning require computer programs that produce a desired output for particular inputs. Applying this idea to

operators. Kozas example shows how a genetic algorithm learns the tree for a symbolic problem involving the attributes Temperature, Humidity, Outlook, and Windy.3

limitations and future work


While this distributed-learning algorithm improves performance, it has limitations.

Real-coded and symbolic-coded genetic algorithms. Since genetic algorithms work on strings of 0s and Is, our method
is currently limited to problems where the inductive learners output can be translated easily into binary code. Although we used a binary-coded genetic algorithm in DLS, genetic algorithms do work on more than binary alphabets. I am working on replacing the binary-coded genetic algorithm with a real-coded one; that is, one that works directly on real intervals. The only changes needed are slightly different crossover and mutation operators: The uniform crossover operator can be used with representations involving real intervals by treating each attributes interval range as a single entity (see Figure 5 ) . The mutation operator can be designed as a specialization/generalization operator that either increases or decreases (with equal probability) the interval range of an attribute. The same argument also applies to problems involving symbolic values. Instead of using intervals, the genetic algorithm would use sets to represent symbolic variables. The crossover operator would remain the same, but the mutation operator would have to be changed. The new mutation operator could simply replace one value of the symbolic variable with another value from its domain. For example, if color is one of the attributes, and if the domain of color is {red, blue, white, yellow), then the mutation operator would change the current value of color in the concept by either adding a color from the domain or removing an existing color. 42

DZFFERENT ALGORZTHMS
USE DZFFERENT BIASES

(IMPLICITLY OR EXPLZCITLY), PROVZDNG DLS WITH THE UNIQUE APPROACH OF U S N G MULTIPLE BIASES.
DLS, we recognize that the outputs of inductive learners can be thought of as programs that take an object as input and give that objects classification as output. For example, the learners output could be in the form of a schema (as in explanationbased learning) for a particular concept. Given an object as its input, the schema would determine whether the object belongs to the concept. Extending DLS to handle languages based on first-order predicates would require mapping a learners output to a function or program (such as a Lisp s-expression), and incorporating into the genetic algorithm special genetic operators similar to the ones used in the geneticprogramming paradigm. This argument also applies to representation languages based on decision trees. The trees generated by several inductive learners (ID3, for example) can be mapped to Lisp s-expressions as input and combined to make a more concise tree. The genetic algorithm can then operate on the s-expressions by applying recombination

Attribute-baseddecomposition.Another limitation of DLS is the way in which it decomposes its data set: Since it allocates a fraction of the examples to each inductive learner, DLS might not be effective if the data set is small. However, we could instead use an attribute-based decomposition scheme where each learning program gets only the data corresponding to a subset of attributes. This might be especially useful in realworld situations where data is spatially distributed (as in air traffic control). In these cases it is more efficient to use the inductive learners where the data resides and then synthesize the results, rather than collect the data at one place and then analyze it. In this method of decomposition, however, the results of the inductive learners are underspecified with respect to the whole problem. This can be overcome by filling the unspecified portions of the partial solutions with complete domains of the respective attributes. We could then use the genetic algorithmon these complete solutions. For example, lets assume that four attributes, AI, AI, A,, and A4, are distributed two each to two programs. The learners generate partial solutions that give a range for a particular attribute; for example, ((0, 3.5) (-8.7, 3)) could be partial solution 1 using the first two attributes, and ((9, 12.5) (3, 8.1)) could be partial solution 2 using the next two attributes. In other words, partial solution 1 says that [(O 5 A I 5 3.5) and (-8.7 I A, I 3)], and partial solution 2 says that [(9 < A g 5 12.5) and(31A4S8.1)].Ifweknowthedomains of the respective attributes (which we usually do in real-world problems), we can expand the partial solutions by combining them with the domains of the missing attributes (that is, the dont cares). Say, for example, the domains of the attributes are (0, lo), (-10, lo), (0, 201, (0, IO), respectively. Then the above partial solutions can be converted into complete solutions:

((0 3.5) (-8.7 3)

(0 20)

(0 10))

1
IEEE EXPERT

((0, 10) (-10, 10) (9, 12.5) (3, 8.1))

_________~

I ---.-. . . _ .-. . . -,
Parentl: [(-0.5 0.5) (0 1) (0.12 0.34) (0 0.5)j Mask: 1 0 1 0 Parent2: [ ( 0 0.3) (0.32 0.7) (0.01 1) (-0T1 0.43)]
~ ~

Childl: [(0 0.3) (0 1) (0.01 1) (0 0.5)] Child2: [(-0.5 0.5) (0.32 0.7) (0.12 0.34) (-0.1 0.43)]

Figure 5. Crossover example.

This accomplishes two things: It completely specifies solutions that a genetic algorithm can use, and it retains the partial solutions in their original form. This also corresponds nicely to the theory of schemata of genetic algorithms, l 2 since the partial results of the inductive learners correspond to schemata (that is, similarity templates that the genetic algorithm treats as building blocks in searching for an optimal solution). My initial experiments using attributebaseddecomposition on areal-world bankruptcy prediction problem show promising results. However, more work needs to be done before making any conclusions.

negative examples, or in the case of noisy data. In such situations, a genetic algorithm can look for an optimal concept that covers as many positive examples as possible and as few negative examples as possible. This is important, since such situations are common in real-world applications. The successful implementation of DLS demonstrates a promising direction for hybrid machine-learning systems, which synergistically combine separate learning paradigms.

iI

IEEE Trans. Systems, Man, and Cybernetics, Vol. 20, No. 2, 1990, pp. 326-338.

7. R. Sikora and M. Shaw, The Distributed Learning System: A Group Problem-Solving Approach to Rule Learning, Tech. Report 9 1-0143, Bureau of Economic and Business Research, Univ. of Illinois at Urbana-Champaign, 1991. 8. B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans, SIAM, Philadelphia, Pa., 1982. 9. J.E. Baker, Reducing Bias and Efficiency in the Selection Algorithm, Proc. Second In! I Conf: Genetic Algorithms, Lawrence Erlbaum, Hillsdale, N.J., 1987, pp.14-21. 10. G. Syswerda, Uniform Crossover in Genetic Algorithms, Proc. Third In1 1 Conj: Genetic Algorithms, Morgan Kaufmann, San Mateo, Calif., 1989, pp. 2-9.
1 I . J.R. Koza, Genetic Programming: A Par-

Acknowledgments
couragement and continued support during this h u80rk,and Selwyn Piramuthu for providing the results of C4.5 on the process control data set. I would also like to thank the anonymous reviewers for their comments and suggestions

ers in DLS. ~ l ~ wehused ~ only ~ one ~ similarity-based learningalgorithm (pLs ),


we could use different learning algorithms ondifferent samples. For instance, wecould use PLS 1 and ID3 simultaneously on different samples, and then decompose their results into rules before giving them to the genetic algorithm for refinement. Another important concept in inductive learning is that of bias: The representational language used by a learning algorithm constrains the search space of possible hypotheses. Since different algorithms use different biases (either implicitly or explicitly), they provide DLS with the unique approach of using multiple biases. This also has important implications for solution quality and efficiency: Before testing, we do not know which bias is suitable for the problem at hand, and the wrong bias can sometimes make learning inefficient or a concept hard to learn. Combining different algorithms can ensure that we avoid these problems.
~

References
1. T. Mitchell, Version Spaces: A Candidate

adigm for Genetically Breeding Populations of Computer Programs To Solve Problems,Tech. report Stan-CS-90-13 14, Dept. of Computer Science, Stanford Univ., Stanford, Calif., 1990. 12. J. Holland, Adaptations in Natural and Artjficial Systems, Univ. ofMichigan Press, Ann Arbor, Mich., 1975.

Elimination Approach to Rule Learning,


Proc. Fifth Intl Joint Con$ Artificial Intelligence, Morgan Kaufmann, San Mateo, Calif., 1977, pp. 305-310.

2. R.S. Michalski, A Theory and Methodology of Inductive Learning, in Machine , Learning, R. Michalski, J. Carbonell, and T. Mitchell, eds., Tioga Publishing, Palo Alto, Calif., 1983, pp. 83-134.
3. J.R. Quinlan, Induction of Decision Trees, in Machine Learning, Vol. 1, NO. I , 1986, pp. 81.106. 4. L.A. Rendell, A General Framework for

Induction and a Study of Selective Induction, in Machine Learning, Vol. I , No. 2, 1986, pp. 177-226.
5 . R. Sikora and M. Shaw, A Double-LaYeredLearningApproachtoAcquiringRules for Financial Classification, Tech. Report 90-1693, Bureau of Economic and Business Research, Univ. of Illinois at UrbanaChampaign, 1990.
6. L.A. Rendell, Induction as Optimization,

R DISTRIBUTED MACHINE-

I
1
~

learning method is useful when the language of the empirical inductive learner cannot represent a concept covering all the positive examples without covering any
JUNE 1992

Riyaz Sikora is a doctoral candidate in management information systems at the University of Illinois, UrbanaChampaign. His research interests include business applications of genetic algorithms, machine learning, intelligent manufacturing, and group decision making in distrimodeling - of buted systems. After receiving his BS from Osmania University in India, he was a research assistant at Washington University. He is amemher of The ~~~~i~~~~of M~~~~~~~~~science, ACM, and ~ ~ sciences ~ ~ ~ t i i t ~ t~~ . ~~~d~~~can reach Sikora at the Beckman Institute, University of Illinois, 405 N. MatthewrAvenue,Urbana,IL6180l,orhye-mail at sikora@uxh.cso.uiuc.edu

You might also like