Learning With Continuous Classes: in Proceedings AI'92 (Adams & Sterling, Eds), 343-348, Singapore: World Scientic, 1992

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

In Proceedings AI'92 (Adams & Sterling, Eds), 343-348, Singapore: World Scienti c, 1992

LEARNING WITH CONTINUOUS CLASSES

J. R. QUINLAN
Basser Department of Computer Science, University of Sydney
Sydney, Australia 2006

ABSTRACT: Some empirical learning tasks are concerned with predicting


values rather than the more familiar categories. This paper describes a new
system, m5, that constructs tree-based piecewise linear models. Four case
studies are presented in which m5 is compared to other methods.

1. Introduction
One branch of machine learning, empirical learning, is concerned with building or
revising models in the light of large numbers of exemplary cases, taking into account
typical problems such as missing data and noise. Many such models involve classi cation
and, for these, learning algorithms that generate decision trees are ecient, robust, and
relatively simple1 7.
;

Other tasks, however, require the learned model to predict a numeric value associated
with a case rather than the class to which the case belongs. For instance, a case might be
the composition of an alloy and the associated value the temperature at which it melts.
Some researchers have attempted to use decision tree methods for value prediction by
dividing the value's range into small \categories" such as 0-4%, 5-9%, etc., then using sys-
tems that build classi cation models. These attempts often fail, partly because algorithms
for building decision trees cannot make use of the implicit ordering of such classes.
Several more e ective learning methods are available for predicting real values. The
cart program1 builds regression trees that di er from decision trees only in having values
rather than classes at the leaves. mars3, a recent and elegant system, constructs models
whose basis functions are splines. Many classical statistical methods such as multiple
linear regression address the same task, postulating a simple form for the model and then
nding parameter values that maximise its t to the training data6 .
This paper describes m5, a new system for learning models that predict values. Like
cart, m5 builds tree-based models but, whereas regression trees have values at their
leaves, the trees constructed by m5 can have multivariate linear models; these model trees
are thus analogous to piecewise linear functions. m5 learns eciently and can tackle
tasks with very high dimensionality { up to hundreds of attributes. This ability sets m5
(and cart) apart from mars, whose computational requirements grow very rapidly with
dimensionality, e ectively limiting its applicability to tasks with no more than 20 or so
attributes. The advantage of m5 over cart is that model trees are generally much smaller
than regression trees and have proven more accurate in the tasks investigated.
This paper gives an overview of the algorithm for constructing model trees, then reports
the performance of m5 on four learning tasks.
2. Constructing Model Trees
We suppose that we have a collection T of training cases. Each case is speci ed by its
value of a xed set of attributes, either discrete or numeric, and has an associated a target
value. The aim is to construct a model that relates the target values of the training cases
to their values of the other attributes. The worth of the model will generally be measured
by the accuracy with which it predicts the target values of unseen cases.
Tree-based models are constructed by the divide-and-conquer method. The set T is
either associated with a leaf, or some test is chosen that splits T into subsets corresponding
to the test outcomes and the same process is applied recursively to the subsets. This
relentless division often produces over-elaborate structures that must be pruned back, for
instance by replacing a subtree with a leaf.
The rst step in building a model tree is to compute the standard deviation of the target
values of cases in T . Unless T contains very few cases or their values vary only slightly,
T is split on the outcomes of a test. Every potential test is evaluated by determining the
subset of cases associated with each outcome; let T denote the subset of cases that have
i

the ith outcome of the potential test. If we treat the standard deviation sd (T ) of the
i

target values of cases in T as a measure of error, the expected reduction in error as a


i

result of this test can be written


X jT j
error = sd (T ) ? jT j  sd (T ):
i
i

After examining all possible tests, m5 choose one that maximises this expected error
reduction. (For comparison, cart chooses a test to give the greatest expected reduction
in either variance or absolute deviation.)
The major innovations of m5 come into play after the initial tree has been grown. A
detailed discussion is precluded by the length of this paper, but the main ideas are:
Error estimates : m5 often needs to estimate the accuracy of a model on unseen cases.
First, the residual of a model on a case is just the absolute di erence between the actual
target value of the case and the value predicted by the model. To estimate the error of a
model derived from a set of training cases, m5 rst determines the average residual of the
model on these cases. This will generally underestimate the error on unseen cases, so m5
multiplies the value by (n +  )=(n ?  ), where n is the number of training cases and  is
the number of parameters in the model. The e ect is to increase the estimated error of
models with many parameters constructed from small numbers of cases.
Linear models : A multivariate linear model is constructed for the cases at each node
of the model tree using standard regression techniques6. However, instead of using all
attributes, this model is restricted to the attributes that are referenced by tests or linear
models somewhere in the subtree at this node. As m5 will compare the accuracy of a
linear model with the accuracy of a subtree, this ensures a level playing eld in which the
two types of models use the same information.
Simpli cation of linear models : After each linear model is obtained as above, it is simpli ed
by eliminating parameters to minimise its estimated error. Even though the elimination
of parameters generally causes the average residual to increase, it also reduces the mul-
tiplicative factor above, so the estimated error can decrease. m5 uses a greedy search
to remove variables that contribute little to the model; in some cases, m5 removes all
variables, leaving only a constant.
Pruning : Each non-leaf node of the model tree is examined, starting near the bottom.
m5 selects as the nal model for this node either the simpli ed linear model above or the
model subtree, depending on which has the lower estimated error. If the linear model is
chosen, the subtree at this node is pruned to a leaf.
Smoothing : Pregibon5 observes that the prediction accuracy of tree-based models can be
improved by a smoothing process. When the value of a case is predicted by a model tree,
the value given by the model at the appropriate leaf is adjusted to re ect the predicted
values at nodes along the path from the root to that leaf. The form of smoothing used by
m5 di ers from that developed by Pregibon, but the motivation is similar. m5's smoothed
predicted value is backed up from the leaf to the root as follows:
 The predicted value at the leaf is the value computed by the model at that leaf.
 If the case follows branch S of subtree S , let n be the number of training cases at
i i

S , P V (S ) the predicted value at S , and M (S ) the value given by the model at S .


i i i

The predicted value backed up to S is


n  P V (S ) + k  M (S )
P V (S ) =
i i

n +k
i

where k is a smoothing constant (default 15).


Smoothing has most e ect on a case when the models along the path predict very di erent
values and when some models were constructed from few training cases.
3. Case Studies
m5 has been evaluated on several learning tasks for which results of other methods
are also available. When comparing approaches, a common measure of performance is
relative error, the ratio of the variance of the residuals to the variance of the target values
themselves. Another useful statistic is the correlation between actual and predicted values
(although a correlation coecient of 1 indicates only that there is a linear relationship
between actual and predicted values). Some authors report percentage deviation, the
average over the cases of the ratio of the residual to the target value { this is not useful if
the target values include 0!
In most trials, m5's performance was assessed in a 10-way cross-validation in which
the available data was divided into ten equal-sized blocks. For each block in turn, a model
was constructed using only cases in the remaining nine blocks, then tested on cases in
the \hold-out" block. Each case was thus tested once using a model constructed without
reference to the case.
All m5 results reported below were obtained using the same default values of parame-
ters such as the minimum number of cases that can be split and the smoothing constant.
To show the e ects of forming models at the leaves and of smoothing, results are also
given with these features disabled.
CHMIN  7 : RM0 Model: RM0 RM1 RM2
CHMIN > 7 : 11.5 -101.4 11.9
MMAX > 24000 : RM1 Cycle time
MMAX  24000 : Min mem 0.030
CACH  48 : RM2 Max mem 0.003 0.008
CACH > 48 : average 217.5 Cache size 0.902
Min chans
Max chans 0.518 4.686
Figure 1 : Model tree for CPU performance

Original Attributes Transformed Attributes


Correlation Percentage Correlation Percentage
Deviation Deviation
Ein{Dor (retrial) { { .966 33.9%
ibp { 35.0% { 33.0%
m5 .921 34.9% .956 34.0%
m5 (no smoothing) .908 37.2% .957 33.9%
m5 (no models) .803 49.9% .853 48.6%
Table 1 : CPU performance data

3.1 CPU Performance


Ein-Dor and Feldmesser2 carried out experiments to predict the relative performance
of new CPUs as a function of six easily-determined variables: cache size, cycle time,
minimum and maximum channels, and minimum and maximum memory. They did not
use the attributes directly, but developed from them three composite attributes. Then,
using 209 cases, they constructed a linear model of the square root of performance as a
function of the three transformed attributes.
Kibler, Aha and Albert4 studied this same data using an instance-based approach. ibp
predicts a case's target value by interpolating from known values of similar cases. Results
were reported for both the six original and three transformed attributes. ibp gave better
predictive accuracy than Ein-Dor and Feldmesser's model without the need to resort to
indirect methods such as predicting the square root of the target value.
When m5 was run on the 209 cases using the original attributes, it produced the
compact model tree of Figure 1. All results with the original and transformed datasets
are shown in Table 1. Note that reported accuracies for Ein-Dor and Feldmesser are for
retrial performance on the training data from which the model was derived { their model
would presumably give worse results on unseen cases. Predictive accuracies for ibp and
m5 shown in the table were obtained by cross-validation { n -way for ibp and 10-way for
m5. The accuracy of the models developed by m5 was comparable to that of ibp, but
performance degraded on the original data when smoothing was disabled, and (markedly)
on both datasets when the system was prevented from forming models at leaves.
Correlation Percentage Relative
Deviation Error
ibp { 11.8% {
m5 .916 12.7% 16.1%
m5 (no smoothing) .885 14.2% 23.2%
m5 (no models) .857 15.2% 27.6%
Table 2 : Car price data

3.2 Car Prices


Kibler, Aha and Albert also evaluated ibp on a collection of 159 car descriptions. Each
case has values for 16 attributes such as number of doors and engine size, with the target
value being the car price. This task is considerably more challenging than the previous
one, since it has higher dimensionality and yet there are fewer training cases. Table 2
shows results for both ibp and m5 on this data; m5 gave a higher percentage deviation
than ibp, but the contribution of smoothing and linear models are both apparent.
3.3 An Arti cial Dataset
Breiman et al 1 reported regression tree results for a small dataset a ected by noise.
In this study there are ten variables, X1 , ..., X10; X1 takes values -1 or 1 and the others
-1, 0, or 1, all values being equiprobable. The generating function is
(
target value = 3 + 3X2 + 2X3 + X4 +  if X1 = 1
?3 + 3X5 + 2X6 + X7 +  if X1 = ?1
where  is a random Gaussian variable with variance 2, representing noise. From 200
training cases, cart constructed a regression tree with 13 leaves. When given a similar
dataset, m5 produced a model tree with just two leaves (as we would hope). Table 3
compares published results from cart with experimental results from m5, both obtained
by a 10-way cross-validation. The better performance of m5 on this data demonstrates
clearly the bene t of allowing linear models, rather than just averages, at leaves.
3.4 Activity of Drugs
The last example concerns a very high-dimensional real-world task: predicting the

Correlation Relative Correlation Relative


Error Error
cart { 17% m5 .887 21.4%
m5 .951 9.5% m5 (no smoothing) .872 24.1%
m5 (no smoothing) .955 8.9% m5 (no models) .856 27.0%
m5 (no models) .914 16.6% linear regression .805 38.5%
Table 3 : An arti cial task Table 4 : Drug activity data
activity levels of LHRH peptides (strings of 10 amino acids). Data provided by Arris
Pharmaceutical consists of 563 cases, each of which records the values of 128 attributes
and a measured activity.
On a cross-validation trial, the model trees constructed by m5, with an average of
9.6 leaves, were still quite simple. The results in Table 4 show the contribution of both
smoothing and linear models. For comparison, the Table also shows the performance of a
standard multivariate linear model using the same cross-validation runs.
4. Conclusion
Model trees, like regression trees, can be learned eciently from large datasets { the
model tree for the car price data was constructed in less than a second, while that for
the LHRH data (563 cases  128 attributes) still required less than a minute on my
DECstation 5000. However, model trees have advantages over regression trees in both
compactness and prediction accuracy, attributable to the ability of model trees to exploit
local linearity in the data. There is one other noteworthy di erence { regression trees will
never give a predicted value lying outside the range observed in the training cases, whereas
model trees can extrapolate. (Whether this last is an advantage or a cause for concern is
still an open question!)
m5 could be strengthened in many respects. I am exploring alternative heuristics to
guide the selection of tests at internal nodes and di erent smoothing algorithms. Further,
the models at the leaves could be generalised to allow non-linear functions of the attributes,
albeit with an increase in computation.
Acknowledgements
Thanks to Arris Pharmaceutical of San Francisco for permission to use the LHRH
data. This research was supported by a grant from the Australian Research Council and
by a research agreement with Digital Equipment Corporation.
References
1. Breiman, L, Friedman, J.H., Olshen, R.A., and Stone, C.J., Classi cation and Regres-
sion Trees (Wadsworth, Belmont CA, 1984).
2. Ein-Dor, P. and Feldmesser, J., Attributes of the performance of central processing
units: a relative performance prediction model, Communications ACM 30, 4 (1987)
308-317.
3. Friedman, J.H., Multivariate Adaptive Regression Splines, Technical Report 102, Lab-
oratory for Computational Statistics, Stanford University, Stanford CA (1988).
4. Kibler, D., Aha, D.W., and Albert, M.K., Instance-based prediction of real-valued
attributes, Technical Report 88-07, ICS, University of California, Irvine CA (1988).
5. Pregibon, D., private communications, 1989, 1992.
6. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., Numerical Recipes
in C (Cambridge University Press, 1988).
7. Quinlan, J.R., C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo
CA, 1992).

You might also like