Professional Documents
Culture Documents
Learning With Continuous Classes: in Proceedings AI'92 (Adams & Sterling, Eds), 343-348, Singapore: World Scientic, 1992
Learning With Continuous Classes: in Proceedings AI'92 (Adams & Sterling, Eds), 343-348, Singapore: World Scientic, 1992
Learning With Continuous Classes: in Proceedings AI'92 (Adams & Sterling, Eds), 343-348, Singapore: World Scientic, 1992
J. R. QUINLAN
Basser Department of Computer Science, University of Sydney
Sydney, Australia 2006
1. Introduction
One branch of machine learning, empirical learning, is concerned with building or
revising models in the light of large numbers of exemplary cases, taking into account
typical problems such as missing data and noise. Many such models involve classication
and, for these, learning algorithms that generate decision trees are ecient, robust, and
relatively simple1 7.
;
Other tasks, however, require the learned model to predict a numeric value associated
with a case rather than the class to which the case belongs. For instance, a case might be
the composition of an alloy and the associated value the temperature at which it melts.
Some researchers have attempted to use decision tree methods for value prediction by
dividing the value's range into small \categories" such as 0-4%, 5-9%, etc., then using sys-
tems that build classication models. These attempts often fail, partly because algorithms
for building decision trees cannot make use of the implicit ordering of such classes.
Several more eective learning methods are available for predicting real values. The
cart program1 builds regression trees that dier from decision trees only in having values
rather than classes at the leaves. mars3, a recent and elegant system, constructs models
whose basis functions are splines. Many classical statistical methods such as multiple
linear regression address the same task, postulating a simple form for the model and then
nding parameter values that maximise its t to the training data6 .
This paper describes m5, a new system for learning models that predict values. Like
cart, m5 builds tree-based models but, whereas regression trees have values at their
leaves, the trees constructed by m5 can have multivariate linear models; these model trees
are thus analogous to piecewise linear functions. m5 learns eciently and can tackle
tasks with very high dimensionality { up to hundreds of attributes. This ability sets m5
(and cart) apart from mars, whose computational requirements grow very rapidly with
dimensionality, eectively limiting its applicability to tasks with no more than 20 or so
attributes. The advantage of m5 over cart is that model trees are generally much smaller
than regression trees and have proven more accurate in the tasks investigated.
This paper gives an overview of the algorithm for constructing model trees, then reports
the performance of m5 on four learning tasks.
2. Constructing Model Trees
We suppose that we have a collection T of training cases. Each case is specied by its
value of a xed set of attributes, either discrete or numeric, and has an associated a target
value. The aim is to construct a model that relates the target values of the training cases
to their values of the other attributes. The worth of the model will generally be measured
by the accuracy with which it predicts the target values of unseen cases.
Tree-based models are constructed by the divide-and-conquer method. The set T is
either associated with a leaf, or some test is chosen that splits T into subsets corresponding
to the test outcomes and the same process is applied recursively to the subsets. This
relentless division often produces over-elaborate structures that must be pruned back, for
instance by replacing a subtree with a leaf.
The rst step in building a model tree is to compute the standard deviation of the target
values of cases in T . Unless T contains very few cases or their values vary only slightly,
T is split on the outcomes of a test. Every potential test is evaluated by determining the
subset of cases associated with each outcome; let T denote the subset of cases that have
i
the ith outcome of the potential test. If we treat the standard deviation sd (T ) of the
i
After examining all possible tests, m5 choose one that maximises this expected error
reduction. (For comparison, cart chooses a test to give the greatest expected reduction
in either variance or absolute deviation.)
The major innovations of m5 come into play after the initial tree has been grown. A
detailed discussion is precluded by the length of this paper, but the main ideas are:
Error estimates : m5 often needs to estimate the accuracy of a model on unseen cases.
First, the residual of a model on a case is just the absolute dierence between the actual
target value of the case and the value predicted by the model. To estimate the error of a
model derived from a set of training cases, m5 rst determines the average residual of the
model on these cases. This will generally underestimate the error on unseen cases, so m5
multiplies the value by (n + )=(n ? ), where n is the number of training cases and is
the number of parameters in the model. The eect is to increase the estimated error of
models with many parameters constructed from small numbers of cases.
Linear models : A multivariate linear model is constructed for the cases at each node
of the model tree using standard regression techniques6. However, instead of using all
attributes, this model is restricted to the attributes that are referenced by tests or linear
models somewhere in the subtree at this node. As m5 will compare the accuracy of a
linear model with the accuracy of a subtree, this ensures a level playing eld in which the
two types of models use the same information.
Simplication of linear models : After each linear model is obtained as above, it is simplied
by eliminating parameters to minimise its estimated error. Even though the elimination
of parameters generally causes the average residual to increase, it also reduces the mul-
tiplicative factor above, so the estimated error can decrease. m5 uses a greedy search
to remove variables that contribute little to the model; in some cases, m5 removes all
variables, leaving only a constant.
Pruning : Each non-leaf node of the model tree is examined, starting near the bottom.
m5 selects as the nal model for this node either the simplied linear model above or the
model subtree, depending on which has the lower estimated error. If the linear model is
chosen, the subtree at this node is pruned to a leaf.
Smoothing : Pregibon5 observes that the prediction accuracy of tree-based models can be
improved by a smoothing process. When the value of a case is predicted by a model tree,
the value given by the model at the appropriate leaf is adjusted to re
ect the predicted
values at nodes along the path from the root to that leaf. The form of smoothing used by
m5 diers from that developed by Pregibon, but the motivation is similar. m5's smoothed
predicted value is backed up from the leaf to the root as follows:
The predicted value at the leaf is the value computed by the model at that leaf.
If the case follows branch S of subtree S , let n be the number of training cases at
i i
n +k
i