Machine-Learning Research: Four Current Directions

AI Magazine Volume 18 Number 4 (1997) (© AAAI)
Articles
Machine-Learning Research
Four Current Directions
Thomas G. Dietterich
■ Machine-learning research has been making great Ensembles of Classifiers

progress in many directions. This article summa-
rizes four of these directions and discusses some The first topic concerns methods for improv-
current open problems. The four directions are (1) ing accuracy in supervised learning. I begin by
the improvement of classification accuracy by introducing some notation. In supervised
learning ensembles of classifiers, (2) methods for learning, a learning program is given training
scaling up supervised learning algorithms, (3) rein-
examples of the form {(x1, y1), ..., (xm, ym)} for
forcement learning, and (4) the learning of com-
plex stochastic models.
some unknown function y = f(x). The xi values
are typically vectors of the form <xi,1, xi,2, ...,
xi,n> whose components are discrete or real val-
T
he last five years have seen an explosion
ued, such as height, weight, color, and age.
in machine-learning research. This ex-
These are also called the features of xi. I use the
plosion has many causes: First, separate
notation xij to refer to the jth feature of xi. In
research communities in symbolic machine
some situations, I drop the i subscript when it
learning, computational learning theory, neur-
is implied by the context.
al networks, statistics, and pattern recognition
The y values are typically drawn from a dis-
have discovered one another and begun to
crete set of classes {1, ..., K} in the case of clas-
work together. Second, machine-learning tech-
sification or from the real line in the case of re-
niques are being applied to new kinds of prob-
lem, including knowledge discovery in data- gression. In this article, I focus primarily on
bases, language processing, robot control, and classification. The training examples might be
combinatorial optimization, as well as to more corrupted by some random noise.
traditional problems such as speech recogni- Given a set S of training examples, a learning
tion, face recognition, handwriting recogni- algorithm outputs a classifier. The classifier is a
tion, medical data analysis, and game playing. hypothesis about the true function f. Given
In this article, I selected four topics within new x values, it predicts the corresponding y
machine learning where there has been a lot of values. I denote classifiers by h1, ..., hL.
recent activity. The purpose of the article is to An ensemble of classifiers is a set of classifiers
describe the results in these areas to a broader whose individual decisions are combined in
AI audience and to sketch some of the open re- some way (typically by weighted or unweight-
search problems. The topic areas are (1) ensem- ed voting) to classify new examples. One of the
bles of classifiers, (2) methods for scaling up su- most active areas of research in supervised
pervised learning algorithms, (3) reinforcement learning has been the study of methods for
learning, and (4) the learning of complex sto- constructing good ensembles of classifiers. The
chastic models. main discovery is that ensembles are often
The reader should be cautioned that this ar- much more accurate than the individual clas-
ticle is not a comprehensive review of each of sifiers that make them up.
these topics. Rather, my goal is to provide a An ensemble can be more accurate than its
representative sample of the research in each of component classifiers only if the individual
these four areas. In each of the areas, there are classifiers disagree with one another (Hansen
many other papers that describe relevant work. and Salamon 1990). To see why, imagine that
I apologize to those authors whose work I was we have an ensemble of three classifiers: {h1, h2,
unable to include in the article. h3}, and consider a new case x. If the three clas-
Copyright © 1997, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1997 / $2.00 WINTER 1997 97
Articles
ular algorithms. I begin by reviewing the

general techniques.
0.2
Subsampling the Training Examples
0.18 The first method manipulates the training ex-
amples to generate multiple hypotheses. The
0.16
learning algorithm is run several times, each
0.14 time with a different subset of the training ex-
amples. This technique works especially well
0.12
Probability
for unstable learning algorithms—algorithms

whose output classifier undergoes major
0.1
changes in response to small changes in the
0.08 training data. Decision tree, neural network,
and rule-learning algorithms are all unstable.
0.06
Linear-regression, nearest-neighbor, and lin-
0.04 ear-threshold algorithms are generally stable.
The most straightforward way of manipulat-
0.02 ing the training set is called bagging. On each
run, bagging presents the learning algorithm
0
0 5 10 15 20 with a training set that consists of a sample of
Number of classifiers in error m training examples drawn randomly with re-
placement from the original training set of m
items. Such a training set is called a bootstrap
Figure 1. The Probability That Exactly l (of 21) Hypotheses Will
replicate of the original training set, and the
Make an Error, Assuming Each Hypothesis Has an Error Rate of 0.3
technique is called bootstrap aggregation
and Makes Its Errors Independently of the Other Hypotheses.
(Breiman 1996a). Each bootstrap replicate con-
tains, on the average, 63.2 percent of the orig-
inal training set, with several training exam-
sifiers are identical, then when h1(x) is wrong, ples appearing multiple times.
h2(x) and h3(x) are also wrong. However, if the Another training-set sampling method is to
errors made by the classifiers are uncorrelated, construct the training sets by leaving out dis-
then when h1(x) is wrong, h2(x) and h3(x) joint subsets of the training data. For example,
might be correct, so that a majority vote cor- the training set can be divided randomly into
rectly classifies x. More precisely, if the error 10 disjoint subsets. Then, 10 overlapping train-
rates of L hypotheses hl are all equal to p < 1/2 ing sets can be constructed by dropping out a
and if the errors are independent, then the different one of these 10 subsets. This same
probability that the majority vote is wrong is procedure is used to construct training sets for
the area under the binomial distribution where tenfold cross-validation; so, ensembles con-
more than L/2 hypotheses are wrong. Figure 1 structed in this way are sometimes called cross-
shows this area for a simulated ensemble of 21 validated committees (Parmanto, Munro, and
hypotheses, each having an error rate of 0.3. Doyle 1996).
The area under the curve for 11 or more hy- The third method for manipulating the
potheses being simultaneously wrong is 0.026,
training set is illustrated by the ADABOOST algo-
which is much less than the error rate of the in-
rithm, developed by Freund and Schapire
dividual hypotheses.
(1996, 1995) and shown in figure 2. Like bag-
Of course, if the individual hypotheses make
ging, ADABOOST manipulates the training exam-
uncorrelated errors at rates exceeding 0.5, then
ples to generate multiple hypotheses. ADABOOST
the error rate of the voted ensemble increases as
maintains a probability distribution pl(x) over
a result of the voting. Hence, the key to success-
the training examples. In each iteration l, it
ful ensemble methods is to construct individual
draws a training set of size m by sampling with
classifiers with error rates below 0.5 whose errors
replacement according to the probability dis-
are at least somewhat uncorrelated.
tribution pl(x). The learning algorithm is then
applied to produce a classifier hl. The error rate
Methods for εl of this classifier on the training examples
(weighted according to pl(x)) is computed and
Constructing Ensembles used to adjust the probability distribution on
Many methods for constructing ensembles the training examples. (In figure 2, note that
have been developed. Some methods are gener- the probability distribution is obtained by nor-
al, and they can be applied to any learning al- malizing a set of weights wl(i) over the training
gorithm. Other methods are specific to partic- examples.)
98 AI MAGAZINE
Articles
The effect of the change in weights is to

place more weight on training examples that
were misclassified by hl and less weight on ex- Input: a set S, of m labeled examples: S = {(xi , yi ), i = 1, 2, . . . , m},
labels yi ∈ Y = {1, . . . , K}
amples that were correctly classified. In subse- Learn (a learning algorithm)
quent iterations, therefore, ADABOOST con- a constant L.
structs progressively more difficult learning [1] initialize for all i: w1 (i) := 1/m initialize the weights
problems. [2] for ` = 1 to L do P
The final classifier, hf, is constructed by a [3] for all i: p` (i) := w` (i)/( i w` (i)) compute normalized weights
weighted vote of the individual classifiers. [4] h` := P Learn(p` ) call Learn with normalized weights.
[5] ²γ̀ := i p` (i)[[h` (xi ) 6= yi ]] calculate the error of h`
Each classifier is weighted according to its ac- [7] if ²γ̀ > 1/2 then
curacy for the distribution pl that it was [8] L := ` − 1
trained on. [9] goto 13
In line 4 of the ADABOOST algorithm (figure [10] βγ̀ := ²γ̀/(1 − ²γ̀)
1−
[[h` (xi )6=yi ]]
[11] for all i: w`+1 (i) := w` (i)βγ compute new weights
2), the base learning algorithm Learn is called [12] end for
`
with the probability distribution pl. If the

L µ
X ¶
learning algorithm Learn can use this probabil- [13] Output: hf (x) = argmax log
1
[[h` (x) = y]]
ity distribution directly, then this procedure y∈Y βγ̀
`=1
generally gives better results. For example,
Quinlan (1996) developed a version of the de-
cision tree–learning program C4.5 that works
with a weighted training sample. His experi- Figure 2. The ADABOOST.M1 Algorithm.
ments showed that it worked extremely well. The formula [[E]] is 1 if E is true and 0 otherwise.
One can also imagine versions of back propa-
gation that scaled the computed output error
for training example (xi, yi) by the weight pl(i).
Errors for important training examples would 35
cause larger gradient-descent steps than errors
for unimportant (low-weight) examples. 30
However, if the algorithm cannot use the
probability distribution pl directly, then a
25
training sample can be constructed by drawing
Error Rate of C4.5
a random sample with replacement in propor-

tion to the probabilities pl. This procedure 20
makes ADABOOST more stochastic, but experi-
ments have shown that it is still effective. 15
Figure 3 compares the performance of C4.5
to C4.5 with ADABOOST.M1 (using random sam-
10
pling). One point is plotted for each of 27 test
domains taken from the Irvine repository of
machine-learning databases (Merz and Mur- 5
phy 1996). We can see that most points lie
above the line y = x, which indicates that the 0
error rate of ADABOOST is less than the error rate 0 5 10 15 20 25 30
Error Rate of AdaBoost with C4.5
of C4.5. Figure 4 compares the performance of
bagging (with C4.5) to C4.5 alone. Again, we see
Figure 3. Comparison of ADABOOST.M1 (applied to C4.5) with C4.5 by Itself.
that bagging produces sizable reductions in the
error rate of C4.5 for many problems. Finally, Each point represents 1 of 27 test domains. Points lying above the diagonal line
exhibit lower error with ADABOOST.M1 than with C4.5 alone. Based on data from
figure 5 compares bagging with boosting (both
Freund and Schapire (1996). As many as 100 hypotheses were constructed.
using C4.5 as the underlying algorithm). The
results show that the two techniques are com-
parable, although boosting appears to still based on 8 different subsets of the 119 available
have an advantage over bagging. input features and 4 different network sizes.
Manipulating the Input Features A sec- The input-feature subsets were selected (by
ond general technique for generating multiple hand) to group features that were based on dif-
classifiers is to manipulate the set of input fea- ferent image-processing operations (such as
tures available to the learning algorithm. For principal component analysis and the fast
example, in a project to identify volcanoes on Fourier transform). The resulting ensemble
Venus, Cherkauer (1996) trained an ensemble classifier was able to match the performance of
of 32 neural networks. The 32 networks were human experts in identifying volcanoes.
WINTER 1997 99
Articles
the input features are highly redundant.

35
Manipulating the Output Targets A
third general technique for constructing a
30 good ensemble of classifiers is to manipulate
the y values that are given to the learning algo-
25 rithm. Dietterich and Bakiri (1995) describe a
Error Rate of C4.5
technique called error-correcting output coding

20
(ECOC). Suppose that the number of classes, K,
is large. Then, new learning problems can be
constructed by randomly partitioning the K
15 classes into two subsets: Al and Bl. The input
data can then be relabeled so that any of the
10 original classes in set Al are given the derived
label 0, and the original classes in set Bl are giv-
5 en the derived label 1. These relabeled data are
then given to the learning algorithm, which
constructs a classifier hl. By repeating this
0
0 5 10 15 20 25 30 process L times (generating different subsets Al
Error Rate of Bagged C4.5 and Bl), we obtain an ensemble of L classifiers
h1, ... , hL.
Figure 4. Comparison of Bagging (applied to C4.5) with C4.5 by Itself. Now, given a new data point x, how should
Each point represents 1 of 27 test domains. Points lying above the diagonal line we classify it? The answer is to have each hl
exhibit lower error with bagging than with C4.5 alone. Based on data from Freund classify x. If hl(x) = 0, then each class in Al re-
and Schapire (1996). Bagging voted 100 classifiers. ceives a vote. If hl(x) = 1, then each class in Bl
receives a vote. After each of the L classifiers
has voted, the class with the highest number
30 of votes is selected as the prediction of the en-
semble.
An equivalent way of thinking about this
25 method is that each class j is encoded as an L-
Error rate of Bagging with C4
bit code word Cj, where bit l is 1 if and only if

j ∈ Bl. The l-th learned classifier attempts to
20
predict bit l of these code words. When the L
classifiers are applied to classify a new point x,
15 their predictions are combined into an L-bit
string. We then choose the class j whose code
word Cj is closest (in Hamming distance) to the
10 L-bit output string. Methods for designing
good error-correcting codes can be applied to
choose the code words Cj (or, equivalently,
5
subsets Al and Bl).
Dietterich and Bakiri report that this tech-
0 nique improves the performance of both the
0 5 10 15 20 25 30 C4.5 and back-propagation algorithms on a va-
Error rate of AdaBoost with C4
riety of difficult classification problems. Re-
cently, Schapire (1997) showed how ADABOOST
Figure 5. Comparison of Bagging (applied to C4.5) can be combined with error-correcting output
with ADABOOST.M1 (applied to C4.5). coding to yield an excellent ensemble-classifi-
Each point represents 1 of 27 test domains. Points lying above the diagonal line cation method that he calls ADABOOST.OC. The
exhibit lower error with boosting than with bagging. Based on data from Freund performance of the method is superior to the
and Schapire (1996).
ECOC method (and bagging) but essentially
the same as another (complex) algorithm
Tumer and Ghosh (1996) applied a similar
called ADABOOST.M2. Hence, the main advan-
technique to a sonar data set with 25 input fea- tage of ADABOOST.OC is implementation simplic-
tures. However, they found that deleting even ity: It can work with any learning algorithm
a few of the input features hurt the perfor- for solving two-class problems.
mance of the individual classifiers so much Ricci and Aha (1997) applied a method that
that the voted ensemble did not perform well. combines error-correcting output coding with
Obviously, this technique only works when feature selection. When learning each classifi-
100 AI MAGAZINE
Articles
Task Test-Set Size C4.5 200-Fold Bootstrap C4.5 200-Fold Random C4.5
Vowel 462 0.5758 0.5152 0.48701
Soybean 376 0.1090 0.0984 0.1090
Part of Speech 3060 0.0827 0.0765 0.07882
NETTALK 7242 0.3000 0.26703 0.25003
Letter Recognition 4000 0.2010 0.00383 0.00003
1. Difference from C4.5 significant at p < 0.05.

2. 256-fold random.
3. Difference from C4.5 significant at p < 0.001.
Table 1. Results on Five Domains (best error rate in boldface).
er, hl, they apply feature-selection techniques 200 classifiers constructed by bagging C4.5 and
to choose the best features for learning the injecting randomness into C4.5. The results
classifier. They obtained improvements in 7 of show that injecting randomness obtains the
10 tasks with this approach. best performance in three of the domains. In
Injecting Randomness The last general- particular, notice that injected randomness ob-
purpose method for generating ensembles of tains perfect test-set performance in the letter-
classifiers is to inject randomness into the recognition task.
learning algorithm. In the back-propagation Ali and Pazzani (1996) injected randomness
algorithm for training neural networks, the into the FOIL algorithm for learning Prolog-
initial weights of the network are set random- style rules. FOIL works somewhat like C4.5 in
ly. If the algorithm is applied to the same train- that it ranks possible conditions to add to a
ing examples but with different initial weights, rule using an information-gain criterion. Ali
the resulting classifier can be different (Kolen and Pazzani computed all candidate condi-
and Pollack 1991). tions that scored within 80 percent of the top-
Although this method is perhaps the most ranked candidate and then applied a weighted
common way of generating ensembles of neur- random-choice algorithm to choose among
al networks, manipulating the training set them. They compared ensembles of 11 classi-
might be more effective. A study by Parmanto, fiers to a single run of FOIL and found statisti-
Munro, and Doyle (1996) compared this tech- cally significant improvements in 15 of 29
nique to bagging and tenfold cross-validated tasks and statistically significant loss of perfor-
committees. They found that cross-validated mance in only one task. They obtained similar
committees worked best, bagging second best, results using 11-fold cross-validation to con-
and multiple random initial weights third best struct the training sets.
on one synthetic data set and two medical di- Raviv and Intrator (1996) combine boot-
agnosis data sets. strap sampling of the training data with in-
For the C4.5 decision tree algorithm, it is also jecting noise into the input features for the
easy to inject randomness (Kwok and Carter learning algorithm. To train each member of
1990). The key decision of C4.5 is to choose a an ensemble of neural networks, they draw
feature to test at each internal node in the de- training examples with replacement from the
cision tree. At each internal node, C4.5 applies original training data. The x values of each
a criterion known as the information-gain ratio training example are perturbed by adding
to rank the various possible feature tests. It Gaussian noise to the input features. They re-
then chooses the top-ranked feature-value test. port large improvements in a synthetic
For discrete-valued features with V values, the benchmark task and a medical diagnosis task.
decision tree splits the data into V subsets, de- A method closely related to these techniques
pending on the value of the chosen feature. for injecting randomness is the Markov chain
For real-valued features, the decision tree splits Monte Carlo (MCMC) method, which has been
the data into two subsets, depending on applied to neural networks by MacKay (1992)
whether the value of the chosen feature is and Neal (1993) and to decision trees by Chip-
above or below a chosen threshold. Dietterich man, George, and McCulloch (1996). The basic
and Kong (1995) implemented a variant of idea of the MCMC method (and related meth-
C4.5 that chooses randomly (with equal proba- ods) is to construct a Markov process that gen-
bility) among the top 20 best tests. Table 1 erates an infinite sequence of hypotheses hl. In
compares a single run of C4.5 to ensembles of a Bayesian setting, the goal is to generate each
WINTER 1997 101

Articles
hypothesis hl with probability P(hl|S), where S is parison with bagging, they found that their
the training sample. P(hl|S) is computed in the method gave excellent results in four real-world
usual way as the (normalized) product of the like- domains.
lihood P(S|hl) and the prior probability P(hl) of Abu-Mostafa (1990) and Caruana (1996) de-
hl. To apply MCMC, we define a set of operators scribe a technique for training a neural network
that convert one hl into another. For a neural on auxiliary tasks as well as the main task. The
network, such an operator might adjust one of key idea is to add some output units to the net-
the weights in the network. In a decision tree, work whose role is to predict the values of the
the operator might interchange a parent and a auxiliary tasks and to include the prediction er-
child node in the tree or replace one node with ror for these auxiliary tasks in the error criteri-
another. The MCMC process works by maintain- on that back propagation seeks to minimize.
ing a current hypothesis hl. At each step, it selects Because these auxiliary output units are con-
an operator, applies it (to obtain hl+1), and then nected to the same hidden units as the primary
computes the likelihood of the resulting classifier task output, the auxiliary outputs can influence
on the training data. It then decides whether to the behavior of the network on the primary
keep hl+1 or discard it and go back to hl. Under task. Parmanto et al. (1994) show that diverse
various technical conditions, it is possible to classifiers can be learned by training on the
prove that a process of this kind eventually con- same primary task but with many different aux-
verges to a stationary probability distribution in iliary tasks. One good source of auxiliary tasks
which the hl‘s are sampled in proportion to their is to have the network attempt to predict one of
posterior probabilities. In practice, it can be diffi- its input features (in addition to the primary
cult to tell when this stationary distribution is output task). They apply this method to med-
reached. A standard approach is to run the ical diagnosis problems.
Markov process for a long period (discarding all More recently, Munro and Parmanto (1997)
generated classifiers) and then collect a set of L developed an approach in which the value of
classifiers from the Markov process. These classi- the auxiliary output is determined dynamically
fiers are then combined by weighted vote accord- by competition among the set of networks.
ing to their posterior probabilities. Each network has a primary output y and a sec-
Algorithm-Specific Methods for Generat- ondary (auxiliary) output z. During training,
ing Ensembles In addition to these general- each network looks at the training example x
purpose methods for generating diverse ensem- and computes its primary and secondary output
bles of classifiers, there are several techniques predictions. The network whose secondary out-
that can be applied to the back-propagation al- put prediction is highest is said to be the winner
gorithm for training neural networks. for this example. It is given a target value for z
Rosen (1996) trains several neural networks si- of 1; the remaining networks are given a target
multaneously and forces the networks to be di- value of 0 for the secondary output. All net-
verse by adding a correlation penalty to the error works are given the value yi as the target for the
function that back propagation minimizes. primary output. The effect is to encourage dif-
Specifically, during training, Rosen keeps track of ferent networks to become experts at predicting
the correlations in the predictions of each net- the secondary output z in different regions of
work. He applies back propagation to minimize the input space. Because the primary and sec-
an error function that is a sum of the usual ondary output share the hidden layer, the errors
squared output prediction error and a term that in the primary output become decorrelated.
measures the correlations with the other net- They show that this method substantially out-
works. He reports substantial improvements in performs an ensemble of ordinary networks
three simple synthetic tasks. trained using different initial random weights
Opitz and Shavlik (1996) take a similar ap- when trained on a synthetic classification task.
proach, but they use a kind of genetic algorithm In addition to these methods for training en-
to search for a good population of neural net- sembles of neural networks, there are also meth-
work classifiers. In each iteration, they apply ge- ods that are specific to decision trees. Buntine
netic operators to the current ensemble to gen- (1990) developed an algorithm for learning op-
erate new network topologies. These networks tion trees. These are decision trees where an in-
are then trained using an error function that ternal node can contain several alternative
combines the usual squared output prediction splits (each producing its own sub–decision
error with a multiplicative term that incorpo- tree). To classify an example, each of these
rates the diversity of the classifiers. After training sub–decision trees is evaluated, and the result-
the networks, they prune the population to re- ing classifications are voted. Kohavi and Kunz
tain the N best networks using a criterion that (1997) describe an option tree algorithm and
considers both accuracy and diversity. In a com- compare its performance to bagged C4.5 trees.
102 AI MAGAZINE
Articles
They show that option trees generally match choosing less correlated subsets of the ensem-
the performance of bagging and produce a ble and combining them by linear least
much more understandable result. squares.
This completes my review of methods for For classification problems, weights are usu-
generating ensembles using a single learning ally obtained by measuring the accuracy of
algorithm. Of course, one can always generate each individual classifier hl on the training da-
an ensemble by combining classifiers con- ta (or a holdout data set) and constructing
structed by different learning algorithms. weights that are proportional to these accura-
Learning algorithms based on very different cies. Ali and Pazzani (1996) describe a method
principles will probably produce very diverse that they call likelihood combination, in which
classifiers. However, often some of these classi- they apply the naive Bayes algorithm (see The
fiers perform much worse than others. Further- Naive Bayes Classifier) to learn weights for clas-
more, there is no guarantee of diversity. Hence, sifiers. In ADABOOST, the weight for classifier hl
when classifiers from different learning algo- is computed from the accuracy of hl measured
rithms are combined, they should be checked on the weighted training distribution that was
(for example, by cross-validation) for accuracy used to learn hl. A Bayesian approach to
and diversity, and some form of weighted com- weighted vote is to compute the posterior
bination should be used. This approach has
probability of each hl. This method requires
been shown effective in some applications (for
the definition of a prior distribution P(hl) that
example, Zhang, Mesirov, and Waltz [1992]).
is multiplied with the likelihood P(S|hl) to esti-
Methods for Combining Classifiers mate the posterior probability of each hl. Ali
and Pazzani experiment with this method. Ear-
Given that we have trained an ensemble of
lier work on Bayesian voting of decision trees
classifiers, how should we combine their indi-
was performed by Buntine (1990).
vidual classification decisions? Many methods
have been explored. They can be subdivided The third approach to combining classifiers
into unweighted vote, weighted vote, and gat- is to learn a gating network or a gating func-
ing networks. tion that takes as input x and produces as out-
The simplest approach is to take an un- put the weights wl to be applied to compute
weighted vote, as is done in bagging, ECOC, the weighted vote of the classifiers hl. Jordan
and many other methods. Although it might and Jacobs (1994) learn gating networks that
appear that more intelligent voting schemes have the form
should do better, the experience in the fore- w l = e zl / Σ u e z u
casting literature has been that simple, un-
zl = v l x.
T
weighted voting is robust (Clemen 1989). One

refinement on simple majority vote is appro- In other words, zl is the dot product of a para-
priate when each classifier hl can produce meter vector vl with the input feature vector x.
class-probability estimates rather than a simple The output weight wl is then the so-called soft-
classification decision. A class-probability esti- max of the individual zl‘s. As with any learning
mate for data point x is the probability that the algorithm, there is a risk of overfitting the
true class is k: P(f(x) = k|hl), for k = 1, ..., K. We training data by learning the gating function
can combine the class probabilities of all the in addition to learning each of the individual
hypotheses so that the class probability of the classifiers. (Below, I discuss the hierarchical
ensemble is: mixture-of-experts method developed by Jor-
1 L dan and Jacobs that learns the wl and the gat-
L Σ l =1
P ( f ( x) = k ) = P ( f ( x) = k | hl).
ing network simultaneously.)
The predicted class of x is then the class having A fourth approach to combining classifiers,
the highest class probability. called stacking, works as follows: Suppose we
Many different weighted voting methods have L different learning algorithms A1, ..., AL
have been developed for ensembles. For regres- and a set S of training examples {(x1, y1), …, (xm,
sion problems, Perrone and Cooper (1993) and ym)}. As usual, we apply each of these algo-
Hashem (1993) apply least squares regression rithms to our training data to produce hy-
to find weights that maximize the accuracy of potheses h1, …, hL. The goal of stacking is to
the ensemble on the training data. They show learn a good combining classifier h* such that
that the weight applied to hl should be inverse- the final classification will be computed by h *
ly proportional to the variance of the estimates (h1(x), …, hL(x)). Wolpert (1992) proposed the
of hl. A difficulty with applying linear least following scheme for learning h* using a form
squares is that the various hypotheses hl can be of leave-one-out cross-validation.
correlated highly. They describe methods for Let
WINTER 1997 103

Articles
(– i ) classifier from H. Most of our learning algo-

hl
rithms consider very large hypothesis spaces;
be a classifier constructed by algorithm Al ap-
so, even after eliminating hypotheses that mis-
plied to all the training examples in S except
classify training examples, there are many hy-
example i. In other words, each algorithm is
potheses remaining. All these hypotheses ap-
applied to the training data m times, leaving
pear equally accurate with respect to the
out one training example each time. We can
available training data. We might have reasons
then apply each classifier
for preferring some of these hypotheses over
(– i )
hl others (for example, preferring simpler hy-
I already gave to example xi to obtain the predicted class potheses or hypotheses with higher prior prob-
the basic l
ability), but nonetheless, there are typically
yˆ i many plausible hypotheses. From this collec-
intuition for This procedure gives us a new data set contain- tion of surviving hypotheses in H, we can eas-
why ing level-two examples whose features are the ily construct an ensemble of classifiers and com-
classes predicted by each of the L classifiers. bine them using the methods described earlier.
ensembles Each example has the form A second cause of the need for ensembles is
can improve 〈( yî1 , yî2 ,… , yîL ), yi 〉. that our learning algorithms might not be able
to solve the difficult search problems that we
performance: Now, we can apply some other learning algo-
pose. For example, the problem of finding the
rithm to this level-two data to learn h *.
Uncorrelated Breiman (1996b) applied this approach to
smallest decision tree that is consistent with a
set of training examples is NP-hard (Hyafil and
errors made combining different forms of linear regression
Rivest 1976). Hence, practical decision tree al-
by the with good results.
gorithms use search heuristics to guide a
individual Why Ensembles Work greedy search for small decision trees. Similar-
ly, finding the weights for the smallest possible
classifiers I already gave the basic intuition for why en-
neural network consistent with the training
sembles can improve performance: Uncorrelat-
can be ed errors made by the individual classifiers can examples is also NP-hard (Blum and Rivest
1988). Neural network algorithms therefore
removed by be removed by voting. However, there is a
use local search methods (such as gradient de-
deeper question lurking here: Why should it be
voting. possible to find ensembles of classifiers that scent) to find locally optimal weights for the
network. A consequence of these imperfect
However, make uncorrelated errors? There is another
search algorithms is that even if the combina-
question as well: Why shouldn’t we be able to
there is a find a single classifier that performs as well as tion of our training examples and our prior
deeper the ensemble? knowledge (for example, preferences for sim-
There are at least three reasons why good en- ple hypotheses, Bayesian priors) determines a
question sembles can be constructed and why it might unique best hypothesis, we might not be able
lurking here: be difficult or impossible to find a single clas- to find it. Instead, we typically find a hypoth-
sifier that performs as well as the ensemble. To esis that is somewhat more complex (or has
Why should understand these reasons, we must consider somewhat lower posterior probability). If we
it be the nature of machine-learning algorithms. run our search algorithms with a slightly dif-
ferent training sample or injected noise (or any
possible to Machine-learning algorithms work by search-
of the other techniques described earlier), we
ing a space of possible hypotheses H for the
find most accurate hypothesis (that is, the hypoth- find a different (suboptimal) hypothesis.
ensembles of esis that best approximates the unknown func- Therefore, ensembles can be seen as a way of
tion f). Two important aspects of the hypothe- compensating for imperfect search algorithms.
classifiers sis space H are its size and the possibility it A third cause of the need for ensembles is
that make contains good approximations to f. that our hypothesis space H might not contain
If the hypothesis space is large, then we the true function f. Instead, H might include
uncorrelated need a large amount of training data to con- several equally good approximations to f. By
errors? strain the search for good approximations. taking weighted combinations of these approx-
Each training example rules out (or makes less imations, we might be able to represent classi-
plausible) all those hypotheses in H that mis- fiers that lie outside H. One way to understand
classify it. In a two-class problem, ideally each this point is to visualize the decision bound-
training example can eliminate half of the hy- aries constructed by learning algorithms. A de-
potheses in H; so, we require O(log|H|) exam- cision boundary is a surface such that examples
ples to select a unique classifier from H. that lie on one side of the surface are assigned
The first cause of the need for ensembles is to a different class than examples that lie on
that the training data might not provide suffi- the other side of the surface. The decision
cient information for choosing a single best boundaries constructed by decision tree–learn-
104 AI MAGAZINE
Articles
ing algorithms are line segments (or, more gen-

erally, hyperplane segments) parallel to the co-
ordinate axes. If the true boundary between Class 1 Class 1
two classes is a diagonal line, then decision tree
algorithms must approximate the diagonal by
a “staircase” of axis-parallel segments (figure 6).
Different bootstrap training samples (or differ-
ent weighted samples created by ADABOOST)
shift the locations of the staircase approxima-
Class 2 Class 2
tion, and by voting among these different ap-
a b
proximations, it is possible to construct better
approximations to the diagonal decision Figure 6. Decision Boundaries.
boundary.
a. This figure shows the true diagonal decision boundary and three staircase ap-
Interestingly, these improved staircase ap-
proximations to it (of the kind that are created by decision tree algorithms). b.
proximations are equivalent to complex deci- This figure shows the voted decision boundary, which is a much better approxi-
sion trees. However, these trees are so large mation to the diagonal boundary.
that were we to include them in our hypothe-
sis space H the space would be far too large for nearest-neighbor methods).
the available training data. Hence, we can see There have been few systematic studies of
that ensembles provide a way of overcoming methods for constructing ensembles of neural
representational inadequacies in our hypothe- networks, rule-learning systems, and other
sis space. types of classifier. Much work remains in this
area.
Open Problems Although ensembles provide accurate classi-
Concerning Ensembles fiers, some problems may limit their practical
Ensembles are well established as a method for application. One problem is that ensembles
obtaining highly accurate classifiers by com- can require large amounts of memory to store
bining less accurate ones. There are still many and large amounts of computation to apply.
questions, however, about the best way to con- For example, earlier I mentioned that an en-
struct ensembles as well as issues about how semble of 200 decision trees attains perfect
best to understand the decisions made by en- performance on a letter-recognition bench-
sembles. mark task. However, these 200 decision trees
Faced with a new learning problem, what is require 59 megabytes of storage, which makes
the best approach to constructing and applying them impractical for most present-day com-
an ensemble of classifiers? In principle, there puters. An important line of research, there-
can be no single best ensemble method, just as fore, is to find ways of converting these ensem-
there can be no single best learning algorithm. bles into less redundant representations,
However, some methods might be uniformly perhaps through the deletion of highly corre-
better than others, and some methods might lated members of the ensemble or by represen-
be better than others in certain situations. tational transformations.
Experimental studies have shown that AD- A second difficulty with ensemble classifiers
ABOOST is one of the best methods for con-
is that an ensemble provides little insight into
structing ensembles of decision trees. Schapire how it makes its decisions. A single decision
(1997) compares ADABOOST.M2 and ADABOOST.OC tree can often be interpreted by human users,
but an ensemble of 200 voted decision trees is
to bagging and error-correcting output coding
much more difficult to understand. Can meth-
and shows that the ADABOOST methods are gen-
ods be found for obtaining explanations (at
erally superior. However, Quinlan (1996) has
least locally) from ensembles? One example of
shown that in domains with noisy training da-
work on this question is Craven’s TREPAN algo-
ta, ADABOOST.M1 can perform badly—it places
rithm (Craven and Shavlik 1996).
high weight on incorrectly labeled training ex-
amples and, consequently, constructs bad clas-
sifiers. Kong and Dietterich (1995) showed that Scaling Up Machine-
combining bagging with error-correcting out-
put coding improved the performance of both
Learning Algorithms
methods, which suggests that combinations of A second major research area has explored
other ensemble methods should be explored as techniques for scaling up learning algorithms
well. Dietterich and Kong also showed that er- so that they can apply to problems with mil-
ror-correcting output coding does not work lions of training examples, thousands of fea-
well with highly local algorithms (such as tures, and hundreds of classes. Large machine-
WINTER 1997 105

Articles
learning problems are beginning to arise in for some threshold value θ. The standard ap-
database-mining applications, where there can proach is to sort the training examples at the
be millions of transactions every day and current node according to the values of feature
where it is desirable to have machine-learning xj and then make a sequential pass over the
algorithms that can analyze such large data sorted examples to choose the threshold θ. In
sets in just a few hours of computer time. An- standard depth-first algorithms, the data must
other area where large learning problems arise be sorted by each candidate feature xj at each
is in information retrieval from full-text data- node of the tree, which can become expensive.
bases and the World Wide Web. In informa- Shafer, Agrawal, and Mehta (1996) describe
tion retrieval, each word in a document can be their SPRINT method in which the training data
Large treated as an input feature; so, each training are broken up into a separate disk file for each
machine- example can be described by thousands of fea- attribute (and sorted by attribute value). For
tures. Finally, applications in speech recogni- feature j, the disk file contains records of the
learning tion, object recognition, and character recog- form <i, xij, yi>, where i is the index of the train-
problems are nition for Chinese and Japanese present ing example, xij is the value of feature j for
situations in which hundreds or thousands of training example i, and yi is the class of exam-
beginning classes must be discriminated. ple i. To choose a splitting threshold θ for fea-
to arise ture j, it is a simple matter to make a serial scan
Learning with Large Training Sets of this disk file and construct class histograms
in database- Decision tree algorithms have been extended for each potential splitting threshold. Once a
mining to handle large data sets in three different value is chosen, the disk file is partitioned
applications, ways. One approach is based on intelligently (logically) into two files, one containing exam-
sampling subsets of the training data as the ples with values less than or equal to θ and the
where there tree is grown. To describe how this process other containing examples with values greater
can be works, I must review how decision tree algo- than θ. As these files are written to disk, a hash
rithms operate. table is built (in the main memory) in which
millions of Decision trees are constructed by starting the index for each training example i is associ-
transactions with the entire training set and an empty tree. ated with the left or right child nodes of the
A test is chosen for the root of the tree, and the newly created split. Then, each of the disk files
every day and training data are then partitioned into disjoint for the other attributes xl, l ≠ j is read and split
where it is subsets depending on the outcome of the test. into a file for the left child and a file for the
desirable to The algorithm is then applied recursively to right child. The index of each example is
each of these disjoint subsets. The algorithm looked up in the hash table to determine the
have terminates when all (or most) of the training child to which it belongs. SPRINT can be paral-
machine- examples within a subset of the data belong to lelized easily, and it has been applied to data
the same class. At this point, a leaf node is cre- sets containing 2.5 million examples. Mehta,
learning ated and labeled with the class. Agrawal, and Rissanen (1996) have developed
algorithms The process of choosing a test for the root of a closely related algorithm, SLIQ, that makes
the tree (or for the root of each subtree) in- more use of main memory but also scales to
that can volves analyzing the training data and choos- millions of training examples and is slightly
analyze ing the one feature that is the best predictor of faster than SPRINT. Both SPRINT and SLIQ scale ap-
such large the output class. In a large and redundant data proximately linearly in the number of training
set, it might be possible to make this choice examples and the number of features, aside
data sets in based on only a sample of the data. Musick, from the cost of performing the initial sorting
just a few Catlett, and Russell (1993) presented an algo- of the data (which need only be done once).
rithm that dynamically chooses the sample A third approach to large data sets is to take
hours of based on how difficult the decision is at each advantage of ensembles of decision trees
computer node. The algorithm typically behaves by us- (Chan and Stolfo 1995). The training data can
ing a small sample near the root of the tree and randomly be partitioned into N disjoint sub-
time. then progressively enlarging the sample as the sets. A separate decision tree can be grown
tree grows. This technique can reduce the time from each subset in parallel. The trees can then
required to grow the tree without reducing the vote to make classification decisions. Although
accuracy of the tree at all. the accuracy of each of the individual decision
A second approach is based on developing trees is less than the accuracy of a single tree
clever data structures that avoid the need to grown with all the data, the accuracy of the en-
store all training data in random-access mem- semble is often better than the accuracy of a
ory. The hardest step for most decision tree al- single tree. Hence, with N parallel processors,
gorithms is to find tests for real-valued fea- we can achieve a speedup of N in the time re-
tures. These tests usually take the form xj ≤ θ quired to construct the decision trees.
106 AI MAGAZINE
Articles
A fourth approach to the problem of choos-

ing splits for real-valued input features is to procedure BuildRuleSet(P ,N )
discretize the values of such features. For ex- P = positive examples
N = negative examples
ample, one feature of a training example RuleSet = {}
might be employee income measured in dol- DL = DescriptionLength(RuleSet, P, N )
while P 6= {}
lars. This could take on tens of thousands of // Grow and prune a new rule
distinct values, and the running time of most split (P, N ) into (GrowP os, GrowN eg) and (P runeP os, P runeN eg)
Rule := GrowRule(GrowP os, GrowN eg)
decision-tree learning algorithms is linear in Rule := PruneRule(Rule, P runeP os, P runeN eg)
the number of distinct values of each feature. add Rule to RuleSet
This problem can be solved by grouping in- if DescriptionLength(RuleSet, P, N ) > DL + 64 then
// Prune the whole rule set and exit
come into a small number of ranges (for exam- for each rule R in RuleSet (considered in reverse order)
ple, $0–$10,000; $10,000–$25,000; $25,000– if DescriptionLength(RuleSet − {R}, P, N ) < DL then
delete R from RuleSet
$50,000; $50,000–$100,000; and greater than DL := DescriptionLength(RuleSet, P, N )
$100,000). If the ranges are chosen well, the re- end if
end for
sulting decision trees will still be accurate. Sev- return (RuleSet)
eral simple and fast algorithms have been de- end if
DL := DescriptionLength(RuleSet, P, N )
veloped for choosing good discretization delete from P and N all examples covered by Rule
points (Kohavi and Sahami 1996; Fayyad and end while
end BuildRuleSet
Irani 1993; Catlett 1991).
A very effective rule-learning algorithm, procedure OptimizeRuleSet(RuleSet, P, N )
for each rule R in RuleSet
called RIPPER, has been developed by William delete R from RuleSet
Cohen (1995) based on an earlier algorithm, U P os := examples in P not covered by RuleSet
U N eg := examples in N not covered by RuleSet
IREP , developed by Furnkranz and Widmer split (U P os, U N eg) into (GrowP os, GrowN eg) and (P runeP os, P runeN eg)
(1994). RIPPER constructs rules of the form test1 RepRule := GrowRule(GrowP os, GrowN eg)
` test2 ` … ` testl ⇒ ci, where each test has the RepRule := PruneRule(RepRule, P runeP os, P runeN eg)
RevRule := GrowRule(GrowP os, GrowN eg, R)
form xj = θ (for discrete features) or xj ≤ θ or xj RevRule := PruneRule(RevRule, P runeP os, P runeN eg)
> θ (for real-valued features). A rule is said to choose better of RepRule and RevRule and add to RuleSet
end for
cover a training example if the example satis- end OptimizeRuleSet
fies all the tests on the left-hand side of the procedure Ripper(P, N, k)
rule. For a training set of size m, the run time RuleSet := BuildRuleSet(P, N )
repeat k times RuleSet := OptimizeRuleSet(RuleSet, P, N )
of RIPPER scales as O(m(log m)2). This is a major return (RuleSet)
improvement over the rule-learning program end Ripper
C 4.5RULES (Quinlan 1993), which scales as
O(m3).
Figure 7 shows pseudocode for RIPPER. RIPPER
works by building an initial set of rules and op- Figure 7. The RIPPER Algorithm (Cohen 1995).
timizing the set of rules k times, where k is a
parameter (typically set to 2). I describe RIPPER
pruning examples covered by the pruned rule.
for the case where there are only two classes, 1
Once the rule has been grown and pruned,
and 2. Examples of class 1 will be referred to as
RIPPER adds it to the rule set and discards all
the positive examples, and examples of class 2
training examples that are covered by this new
will be referred to as the negative examples.
rule. It uses a description-length criterion to de-
RIPPER can easily be extended to handle larger
cide when to stop adding rules. The description
numbers of classes.
length of a set of rules is the number of bits
To build a set of rules, RIPPER constructs one
rule at a time. Before learning each rule, it di- needed to represent the rules plus the number
vides the training data into a growing set (con- of bits needed to identify the training examples
taining two-thirds of the data) and a pruning that are exceptions to the rules. Minimum–de-
set (containing the remaining one-third). It scription-length criteria of this kind have been
then iteratively adds tests to the rule until the applied successfully to rule- and tree-learning
rule covers no negative examples. Tests are se- algorithms (Quinlan and Rivest 1989). RIPPER
lected using an information-gain heuristic de- stops adding rules when the description length
veloped for Quinlan’s (1990) FOIL system. Once of the rule set is more than 64 bits larger than
a rule is grown, it is immediately pruned by the best description length observed to this
deleting tests in reverse order, testl, testl–1, …, point. It then considers the rules in reverse or-
test1, to find the pruned rule that maximizes the der and deletes any rule that will reduce the to-
quantity (p – n)/(p + n), where p is the number tal description length of the rule set.
of positive pruning examples covered by the To optimize a set of rules, RIPPER considers
pruned rule, and n is the number of negative deleting each rule in turn and regrowing and
WINTER 1997 107

Articles
repruning it. Two candidate replacement rules first. For discrete feature j, the mutual infor-
are grown and pruned. The first candidate is mation weight wj can be computed as
grown starting with an empty rule, whereas P ( y = c , x j = v)
the second candidate is grown starting with w j = ∑ ∑ P ( y = c , x j = v) ⋅ log ,
v c P ( y = c ) P ( x j = v)
the current rule. The better of the two candi-
dates is selected (using a description-length where P(y = c) is the proportion of training ex-
heuristic) and added to the rule set. amples in class c, and P(xj = v) is the probability
Cohen compared RIPPER to C4.5RULES on 37 that feature j takes on value v. For real-valued
data sets and found that RIPPER matched or beat features, the sums become integrals that must
C4.5RULES in 22 of the 37 problems. The rule be approximated. A good approximation is to
sets that it finds are always smaller than those apply a discretization algorithm, such as the
constructed by C4.5RULES. An implementation one advocated by Fayyad and Irani (1993);
of RIPPER is available for research use from Co- convert the real-valued feature into a discrete-
hen (www.research.att.com/~wcohen/rip- valued feature; and then apply the previous
perd.html). formula. Wettschereck and Dietterich (1995)
have obtained good results with nearest-neigh-
Learning with Many Features bor algorithms using mutual information
In many learning problems, there are hun- weighting.
dreds or thousands of potential features de- A problem with mutual information weight-
scribing each input object x. Popular learning ing is that it treats each feature independently.
algorithms such as C4.5 and back propagation For features whose predictive power is only ap-
do not scale well when there are many fea- parent in combination with other features, mu-
tures. Indeed, from a statistical point of view, tual information assigns a weight of zero. For
examples with many irrelevant, but noisy, in- example, a difficult class of learning problems
put features provide little information. It is involves learning parity functions with ran-
easy for learning algorithms to be confused by dom irrelevant features. A parity function over
the noisy features and construct poor classi- n binary features is equal to 1 if and only if an
fiers. Hence, in practical applications, it is wise odd number of the features are equal to 1. Sup-
to carefully choose which features to provide pose we define a learning problem in which
to the learning algorithm. there are 4 relevant features and 10 irrelevant
Research in machine learning has sought to (random) binary features, and the class is the 4-
automate the selection and weighting of fea- parity of the 4 relevant features. The mutual in-
tures, and many different algorithms have formation weights of all features will be ap-
been developed for this purpose. An excellent proximately zero using the previous formula.
review has been written by Wettschereck, Aha, An algorithm that overcomes this problem
and Mohri (1997). A comprehensive review of (and is one of the most successful preprocess-
the statistical literature on feature selection ing algorithms to date) is the RELIEF-F algorithm
can be found in Miller (1990). I discuss a few of (Kononenko 1994), which is an extension of
the most significant methods here. an earlier algorithm called RELIEF (Kira and Ren-
Three main approaches have been pursued. dell 1992). The basic idea of these algorithms is
The first approach is to perform some initial to draw examples at random, compute their
analysis of the training data and select a sub- nearest neighbors, and adjust a set of feature
set of the features to feed to the learning algo- weights to give more weight to features that
rithm. A second approach is to try different discriminate the example from neighbors of
subsets of the features on the learning algo- different classes. Specifically, let x be a ran-
rithm, estimate the performance of the algo- domly chosen training example, and let xs and
rithm with these features, and keep the sub- xd be the two training examples nearest to x (in
sets that perform best. The third approach is Euclidean distance) in the same class and in a
to integrate the selection and weighting of different class, respectively. The goal of RELIEF is
features directly into the learning algorithm. to set the weight wj on input feature j to be
I discuss two examples of each approach.
w j = P ( x j ≠ xdj ) – P ( x j ≠ xsj ).
Selecting and Weighting Features by
Preprocessing A simple preprocessing In other words, the weight wj should be maxi-
technique is to compute the mutual informa- mized when xd has a high probability of taking
tion (also called the information gain) be- on a different value for feature j, and xs has a
tween each input feature and the class. The low probability of taking on a different value for
mutual information between two random vari- feature j. RELIEF-F computes a more reliable esti-
ables is the average reduction in uncertainty mate of this probability difference by comput-
about the second variable given a value of the ing the B nearest neighbors of x in each class.
108 AI MAGAZINE
Articles
Figure 8 describes the RELIEF-F algorithm. In

this algorithm, δ(u, v) for two feature values u procedure Relief-F(L, B)
L = the number of random examples to draw
and v is defined as follows: B = the number of near neighbors to compute
| u – v | for real - valued features for all features j: wj := 0.0
 pc := the fraction of the training examples belonging to class c
δ ( u, v) = 0 if u = v for l := 1 to L do
1 if u ≠ v for discrete features. randomly select an instance (xt , yt )
 let Hit be the set of B examples (xi , yi ) nearest to xt such that yi = yt .
for each class c 6= yt
Kononenko, Simec, and Robnik-Sikonja let Mc be a set of B examples (xi , yi ) nearest to xt such that yi = c.
end for
(1997) have shown that RELIEF-F is effective at for each feature j
X X X
detecting relevant features, even when these wj := wj −
1
δγ(xtj , xij ) +
py
δγ(xtj , xij )
LB (1 − pc )LB
features are highly dependent on other fea- (xi ,yi )∈Hit c6=yt (xi ,yi )∈Mc
end for j
tures. end for l
For the 4-parity problem mentioned previ- return wj ∀ j .
end Relief-F
ously (with 10 irrelevant random features), RE-
LIEF-F can correctly separate the 4 relevant fea-
tures from the 10 irrelevant ones given 400
training examples. Kononenko computes the Figure 8. The RELIEF-F Algorithm.
10 nearest neighbors in each class (that is, B =
10). In his experiments, he sets the number of
sample points L to be equal to the number of considered. This method is only practical for
training examples, but in large data sets, good data sets with relatively small numbers of fea-
results can be obtained from much smaller tures and fast learning algorithms, but it gave
samples. excellent results on the University of Califor-
Kononenko et al. have also experimented nia at Irvine benchmarks.
with integrating RELIEF-F into a decision Moore and Lee (1994) describe a much more
tree–learning algorithm called ASSISTANT-R. They efficient approach to feature selection that
show that ASSISTANT-R is able to perform much combines leave-one-out cross-validation
better than the original ASSISTANT program (LOOCV) with the nearest-neighbor algo-
(which uses mutual information to choose fea- rithm. In leave-one-out cross-validation, each
tures) in domains with highly dependent fea- training example is temporarily deleted from
tures but gives essentially the same perfor- the training data, and the nearest-neighbor
mance in domains with independent features. learning algorithm is applied to predict the
Selecting and Weighting Features by class of the example. The total number of clas-
Testing with the Learning Algorithm sification errors is the leave-one-out cross-vali-
John, Kohavi, and Pfleger (1994) describe a dated estimate of the error rate of the learning
computationally expensive method that they algorithm. Moore and Lee use the LOOCV er-
call the wrapper method for selecting input fea- ror to compare different sets of features with
tures. The idea is to generate sets of features, the goal of finding the set of relevant features
run the learning algorithm using only these that minimizes the LOOCV error.
features, and evaluate the resulting classifiers They combine two clever ideas to achieve
using 10-fold cross-validation (or a single hold- this. The first idea is called racing: Suppose we
out set). In 10-fold cross-validation, the train- are considering two different sets of relevant
ing data are subdivided randomly into 10 dis- features, A and B. We repeatedly choose a
joint equal-sized sets. The learning algorithm training example at random, temporarily
is applied 10 times, each time on a training set delete it from the training set, and apply the
containing all but one of these subsets. The re- nearest-neighbor rule to classify it using the
sulting classifier is tested on the one-tenth of features in set A and the features in set B. We
the data that was held out. The performance of count the number of classification errors corre-
the 10 classifiers (on their 10 respective hold- sponding to each set. As we process more and
out sets) is averaged to provide an estimate of more training examples in this leave-one-out
the overall performance of the learning algo- fashion, the error rate for feature set A can be-
rithm when trained with the given features. come so much larger than the error rate for B
Kohavi and John explored step-wise selec- that we can conclude with high confidence
tion algorithms that start with a set of features that B is the better feature set and terminate
(for example, the empty set) and considered the race. In their paper, Moore and Lee (1994)
adding or deleting a single feature. The possi- apply Bayesian statistics to make this termina-
ble changes to the feature set are evaluated (us- tion decision.
ing 10-fold cross-validation), and the best The second idea is based on schemas. We
change is made. Then, a new set of changes is can represent each set of relevant features by a
WINTER 1997 109

Articles
bit vector where a 1 in position j means that to see which works better with various learn-
feature j is relevant, and a 0 means it is irrele- ing algorithms.
vant. A schema is a vector containing 0s, 1s, Integrating Feature Weighting into the
and ❀s. A ❀ in position j means that this fea- Learning Algorithm I now discuss two
ture should be selected randomly to be relevant methods that integrate feature selection direct-
50 percent of the time. Moore and Lee race ly into the learning algorithm. Both of them
pairs of schemas against one another as fol- have been shown to work well experimentally,
lows: A training example is randomly selected and the second method, called WINNOW, works
and temporarily deleted from the training set. extremely well in problems with thousands of
The nearest-neighbor algorithm is applied to potentially relevant input features.
classify it using each of the two schemas being The first algorithm is called the variable-ker-
raced. To classify an example using a schema, nel similarity metric, or VSM method (Lowe
features indicated by a 0 are ignored, features 1995). VSM is a form of Gaussian radial basis
indicated by a 1 are selected, and features indi- function method. To classify a new data point
cated by a ❀ are selected with probability 0.5. xt, it defines a multivariate Gaussian probabili-
Suppose, for illustration, that we have five fea- ty distribution ϕ centered on xt with standard
tures. Moore and Lee begin by conducting five deviation σ. Each example (xt,yi) in the training
simultaneous pairwise races: data set “votes” for class yi with an amount
1❀❀❀❀ races against 0❀❀❀❀ ϕ(||xi – xt ||; σ). The class with the highest vote
❀1❀❀❀ races against ❀0❀❀❀ is assigned to be the class yt = f(xt) of the data
❀❀1❀❀ races against ❀❀0❀❀ point xt.
❀❀❀1❀ races against ❀❀❀0❀ The key to the effectiveness of VSM is that it
❀❀❀❀1 races against ❀❀❀❀0 learns a weighted distance metric for measur-
All the races are terminated as soon as one of ing the distance between the new data point x
the schemas is found to be better than its oppo- and each training point xi. VSM also adjusts the
nent. In the next iteration, all single-bit refine- size σ of the Gaussian distribution depending
ments of the winning schema are raced against on the local density of training examples in the
one another. For example, suppose schema 1 neighborhood of xt.
was the winner of the first race. Then, the next In detail, VSM is controlled by a set of learned
iteration involves the following four pairwise feature weights w1, …, wn, a kernel radius para-
races: meter r, and a number of neighbors R. To classify
a new data point xt, VSM first computes the
11❀❀❀ races against 01❀❀❀ weighted distances to the R nearest neighbors
❀11❀❀ races against ❀10❀❀ (xi,yi):
❀1❀1❀ races against ❀1❀0❀
( )
n
❀1❀❀1 races against ❀1❀❀0 di = ∑ w 2j xtj – xij
2
This process continues until all the ❀s are re- j=1
moved from the winning schema. Moore and It then computes a kernel width σ from the av-
Lee found that this method never misses im- erage distance to the R/2 nearest neighbors:
portant relevant features, even when the fea- R /2
2r
R ∑
tures are highly dependent. However, in some σ = di .
rare cases, the races can take a long time to i=1
conclude. The problem appears to be that the Finally, it computes the probability that the
algorithm can be slow to replace a ❀ with a 0. next point xt belongs to each class c (the quan-
In contrast, the algorithm is quick to replace a tity vi is the vote from training example i):
❀ with a 1. Moore and Lee, therefore, investi-
gated an algorithm, called SCHEMATA+, that ter- vi = exp(– di2 / 2σ )
minates each race after 2000 evaluations (in R
favor of the 0, using the race statistics to P[ f ( x t ) = c ] = ∑ vi[[ yi = c ]] .

choose the feature least likely to be relevant).
i=1 vi
The median number of training examples that VSM then guesses the class with the highest
must be evaluated by SCHEMATA+ is only 13 per- probability.
cent of the number of evaluations required by How does VSM learn the values of the feature
greedy forward selection and only 11 percent weights and the kernel radius r? It does so by
of the number of evaluations required by performing gradient-descent search to mini-
greedy backward elimination, but it achieves mize the LOOCV accuracy of the VSM classifier.
the same levels of accuracy. An important di- Lowe (1995) shows how to compute the gradi-
rection for future research is to compare the ent of the LOOCV error with respect to each of
accuracy and speed of SCHEMATA+ and RELIEF-F the weights wj and the parameter r. Starting
110 AI MAGAZINE
Articles
with initial values for these parameters, VSM

computes the R nearest neighbors of each
training example (xi, yi). It then computes the procedure Winnow(α, βγ,θγ )
αγ> 1 is the promotion parameter
gradient and performs a search in the direction βγ< 1 is the demotion parameter
of the gradient to minimize LOOCV error (and θγis the threshold
keeps this set of nearest neighbors fixed). The initialize wj := 1 for all j
for each training
P example (x, y)
search along the direction of the gradient is
z := j wj · xj
called a line search, and there are several effi- ½
0 if z < θγ
cient algorithms available (Press et al. 1992). y 0 :=
1 if z ≥ θγ
Even though the weights are changing during if y 0 = 0 and y = 1 then
the line search, the set of nearest neighbors wj := wj ·
αγfor each j such that xj = 1
end if
(and the gradient) is not recomputed. Once
else if y 0 = 1 and y = 0 then
the error is minimized along this direction of wj := wj ·
βγfor each j such that xj = 1
the gradient, the R nearest neighbors are re- end if
computed, a new gradient is computed, and a end for
end Winnow
new line search is performed. Lowe applied the
conjugate gradient algorithm to select each
new search direction, and he reports that gen-
erally only 5 to 30 line searches were required Figure 9. The WINNOW Algorithm.
to minimize the LOOCV error. The resulting
classifiers gave excellent results on two chal-
method. The idea is to assume that the true
lenging benchmark tasks.
function y = f(x) belongs to some set of classi-
The last feature weighting algorithm I dis-
fiers H and derive a bound on the maximum
cuss is the WINNOW algorithm developed by Lit-
number of mistakes that WINNOW makes when
tlestone (1988). WINNOW is a linear threshold
an adversary is allowed to choose f and the or-
algorithm for two-class problems with binary
der in which the training examples are pre-
(that is, 0/1 valued) input features. It classifies
sented to WINNOW. Littlestone proves the fol-
a new example x into class 2 if
lowing result:
∑wx
j j j >θ Theorem 1: Let f be a disjunction of r out of
and into class 1 otherwise. WINNOW is an online its n input features. Then, WINNOW will learn f
algorithm; it accepts examples one at a time and make no more than 2 + 3r(1 + log2 n) mis-
and updates the weights wj as necessary. takes (for α = 2, β = 1/2, and θ = n).
Pseudocode for the algorithm is shown in fig- This theorem shows that the convergence
ure 9. time of WINNOW is linear in the number of rel-
WINNOW initializes its weights wj to 1. It then evant features r and only logarithmic in the to-
accepts a new example (x, y) and applies the tal number of features n. Similar results hold
threshold rule to compute the predicted class for other values of the α, β, and θ parameters,
y’. If the predicted class is correct (y’ = y), WIN- which permits WINNOW to learn functions
NOW does nothing. However, if the predicted where u out of v of the features must be 1 and
class is wrong, WINNOW updates its weights as many other interesting classes of Boolean
follows. If y’ = 0 and y = 1, then the weights are functions.
too low; so, for each feature such that xj = 1, wj WINNOW is an example of an exponential up-
:= wj · α, where α is a number greater than 1 date algorithm. The weights of the relevant fea-
called the promotion parameter. If y’ = 1 and y = tures grow exponentially, but the weights of
0, then the weights were too high; so, for each the irrelevant features shrink exponentially.
feature xj = 1, it decreases the corresponding General results in computational learning the-
weight by setting wj := wj · β, where β is a num- ory have developed similar exponential update
ber less than 1 called the demotion parameter. algorithms for many applications. A common
The fact that WINNOW leaves the weights un- property of these algorithms is that they excel
changed when the predicted class is correct is when the number of relevant features is small
somewhat puzzling to many people. It turns compared to the total number of features.
out to be critical for its successful behavior, WINNOW has been applied to several experi-
both theoretically and experimentally. One ex- mental learning problems. Blum (1997) de-
planation is that like ADABOOST, this strategy fo- scribes an application to a calendar scheduling
cuses WINNOW’s attention on its mistakes. An- task. In this task, a calendar system is given a
other explanation is that it helps prevent description of a proposed meeting (including
overfitting. the list of invitees and their properties). It must
Littlestone’s theoretical analysis of WINNOW then predict the start time, day of week, loca-
introduced the worst-case mistake bound tion, and duration of the meeting. There were
WINTER 1997 111

Articles
ple, {to, too, two}). WINNOW’s task is to decide

100 whether each occurrence of these words is cor-
rect or incorrect based on its context.
Golding and Roth use two kinds of Boolean
95 input feature. The first kind are context words.
A context word is a feature that is true if a partic-
ular word (for example, “cloudy”) appears
Winnow Accuracy
90
within 10 words before or after the target word.
The second kind are collocation features. These
85 test for a string of two words or part-of-speech
tags immediately adjacent to the target word.
For example, the sequence target to VERB is a
80 collocation feature that checks whether the tar-
get word is immediately followed by the word
to and then a word that can potentially be a
75
verb (according to a dictionary lookup). Based
on the one-million-word Brown corpus (Kucera
70 and Francis 1967), Golding and Roth defined
70 75 80 85 90 95 100 more than 10,000 potentially relevant features.
Modified Bayes Accuracy
Golding and Roth applied WINNOW (with α =
3/2, β varying between 0.5 and 0.9, and θ = 1).
They compared its accuracy to the best previ-
Figure 10. Comparison of the Percentage of Correct ous method, which is a modified naive
Classifications for a Modified Bayesian Method and WINNOW for Bayesian algorithm. Figure 10 shows the results
21 Sets of Frequently Confused Words. for 21 sets of frequently confused words.
Trained on 80 percent of the Brown corpus and tested on the remaining 20 per- An important advantage of WINNOW, in addi-
cent. Points lying above the line y = x correspond to cases where WINNOW is more tion to its speed and ability to operate online,
accurate. is that it can adapt rapidly to changes in the
target function. This advantage was shown to
be important in the calendar scheduling prob-
34 input features available, some with many lem, where the scheduling of different types of
possible values. Blum defined Boolean features meeting can shift when semesters change (or
corresponding to all possible values of all pairs during summer break). Blum’s experiments
of the input features. For example, given two showed that WINNOW was able to respond
input features—event-type and position-of-at- quickly to such changes. By comparison, to en-
tendees—he would define a separate Boolean able the decision tree algorithm to respond to
feature for each legal combination, such as changes, it was necessary to decide which old
event-type = meeting and position-of-atten- training examples could be deleted. This task is
dees = grad-student. There were a total of difficult to do because although some kinds of
59,731 Boolean input features. He then ap- meeting might change with the change in se-
plied WINNOW with α = 3/2 and β = 1/2. He mesters, other meetings might stay the same. A
modified WINNOW to prune (set to 0) weights decision to keep, for example, only the most re-
that become very small (less than 0.00001). cent 180 training examples means that exam-
WINNOW found that 561 of the Boolean fea- ples of rare meetings whose scheduling does
tures were actually useful for prediction. Its ac- not change will be lost, which hurts the perfor-
curacy was better than that of the best previous mance of the decision tree approach. In con-
classifier for this task (which used a greedy for- trast, because WINNOW only revises the weights
ward-selection algorithm to select relevant fea- on features when they have the value 1, fea-
tures for a decision tree–learning algorithm). tures describing such rare meetings are likely to
Golding and Roth (1996) describe an appli- retain their weights even as the weights for oth-
cation of WINNOW to context-sensitive spelling er features are being modified rapidly.
correction. This is the task of identifying
spelling errors where one legal word is substi-
Summary: Scaling Up
tuted for another, such as “It’s not to late,” Learning Algorithms
where to is substituted for too. The Random This concludes my review of methods for scal-
House dictionary (Flexner 1983) lists many ing up learning algorithms to apply to very
sets of commonly confused words, and Gold- large problems. With the techniques described
ing and Roth developed a separate WINNOW here, problems having one million training ex-
classifier for each of the listed sets (for exam- amples can be solved in reasonable amounts of
112 AI MAGAZINE
Articles
computer time. However, it is not clear whether An Introduction to

the current stock of ideas will permit the solu- Dynamic Programming
tion of problems with billions of training exam-
The most important insight of the past five
ples. An important open problem is to gather
years is that reinforcement learning is best an-
more practical experience with very large prob-
alyzed as a form of online, approximate dy-
lems, so that we can understand their properties
namic programming (Barto, Bradtke, and
and determine where these algorithms fail.
Singh 1995). I introduce this insight using the
A recurring theme is the use of subsamples
following notation: Consider a robot interact-
of the training data to make critical intermedi-
ing with an external environment. At each
ate decisions (such as the choice of relevant
time t, the environment is in some state st, and
features). Another theme is the development the robot has available some set of actions A.
of efficient online algorithms, such as WINNOW. The robot executes an action at, which causes
These are anytime algorithms that can produce the environment to move to a new state st+1. A
a useful answer regardless of how long they are convenient way of specifying the desired be-
permitted to run. The longer they run, the bet- havior of the robot is to define an immediate
ter the result they produce. reward function R(st, a, st+1) that specifies a re-
An important open topic is the problem of al-valued reward for this transition from st to
handling thousands of output classes. The sec- st+1. For example, we can assign a positive re-
tion Ensembles of Classifiers has already de- ward to actions that reach a desired goal loca-
scribed the two methods that are most appro- tion, and we can assign a negative reward to
priate in this case: error-correcting output undesirable actions such as colliding with
coding and ADABOOST.OC. Both of these meth- walls, people, or other robots. The immediate
ods should scale well with the number of class- reward in all other states could be defined as 0.
es. Error-correcting output coding has been Our long-term goal for the robot can then be
tested on problems with as many as 126 classes, defined as some function of the immediate re-
but tests on very large problems with thou- wards it receives. A commonly used criterion is
sands of classes have not yet been performed. the cumulative discounted reward,
∞
∑ γ t R( st , a, st +1 ),
Reinforcement Learning t =0
The previous two sections discussed problems where 0 ≤ γ < 1 is a discount factor that controls
in supervised learning from examples. This the relative importance of short-term and
section addresses problems of sequential deci- long-term rewards.
sion making and control that come under the A procedure or rule for choosing each action
heading of reinforcement learning. a given state s is called the policy of the robot,
Work in reinforcement learning dates back to and it can be formalized as a function a = π(s).
the earliest days of AI when Arthur Samuel The goal of reinforcement learning algorithms
(1959) developed his famous checkers program. is to compute the optimal policy, denoted π*,
More recently, there have been several impor- which maximizes the cumulative discounted
tant advances in the practice and theory of re- reward.
inforcement learning. Perhaps the most famous Researchers in dynamic programming (for
work is Gerry Tesauro’s (1992) TD-GAMMON pro- example, Bellman [1957]) found it convenient
gram, which has learned to play backgammon to define a real-valued function fπ(s) called the
value function of policy π. The value function
better than any other computer program and
fπ(s) gives the expected cumulative discounted
almost as well as the best human players. Two
reward that will be received by starting in state
other interesting applications are the work of
s and executing policy π. It can be defined re-
Zhang and Dietterich (1995) on job-shop sched-
cursively by the formula
uling and Crites and Barto (1995) on real-time
scheduling of passenger elevators. fπ(s) =
Kaelbling, Littman, and Moore (1996) pub- ∑s’ P(s’|s, π(s)) · [R(s, π(s), s’) + γf π(s’)], (1)
lished an excellent survey of reinforcement where P(s’|s, π(s)) is the probability that the
learning, and Mahadevan and Kaelbling next state will be s’ given that the current state
(1996) report on a recent National Science is s, and we take action π(s).
Foundation–sponsored workshop on the sub- Given a policy π, a reward function R, and
ject. Two new books (Barto and Sutton 1997; the transition probability function P, it is pos-
Bertsekas and Tsitskilis 1996) describe the new- sible to compute the value function fπ by solv-
ly developed reinforcement learning algo- ing the system of linear equations containing
rithms and the theory behind them. I summa- one equation of the form of equation 1 for
rize these developments here. each possible state s. The system of equations
WINTER 1997 113

Articles
backups can be performed in any convenient

order—even at random. The only requirement
P(s’|s,a) R(s’|s,a) f(s’) is that γ must be small enough so that the sum
of the expected discounted rewards converges,
0.8 s’1 0 10 and the simple backups must be performed un-
til fπ(s) converges for every state s.
s
a 0.1
s’2 Given a value function fπ(s), it is possible to
0 5
compute a new policy π’ that is guaranteed to
be at least as good π. This new policy is com-
0.1 s’3 2 0 puted by performing a one-step lookahead
search and choosing the action whose backed-
up value is the largest:
π’(s) := argmaxa ∑ s’ P(s’|s, a) [R(s, a, s’)
Figure 11. An Example of a Simple Backup. + γfπ(s’)]. (3)
Each arc is labeled with the probability of making the transition. The resulting In words, we consider each possible action a,
states are labeled with their associated immediate reward and their value. compute its expected backed-up value, and de-
fine π’(s) to be the action that gives the maxi-
mum backed-up value.
By alternately computing the value fπ of pol-
icy π and then updating π using equation 2, we
procedure PolicyIteration(P, R)
let πγbe an arbitrary initial policy
can converge to the optimal policy π* and its
repeat until πγis unchanged optimal value function f*. Note that if we plug
perform simple backups to compute f πγfor each state (Equation 2).
update πγfor each state (Equation 3).
f* into equation 2, the new policy is un-
end changed—it is still the optimal policy. This de-
end PolicyIteration fines the algorithm known as policy iteration,
shown in figure 12. It is easy to show that pol-
icy iteration converges in a fixed number of it-
erations. Unfortunately, each iteration can be
Figure 12. The Policy-Iteration Algorithm. expensive because it requires computing the
value of the current policy.
An alternative to policy iteration is to work
directly with the value function. Bellman
proved that the value function f* of the opti-
can be solved using standard methods, such as mal policy π* is the unique fixed point of the
Gaussian elimination or Gauss-Seidel iteration, Bellman equation:
or it can be solved iteratively by converting the f(s) := maxa ∑s’ P(s’|s, a) [R(s, a, s’) + γf(s’)] (4)
equation into an assignment statement:
In other words, we perform a one-step looka-
fπ(s) := ∑s’ P(s’|s, π(s)) · [R(s, π(s), s’)
head just as we did for the policy-improve-
+ γfπ(s’)], (2)
ment step of policy iteration, but instead of re-
This assignment statement is called a simple membering the best action, we update our
backup because it can be viewed as taking the estimate of the value of state s. This is called a
current estimated value(s) fπ(s’) and “backing Bellman backup. It is more expensive than a
them up” to compute a revised estimate for simple backup because we must consider each
fπ(s). An example is shown in figure 11. In state possible action a and then each possible result-
s, we perform action a = π(s). There are three ing state s’. By performing enough Bellman
possible resulting states: backups in every state, we can converge to the
s1' , s2' , and s3' , optimal value function f*. This is called the
value-iteration algorithm, and it is summa-
with probabilities 0.8, 0.1, and 0.1, respective-
rized in figure 13.
ly. The immediate rewards for reaching these
Unfortunately, value iteration can be more
states are 0, 0, and 2, respectively, and the esti-
difficult than policy iteration because (1) each
mated values f(s’) of the states are 10, 5, and 0,
backup is a more expensive Bellman backup
respectively. Assuming γ = 1, the new estimat-
rather than a simple backup and (2) the value
ed value of fπ(s) is
function can take a long time to converge. In-
fπ(s) := (0 + 0.8 · 10) + (0 + 0.1 · 5) deed, it is possible for the optimal policy to
+ (2 + 0.1 · 0) have converged long before the value function
:= 10.5. converges.
To compute the value of a policy, the simple A hybrid algorithm that combines aspects of
114 AI MAGAZINE
Articles
both value iteration and policy iteration is

called modified policy iteration. This algorithm is procedure ValueIteration(P, R)
essentially the same as policy iteration except let f be an arbitrary initial value function
repeat until f is unchanged in all states
that only a fixed number of simple backups are for each state s, perform a Bellman backup (Equation 4)
performed in each state in each iteration. end
for each state s, compute the optimal policy (Equation 3)
Thus, the estimated value function for the cur- end ValueIteration
rent policy fπ does not completely converge to
the correct value function, but under fairly
mild conditions, it can be shown that the algo-
rithm will still converge to the optimal policy. Figure 13. The Value-Iteration Algorithm.
All three of these algorithms—policy itera-
tion, value iteration, and modified policy iter- Temporal Difference
ation—require performing backups in every
state; so, their running time scales with the
Learning and TD(λ)
number of states. In fact, value iteration can I begin by describing a simplified version of
run for an infinite amount of time without Sutton’s TD(λ) algorithm, called TD(0). TD(0) is a
converging. Policy iteration requires time at method for computing approximate simple
least O(n3) for problems with n states just to backups online. Suppose we are in state s, and
compute the value of the policy in each itera- we follow the current policy by taking action a
tion. For small problems, it is not a difficulty, = π(s). If we are interacting with a real external
but for problems of interest in AI, the state environment (or with a simulator), the envi-
space often has 1020 or 1040 possible states, ronment makes a probabilistic transition to a
which renders these algorithms infeasible. new state s’ and produces the immediate re-
Bellman termed this the curse of dimensionality ward R(s, a, s’). The TD(0) algorithm observes
because the number of states, and, hence, the this new state and reward and updates the val-
running time, increases exponentially with the ue function as follows:
number of dimensions in the state space. fπ(s) := (1 – α) fπ(s) + α[R(s, a, s’) + γfπ(s’)],
Another drawback of these algorithms is where α is a learning rate parameter. Typical val-
that they require a complete model of the sys- ues for α are between 0.01 and 0.5. (Technical-
tem, by which I mean the transition probabil- ly, the α values must shrink to 0 over time for
ities P(s’|s, a) and the reward function R(s, a, TD(0) to converge.)
s’). There are many applications where this The basic idea of TD(0) is that if we visit s
model is unavailable (for example, in a robot many times and apply action a many times,
interacting with an unknown environment) or then by sampling over time, we get the same
where the model cannot easily be converted effect as if we performed a simple backup. We
into a transition probability matrix. For exam- can do this by sampling from the probability
ple, in the elevator control problem studied by distribution P(s’|s, a) rather than by having di-
Crites and Barto (1995), a software simulator is rect computational access to P.
available that can take a current state s and a This strategy allows TD(0) to compute the
proposed action a and generate the next state value of a policy without having an explicit
s’ according to the transition probability distri- model. The environment serves as its own
bution. However, this distribution is not ex- model!
plicitly represented anywhere, so it is not avail- When Sutton developed TD(0), he also intro-
able for direct use by a dynamic programming duced a second idea. Instead of storing a sepa-
algorithm. The problem of constructing an ex- rate value fπ(s) for each state s, suppose we rep-
plicit probability transition matrix and reward resent the value function as a neural network
function has been called the curse of modeling, (or some other differentiable function approxi-
and in many problems, it is just as severe as the mator) of the form f(s, W), where W is a vector
curse of dimensionality. Reinforcement learn- of adjustable weights. With this representation,
ing algorithms provide a way of overcoming we can’t directly assign a value to a state, but we
can adjust the weights so that f(s, W) is closer to
these two curses.
the desired value by defining an error function:
Reinforcement learning algorithms have
introduced three key innovations: (1) sto- 1 π
J (W ) = ( f ( s, W ) – [ R( s, a, s' ) + γf π ( s' , W )])2
chastic approximation of backups, (2) value- 2
function approximation, and (3) model-free This is (one half) the squared difference be-
learning. I discuss these innovations in the tween the current estimated value of state s,
context of an algorithm known as TD(λ) devel- fπ(s, W), and the backed-up value, which is
oped by Sutton (1988). called the temporal difference error. Our goal is to
WINTER 1997 115

Articles
s t–7 s t–6 s t–5 s t–4 s t–3 s t–2 s t–1 st s t+1
Figure 14. A Sequence of States.

The eligibility of each state (with λ = 0.8) is shown as a vertical bar.
modify W to reduce the temporal difference ∞
error J(W). Differentiating J(W) (and treating Σ λi ∇ w t − i f ( st − i , Wt − i )

i= 0
only the first occurrence of W as adjustable), can be implemented by maintaining a current

we obtain the learning rule gradient vector G:
W:= W – α∇wf(s, W) (fπ(s, W) Gt : = λGt – 1 + ∇ wt f ( st , Wt ).
– [R(s, a, s’) + γfπ(s’, W)]),
With this change, we get the full algorithm
where ∇wf(s, W) is the gradient of f with respect TD(λ) shown in figure 15. Readers familiar with
to the weights W. This takes a step of size α in the momentum method for stabilizing back
the direction of the decreasing gradient scaled propagation will note that the eligibility trace
by the size of the temporal difference error. mechanism is similar. However, in the mo-
By using a smoothly parameterized function mentum method, previous weight changes are
approximation f(s, W), TD(0) can circumvent remembered, but in the eligibility trace mech-
the curse of dimensionality, provided that f(s, anism, previous gradient vectors are remem-
W) can accurately approximate the true value bered, and future temporal differences deter-
function f(s) with a small number of parame- mine the step size along these previous
ters W. gradients.
Sutton introduced one other wonderful idea There have been many theoretical studies of
in the TD(λ) algorithm—the eligibility trace. Sup- the behavior of TD(λ). The most general results
pose we have visited a sequence of states s1, s2, have been obtained by Tsitsiklis and Van Roy
(1996). They analyze the case where the func-
..., st, st+1, and we are updating the value of f(st,
tion approximator f(s, W) is a linear combina-
Wt). Sutton suggested that we might want to
tion of fixed (and arbitrary) orthogonal basis
update the values of the preceding states as well
functions. To analyze how well f(s, W) can ap-
because the key idea of dynamic programming
proximate fπ, some notion of the distance be-
is to propagate information about expected re-
tween two value functions is needed. Tsitsiklis
wards backward through the state space. Sutton and Van Roy define the following measure:
proposed that we should remember the gradi-
||f – fπ||D = (∑s[f(s) – fπ(s)]2 D(s))1/2,
ent ∇wt-if(st-i,Wt-i) for each state st-i, for i = 1, …, n,
that we have visited. When we update f(st-i,Wt-i) where D(s) is the probability that the policy π
by taking a step in the direction of –∇tf(st, Wt), visits state s.
we also take a smaller step in the direction of With this measure, let
–∇wt–1f(st-1,Wt-1) and an even smaller step in the fˆ π
direction of –∇wt–2f(st-2,Wt-2), and so on. Each step be the best approximation of fπ that can be rep-
size will be decreased by a factor of λ < 1, which resented by a linear combination of the given
gives us the learning rule basis functions. Tsitsiklis and Van Roy prove
∞ that TD(λ) will converge to a function f such that
W : = W – α ( ∑ λi ∇ w t − i f ( st − i , Wt − i ))( f π ( s t , W )
i= 0
|| fˆ π – f * ||D
|| f – f π ||D ≤ .
–[ R( st , a, st +1 ) + γf π ( st +1 , W )]). 1 – γ (1 – λ ) / (1 – λγ )
The value of λi is called the eligibility of state st-i. The quantity on the left-hand side is the er-
Figure 14 shows the eligibility of a sequence of ror between the function f learned by TD(λ) and
states as a bar graph. the true value function fπ for policy π. The nu-
The infinite sum merator on the right-hand side is the inherent
116 AI MAGAZINE
Articles
approximation error resulting from the use of

a linear combination of the given, fixed-basis procedure TD(πγ,λ, γγ,α, s0 )
functions. The denominator is the error that initialize G = 0
initialize W randomly
results from the fact that approximation errors s0 is the starting state
in one state will be propagated backward to while W has not converged do
take action a = πγ(st )
earlier states. For large λ, the denominator ap- observe resulting state st+1 and reward R(st , a, st+1 )
proaches 1 (no error), but for λ = 0, the denom- G := λG + ∇W f (s³ t , W ) ´
inator becomes 1 – γ, which could be small αG · f (st , W ) −
W := W − [R(st , a, st+1 ) + γγf (st+1 , W )]
end while
and, hence, produce large errors. For this result return W which defines f πγ
to hold, it is essential that the backups per- end TD
formed by TD(λ) be performed according to the
current policy π. If this condition is not ob-
served, then TD(λ) may fail to converge. Figure 15. The TD(λ) Algorithm for Computing the Value of a Policy π.
The TD(λ) algorithm provides a way of com-
puting the value of a fixed policy without di-
rect access to the transition probabilities and is easy to construct examples where the strate-
reward function. However, this algorithm is gy of always performing the best action based
only of limited utility unless we can perform on the current approximation to the value
the policy-improvement step from equation 3, function, f(s, W), leads to a local minimum.
thereby implementing the policy-iteration al- For example, consider the navigation problem
gorithm. Unfortunately, policy improvement shown in figure 16. There is a large mountain
requires access to a model that can generate separating the start state from the goal. A net-
the possible next states and their probabilities. work of roads passes to the north of the moun-
tain, and a similar (and shorter) network passes
Applications of TD(λ) to the south. Suppose that our initial policy
There are many domains where such a model takes us to the north side, and we eventually
is available. For example, in game-playing set- reach the goal. After updating our value func-
tings, such as backgammon, it is easy to com- tion using TD(λ), suppose that the north side
pute the set of available moves from each state still appears to be shorter than the south side
and the probabilities of all successor states. (because our estimates for the values of the
Hence, TD(λ) can be combined with policy im- states along the south side are too high). Then,
provement to learn an approximately optimal in future trials, we will continue to take the
policy for backgammon, which is what north roads, and we will never try the south-
Tesauro (1995, 1992) did in his famous TD-GAM- ern route.
MON system. This is called the problem of exploration. The
TD-GAMMON uses a neural network represen- heart of the problem is that to find the optimal
tation of the value of a state. The state of the policy, it is necessary to prove that every off-
backgammon game is described by a vector of policy action leads to expected results that are
198 features that encode the locations of the worse than the actions of the optimal policy.
pieces on the board and the values shown on In this situation, it is essential to explore the
the dice. TD-GAMMON begins with a randomly southern path to determine whether it is worse
initialized neural network. It plays a series of (or better!) than the northern path. The strate-
games against itself. At each step, it makes a gy of always taking the action that appears to
full one-step lookahead search, applies the be optimal based on the current value function
neural network to evaluate each of the result- is called the pure exploitation strategy. The ex-
ing states, and makes the move corresponding ample in figure 16 shows that the pure ex-
to the highest backed-up value. In short, it ap- ploitation strategy does not always find the op-
plies equation 3 to compute the action to per- timal policy. Hence, online reinforcement
form next, which is a form of local policy im- learning algorithms must balance exploitation
provement. After making the move, it observes with exploration.
the resulting state and applies the TD(λ) rule to Fortunately, in backgammon, the random
update the value function. In effect, TD-GAM- dice rolls inject so much randomness into the
MON is executing a form of modified policy it- game that TD-GAMMON thoroughly investigates
eration where it alternates between one step of the possible moves in the game. Experimental-
policy evaluation (the TD(λ) update) and one ly, the performance of TD-GAMMON is outstand-
step of policy improvement (the computation ing. It plays much better than any other com-
of the best move to make). puter program, and it is nearly as good as the
There is no guarantee that this algorithm world’s best players. The results of three ver-
will converge to the optimal policy. Indeed, it sions of the program in three separate matches
WINTER 1997 117

Articles
its resource constraints. The actions in this

search space identify the earliest constraint vi-
olation and repair it by moving tasks later in
time (thus lengthening the schedule). The
search terminates when a violation-free sched-
ule is found.
Zhang and Dietterich reformulated this job-
shop scheduling problem as a reinforcement
learning problem where the optimal policy
chooses a sequence of repairs that produces the
shortest possible schedule. The immediate re-
ward function gives a small cost to each repair
Start G action and a final reward that is inversely pro-
portional to the final length of the schedule.
Zhang and Dietterich applied TD(λ) with a feed-
forward neural network to represent the value
function. The actions in this domain are deter-
ministic, so deliberate exploration is needed.
Their system makes a random exploratory
move with a given probability, β, which is grad-
ually decreased during learning. After learning,
their system finds schedules that are substan-
Figure 16. Two Networks of Roads around a Mountain. tially shorter than the best previous method
Without exploration, an initial policy that follows the northern roads will never (for the same expenditure of computer time).
discover that the southern roads provide a shorter route to the goal. This approach of converting combinatorial op-
timization problems into reinforcement learn-
against human players are shown in table 2. In ing problems should be applicable to many
some situations, the moves chosen by TD-GAM- other important industrial domains.
MON have been adopted by expert humans. These two applications show that when a
A similar strategy was applied by Zhang and simulator is available for a task, it is possible to
Dietterich (1995) to the problem of job-shop solve the reinforcement learning problem us-
scheduling. In job-shop scheduling, a set of ing TD(λ) even in very large search spaces. How-
tasks must be scheduled to avoid resource con- ever, in domains involving interaction with
flicts. Each task requires certain resources hard-to-model real-world environments (for
throughout its duration, and each task has pre- example, robot navigation, factory automa-
requisite tasks that must be completed before it tion), some other method is needed. Two ap-
can be executed. An optimal schedule is one proaches have been explored.
that completes all its tasks in the minimum One approach is to learn a predictive model
amount of time but satisfies all resource and of the environment by interacting with it.
prerequisite constraints. Each interaction with the environment pro-
Zweben, Daun, and Deale (1994) developed vides a training example for supervised learn-
a repair-based search space for this task in ing of the form s’ = env(s, a), where env is the
which each state is a complete schedule (that environment, s and a are the current state and
is, all tasks have assigned start times). The action, and s’ is the resulting state. Standard
starting state is a critical path schedule in which supervised learning algorithms can be applied
every task is scheduled as early as possible, sub- to learn this model, which can then be com-
ject to its prerequisite constraints and ignoring bined with TD(λ) to learn an optimal policy.
Program Training Games Opponents Results

TD-GAMMON 1.0 300,000 Robertie, Davis, Magriel –13 pts/51 games (–0.25 ppg)
TD-GAMMON 2.0 800,000 Goulding, Woolsey, Snellings, Russell, Sylvester –7 pts/38 games (–0.18 ppg)
TD-GAMMON 2.1 1,500,000 Robertie –1 pts/40 games ( –0.02 ppg)
ppg = points per game.
Table 2. Summary of the Performance of TD-GAMMON against Some of the World’s Best Players.
The results are expressed in net points won (or lost) and in points won per game. Taken from Tesauro (1995).
118 AI MAGAZINE
Articles
f(s) s
a1 a2 a3 Q(s,a) s,a1
s’1 s’2
f(s’) s’1 s’2 s’3 s’4 s’5 s’6
Q(s’,a’) s’1,a4 s’1,a5 s’1,a6 s’2,a4 s’2,a5 s’2,a6
Figure 17. A Comparison of the Value Function f and the Q Function.

Black nodes represent situations where the agent has chosen an action. White nodes are states where the agent has not yet chosen an action.
The second approach is a model-free algo- a, and observed the resulting state s’ and im-
rithm called Q-LEARNING, developed by Watkins mediate reward R(s, a, s’). We can update the Q
(Watkins 1989; Watkins and Dayan 1992), function as follows:
which is the subject of the next section.
Q(s, a) := (1 – α)Q(s, a) +
Model-Free Reinforcement α[R(s, a, s’) + γmaxa Q(s’, a)].
Learning (Q-LEARNING) Suppose that every time we visit state s, we
Q-LEARNING is an online approximation of value
choose the action a uniformly at random.
iteration. The key to Q-LEARNING is to replace Then, the effect is to approximate a full Bell-
the value function f(s) with an action-value man backup (see equation 4). Each value Q(s,
function, Q(s, a). The quantity Q(s, a) gives the a) is the expected cumulative discounted re-
expected cumulative discounted reward of per- ward of executing action a in s, and the maxi-
forming action a in state s and then pursuing mum of these values is f(s). The random choice
the current policy thereafter. Hence, the value of a ensures that we learn Q values for every ac-
of a state is the maximum of the Q values for tion available in state s. The online updates en-
that state: sure that we experience the resulting states s’
in proportion to their probabilities P(s’|s, a).
f(s) = maxa Q(s,a). In general, we can choose which actions to
We can write down the Q version of the Bell- perform in any way we like, as long as there is
man equation as follows: a nonzero probability of performing every ac-
Q(s, a) = tion in every state. Hence, a reasonable strate-
∑s’ P(s’|s, a) [R(s, a, s’) + maxa’ γQ(s’, a’)]. gy is to choose the best action (that is, the one
whose estimated value Q(s, a) is largest) most
The role of the Q function is illustrated in
of the time but to choose a random action
figure 17, where it is contrasted with the value
with some small probability. Watkins proves
function f. In the left part of the figure, we see
that as long as every action is performed in
that the Bellman backup updates the value of
every state infinitely many times, the Q func-
f(s) by considering the values f(s’) of states that
tion converges to the optimal function Q* with
result from different possible actions. In the
probability 1.
right part of the figure, the analogous backup
Once we have learned Q*, we must convert
works by taking the best of the values Q(s’, a’)
it into a policy. Although this conversion pre-
and backing them up to compute an updated
sented an insurmountable difficulty for TD(0),
value for Q(s, a).
it is trivial for the Q representation. The opti-
The Q function can be learned by an algo-
mal policy can be computed as
rithm that exploits the same insight as
TD(0)—online sampling of the transition prob- π*(s) = argmaxa Q*(s, a).
abilities—and an additional idea—online sam- Crites and Barto (1995) applied Q-LEARNING
pling of the available actions. Specifically, sup- to a problem of controlling 4 elevators in a 10-
pose we have visited state s, performed action story office building during peak “down traf-
WINTER 1997 119

Articles
700 1
1.12 2.74
^| ^|
600
0.8
Percent of waits > 60 seconds

500
Squared Wait Time
0.6
400
300 0.4
200
0.2
100
0 0
SECTOR DLB HUFF/B LQF HUFF FIM ESA/nq ESA Q SECTOR DLB HUFF/B LQF HUFF FIM ESA/nq ESA Q
Policy Policy
Figure 18. Comparison of Learned Elevator Policy Q with Eight Published Heuristic Policies.
fic.” People press elevator call buttons on the ING is to the problem of assigning radio chan-
various floors of the building to call the eleva- nels for cellular telephone traffic. Singh and
tor to the floor. Once inside the elevator, they Bertsekas (1997) showed that Q-LEARNING could
can also press destination buttons to request find a much better policy than some sophisti-
that the elevator stop at various floors. The el- cated and complex published methods.
evator control decisions are (1) to decide, after
stopping at a floor, which direction to go next Open Problems in
and (2) to decide, when approaching a floor, Reinforcement Learning
whether to stop at the floor or skip it. Crites Many important problems remain unsolved in
and Barto applied rules to make the first deci- reinforcement learning, which reflects the rel-
sion; so, the reinforcement learning problem is ative youth of the field. I discuss a few of these
to learn whether to stop or skip floors. The goal problems here.
of the controller is to minimize the square of First, the use of multilayer sigmoidal neural
the time that passengers must wait for the ele- networks for value-function approximation
vator to arrive after pressing the call button. has worked, but there is no reason to believe
Crites and Barto used a team of four Q-learn- that such networks are well suited to reinforce-
ers, one for each of the four elevator cars. Each ment learning. First, they tend to forget
Q-function was represented as a neural net- episodes (both good and bad) unless they are
work with 47 input features, 20 sigmoidal hid- retrained on the episodes frequently. Second,
den units, and 2 linear output units (to repre- the need to make small gradient-descent steps
sent Q(s, stop) and Q(s, skip)). The immediate makes learning slow, particularly in the early
reward was the (negative of the) squared wait stages. An important open problem is to clarify
time since the previous action. They employed what properties an ideal value-function ap-
a form of random exploration in which ex- proximator would possess and develop func-
ploratory actions are more likely to be chosen tion approximators with these properties. Ini-
if they have higher estimated Q values. tial research suggests that value-function
Figure 18 compares the performance of the approximators should be local averagers that
learned policy to that of eight heuristic algo- compute the value of a new state by interpolat-
rithms, including the best nonproprietary al- ing among the values of previously visited
gorithms. The left-hand graph shows the states (Gordon 1995).
squared wait time, and the right-hand graph A second key problem is to develop rein-
shows the percentage of passengers that had to forcement methods for hierarchical problem
wait more than 60 seconds for the elevator. solving. For very large search spaces, where the
The learned Q policy performs better than all distance to the goal and the branching factor
the other methods. are big, no search method can work well. Of-
Another interesting application of Q-LEARN- ten such large search spaces have a hierarchical
120 AI MAGAZINE
Articles
(or approximately hierarchical) structure that choosing actions to optimize expected utility
can be exploited to reduce the cost of search. is exactly the problem faced by general intelli-
There have been several studies of ideas for hi- gent agents. Reinforcement learning provides
erarchical reinforcement learning (for exam- one approach to attacking these problems.
ple, Dayan and Hinton [1993], Kaelbling
[1993], and Singh [1992]).
The third key problem is to develop intelli- Learning Stochastic Models
gent exploration methods. Weak exploration The final topic that I discuss is the area of learn-
methods that rely on random or biased ran- ing stochastic models. Traditionally, re-
dom choice of actions cannot be expected to searchers in machine learning have sought
scale well to large, complex spaces. A property general-purpose learning algorithms—such as
of the successful applications shown previous- the decision tree, rule, neural network, and
ly (particularly, backgammon and job-shop nearest-neighbor algorithms—that could effi-
scheduling) is that even random search reach- ciently search a large and flexible space of clas-
es a goal state and receives a reward. In do- sifiers for a good fit to training data. Although
mains where success is contingent on a long these algorithms are general, they have a major
sequence of successful choices, random search drawback. In a practical problem where there is
has a low probability of receiving any reward. extensive prior knowledge, it can be difficult to
More intelligent search methods, such as incorporate this prior knowledge into these
means-ends analysis, need to be integrated in- general algorithms. A secondary problem is
to reinforcement learning systems as they have that the classifiers constructed by these general
been integrated into other learning architec- learning algorithms are often difficult to inter-
tures such as SOAR (Laird, Newell, and Rosen- pret—their internal structure might not have
bloom 1987) and PRODIGY (Minton et al. 1989). any correspondence to the real-world process
A fourth problem is that optimizing cumu- that is generating the training data.
lative discounted reward is not always appro- Over the past five years or so, there has been
priate. In problems where the system needs to tremendous interest in a more knowledge-
operate continuously, a better goal is to maxi- based approach based on stochastic modeling.
mize the average reward per unit time. Howev- A stochastic model describes the real-world
er, algorithms for this criterion are more com- process by which the observed data are gener-
plex and not as well behaved. Several new ated. Sometimes, the terms generative stochastic
methods have been put forward recently (Ma- model and causal model are used to emphasize
hadevan 1996; Ok and Tadepalli 1996; this perspective. The stochastic model is typi-
Schwartz 1993). cally represented as a probabilistic network—a
The fifth, and perhaps most difficult, prob- graph structure that captures the probabilistic
lem is that existing reinforcement learning al- dependencies (and independencies) among a
gorithms assume that the entire state of the set of random variables. Each node in the
environment is visible at each time step. This graph has an associated probability distribu-
assumption is not true in many applications, tion, and from these individual distributions,
such as robot navigation or factory control, the joint distribution of the observed data can
where the available sensors provide only par- be computed. To solve a learning problem, the
tial information about the environment. A few programmer designs the structure of the graph
algorithms for the solution of hidden-state re- and chooses the forms of the probability distri-
inforcement learning problems have been de- butions, yielding a stochastic model with
veloped (Littman, Cassandra, and Kaelbling many free parameters (that is, the parameters
1995; McCallum 1995; Parr and Russell 1995; of the node-probability distributions). Given a
Cassandra, Kaelbling, and Littman 1994). Ex- training sample, learning algorithms can be
act solution appears to be difficult. The chal- applied to determine the values of the free pa-
lenge is to find approximate methods that rameters, thereby fitting the model to the data.
scale well to large hidden-state applications. Once a stochastic model has been learned,
Despite these substantial open problems, re- probabilistic inference can be carried out to
inforcement learning methods are already be- support tasks such as classification, diagnosis,
ing applied to a wide range of industrial prob- and prediction.
lems where traditional dynamic programming More details on probabilistic networks are
methods are infeasible. Researchers in the area given in two recent textbooks: Jensen (1996)
are optimistic that reinforcement learning al- and Castillo, Gutierrez, and Hadi (1997).
gorithms can solve many problems that have
resisted solution by machine-learning meth- Probabilistic Networks
ods in the past. Indeed, the general problem of Figure 19 is an example of a probabilistic net-
WINTER 1997 121

Articles
P(A, B|C) = P(A|C) · P(B|C).

In the graph, Age affects Insulin only through
Age Preg the Diabetes node; so, Age and Insulin are con-
ditionally independent given Diabetes. More
generally, given the values of its parents, a
node is independent of all other nodes in the
graph except its descendants. Formally,
Mass P(A, B|Parents(A))
= P(B|Parents(A)) · P(A|Parents(A)),
unless B is a descendent of A.
Diabetes In addition to specifying the structure of the
network, we need to specify how each proba-
bility distribution is represented. One standard
approach is to discretize the variables into a
small number of values and represent each
Insulin probability distribution as a table. For exam-
ple, suppose we discretized Age into the values
{0–25, 26–50, 51–75, > 75}, Preg into the values
Glucose {0, 1, > 1}, and Mass into the values {0–50 kilo-
grams (kg), 51–100 kg, > 100 kg}. Then, the
probability distributions for these three nodes
could be represented by the probability tables
shown in table 3. The learning task is to fill in
Figure 19. A Probabilistic Network for Diabetes Diagnosis. the probability values in these tables. The table
for P(A) requires three independent parameters
work that might be used to diagnose diabetes. (because the four values must sum to 1). The
There are six variables (with their abbrevia- table for P(N) requires 2 parameters, and the
tions): (1) Age: age of patient (A); (2) Preg: table for P(M|A, N) requires 24 parameters (be-
number of pregnancies (N); (3) Mass: body cause each row must sum to 1). Similar tables
mass (M); (4) Insulin: Blood-insulin level (after would be required for the other three nodes in
a glucose-tolerance test) (I); (5) Glucose: blood- the network.
glucose level (after a glucose-tolerance test) Given a set of training examples, this learn-
(G); and (6) Diabetes: true if the patient has di- ing problem is easy to solve. Each probability
abetes (D). can be computed directly from the training da-
In a medical diagnosis setting, the first five ta. For example, the cell P(N = 1) can be com-
variables would be observed and then the computed as the proportion of patients in the sam-
puter would estimate the probability that the ple that had exactly 1 pregnancy. The
patient has diabetes (that is, estimate the prob- parameter P(M = 51–100|A = 26–50, N = 1) is
ability that the diabetes variable is true). the fraction of training examples with Age =
This network corresponds to the following 26–50 and Preg = 1 that have a Mass = 51–100
decomposition of the joint-probability distrib- kg. Technically, these are the maximum-likeli-
ution among the six variables: hood estimates of each of the probabilities. A
difficulty that can arise is that some of the cells
P(A, N, M, I, G, D) = P(A) · P(N) · P(M|A, N) in the tables can have few examples; so, the re-
· P(D|M, A, N) · P(I|D) · P(G|I, D). sulting probability estimates are uncertain.
Each node in the network corresponds to a One solution is to smooth probabilities for ad-
probability distribution of the form jacent cells in the table. For example, we might
P(Node|Parents), where the Parents of Node are require that P(M = 51–100|A = 26–50, N = 1)
the nodes with arcs pointing to Node. In other have a value similar to P(M = 51–100|A =
words, if we believe the network is a correct 26–50, N = >1). Of course, we could also take
representation of the relationships among the an ensemble approach and generate an ensem-
variables, then it should be possible to factor ble of fitted stochastic models, as described in
the joint probability distribution into the the section Ensembles of Classifiers.
product of these smaller distributions. The process of learning a stochastic model
The structure of the network—particularly consists of three steps: (1) choosing the graph
the arcs that are absent—can be viewed as structure, (2) specifying the form of the proba-
specifying conditional independencies. Two bility distribution at each node in the graph,
variables A and B are conditionally indepen- and (3) fitting the parameters of these proba-
dent given C if bility distributions to the training data. In
122 AI MAGAZINE
Articles
Age P(A)
0–25 Preg P(N)
26–50 0
51–75 1
> 75 >1
P(M|A, N)
Age Preg 0–50 51–100 >100
0–25 0
0–25 1
0–25 >1
26–50 0
26–50 1
26–50 >1
51–75 0
51–75 1
51–75 >1
>75 0
>75 1
>75 >1
Table 3. Probability Tables for the Age, Preg, and Mass Nodes from Figure 19.
A learning algorithm must fill in the actual probability values based on the observed training data.
most current applications, steps 1 and 2 are worst case, these algorithms require exponential
performed by a user, and step 3 is performed time, but if the probabilistic network is sparsely
by a learning algorithm. However, later in this connected, the running time is reasonable.
section, I briefly discuss methods for automat-
ing step 1—learning the graph structure. The Naive Bayes Classifier
Once we learn the model, how can we apply A simple approach to stochastic modeling for
it to predict whether new patients have dia- classification problems is the so-called naive
betes? For a new case, we observe the values of Bayes classifier (Duda and Hart 1973). In this
all the variables in the model except for the Di- approach, the training examples are assumed
abetes node. Our goal is to compute the prob- to be produced by the probabilistic network
ability of the node; that is, we seek the distrib- shown in figure 20, where the class variable is
ution P(D|A, N, M, I, G). We could precompute y, and the features are x1, …, xn. According to
this distribution offline before observing any this model, the environment generates an ex-
of the data, but the resulting conditional prob- ample by first choosing (stochastically) which
ability table would be immense. A better ap- class to generate. Then, once the class is cho-
proach is to wait until the values of the vari- sen, the features describing the example are
ables have been observed and then compute generated independently according to their in-
the single corresponding row of the class prob- dividual distributions P(xj|y).
ability table. In most applications, the values of the fea-
This inference problem has been studied in- tures are discretized so that each feature takes
tensively, and a general and elegant algo- on only a small number of discrete values. The
rithm—the junction-tree algorithm—has been probability distributions are represented as ta-
developed (Jensen, Lauritzen, and Olesen 1990). bles as in the diabetes example, and the net-
In addition, efficient online algorithms have work can be learned directly from the training
been discovered (D’Ambrosio 1993). In the data by counting the fraction of examples in
WINTER 1997 123

Articles
petitive with, or superior to, C4.5. Domingos

and Pazzani showed that the algorithm is ro-
y bust to violations of the assumption that the
features are generated independently.
Naive Unsupervised Learning

An important application of stochastic models
is to problems of unsupervised learning. In un-
supervised learning, we are given a collection
x1 x2 x3 xn
of examples {x1, …, xm}, and our goal is to con-
struct some model of how these examples are
generated. For example, we might believe that
these examples belong to some collection of
Figure 20. Probabilistic Network for the Naive Bayes Classifier. classes, and we want to determine the proper-
ties of these classes. This is sometimes called
each class that take on each feature value. clustering the data into classes, and many algo-
Because of the simple form of the network, rithms have been developed that apply a mea-
it is easy to derive a classification rule through sure of the distance between two examples to
the application of Bayes’ rule. Suppose there group together nearby examples.
are only two classes, 1 and 2. Then, our deci- A stochastic modeling approach for unsu-
sion rule is to classify a new example into class pervised learning works essentially the same
1 if P(y = 1|x) > P(y = 2|x) or, equivalently, if P(y way as for supervised learning. We begin by
= 1|x)/P(y = 2|x) > 1. By Bayes’s rule, we can defining the structure of the model as a proba-
write bilistic network. One commonly used model is
the same naive network used by the naive
P(y = 1|x) = P(x|y = 1) · P(y = 1)/P(x)
Bayes classifier shown in figure 20.
P(y = 2|x) = P(x|y = 2) · P(y = 2)/P(x).
Although we can use the same network
Dividing the first equation by the second al- structure, the problem of learning the parame-
lows us to cancel the normalizing denomina- ters of the network is much more difficult be-
tor P(x) and obtain cause our data do not contain the values of the
P ( y = 1 | x) P ( x | y = 1) ⋅ P ( y = 1) class variable y. This is a simple case of the
= . problem of fitting stochastic models that con-
P ( y = 2 | x) P ( x | y = 2) ⋅ P ( y = 2)
tain hidden variables, which are variables whose
The quantity P(x|y = 1) is just the product of values are not observed in the training data.
the individual probabilities P(xj|y = 1); so, we Many different algorithms have been devel-
have oped for fitting networks containing hidden
P ( y = 1 | x) variables. As with the naive Bayes classification
= algorithm, the basic goal (at least for this arti-
P ( y = 2 | x)
cle) is to compute the maximum-likelihood es-
P ( x1 | y = 1) ⋅ P ( x2 | y = 1) timates of the parameters of the network,
P ( x1 | y = 2) ⋅ P ( x2 | y = 2) which I shall refer to as the vector of weights
L P ( xn | y = 1) ⋅ P ( y = 1) W. In other words, we want to find the value
. of W that maximizes P(S|W), where S is the ob-
L P ( xn | y = 2) ⋅ P ( y = 2)
served training sample. This is typically formu-
This gives the decision rule that we should lated as the equivalent problem of maximizing
classify an example into class 1 if and only if the log likelihood: log P(S|W). Under the as-
P ( x1 | y = 1) P ( x2 | y = 1) sumption that each training example in S is
⋅ … generated independently, this is equivalent to
P ( x1 | y = 2) P ( x2 | y = 2)
maximizing ∑i log P(xi|W), where xi is the i-th
P ( xn | y = 1) P ( y = 1)
⋅ > 1. training example.
P ( xn | y = 2) P ( y = 2) I briefly sketch three algorithms: (1) gradient
Despite the fact that the naive Bayes model descent, (2) expectation maximization, and (3)
barely deserves the name model in many appli- Gibbs sampling.
cations, it performs surprisingly well. Figure 21 Gradient Descent for Bayes Networks
compares the performance of C4.5 to the naive Russell et al. (1995) describe a method for com-
Bayes classifier on 28 benchmark tasks puting the gradient of the log likelihood:
(Domingos and Pazzani 1996). The results ∇wlogP(xi|W). Let us focus on a particular node
show that except for a few domains where in the network representing variable V. Let U
naive Bayes performs badly, it is typically com- be the parents of V, and let w = P(V = v|U = u)
124 AI MAGAZINE
Articles
be the entry in the conditional probability

100
table for V when V = v and U = u. Then, Russell
et al. show that
90
∂ log P ( xi | W ) P (V = v, U = u | W , xi )
= .
∂w w 80
The conditional probability in the numerator
C4.5 Accuracy
70
can be computed by any algorithm for infer-
ence in probabilistic networks (including the 60
junction-tree algorithm mentioned earlier).
With this formula, the gradient of the para- 50
meters of a network can be computed with re-
spect to a training sample S. We can then apply 40
standard gradient-descent algorithms, such as

30
fixed step-size methods or the conjugate gradi-
ent algorithm, to search for the maximum- 20
likelihood set of weights. One subtlety is that 20 30 40 50 60 70 80 90 100
Naive Bayes Accuracy
we must ensure that the weights lie between 0
and 1 and sum to 1 appropriately. It suffices to
constrain and renormalize them after each Figure 21. Comparison of C4.5 and the Naive
step of gradient descent. Russell et al. have test- Bayesian Classifier on 28 Data Sets.
ed this algorithm on a wide variety of proba-
bilistic network structures.
probability distribution of the values of the
The Expectation Maximization Algo- hidden variables is correctly specified in each
rithm The second algorithm is the expecta- augmented training example.
tion-maximization (EM) algorithm (Dempster, In the case of figure 20, the E-step applies
Laird, and Rubin 1976). EM can be applied to Bayes’s theorem:
probabilistic networks if the node-probability
distributions belong to the exponential family P(yi = 1|xi) = P(xi|yi = 1) · P(yi = 1)/P(xi).
of distributions (which includes the binomial, All the quantities on the right-hand side can
multinomial, exponential, poisson, and nor- be computed given the current value for W.
mal distributions). In particular, conditional The quantity P(xi|yi = 1) is the product of P(xij|yi
probability tables have multinomial distribu- = 1) for each feature j, and P(yi = 1) is the cur-
tions; so, the EM algorithm can be applied to rent estimated probability of generating an ex-
the case described in this article. ample in class 1, P(y = 1).
The EM algorithm is an iterative algorithm The M-step for naive Bayes classification
that starts with an initial value for W and incre- must estimate each of the weights from the
mentally modifies W to increase the likelihood augmented training examples. To estimate P(y
of the observed data. One way to understand = 1), we sum the augmented value P(yi = 1) over
EM is to imagine that each training example is all i and divide by the sample size m:
augmented to include parameters describing
1
values of the hidden variables. For concrete- P ( y = 1) = Σ i P ( yi = 1).
ness, consider the simple case of figure 20 with m
the class variable y hidden. Assume y takes on To estimate the conditional probability that
two values, 1 and 2. Then, we would augment feature xj is 1 for examples in class 1, we take
each training example with P(y = 1|x) and P(y = each training example i that has xij = 1, sum up
2|x). (Actually, the P(y = 2|x) is redundant in the augmented values P(yi = 1), and divide by
this case because P(y = 2|x) = 1 – P(y = 1|x)). the total P(y = 1):
More generally, each observed example xi will 1
be augmented with the expected values of the P ( x j = 1 | y = 1) = ∑ {i | xij =1 } P ( yi = 1)
P ( y = 1)
sufficient statistics for describing the probabili-
ty distribution of the hidden variables. In effect, we treat each augmented training ex-
The EM algorithm alternates between two ample as if it were a member of class 1 with
steps until W converges: probability P(yi = 1) and as if it were a member
First is the E-step: Given the current value of of class 2 with probability 1 – P(yi = 1).
W, compute the augmentations P(yi = 1|xi) and A well-known application of the EM algo-
P(yi = 2|xi) for each example xi. rithm in unsupervised clustering is the AUTO-
Second is the M-step: Given the augmented CLASS program (Cheeseman et al. 1988). In ad-
data set, compute the maximum-likelihood es- dition to discrete variables (of the kind I have
timates for W under the assumption that the been discussing), AUTOCLASS can handle contin-
WINTER 1997 125

Articles
uous variables. Instead of computing the max- the network and setting l to zero. We then re-
imum-likelihood estimates of the parameters, peat the following loop L times.
it adopts a prior probability distribution over First, compute new values for y1, …, yn: From
the parameter values and computes the maxi- the probabilistic network in figure 20, we can
mum a posteriori probability (MAP) values. compute P(yi|W, xi) because we know (current
This is easily accomplished by a minor modifi- guesses for) W and (observed values for) xi. To
cation of EM. One of the most interesting appli- sample from this distribution, we flip a biased
cations of AUTOCLASS was to the problem of an- coin with probability of heads P(yi|W, xi).
alyzing the infrared spectra of stars. AUTOCLASS Second, compute new values for W: Let w0
discovered a new class of star, and this discov- be the parameter that represents the probabil-
ery was subsequently accepted by astronomers ity of generating an example from class 0. Sup-
(Cheeseman et al. 1988). pose the prior distribution, P(w0), is the uni-
Gibbs Sampling The final algorithm that I form distribution for all values 0 ≤ w0 ≤ 1. Let
discuss is a Monte Carlo technique called Gibbs m0 be the number of training examples (cur-
sampling (Geman and Geman 1984). Gibbs rently) assigned to class 0. Then, the posterior
sampling is a method for generating random probability P(w0|y1, …, ym) has a special form
samples from a joint-probability distribution known as a beta distribution with parameters
P(A1, …, An) when sampling directly from the m0 + 1 and m – m0 + 1. Algorithms are available
joint distribution is difficult. Suppose we know for drawing samples from this distribution.
the conditional distribution of each variable Ai Similarly, let wjv0 be the parameter that rep-
in terms of all the others: P(A1|A2, …, An), resents the probability that the j-th feature will
P(A2|A1, A3, …, An), …, P(An|A1, …, An-1). The have the value v when drawn from class 0: P(xj
Gibbs sampler works as follows: We start with a = v|yj = 0). Again assuming uniform priors for
set of arbitrary values, a1, …, an, for the random P(wjv0), this variable also has a beta distribution
variables. For each value of i, we then sample a with parameters cjv0 + 1 and m0 – cjv0 + 1, where
new value ai for random variable Ai according cjv0 is the number of training examples in class
to the distribution P(Ai|A1 = a1, …, Ai–1 = ai–1, Ai+1 0 having xj = v. As with w0, we can sample from
= ai+1, …, An = an). If we repeat this long enough, this distribution.
then under certain mild conditions the empir- We can do the same thing for the parame-
ical distribution of these generated points will ters concerning class 1: wjv1.
converge to the joint distribution. Third, record the value of the parameter vec-
How is this useful for learning in probabilis- tor. Let l := l + 1 and set Wl := W. To allow the
tic networks? Suppose we want to take a full Gibbs sampler to converge to a stationary dis-
Bayesian approach to learning the unknown tribution, we should perform some number of
parameters in the network. In such cases, we iterations before recording L values of the pa-
want to compute the posterior probability of rameter vector. This procedure gives us an en-
the unknown parameters given the data. Let semble of L probabilistic networks that can be
W denote the vector of unknown parameters. applied to classify new data points.
In a full Bayesian approach, we treat W as a There is a close resemblance between Gibbs
random variable with a prior probability distri- sampling and the EM algorithm. In both algo-
bution P(W). Given the training examples {x1, rithms, the training examples are augmented
…, xm}, we want to compute P(W|x1, …, xm), with information about the hidden variables. In
which is difficult because of the hidden classes EM, this information describes the probability
y1, …, ym. However, using the Gibbs sampler, distribution of the hidden variables, whereas in
we can generate samples from the distribution Gibbs sampling, this information consists of
P(W, y1, …, ym|x1, …, xm). random values drawn according to the probabil-
Then, by simply ignoring the y values, we ob- ity distribution. In EM, the parameters are recom-
tain samples for W: W1, …, WL. puted to be their maximum-likelihood estimates
These W values constitute an ensemble of based on the current values of the hidden vari-
learned values for the network parameters. To ables, whereas in Gibbs sampling, new parame-
classify a new data point x, we can apply each ter values are sampled from the posterior distri-
of the values of Wl to predict the class of x and bution given the current values of the hidden
have them vote. The voting is with equal variables. Hence, EM can be viewed as a maxi-
weight. If some W values have higher posterior mum-likelihood approximation (or MAP ap-
probability than others, then they will appear proximation) to Gibbs sampling.
more often in our sample. One potential problem that can arise in
To apply the Gibbs sampler to our unsuper- Gibbs sampling is caused by symmetries in the
vised learning problem, we begin by choosing stochastic model. In the case we are consider-
random initial values for the parameters W of ing, for example, suppose there are two true
126 AI MAGAZINE
Articles
underlying hidden classes, class A and class B. data points xi, then choosing an expert ei sto-
We want the model to generate examples from chastically (depending on the value of xi), and
class A when y = 1 and from class B when y = then choosing the class yi depending on the
2. However, there is nothing to force the mod- values of xi and ei.
el to do this. It could just as easily use y = 1 to There are two things to note about this
represent examples of class B and y = 2 to rep- model. First, the direction of causality is re-
resent examples of class A. If permitted to run versed from the naive Bayes network of figure
long enough, in fact, the Gibbs sampler should 20. Second, if we assume that all the features of
explore both possibilities. If our sample of each xi will always be observed, we do not
weight vectors Wl includes weight vectors cor- need to model the probability distribution
responding to both of these alternatives, then P(X) on the bottom node in the graph. Third,
when we combine these weight vectors, we unless we make some strong assumptions, the
will get bad results. In practice, this problem probability distribution P(Y|X, E) is going to be
often does not arise, but in general, steps must extremely complex because it must specify the
be taken to remove symmetries from the mod- probability of the classes as a function of every
el (see Neal [1993]). possible combination of features for X and ex-
Gibbs sampling is a general method; it can pert E.
often be applied in situations where the EM al- In their development of this general model,
gorithm cannot. The generality of Gibbs sam- Jordan and Jacobs assume that each probabili-
pling has made it possible to construct a gen- ty distribution has a simple form (see figure 23,
eral-purpose programming environment, which shows the model as a kind of neural net-
called BUGS, for learning stochastic models work classifier). Each value of the random vari-
(Gilks, Thomas, and Spiegelhalter 1993). In able E specifies a different expert. The input
this environment, the user specifies the graph features x are fed into each of the experts and
structure of the stochastic model, the form of into a gating network. The output of each ex-
the probability distribution at each node, and pert is a probability distribution over the pos-
prior distributions for each parameter. The sys- sible classes. The output of the gate is a proba-
tem then develops a Gibbs sampling algorithm bility distribution over the experts. This overall
for fitting this model to the training data. BUGS model has the following analytic form:
can be downloaded from www.mrc-bsu.cam. P ( y | x) = ∑ g e ( x) pe ( y | x),
e
ac.uk/bugs/.
where the index e varies over the different ex-
More Sophisticated Stochastic Models perts. The value ge(x) is the output of the gating
This section presents three examples of more network for expert e. The value pe(y|x) is the
sophisticated stochastic models that have been probability distribution over the various class-
developed: (1) the hierarchical mixture of ex- es output by expert e.
perts (HME) model, (2) the hidden Markov Jordan and Jacobs have investigated net-
model (HMM), and (3) the dynamic probabilis- works where the individual experts and the
tic network (DPN). Like the naive model from gating network have simple forms. In a two-
figure 20, these models can be applied to a class problem, each expert has the form
wide number of problems without performing Pe ( y | x) = σ ( w eT x),
the kind of detailed modeling of causal con-
where
nections that we performed in the diabetes ex-
ample from figure 19. w eT
The Hierarchical Mixture of Experts is the transpose of a vector of parameters, x is
The HME model (Jordan and Jacobs 1994) is in- the vector of input feature values, and σ is the
tended for supervised learning in situations usual logistic sigmoid function 1/(1 + exp(·)).
where one believes the training data are being The gating network was described earlier in
generated by a mixture of separate experts. For Ensembles of Classifiers. The gating values are
example, in a speech-recognition system, we computed according to
might face the task of distinguishing the spo- ze = veT x
ken words bee, tree, gate, and mate. This task
naturally decomposes into two hard subprob- g e = e ze / Σ u e zu .
lems: (1) distinguishing bee from tree and (2) In other words, ze is the dot product of a weight
distinguishing gate from mate. vector ve and the input features x. The output ge
Figure 22 shows the probabilistic network is the soft-max of the ze values. The ge values are
for a simple mixture-of-experts model for this all positive and sum to 1. This is known as the
case. According to this model, a training exam- multinomial logit model in statistics.
ple (xi, yi) is generated by first generating the The problem of learning the parameters for
WINTER 1997 127

Articles
a mixture-of-experts model is similar to the can be the same for all states: P(o|s). In this
problem of unsupervised learning, except that variation, the HMM is equivalent to a stochastic
Y here, the hidden variable is not the class yi of finite-state automaton (FSA). At each time
each training example i but, rather, the expert step, the FSA makes a probabilistic state transi-
ei that was responsible for generating the train- tion and then generates an output letter.
ing example. If we knew which expert generat- Another common variation on HMMs is to
ed each training example, then we could fit the include an absorbing state or halting state as
E parameters directly from the training data, as one of the values of the state variable. If the
we did for the naive Bayes algorithm. Jordan HMM makes a transition into this state, it termi-
and Jacobs have applied both gradient-descent nates the string being generated, permitting
and the EM algorithm to solve this learning HMMs to model strings of variable length.
X problem. For the particular choice of sigmoid HMMs have been applied widely in speech
and soft-max functions for the experts and recognition, where the alphabet of letters con-
gates, the EM algorithm has a particularly effi- sists of frames of the speech signal (Rabiner
cient implementation as a sequence of weight- 1989). Each word in the language can be mod-
Figure 22. ed least squares problems. eled as an HMM. Given a new spoken word, a
Probabilistic Model The simple one-level hierarchy of experts speech-recognition system computes the like-
Describing a Mixture shown in figure 22 can be extended to deeper lihood that each of the word HMMs generated
of Experts. hierarchies. Figure 24 shows a three-level hier- this spoken word. The recognizer then predicts
archy. All the fitting algorithms apply to this the most likely word. A similar analysis can be
more general case as well. applied at the level of whole sentences by con-
The Hidden Markov Model Figure 25 catenating word-level HMMs. Then, the goal is
shows the probabilistic network of a hidden to find the sentence most likely to have gener-
Markov model (HMM) (Rabiner 1989). An HMM ated to speech signal.
generates examples that are strings of length n To learn an HMM, a set of training examples
over some alphabet A of letters. Hence, each is provided, where each example is a string.
example x is a string of letters: o1o2…on(where The sequence of states that generated the
the o stands for observable). To generate a string is hidden. Once again, however, the EM
string, the HMM begins by generating an initial algorithm can be applied. In the context of
state s1 according to probability distribution HMMs, it is known as the Baum-Welch, or for-
P(s1). From this state, it then generates the first ward-backward, algorithm. In the E-step, the
letter of the string according to the distribu- algorithm augments each training example
tion P(o1|s1). It then generates the next state s2 with statistics describing the hypothesized
according to P(s2|s1). Then, it generates the sec- string of states that generated the example. In
ond letter according to P(o2|s2), and so on. the M-step, the parameters of the probability
In some applications, the transition proba- distributions are reestimated from the aug-
bility distribution is the same for all pairs of mented training data.
adjacent state variables: P(st|st–1). Similarly, the Stolcke and Omohundro (1994) developed
output (or emission) probability distribution algorithms for learning the structure and para-
meters of HMMs from training examples. They
applied their techniques to several problems in
Y speech recognition and incorporated their al-
gorithm into a speaker-independent speech-
recognition system.
The Dynamic Probabilistic Network In
the HMM, the number of values for the state
variable at each point in time can become
large. Consequently, the number of parameters
in the state-transition probability distribution
Expert 1 Expert 2 Gate P(st|st–1) can become intractably large. One so-
lution is to represent the causal structure with-
in each state by a stochastic model. For exam-
ple, consider a mobile robot with a steerable
television camera. The images observed by the
X
camera will be the output alphabet of the HMM.
The hidden state will consist of the location of
the robot within a room and the direction the
Figure 23. The Mixture-of-Experts Model Viewed camera is pointing. Suppose there are 100 pos-
as a Specialized Neural Network. sible locations and 15 possible camera direc-
128 AI MAGAZINE
Articles
tions. In an HMM, the hidden state will be a sin-

gle variable with 1500 possible values.
However, suppose that the robot has sepa-
rate commands to change its location and Y
steer its camera. At each time step, it chooses
to perform exactly one of these two actions.
Then, it makes sense to represent the hidden
state by two separate state variables: (1) robot E3
location and (2) camera direction. Figure 26
shows the resulting stochastic model, which is
variously called a DPN (Kanazawa, Koller, and
Russell 1995), a dynamic belief network (Dean
and Kanazawa 1989), and a factorial HMM E2
(Ghamramani and Jordan 1996).
Unfortunately, inference and learning with
DPNs is computationally challenging. The over-
all approach of applying the EM algorithm or E1
Gibbs sampling is still sound. However, the E-
step of computing the augmented training ex-
amples is itself difficult. Ghahramani and Jor-
dan (1996) and Kanazawa, Koller, and Russell
(1995) describe algorithms that can perform X
approximate E-steps. A recent review of this ac-
tive research area can be found in Smyth,
Heckerman, and Jordan (1997). Figure 24. The Hierarchical Mixture-of-Experts Model.
Application-Specific
Stochastic Models
A major motivation for the stochastic model- S1 S2 S3 Sn
ing approach to machine learning is to com-
municate background knowledge to the learn-
ing algorithm. Although the general models
discussed here achieve this goal to some ex-
tent, it is possible to go much further in this di- O1 O2 O3 On
rection, and many applications of machine
learning are pursuing this approach. To give a
flavor of the kinds of model being developed, Figure 25. The Probabilistic Network for a Hidden Markov Model.
I describe one example from the many recently
published papers.
Revow, Williams, and Hinton (1996) devel- ators that ink the pixels in the image. A para-
oped a stochastic model for handwritten digit meter σ specifies the standard deviation of the
recognition, shown in figure 27. To generate a Gaussian ink generators. The beads are placed
digit according to this model, we first random- a distance 2σ apart along the spline; hence, the
ly choose one of the 10 digits, which deter- number of beads varies as a function of σ. At
mines the home locations of 8 points that con- this point, we have made all the choices that
trol a uniform B spline (denoted h1, …, h8 in are independent of the image pixels.
the figure). The B spline specifies the shape of Now, the model generates N inked pixels (in-
the digit. The next step randomly perturbs dicated in the diagram by the box with an N in
these control points (using a Gaussian distrib- the lower left corner). To generate each pixel, it
ution) to produce the control points that will first generates a Boolean variable that indicates
be used to generate the handwritten digit (de- whether the pixel will be a noise pixel or a pixel
noted k1, …, k8 in the figure). from the digit. If it is a noise pixel, the location
Next we generate a six-degree-of-freedom is chosen uniformly at random and inked. If it
affine transformation that models translation, is a digit pixel, one of the l beads is chosen at
rotation, skewing, and so on. From the ran- random (specified by the variable bead index),
domly chosen affine transformation and the and a pixel is inked according to a symmetric
control points, we deterministically lay out a Gaussian probability distribution.
sequence of beads along the spline curve. The parameters of the model include the
These beads act as circular Gaussian ink gener- probability of noise pixels, the standard devia-
WINTER 1997 129

Articles
the probability that a given digit model gener-

L1 L2 L3 Ln ated the given image.
In practice, Revow et al. take a maximimum-
likelihood approach. They compute the values
D1 D2 D3 Dn for the control points (k1, …, k8), σ, and the
noise-digit choices that maximize the likeli-
hood of generating the observed image. The EM
algorithm is applied to compute this. Figure 28
O1 O2 O3 On shows the fitting of the model for a 3 and for a
5 to a typical image. Notice that during the fit-
ting, σ is initialized to a large value so that the
ink generators are likely to capture inked pix-
Figure 26. A Simple Dynamic Probabilistic Network with Two Independent els. When EM begins to converge, σ is decreased
State Variables: Robot Location (Lt) and Camera Direction (St). (according to a given schedule), and new beads
are positioned along the spline using the cur-
rent value of σ. Then, fitting is resumed. This
digit affine
h1 h8
task is performed approximately six times for
transform
each image.
Now let’s consider learning. The goal of
knots learning is to learn for each digit the home lo-
k1 k8 cations of the control points and the variance
for perturbing these control points. In addi-
tion, the parameters controlling the affine
beads bead digit vs.
b1 bl index
location
noise
transformation must be learned, but these pa-
rameters are assumed to be the same for all
classes. To solve this learning problem, Revow
et al. again apply EM. In the E-step, the training
bead examples are augmented with the maximum-
likelihood locations of the control points and
the value of σ, which are computed using a
pixel nested EM, as described earlier. In the M-step,
N
the average home location of each control
point and the average σ are computed. The
model for each digit is trained by fitting only
Figure 27. Probabilistic Network for to training examples of that digit. The running
Generating a Digit and Choosing N Pixels to Ink. time is dominated by the cost of the inner
tion σ of the Gaussian ink generators, the vari- (classification) EM algorithm.
ance of the Gaussian used to perturb the con- Revow et al. report results that are competi-
trol points, the parameters controlling the tive with the best-known methods on the task
affine transformation, and the home locations of recognizing digits from U.S. mail zip codes.
of the eight control points. To achieve this performance, they use some
We want to perform two tasks with this postprocessing steps that analyze how well
model: (1) classification and (2) learning. Let each digit model fits the image. They also con-
us consider classification first. Given an image sidered learning mixtures of digit models for
and 10 learned spline models, our goal is to cases where there are significantly different
find the spline model that is most likely to ways of writing a digit (for example, the digit
have generated the image. We do this by com- 7 with and without a central horizontal bar).
puting the probability that the spline model The advantage of the stochastic modeling
for each digit generated the image. This com- approach is that learning is fast both computa-
putation is difficult because we are given only tionally, because EM is quick, and statistically,
the locations of the inked pixels and the (hy- because there are only 22 parameters in each
pothesized) class of the digit. In principle, we digit model (2 coordinates for each control
should consider all possible locations for the point and 6 affine transformation parameters).
control points, all possible values for σ, and all Another advantage is that the digit images do
possible choices for whether each inked pixel not need to be preprocessed (for example, to
is a noise pixel or a digit pixel. Each such com- remove slants and scale to a standard size).
bination could have generated the observed This preprocessing can be a significant source
image (with a certain probability). We should of errors as well as require extra implementa-
integrate over these parameters to compute tion effort. A related advantage is that the sto-
130 AI MAGAZINE
Articles
chastic models can fit a single digit in the con-

text of several other digits—so that precise seg-
mentation is not required prior to classifica-
tion. The primary disadvantage is that
classification is slower because of the need to
perform a search to fit each digit model.
Learning the Structure

of Stochastic Models
All the methods I have discussed to this point
learn the parameters of a stochastic model
whose structure is given. An important re-
search question is whether this structure can
be learned from data as well. Recent research
has made major progress in developing algo-
rithms for this problem.
One of the first algorithms was developed by
Chow and Liu (1968) for learning a network
structure in the form of a directed tree. This al-
gorithm first constructs a complete undirected
graph where the nodes are the variables, and
the edges are labeled with the mutual informa-
tion between the variables. The algorithm then
finds the maximum weighted spanning tree of
this graph, chooses a root node arbitrarily, and
orders the arcs to point away from the root. A
nice feature of this algorithm is that it is quite
fast—it runs in polynomial time.
Cooper and Herskovits (1992) developed an
algorithm, called K2, for learning the structure
of a stochastic model where all variables are
observed in the training data. They adopt a
maximum a posteriori (MAP) approach within
a Bayesian framework. The key contribution of Figure 28. Some Stages of Fitting Models to an Image of a 3
their paper was to derive a formula for the pos- (from Revow et al. [1996]. Reproduced with permission.).
terior probability of a network. This formula The image is displayed in the top row. The next row shows the model for a 3
can be updated incrementally as changes are being fitted. The bottom row attempts to fit the model of a 5. The light circles in-
made to the network. dicate the value of σ. In the bottom two rows, the image has been artificially
thinned so that the circles are visible.
Using their formula, they implemented a
simple greedy search algorithm for finding (ap- tribution in the form of a prior network and
proximately) the MAP network structure. Their two parameters: (1) an equivalent sample size
algorithm requires the user to provide an or- (which controls how much new data are re-
dering of the variables, and it will consider quired to override the prior) and (2) a penalty
adding arcs only from variables earlier in the for each arc that is different from the prior net-
ordering to variables later in the ordering. This work. Their local search algorithm considers
approach significantly constrains the search all one-step changes to the network (arc addi-
and also ensures that the learned network will tion, deletion, and reversal) and retains the
remain acyclic. K2 begins with a network hav- change that most increases the posterior prob-
ing no arcs—all variables are independent of ability of the network. They obtain results
one another. K2 evaluates the posterior proba- comparable to K2. Interesting, they found that
bility of adding each possible single edge and the most important role for the prior network
makes the highest-ranking addition. This was to provide a good starting point for local
greedy algorithm is continued until no single search (rather than to bias the objective func-
change improves the posterior probability. tion for guiding the search).
Heckerman, Geiger, and Chickering (1995) Friedman and Goldszmidt (1996) developed
describe a modification to the K2 method for a network learning algorithm, called tree-aug-
computing posterior probabilities and a local mented naive Bayes (TAN), specifically for su-
search algorithm that uses this improvement. pervised learning. The TAN algorithm starts
Their method requires a prior probability dis- with a naive Bayes network of the kind shown
WINTER 1997 131

Articles
Spirtes, Glymour, and Scheines (1993) and the

references therein.
Summary: Stochastic Models

Diabetes This completes my review of methods for
learning with stochastic models. There are sev-
eral good survey articles of this topic (Buntine
1996, 1994; Heckerman 1996).
Preg Age Insulin DPF Mass Glucose The area of stochastic modeling is active
right now. Journals and conferences that were
once devoted exclusively to neural network
applications are now presenting many papers
on stochastic modeling. The research commu-
Figure 29. Probabilistic Network Constructed by the TAN nity is still gathering experience and develop-
Algorithm for a Diabetes Diagnosis Task. ing improved algorithms for fitting and rea-
soning with stochastic models. Many of the
stochastic models we would like to work with
are intractable. The challenge is to find gener-
100
al-purpose, tractable approximation algo-
rithms for reasoning with these elegant and
95 expressive stochastic models.
90
Concluding Remarks
Accuracy of TAN
85 Any survey must choose particular areas and

omit others. Let me briefly mention some oth-
er active areas. A central topic in machine
80
learning is the control of overfitting. There
have been many developments in this area as
75 researchers explored various penalty functions
and resampling techniques (including cross-
70 validation) for preventing overfitting. An un-
derstanding of the overfitting process has been
65
obtained through the statistical concepts of
65 70 75 80 85 90 95 100 bias and variance, and several authors have de-
Accuracy of C4.5 veloped bias-variance decompositions for clas-
sification problems.
Figure 30. Comparison of TAN and C4.5 on 22 Benchmark Tasks. Another active topic has been the study of
Points above the diagonal line correspond to cases where algorithms for learning relations expressed as
TAN gave more accurate results than C4.5.
Horn-clause programs. This area is also known
as inductive logic programming, and many al-
in figure 20 and considers adding arcs to im- gorithms and theoretical results have been de-
prove the posterior probability of the network. veloped in this area.
They apply a modification of the Chow and Finally, many papers have addressed practi-
Liu algorithm to learn a tree structure of arcs cal problems that arise in applications such as
connecting the xj variables to one another. Fig- visualization of learned knowledge, methods
ure 29 shows a network learned by TAM for a di- for extracting understandable rules from neur-
abetes diagnosis problem. Compared to figure al networks, algorithms for identifying noise
19, the directions of the arcs are wrong. and outliers in data, and algorithms for learn-
Nonetheless, the network gives accurate classi- ing easy-to-understand classifiers.
fications. Figure 30 compares the performance There have been many exciting develop-
of TAN to C4.5 on 22 benchmark problems. The ments in the past five years, and the relevant
plot shows that TAN outperforms C4.5 on most literature in machine learning has been grow-
of the domains. ing rapidly. As more areas within AI and com-
There are many other important papers on puter science apply machine-learning methods
the topic of structure learning for probabilistic to attack their problems, I expect that the flow
networks. Four important references are (1) of interesting problems and practical solutions
Verma and Pearl (1990), (2) Spirtes and Meek will continue. It is an exciting time to be work-
(1995), (3) Spiegelhalter et al. (1993), and (4) ing in machine learning.
132 AI MAGAZINE
Articles
References Cheeseman, P.; Self, M.; Kelly, J.; Taylor, W.; Free-
man, D.; and Stutz, J. 1988. Bayesian Classification.
Abu-Mostafa, Y. 1990. Learning from Hints in Neural
In Proceedings of the Seventh National Conference
Networks. Journal of Complexity 6:192–198.
on Artificial Intelligence, 607–611. Menlo Park,
Ali, K. M., and Pazzani, M. J. 1996. Error Reduction Calif.: American Association for Artificial Intelli-
through Learning Multiple Descriptions. Machine gence.
Learning 24(3): 173–202.
Cherkauer, K. J. 1996. Human Expert-Level Perfor-
Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. mance on a Scientific Image Analysis Task by a Sys-
Learning to Act Using Real-Time Dynamic Program- tem Using Combined Artificial Neural Networks. In
ming. Artificial Intelligence 72:81–138. Working Notes of the AAAI Workshop on Integrat-
Barto, A. G., and Sutton, R. 1997. Introduction to Re- ing Multiple Learned Models, ed. P. Chan, 15–21.
inforcement Learning. Cambridge, Mass.: MIT Press. Available at www.cs.fit.edu/imlm/.
Bellman, R. E. 1957. Dynamic Programming. Prince- Chipman, H.; George, E.; and McCulloch, R. 1996.
ton, N.J.: Princeton University Press. Bayesian CART, Department of Statistics, University
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dy- of Chicago. Available at gsbrem.uchicago.edu/Pa-
namic Programming. Belmont, Mass.: Athena Scientif- pers/cart.ps.
ic. Chow, C., and Liu, C. 1968. Approximating Discrete
Blum, A., and Rivest, R. L. 1988. Training a 3-Node Probability Distributions with Dependence Trees.
Neural Network Is NP-Complete (Extended Ab- IEEE Transactions on Information Theory 14:462–467.
stract). In Proceedings of the 1988 Workshop on Com- Clemen, R. T. 1989. Combining Forecasts: A Review
putational Learning Theory, 9–18. San Francisco, and Annotated Bibliography. International Journal of
Calif.: Morgan Kaufmann. Forecasting 5:559–583.
Blum, A. 1997. Empirical Support for WINNOW and Cohen, W. W. 1995. Fast Effective Rule Induction. In
Weighted-Majority Algorithms: Results on a Calen- Proceedings of the Twelfth International Conference on
dar Scheduling Domain. Machine Learning 26(1): Machine Learning, 115–123. San Francisco, Calif.:
5–24. Morgan Kaufmann.
Breiman, L. 1996a. Bagging Predictors. Machine Cooper, G. F., and Herskovits, E. 1992. A Bayesian
Learning 24(2): 123–140. Method for the Induction of Probabilistic Networks
Breiman, L. 1996b. Stacked Regressions. Machine from Data. Machine Learning 9:309–347.
Learning 24:49–64. Craven, M. W., and Shavlik, J. W. 1996. Extracting
Buntine, W. 1996. A Guide to the Literature on Tree-Structured Representations from Trained Net-
Learning Probabilistic Networks from Data. IEEE works. In Advances in Neural Information Processing
Transactions on Knowledge and Data Engineering Systems 8, eds. D. S. Touretzky, M. C. Mozer, and M.
8:195–210. Hasselmo, 24–30. Cambridge, Mass. MIT Press.
Buntine, W. 1994. Operations for Learning with Crites, R. H., and Barto, A. G. 1995. Improving Ele-
Graphical Models. Journal of Artificial Intelligence Re- vator Performance Using Reinforcement Learning.
search 2:159–225. In Advances in Neural Information Processing Systems 8,
Buntine, W. L. 1990. A Theory of Learning Classifi- 1017–1023. San Francisco, Calif.: Morgan Kauf-
cation Rules. Ph.D. thesis, School of Computing Sci- mann.
ence, University of Technology. D’Ambrosio, B. 1993. Incremental Probabilistic In-
Caruana, R. 1996. Algorithms and Applications for ference. In Ninth Annual Conference on Uncertainty on
Multitask Learning. In Proceedings of the Thirteenth AI, eds. D. Heckerman and A. Mamdani, 301–308.
International Conference on Machine Learning, ed. L. San Francisco, Calif.: Morgan Kaufmann.
Saitta, 87–95. San Francisco, Calif.: Morgan Kauf- Dayan, P., and Hinton, G. 1993. Feudal Reinforce-
mann. ment Learning. In Advances in Neural Information
Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. Processing Systems 5, 271–278. San Francisco, Calif.:
1994. Acting Optimally in Partially Observable Sto- Morgan Kaufmann.
chastic Domains. In Proceedings of the Twelfth Na- Dean, T., and Kanazawa, K. 1989. A Model for Rea-
tional Conference on Artificial Intelligence, soning about Persistence and Causation. Computa-
1023–1028. Menlo Park, Calif.: American Associa- tional Intelligence 5(3): 142–150.
tion for Artificial Intelligence. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1976.
Castillo, E.; Gutierrez, J. M.; and Hadi, A. 1997. Ex- Maximum Likelihood from Incomplete Data via the
pert Systems and Probabilistic Network Models. New EM Algorithm. Journal of the Royal Statistical Society B
York: Springer-Verlag. 39: 1–38.
Catlett, J. 1991. On Changing Continuous Attributes Dietterich, T. G., and Bakiri, G. 1995. Solving Multi-
into Ordered Discrete Attributes. In Proceedings of the class Learning Problems via Error-Correcting Output
European Working Session on Learning, ed. Y. Ko- Codes. Journal of Artificial Intelligence Research
dratoff, 164–178. Berlin: Springer-Verlag. 2:263–286.
Chan, P. K., and Stolfo, S. J. 1995. Learning Arbiter Dietterich, T. G., and Kong, E. B. 1995. Machine
and Combiner Trees from Partitioned Data for Scal- Learning Bias, Statistical Bias, and Statistical Vari-
ing Machine Learning. In Proceedings of the First Inter- ance of Decision Tree Algorithms, Department of
national Conference on Knowledge Discovery and Data Computer Science, Oregon State University. Avail-
Mining, 39–44. Menlo Park, Calif.: AAAI Press. able at ftp.cs.orst.edu/pub/tgd/papers/tr-bias.ps.gz.
WINTER 1997 133

Articles
Domingos, P., and Pazzani, M. 1996. Beyond Indepen- Bayesian Networks, MSR-TR-95-06, Advanced Tech-
dence: Conditions for the Optimality of the Simple nology Division, Microsoft Research, Redmond,
Bayesian Classifier. In Proceedings of the Thirteenth Inter- Washington.
national Conference on Machine Learning, ed. L. Saitta, Heckerman, D.; Geiger, D.; and Chickering, D. M.
105–112. San Francisco, Calif.: Morgan Kaufmann. 1995. Learning Bayesian Networks: The Combina-
Duda, R. O., and Hart, P. E. 1973. Pattern Classifica- tion of Knowledge and Statistical Data. Machine
tion and Scene Analysis. New York: Wiley. Learning 20:197–243.
Fayyad, U. M., and Irani, K. B. 1993. Multi-Interval Hyafil, L., and Rivest, R. L. 1976. Constructing Opti-
Discretization of Continuous-Valued Attributes for mal Binary Decision Trees Is NP-Complete. Informa-
Classification Learning. In Proceedings of the Thir- tion Processing Letters 5(1): 15–17.
teenth International Joint Conference on Artificial Jensen, F. V.; Lauritzen, S. L.; and Olesen, K. G. 1990.
Intelligence, 1022–1027. Menlo Park, Calif.: Interna- Bayesian Updating in Recursive Graphical Models by
tional Joint Conferences on Artificial Intelligence. Local Computations. Computational Statistical Quar-
Flexner, S. B. 1983. Random House Unabridged Dictio- terly 4:269–282.
nary, 2d ed. New York: Random House. Jensen, F. 1996. An Introduction to Bayesian Networks.
Freund, Y., and Schapire, R. E. 1995. A Decision-The- New York: Springer.
oretic Generalization of On-Line Learning and an John, G.; Kohavi, R.; and Pfleger, K. 1994. Irrelevant
Application to Boosting. In Proceedings of the Second Features and the Subset Selection Problem. In Pro-
European Conference on Computational Learning Theo- ceedings of the Eleventh International Conference on Ma-
ry, 23–37. Berlin: Springer-Verlag. chine Learning, 121–129. San Francisco, Calif.: Mor-
Freund, Y., and Schapire, R. E. 1996. Experiments gan Kaufmann.
with a New Boosting Algorithm. In Proceedings of the Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical
Thirteenth International Conference on Machine Learn- Mixtures of Experts and the EM Algorithm. Neural
ing, ed. L. Saitta, 148–156. San Francisco, Calif.: Mor- Computation 6(2): 181–214.
gan Kaufmann.
Kaelbling, L. P. 1993. Hierarchical Reinforcement
Friedman, N., and Goldszmidt, M. 1996. Building Learning: Preliminary Results. In Proceedings of the
Classifiers Using Bayesian Networks. In Proceedings Tenth International Conference on Machine Learning,
of the Thirteenth National Conference on Artificial 167–173. San Francisco, Calif.: Morgan Kaufmann.
Intelligence, 1277–1284. Menlo Park, Calif.: Ameri-
Kaelbling, L. P.; Littman, M. L.; and Moore, A. W.
can Association for Artificial Intelligence. 1996. Reinforcement Learning: A Survey. Journal of
Furnkranz, J., and Widmer, G. 1994. Incremental Re- Artificial Intelligence Research 4:237–285.
duced Error Pruning. In Proceedings of the Eleventh In- Kanazawa, K.; Koller, D.; and Russell, S. 1995. Sto-
ternational Conference on Machine Learning, 70–77. chastic Simulation Algorithms for Dynamic Proba-
San Francisco, Calif.: Morgan Kaufmann. bilistic Networks. In Proceedings of the Eleventh Con-
Geman, S., and Geman, D. 1984. Stochastic Relax- ference on Uncertainty in Artificial Intelligence,
ation, Gibbs Distributions, and the Bayesian Restora- 346–351. San Francisco, Calif.: Morgan Kaufmann.
tion of Images. IEEE Transactions on Pattern Analysis
Kira, K., and Rendell, L. A. 1992. A Practical Ap-
and Machine Intelligence 6:721–741. proach to Feature Selection. In Proceedings of the
Ghahramani, Z., and Jordan, M. I. 1996. Factorial Ninth International Conference on Machine Learning,
Hidden Markov Models. In Advances in Neural Infor- 249–256. San Francisco, Calif.: Morgan Kaufmann.
mation Processing Systems 8, eds. D. S. Touretzky, M. Kohavi, R., and Kunz, C. 1997. Option Decision
C. Mozer, and M. Hasselmo, 472–478. Cambridge, Trees with Majority Votes. In Proceedings of the Four-
Mass.: MIT Press. teenth International Conference on Machine Learning,
Gilks, W.; Thomas, A.; and Spiegelhalter, D. 1993. A 161–169. San Francisco, Calif.: Morgan Kaufmann.
Language and Program for Complex Bayesian Mod- Kohavi, R., and Sahami, M. 1996. Error-Based and
elling. The Statistician 43:169–178. Entropy-Based Discretizing of Continuous Features.
Golding, A. R., and Roth, D. 1996. Applying WINNOW In Proceedings of the Second International Conference on
to Context-Sensitive Spelling Correction. In Proceed- Knowledge Discovery and Data Mining, 114–119. San
ings of the Thirteenth International Conference on Ma- Francisco, Calif.: Morgan Kaufmann.
chine Learning, ed. L. Saitta, 182–190. San Francisco, Kolen, J. F., and Pollack, J. B. 1991. Back Propagation
Calif.: Morgan Kaufmann. Is Sensitive to Initial Conditions. In Advances in
Gordon, G. J. 1995. Stable Function Approximation Neural Information Processing Systems 3, 860–867. San
in Dynamic Programming. In Proceedings of the Francisco, Calif.: Morgan Kaufmann.
Twelfth International Conference on Machine Learning, Kong, E. B., and Dietterich, T. G. 1995. Error-Correct-
261–268. San Francisco, Calif.: Morgan Kaufmann. ing Output Coding Corrects Bias and Variance. In
Hansen, L., and Salamon, P. 1990. Neural Network The Twelfth International Conference on Machine
Ensembles. IEEE Transactions on Pattern Analysis and Learning, eds. A. Prieditis and S. Russell, 313–321.
Machine Intelligence 12:993–1001. San Francisco, Calif.: Morgan Kaufmann.
Hashem, S. 1993. Optimal Linear Combinations of Kononenko, I. 1994. Estimating Attributes: Analysis
Neural Networks. Ph.D. thesis, School of Industrial and Extensions of Relief. In Proceedings of the 1994
Engineering, Purdue University. European Conference on Machine Learning, 171–182.
Heckerman, D. 1996. A Tutorial on Learning with Amsterdam: Springer Verlag.
134 AI MAGAZINE
Articles
Kononenko, I.; Simec, E.; and Robnik-Sikonja, M. Theoretic Subsampling for Induction on Large Data-
1997. Overcoming the Myopic of Inductive Learning bases. In Proceedings of the Tenth International Confer-
Algorithms with RELIEF-F. Applied Intelligence. Forth- ence on Machine Learning, ed. P. E. Utgoff, 212–219.
coming. San Francisco, Calif.: Morgan Kaufmann.
Kucera, H., and Francis, W. N. 1967. Computational Neal, R. 1993. Probabilistic Inference Using Markov
Analysis of Present-Day American English. Providence, Chain Monte Carlo Methods, CRG-TR-93-1, Depart-
R.I.: Brown University Press. ment of Computer Science, University of Toronto.
Kwok, S. W., and Carter, C. 1990. Multiple Decision Ok, D., and Tadepalli, P. 1996. Auto-Exploratory Av-
Trees. In Uncertainty in Artificial Intelligence 4, eds. R. erage Reward Reinforcement Learning. In Proceed-
D. Schachter, T. S. Levitt, L. N. Kannal, and J. F. Lem- ings of the Thirteenth National Conference on Arti-
mer, 327–335. Amsterdam: Elsevier Science. ficial Intelligence, 881–887. Menlo Park, Calif.:
Laird, J. E.; Newell, A.; and Rosenbloom, P. S. 1987. American Association for Artificial Intelligence.
SOAR: An Architecture for General Intelligence. Arti- Opitz, D. W., and Shavlik, J. W. 1996. Generating Ac-
ficial Intelligence 33(1): 1–64. curate and Diverse Members of a Neural-Network
Littlestone, N. 1988. Learning Quickly When Irrele- Ensemble. In Advances in Neural Information Process-
vant Attributes Abound: A New Linear-Threshold Al- ing Systems 8, eds. D. S. Touretzky, M. C. Mozer, and
gorithm. Machine Learning 2:285–318. M. Hesselmo, 535–541. Cambridge, Mass.: MIT Press.
Littman, M. L.; Cassandra, A.; and Kaelbling, L. P. Parmanto, B.; Munro, P. W.; and Doyle, H. R. 1996.
1995. Learning Policies for Partially Observable En- Improving Committee Diagnosis with Resampling
vironments: Scaling Up. In Proceedings of the Twelfth Techniques. In Advances in Neural Information Pro-
International Conference on Machine Learning, cessing Systems 8, eds. D. S. Touretzky, M. C. Mozer,
362–370. San Francisco, Calif.: Morgan Kaufmann. and M. Hesselmo, 882–888. Cambridge, Mass.: MIT
Press.
Lowe, D. G. 1995. Similarity Metric Learning for a
Variable-Kernel Classifier. Neural Computation 7(1): Parmanto, B.; Munro, P. W.; Doyle, H. R.; Doria, C.;
72–85. Aldrighetti, L.; Marino, I. R.; Mitchel, S.; and Fung, J.
J. 1994. Neural Network Classifier for Hepatoma De-
MacKay, D. 1992. A Practical Bayesian Framework
tection. In Proceedings of the World Congress on Neural
for Backpropagation Networks. Neural Computation
Networks 1, 285–290. Hillsdale, N.J.: Lawrence Erl-
4(3): 448–472.
baum.
Mahadevan, S. 1996. Average Reward Reinforcement
Parr, R., and Russell, S. 1995. Approximating Opti-
Learning: Foundations, Algorithms, and Empirical
mal Policies for Partially Observable Stochastic Do-
Results. Machine Learning 22:159–195.
mains. In Proceedings of the Fourteenth Internation-
Mahadevan, S., and Kaelbling, L. P. 1996. The Na- al Joint Conference on Artificial Intelligence,
tional Science Foundation Workshop on Reinforce- 1088–1094. Menlo Park, Calif.: International Joint
ment Learning. AI Magazine 17(4): 89–97. Conferences on Artificial Intelligence.
McCallum, R. A. 1995. Instance-Based Utile Distinc- Perrone, M. P., and Cooper, L. N. 1993. When Net-
tions for Reinforcement Learning with Hidden State. works Disagree: Ensemble Methods for Hybrid Neur-
In Proceedings of the Twelfth International Conference al Networks. In Neural Networks for Speech and Image
on Machine Learning, 387–396. San Francisco, Calif.: Processing, ed. R. J. Mammone. New York: Chapman
Morgan Kaufmann. and Hall.
Mehta, M.; Agrawal, R.; and Rissanen, J. 1996. SLIQ: A Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; and Ver-
Fast Scalable Classifier for Data Mining. In Lecture rerling, W. T. 1992. Numerical Recipes in C: The Art of
Notes in Computer Science, 18–32. New York: Springer- Scientific Computing, 2d ed. Cambridge, U.K.: Cam-
Verlag. bridge University Press.
Merz, C. J., and Murphy, P. M. 1996. UCI Repository Quinlan, J. R. 1996. Bagging, Boosting, and C4.5. In
of Machine Learning Databases. Available at Proceedings of the Thirteenth National Conference
www.ics.uci.edu/mlearn/MLRepository.html. on Artificial Intelligence, 725–730. Menlo Park,
Miller, A. J. 1990. Subset Selection in Regression. New Calif.: American Association for Artificial Intelli-
York: Chapman and Hall. gence.
Minton, S.; Carbonell, J. G.; Knoblock, C. A.; Kuok- Quinlan, J. R. 1993. C4.5: Programs for Empirical
ka, D. R.; Etzioni, O.; and Gil, Y. 1989. Explanation- Learning. San Francisco, Calif.: Morgan Kaufmann.
Based Learning: A Problem-Solving Perspective. Arti- Quinlan, J. R. 1990. Learning Logical Definitions
ficial Intelligence 40:63–118. from Relations. Machine Learning 5(3): 239–266.
Moore, A. W., and Lee, M. S. 1994. Efficient Algo- Quinlan, J. R., and Rivest, R. L. 1989. Inferring Deci-
rithms for Minimizing Cross-Validation Error. In Pro- sion Trees Using the Minimum Description Length
ceedings of the Eleventh International Conference on Ma- Principle. Information and Computation 80:227–248.
chine Learning, eds. W. Cohen and H. Hirsh, 190–198. Rabiner, L. R. 1989. A Tutorial on Hidden Markov
San Francisco, Calif.: Morgan Kaufmann. Models and Selected Applications in Speech Recog-
Munro, P., and Parmanto, B. 1997. Competition nition. In Proceedings of the IEEE 77(2): 257–286.
among Networks Improves Committee Performance. Washington, D.C.: IEEE Computer Society.
In Advances in Neural Information Processing Systems 9, Raviv, Y., and Intrator, N. 1996. Bootstrapping with
592–598. Cambridge, Mass.: MIT Press. Noise: An Effective Regularization Technique. Con-
Musick, R.; Catlett, J.; and Russell, S. 1992. Decision- nection Science 8(3–4): 355–372.
WINTER 1997 135

Articles
Revow, M.; Williams, C. K. I.; and Hinton, G. E. Sutton, R. S. 1988. Learning to Predict by the Meth-
1996. Using Generative Models for Handwritten Dig- ods of Temporal Differences. Machine Learning 3(1):
it Recognition. IEEE Transactions on Pattern Analysis 9–44.
and Machine Intelligence 18(6): 592–606. Tesauro, G. 1995. Temporal Difference Learning and
Ricci, F., and Aha, D. W. 1997. Extending Local TD - GAMMON . Communications of the ACM 28(3):
Learners with Error-Correcting Output Codes, Naval 58–68.
Center for Applied Research in Artificial Intelligence, Tesauro, G. 1992. Practical Issues in Temporal Differ-
Washington, D.C. ence Learning. Machine Learning 8:257–278.
Rosen, B. E. 1996. Ensemble Learning Using Decor- Tsitsiklis, J. N., and Van Roy, B. 1996. An Analysis of
related Neural Networks. Connection Science 8(3–4): Temporal-Difference Learning with Function Ap-
373–384. proximation, LIDS-P-2322, Laboratory for Informa-
Russell, S.; Binder, J.; Koller, D.; and Kanazawa, K. tion and Decision Systems, Massachusetts Institute
1995. Local Learning in Probabilistic Networks with of Technology.
Hidden Variables. In Proceedings of the Fourteenth Tumer, K., and Ghosh, J. 1996. Error Correlation and
International Joint Conference on Artificial Intelli- Error Reduction in Ensemble Classifiers. Connection
gence, 1146–1152. Menlo Park, Calif.: International Science 8(3–4): 385–404.
Joint Conferences on Artificial Intelligence. Verma, T., and Pearl, J. 1990. Equivalence and Syn-
Samuel, A. L. 1959. Some Studies in Machine Learn- thesis of Causal Models. In Proceedings of the Sixth
ing Using the Game of Checkers. IBM Journal of Re- Conference on Uncertainty in Artificial Intelligence,
search and Development 3:211–229. 220–227. San Francisco, Calif.: Morgan Kaufmann.
Schapire, R. E. 1997. Using Output Codes to Boost Watkins, C. J. C. H. 1989. Learning from Delayed Re-
Multiclass Learning Problems. In Proceedings of the wards. Ph.D. thesis, King’s College, Oxford.
Fourteenth International Conference on Machine Learn- Watkins, C. J., and Dayan, P. 1992. Technical Note:
ing, 313–321. San Francisco, Calif.: Morgan Kauf- Q-LEARNING. Machine Learning 8:279–292.
mann. Wettschereck, D.; Aha, D. W.; and Mohri, T. 1997. A
Schwartz, A. 1993. A Reinforcement Learning Review and Empirical Evaluation of Feature Weight-
Method for Maximizing Undiscounted Rewards. In ing Methods for a Class of Lazy Learning Algorithms.
Proceedings of the Tenth International Conference on Artificial Intelligence Review 10:1–37.
Machine Learning, ed. P. Utgoff, 298–305. San Francis- Wettschereck, D., and Dietterich, T. G. 1995. An Ex-
co, Calif.: Morgan Kaufmann. perimental Comparison of the Nearest-Neighbor
Shafer, J.; Agrawal, R.; and Mehta, M. 1996. SPRINT: A and Nearest-Hyperrectangle Algorithms. Machine
Scalable Parallel Classifier for Data Mining. In Pro- Learning 19:5–27.
ceedings of the Twenty-Second VLDB Conference, Wolpert, D. 1992. Stacked Generalization. Neural
544–555. San Francisco, Calif.: Morgan Kaufmann. Networks 5(2): 241–260.
Singh, S. P. 1992. Transfer of Learning by Composing Zhang, W., and Dietterich, T. G. 1995. A Reinforce-
Solutions to Elemental Sequential Tasks. Machine ment Learning Approach to Job-Shop Scheduling. In
Learning 8(3): 323–340. Proceedings of the Twelfth International Joint Con-
Singh, S., and Bertsekas, D. 1997. Reinforcement ference on Artificial Intelligence, 1114–1120. Menlo
Learning for Dynamic Channel Allocation in Cellu- Park, Calif.: International Joint Conferences on Arti-
lar Telephone Systems. In Advances in Neural Infor- ficial Intelligence.
mation Processing Systems 10, 974–980. Cambridge, Zhang, X.; Mesirov, J. P.; and Waltz, D. L. 1992. Hy-
Mass.: MIT Press. brid System for Protein Secondary Structure Predic-
Smyth, P.; Heckerman, D.; and Jordan, M. I. 1997. tion. Journal of Molecular Biology 225:1049–1063.
Probabilistic Independence Networks for Hidden Zweben, M; Daun, B.; and Deale, M. 1994. Schedul-
Markov Probability Models. Neural Computation 9(2): ing and Rescheduling with Iterative Repair. In Intel-
227–270. ligent Scheduling 8, eds. M. Zweben and M. Fox,
241–255. San Francisco, Calif.: Morgan Kaufmann.
Spiegelhalter, D.; Dawid, A.; Lauritzen, S.; and Cow-
ell, R. 1993. Bayesian Analysis in Expert Systems.
Statistical Science 8:219–282. Thomas G. Dietterich (Ph.D,
Spirtes, P.; Glymour, C.; and Scheines, R. 1993. Cau- Stanford University) has been a
sation, Prediction, and Search. New York: Springer-Ver- faculty member in computer sci-
lag. Also available online at hss.cmu.edu/html/de- ence at Oregon State University
partments/philosophy/TETRAD.BOOK/book.html. since 1985. He also held a position
as senior scientist at Arris Pharma-
Spirtes, P., and Meek, C. 1995. Learning Bayesian
ceutical Corporation from 1991 to
Networks with Discrete Variables from Data. In Pro-
1993, where he applied machine-
ceedings of the First International Conference on Knowl- learning methods to drug design
edge Discovery and Data Mining, 294–299. San Francis- problems. He is editor (with Jude Shavlik) of Readings
co, Calif.: Morgan Kaufmann. in Machine Learning and executive editor of the jour-
Stolcke, A., and Omohundro, S. M. 1994. Best-First nal Machine Learning. His e-mail address is
Model Merging for Hidden Markov Model Induc- tgd@cs.orst.edu.
tion, TR-94-003, International Computer Science In-
stitute, Berkeley, California.
136 AI MAGAZINE

Machine-Learning Research: Four Current Directions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine-Learning Research: Four Current Directions

Uploaded by

Copyright:

Available Formats

AI Magazine Volume 18 Number 4 (1997) (© AAAI)

■ Machine-learning research has been making great Ensembles of Classifiers

ular algorithms. I begin by reviewing the

for unstable learning algorithms—algorithms

The effect of the change in weights is to

with the probability distribution pl. If the

a random sample with replacement in propor-

the input features are highly redundant.

technique called error-correcting output coding

bit code word Cj, where bit l is 1 if and only if

1. Difference from C4.5 significant at p < 0.05.

Table 1. Results on Five Domains (best error rate in boldface).

WINTER 1997 101

weighted voting is robust (Clemen 1989). One

WINTER 1997 103

(– i ) classifier from H. Most of our learning algo-

ing algorithms are line segments (or, more gen-

WINTER 1997 105

A fourth approach to the problem of choos-

WINTER 1997 107

Figure 8 describes the RELIEF-F algorithm. In

WINTER 1997 109

This process continues until all the ❀s are re- j=1

favor of the 0, using the race statistics to P[ f ( x t ) = c ] = ∑ vi[[ yi = c ]] .

with initial values for these parameters, VSM

WINTER 1997 111

ple, {to, too, two}). WINNOW’s task is to decide

computer time. However, it is not clear whether An Introduction to

WINTER 1997 113

backups can be performed in any convenient

both value iteration and policy iteration is

WINTER 1997 115

s t–7 s t–6 s t–5 s t–4 s t–3 s t–2 s t–1 st s t+1

Figure 14. A Sequence of States.

modify W to reduce the temporal difference ∞

error J(W). Differentiating J(W) (and treating Σ λi ∇ w t − i f ( st − i , Wt − i )

only the first occurrence of W as adjustable), can be implemented by maintaining a current

approximation error resulting from the use of

WINTER 1997 117

its resource constraints. The actions in this

Program Training Games Opponents Results

ppg = points per game.

Q(s’,a’) s’1,a4 s’1,a5 s’1,a6 s’2,a4 s’2,a5 s’2,a6

Figure 17. A Comparison of the Value Function f and the Q Function.

WINTER 1997 119

Percent of waits > 60 seconds

WINTER 1997 121

P(A, B|C) = P(A|C) · P(B|C).

WINTER 1997 123

petitive with, or superior to, C4.5. Domingos

Naive Unsupervised Learning

be the entry in the conditional probability

The conditional probability in the numerator

standard gradient-descent algorithms, such as

WINTER 1997 125

WINTER 1997 127

tions. In an HMM, the hidden state will be a sin-

WINTER 1997 129

the probability that a given digit model gener-

chastic models can fit a single digit in the con-

Learning the Structure

WINTER 1997 131

Spirtes, Glymour, and Scheines (1993) and the