Professional Documents
Culture Documents
Machine-Learning Research: Four Current Directions
Machine-Learning Research: Four Current Directions
Articles
Machine-Learning Research
Four Current Directions
Thomas G. Dietterich
T
he last five years have seen an explosion
ued, such as height, weight, color, and age.
in machine-learning research. This ex-
These are also called the features of xi. I use the
plosion has many causes: First, separate
notation xij to refer to the jth feature of xi. In
research communities in symbolic machine
some situations, I drop the i subscript when it
learning, computational learning theory, neur-
is implied by the context.
al networks, statistics, and pattern recognition
The y values are typically drawn from a dis-
have discovered one another and begun to
crete set of classes {1, ..., K} in the case of clas-
work together. Second, machine-learning tech-
sification or from the real line in the case of re-
niques are being applied to new kinds of prob-
lem, including knowledge discovery in data- gression. In this article, I focus primarily on
bases, language processing, robot control, and classification. The training examples might be
combinatorial optimization, as well as to more corrupted by some random noise.
traditional problems such as speech recogni- Given a set S of training examples, a learning
tion, face recognition, handwriting recogni- algorithm outputs a classifier. The classifier is a
tion, medical data analysis, and game playing. hypothesis about the true function f. Given
In this article, I selected four topics within new x values, it predicts the corresponding y
machine learning where there has been a lot of values. I denote classifiers by h1, ..., hL.
recent activity. The purpose of the article is to An ensemble of classifiers is a set of classifiers
describe the results in these areas to a broader whose individual decisions are combined in
AI audience and to sketch some of the open re- some way (typically by weighted or unweight-
search problems. The topic areas are (1) ensem- ed voting) to classify new examples. One of the
bles of classifiers, (2) methods for scaling up su- most active areas of research in supervised
pervised learning algorithms, (3) reinforcement learning has been the study of methods for
learning, and (4) the learning of complex sto- constructing good ensembles of classifiers. The
chastic models. main discovery is that ensembles are often
The reader should be cautioned that this ar- much more accurate than the individual clas-
ticle is not a comprehensive review of each of sifiers that make them up.
these topics. Rather, my goal is to provide a An ensemble can be more accurate than its
representative sample of the research in each of component classifiers only if the individual
these four areas. In each of the areas, there are classifiers disagree with one another (Hansen
many other papers that describe relevant work. and Salamon 1990). To see why, imagine that
I apologize to those authors whose work I was we have an ensemble of three classifiers: {h1, h2,
unable to include in the article. h3}, and consider a new case x. If the three clas-
Copyright © 1997, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1997 / $2.00 WINTER 1997 97
Articles
98 AI MAGAZINE
Articles
WINTER 1997 99
Articles
100 AI MAGAZINE
Articles
Task Test-Set Size C4.5 200-Fold Bootstrap C4.5 200-Fold Random C4.5
Vowel 462 0.5758 0.5152 0.48701
Soybean 376 0.1090 0.0984 0.1090
Part of Speech 3060 0.0827 0.0765 0.07882
NETTALK 7242 0.3000 0.26703 0.25003
Letter Recognition 4000 0.2010 0.00383 0.00003
er, hl, they apply feature-selection techniques 200 classifiers constructed by bagging C4.5 and
to choose the best features for learning the injecting randomness into C4.5. The results
classifier. They obtained improvements in 7 of show that injecting randomness obtains the
10 tasks with this approach. best performance in three of the domains. In
Injecting Randomness The last general- particular, notice that injected randomness ob-
purpose method for generating ensembles of tains perfect test-set performance in the letter-
classifiers is to inject randomness into the recognition task.
learning algorithm. In the back-propagation Ali and Pazzani (1996) injected randomness
algorithm for training neural networks, the into the FOIL algorithm for learning Prolog-
initial weights of the network are set random- style rules. FOIL works somewhat like C4.5 in
ly. If the algorithm is applied to the same train- that it ranks possible conditions to add to a
ing examples but with different initial weights, rule using an information-gain criterion. Ali
the resulting classifier can be different (Kolen and Pazzani computed all candidate condi-
and Pollack 1991). tions that scored within 80 percent of the top-
Although this method is perhaps the most ranked candidate and then applied a weighted
common way of generating ensembles of neur- random-choice algorithm to choose among
al networks, manipulating the training set them. They compared ensembles of 11 classi-
might be more effective. A study by Parmanto, fiers to a single run of FOIL and found statisti-
Munro, and Doyle (1996) compared this tech- cally significant improvements in 15 of 29
nique to bagging and tenfold cross-validated tasks and statistically significant loss of perfor-
committees. They found that cross-validated mance in only one task. They obtained similar
committees worked best, bagging second best, results using 11-fold cross-validation to con-
and multiple random initial weights third best struct the training sets.
on one synthetic data set and two medical di- Raviv and Intrator (1996) combine boot-
agnosis data sets. strap sampling of the training data with in-
For the C4.5 decision tree algorithm, it is also jecting noise into the input features for the
easy to inject randomness (Kwok and Carter learning algorithm. To train each member of
1990). The key decision of C4.5 is to choose a an ensemble of neural networks, they draw
feature to test at each internal node in the de- training examples with replacement from the
cision tree. At each internal node, C4.5 applies original training data. The x values of each
a criterion known as the information-gain ratio training example are perturbed by adding
to rank the various possible feature tests. It Gaussian noise to the input features. They re-
then chooses the top-ranked feature-value test. port large improvements in a synthetic
For discrete-valued features with V values, the benchmark task and a medical diagnosis task.
decision tree splits the data into V subsets, de- A method closely related to these techniques
pending on the value of the chosen feature. for injecting randomness is the Markov chain
For real-valued features, the decision tree splits Monte Carlo (MCMC) method, which has been
the data into two subsets, depending on applied to neural networks by MacKay (1992)
whether the value of the chosen feature is and Neal (1993) and to decision trees by Chip-
above or below a chosen threshold. Dietterich man, George, and McCulloch (1996). The basic
and Kong (1995) implemented a variant of idea of the MCMC method (and related meth-
C4.5 that chooses randomly (with equal proba- ods) is to construct a Markov process that gen-
bility) among the top 20 best tests. Table 1 erates an infinite sequence of hypotheses hl. In
compares a single run of C4.5 to ensembles of a Bayesian setting, the goal is to generate each
hypothesis hl with probability P(hl|S), where S is parison with bagging, they found that their
the training sample. P(hl|S) is computed in the method gave excellent results in four real-world
usual way as the (normalized) product of the like- domains.
lihood P(S|hl) and the prior probability P(hl) of Abu-Mostafa (1990) and Caruana (1996) de-
hl. To apply MCMC, we define a set of operators scribe a technique for training a neural network
that convert one hl into another. For a neural on auxiliary tasks as well as the main task. The
network, such an operator might adjust one of key idea is to add some output units to the net-
the weights in the network. In a decision tree, work whose role is to predict the values of the
the operator might interchange a parent and a auxiliary tasks and to include the prediction er-
child node in the tree or replace one node with ror for these auxiliary tasks in the error criteri-
another. The MCMC process works by maintain- on that back propagation seeks to minimize.
ing a current hypothesis hl. At each step, it selects Because these auxiliary output units are con-
an operator, applies it (to obtain hl+1), and then nected to the same hidden units as the primary
computes the likelihood of the resulting classifier task output, the auxiliary outputs can influence
on the training data. It then decides whether to the behavior of the network on the primary
keep hl+1 or discard it and go back to hl. Under task. Parmanto et al. (1994) show that diverse
various technical conditions, it is possible to classifiers can be learned by training on the
prove that a process of this kind eventually con- same primary task but with many different aux-
verges to a stationary probability distribution in iliary tasks. One good source of auxiliary tasks
which the hl‘s are sampled in proportion to their is to have the network attempt to predict one of
posterior probabilities. In practice, it can be diffi- its input features (in addition to the primary
cult to tell when this stationary distribution is output task). They apply this method to med-
reached. A standard approach is to run the ical diagnosis problems.
Markov process for a long period (discarding all More recently, Munro and Parmanto (1997)
generated classifiers) and then collect a set of L developed an approach in which the value of
classifiers from the Markov process. These classi- the auxiliary output is determined dynamically
fiers are then combined by weighted vote accord- by competition among the set of networks.
ing to their posterior probabilities. Each network has a primary output y and a sec-
Algorithm-Specific Methods for Generat- ondary (auxiliary) output z. During training,
ing Ensembles In addition to these general- each network looks at the training example x
purpose methods for generating diverse ensem- and computes its primary and secondary output
bles of classifiers, there are several techniques predictions. The network whose secondary out-
that can be applied to the back-propagation al- put prediction is highest is said to be the winner
gorithm for training neural networks. for this example. It is given a target value for z
Rosen (1996) trains several neural networks si- of 1; the remaining networks are given a target
multaneously and forces the networks to be di- value of 0 for the secondary output. All net-
verse by adding a correlation penalty to the error works are given the value yi as the target for the
function that back propagation minimizes. primary output. The effect is to encourage dif-
Specifically, during training, Rosen keeps track of ferent networks to become experts at predicting
the correlations in the predictions of each net- the secondary output z in different regions of
work. He applies back propagation to minimize the input space. Because the primary and sec-
an error function that is a sum of the usual ondary output share the hidden layer, the errors
squared output prediction error and a term that in the primary output become decorrelated.
measures the correlations with the other net- They show that this method substantially out-
works. He reports substantial improvements in performs an ensemble of ordinary networks
three simple synthetic tasks. trained using different initial random weights
Opitz and Shavlik (1996) take a similar ap- when trained on a synthetic classification task.
proach, but they use a kind of genetic algorithm In addition to these methods for training en-
to search for a good population of neural net- sembles of neural networks, there are also meth-
work classifiers. In each iteration, they apply ge- ods that are specific to decision trees. Buntine
netic operators to the current ensemble to gen- (1990) developed an algorithm for learning op-
erate new network topologies. These networks tion trees. These are decision trees where an in-
are then trained using an error function that ternal node can contain several alternative
combines the usual squared output prediction splits (each producing its own sub–decision
error with a multiplicative term that incorpo- tree). To classify an example, each of these
rates the diversity of the classifiers. After training sub–decision trees is evaluated, and the result-
the networks, they prune the population to re- ing classifications are voted. Kohavi and Kunz
tain the N best networks using a criterion that (1997) describe an option tree algorithm and
considers both accuracy and diversity. In a com- compare its performance to bagged C4.5 trees.
102 AI MAGAZINE
Articles
They show that option trees generally match choosing less correlated subsets of the ensem-
the performance of bagging and produce a ble and combining them by linear least
much more understandable result. squares.
This completes my review of methods for For classification problems, weights are usu-
generating ensembles using a single learning ally obtained by measuring the accuracy of
algorithm. Of course, one can always generate each individual classifier hl on the training da-
an ensemble by combining classifiers con- ta (or a holdout data set) and constructing
structed by different learning algorithms. weights that are proportional to these accura-
Learning algorithms based on very different cies. Ali and Pazzani (1996) describe a method
principles will probably produce very diverse that they call likelihood combination, in which
classifiers. However, often some of these classi- they apply the naive Bayes algorithm (see The
fiers perform much worse than others. Further- Naive Bayes Classifier) to learn weights for clas-
more, there is no guarantee of diversity. Hence, sifiers. In ADABOOST, the weight for classifier hl
when classifiers from different learning algo- is computed from the accuracy of hl measured
rithms are combined, they should be checked on the weighted training distribution that was
(for example, by cross-validation) for accuracy used to learn hl. A Bayesian approach to
and diversity, and some form of weighted com- weighted vote is to compute the posterior
bination should be used. This approach has
probability of each hl. This method requires
been shown effective in some applications (for
the definition of a prior distribution P(hl) that
example, Zhang, Mesirov, and Waltz [1992]).
is multiplied with the likelihood P(S|hl) to esti-
Methods for Combining Classifiers mate the posterior probability of each hl. Ali
and Pazzani experiment with this method. Ear-
Given that we have trained an ensemble of
lier work on Bayesian voting of decision trees
classifiers, how should we combine their indi-
was performed by Buntine (1990).
vidual classification decisions? Many methods
have been explored. They can be subdivided The third approach to combining classifiers
into unweighted vote, weighted vote, and gat- is to learn a gating network or a gating func-
ing networks. tion that takes as input x and produces as out-
The simplest approach is to take an un- put the weights wl to be applied to compute
weighted vote, as is done in bagging, ECOC, the weighted vote of the classifiers hl. Jordan
and many other methods. Although it might and Jacobs (1994) learn gating networks that
appear that more intelligent voting schemes have the form
should do better, the experience in the fore- w l = e zl / Σ u e z u
casting literature has been that simple, un-
zl = v l x.
T
104 AI MAGAZINE
Articles
learning problems are beginning to arise in for some threshold value θ. The standard ap-
database-mining applications, where there can proach is to sort the training examples at the
be millions of transactions every day and current node according to the values of feature
where it is desirable to have machine-learning xj and then make a sequential pass over the
algorithms that can analyze such large data sorted examples to choose the threshold θ. In
sets in just a few hours of computer time. An- standard depth-first algorithms, the data must
other area where large learning problems arise be sorted by each candidate feature xj at each
is in information retrieval from full-text data- node of the tree, which can become expensive.
bases and the World Wide Web. In informa- Shafer, Agrawal, and Mehta (1996) describe
tion retrieval, each word in a document can be their SPRINT method in which the training data
Large treated as an input feature; so, each training are broken up into a separate disk file for each
machine- example can be described by thousands of fea- attribute (and sorted by attribute value). For
tures. Finally, applications in speech recogni- feature j, the disk file contains records of the
learning tion, object recognition, and character recog- form <i, xij, yi>, where i is the index of the train-
problems are nition for Chinese and Japanese present ing example, xij is the value of feature j for
situations in which hundreds or thousands of training example i, and yi is the class of exam-
beginning classes must be discriminated. ple i. To choose a splitting threshold θ for fea-
to arise ture j, it is a simple matter to make a serial scan
Learning with Large Training Sets of this disk file and construct class histograms
in database- Decision tree algorithms have been extended for each potential splitting threshold. Once a
mining to handle large data sets in three different value is chosen, the disk file is partitioned
applications, ways. One approach is based on intelligently (logically) into two files, one containing exam-
sampling subsets of the training data as the ples with values less than or equal to θ and the
where there tree is grown. To describe how this process other containing examples with values greater
can be works, I must review how decision tree algo- than θ. As these files are written to disk, a hash
rithms operate. table is built (in the main memory) in which
millions of Decision trees are constructed by starting the index for each training example i is associ-
transactions with the entire training set and an empty tree. ated with the left or right child nodes of the
A test is chosen for the root of the tree, and the newly created split. Then, each of the disk files
every day and training data are then partitioned into disjoint for the other attributes xl, l ≠ j is read and split
where it is subsets depending on the outcome of the test. into a file for the left child and a file for the
desirable to The algorithm is then applied recursively to right child. The index of each example is
each of these disjoint subsets. The algorithm looked up in the hash table to determine the
have terminates when all (or most) of the training child to which it belongs. SPRINT can be paral-
machine- examples within a subset of the data belong to lelized easily, and it has been applied to data
the same class. At this point, a leaf node is cre- sets containing 2.5 million examples. Mehta,
learning ated and labeled with the class. Agrawal, and Rissanen (1996) have developed
algorithms The process of choosing a test for the root of a closely related algorithm, SLIQ, that makes
the tree (or for the root of each subtree) in- more use of main memory but also scales to
that can volves analyzing the training data and choos- millions of training examples and is slightly
analyze ing the one feature that is the best predictor of faster than SPRINT. Both SPRINT and SLIQ scale ap-
such large the output class. In a large and redundant data proximately linearly in the number of training
set, it might be possible to make this choice examples and the number of features, aside
data sets in based on only a sample of the data. Musick, from the cost of performing the initial sorting
just a few Catlett, and Russell (1993) presented an algo- of the data (which need only be done once).
rithm that dynamically chooses the sample A third approach to large data sets is to take
hours of based on how difficult the decision is at each advantage of ensembles of decision trees
computer node. The algorithm typically behaves by us- (Chan and Stolfo 1995). The training data can
ing a small sample near the root of the tree and randomly be partitioned into N disjoint sub-
time. then progressively enlarging the sample as the sets. A separate decision tree can be grown
tree grows. This technique can reduce the time from each subset in parallel. The trees can then
required to grow the tree without reducing the vote to make classification decisions. Although
accuracy of the tree at all. the accuracy of each of the individual decision
A second approach is based on developing trees is less than the accuracy of a single tree
clever data structures that avoid the need to grown with all the data, the accuracy of the en-
store all training data in random-access mem- semble is often better than the accuracy of a
ory. The hardest step for most decision tree al- single tree. Hence, with N parallel processors,
gorithms is to find tests for real-valued fea- we can achieve a speedup of N in the time re-
tures. These tests usually take the form xj ≤ θ quired to construct the decision trees.
106 AI MAGAZINE
Articles
repruning it. Two candidate replacement rules first. For discrete feature j, the mutual infor-
are grown and pruned. The first candidate is mation weight wj can be computed as
grown starting with an empty rule, whereas P ( y = c , x j = v)
the second candidate is grown starting with w j = ∑ ∑ P ( y = c , x j = v) ⋅ log ,
v c P ( y = c ) P ( x j = v)
the current rule. The better of the two candi-
dates is selected (using a description-length where P(y = c) is the proportion of training ex-
heuristic) and added to the rule set. amples in class c, and P(xj = v) is the probability
Cohen compared RIPPER to C4.5RULES on 37 that feature j takes on value v. For real-valued
data sets and found that RIPPER matched or beat features, the sums become integrals that must
C4.5RULES in 22 of the 37 problems. The rule be approximated. A good approximation is to
sets that it finds are always smaller than those apply a discretization algorithm, such as the
constructed by C4.5RULES. An implementation one advocated by Fayyad and Irani (1993);
of RIPPER is available for research use from Co- convert the real-valued feature into a discrete-
hen (www.research.att.com/~wcohen/rip- valued feature; and then apply the previous
perd.html). formula. Wettschereck and Dietterich (1995)
have obtained good results with nearest-neigh-
Learning with Many Features bor algorithms using mutual information
In many learning problems, there are hun- weighting.
dreds or thousands of potential features de- A problem with mutual information weight-
scribing each input object x. Popular learning ing is that it treats each feature independently.
algorithms such as C4.5 and back propagation For features whose predictive power is only ap-
do not scale well when there are many fea- parent in combination with other features, mu-
tures. Indeed, from a statistical point of view, tual information assigns a weight of zero. For
examples with many irrelevant, but noisy, in- example, a difficult class of learning problems
put features provide little information. It is involves learning parity functions with ran-
easy for learning algorithms to be confused by dom irrelevant features. A parity function over
the noisy features and construct poor classi- n binary features is equal to 1 if and only if an
fiers. Hence, in practical applications, it is wise odd number of the features are equal to 1. Sup-
to carefully choose which features to provide pose we define a learning problem in which
to the learning algorithm. there are 4 relevant features and 10 irrelevant
Research in machine learning has sought to (random) binary features, and the class is the 4-
automate the selection and weighting of fea- parity of the 4 relevant features. The mutual in-
tures, and many different algorithms have formation weights of all features will be ap-
been developed for this purpose. An excellent proximately zero using the previous formula.
review has been written by Wettschereck, Aha, An algorithm that overcomes this problem
and Mohri (1997). A comprehensive review of (and is one of the most successful preprocess-
the statistical literature on feature selection ing algorithms to date) is the RELIEF-F algorithm
can be found in Miller (1990). I discuss a few of (Kononenko 1994), which is an extension of
the most significant methods here. an earlier algorithm called RELIEF (Kira and Ren-
Three main approaches have been pursued. dell 1992). The basic idea of these algorithms is
The first approach is to perform some initial to draw examples at random, compute their
analysis of the training data and select a sub- nearest neighbors, and adjust a set of feature
set of the features to feed to the learning algo- weights to give more weight to features that
rithm. A second approach is to try different discriminate the example from neighbors of
subsets of the features on the learning algo- different classes. Specifically, let x be a ran-
rithm, estimate the performance of the algo- domly chosen training example, and let xs and
rithm with these features, and keep the sub- xd be the two training examples nearest to x (in
sets that perform best. The third approach is Euclidean distance) in the same class and in a
to integrate the selection and weighting of different class, respectively. The goal of RELIEF is
features directly into the learning algorithm. to set the weight wj on input feature j to be
I discuss two examples of each approach.
w j = P ( x j ≠ xdj ) – P ( x j ≠ xsj ).
Selecting and Weighting Features by
Preprocessing A simple preprocessing In other words, the weight wj should be maxi-
technique is to compute the mutual informa- mized when xd has a high probability of taking
tion (also called the information gain) be- on a different value for feature j, and xs has a
tween each input feature and the class. The low probability of taking on a different value for
mutual information between two random vari- feature j. RELIEF-F computes a more reliable esti-
ables is the average reduction in uncertainty mate of this probability difference by comput-
about the second variable given a value of the ing the B nearest neighbors of x in each class.
108 AI MAGAZINE
Articles
bit vector where a 1 in position j means that to see which works better with various learn-
feature j is relevant, and a 0 means it is irrele- ing algorithms.
vant. A schema is a vector containing 0s, 1s, Integrating Feature Weighting into the
and ❀s. A ❀ in position j means that this fea- Learning Algorithm I now discuss two
ture should be selected randomly to be relevant methods that integrate feature selection direct-
50 percent of the time. Moore and Lee race ly into the learning algorithm. Both of them
pairs of schemas against one another as fol- have been shown to work well experimentally,
lows: A training example is randomly selected and the second method, called WINNOW, works
and temporarily deleted from the training set. extremely well in problems with thousands of
The nearest-neighbor algorithm is applied to potentially relevant input features.
classify it using each of the two schemas being The first algorithm is called the variable-ker-
raced. To classify an example using a schema, nel similarity metric, or VSM method (Lowe
features indicated by a 0 are ignored, features 1995). VSM is a form of Gaussian radial basis
indicated by a 1 are selected, and features indi- function method. To classify a new data point
cated by a ❀ are selected with probability 0.5. xt, it defines a multivariate Gaussian probabili-
Suppose, for illustration, that we have five fea- ty distribution ϕ centered on xt with standard
tures. Moore and Lee begin by conducting five deviation σ. Each example (xt,yi) in the training
simultaneous pairwise races: data set “votes” for class yi with an amount
1❀❀❀❀ races against 0❀❀❀❀ ϕ(||xi – xt ||; σ). The class with the highest vote
❀1❀❀❀ races against ❀0❀❀❀ is assigned to be the class yt = f(xt) of the data
❀❀1❀❀ races against ❀❀0❀❀ point xt.
❀❀❀1❀ races against ❀❀❀0❀ The key to the effectiveness of VSM is that it
❀❀❀❀1 races against ❀❀❀❀0 learns a weighted distance metric for measur-
All the races are terminated as soon as one of ing the distance between the new data point x
the schemas is found to be better than its oppo- and each training point xi. VSM also adjusts the
nent. In the next iteration, all single-bit refine- size σ of the Gaussian distribution depending
ments of the winning schema are raced against on the local density of training examples in the
one another. For example, suppose schema 1 neighborhood of xt.
was the winner of the first race. Then, the next In detail, VSM is controlled by a set of learned
iteration involves the following four pairwise feature weights w1, …, wn, a kernel radius para-
races: meter r, and a number of neighbors R. To classify
a new data point xt, VSM first computes the
11❀❀❀ races against 01❀❀❀ weighted distances to the R nearest neighbors
❀11❀❀ races against ❀10❀❀ (xi,yi):
❀1❀1❀ races against ❀1❀0❀
( )
n
❀1❀❀1 races against ❀1❀❀0 di = ∑ w 2j xtj – xij
2
moved from the winning schema. Moore and It then computes a kernel width σ from the av-
Lee found that this method never misses im- erage distance to the R/2 nearest neighbors:
portant relevant features, even when the fea- R /2
2r
R ∑
tures are highly dependent. However, in some σ = di .
rare cases, the races can take a long time to i=1
conclude. The problem appears to be that the Finally, it computes the probability that the
algorithm can be slow to replace a ❀ with a 0. next point xt belongs to each class c (the quan-
In contrast, the algorithm is quick to replace a tity vi is the vote from training example i):
❀ with a 1. Moore and Lee, therefore, investi-
gated an algorithm, called SCHEMATA+, that ter- vi = exp(– di2 / 2σ )
minates each race after 2000 evaluations (in R
110 AI MAGAZINE
Articles
90
within 10 words before or after the target word.
The second kind are collocation features. These
85 test for a string of two words or part-of-speech
tags immediately adjacent to the target word.
For example, the sequence target to VERB is a
80 collocation feature that checks whether the tar-
get word is immediately followed by the word
to and then a word that can potentially be a
75
verb (according to a dictionary lookup). Based
on the one-million-word Brown corpus (Kucera
70 and Francis 1967), Golding and Roth defined
70 75 80 85 90 95 100 more than 10,000 potentially relevant features.
Modified Bayes Accuracy
Golding and Roth applied WINNOW (with α =
3/2, β varying between 0.5 and 0.9, and θ = 1).
They compared its accuracy to the best previ-
Figure 10. Comparison of the Percentage of Correct ous method, which is a modified naive
Classifications for a Modified Bayesian Method and WINNOW for Bayesian algorithm. Figure 10 shows the results
21 Sets of Frequently Confused Words. for 21 sets of frequently confused words.
Trained on 80 percent of the Brown corpus and tested on the remaining 20 per- An important advantage of WINNOW, in addi-
cent. Points lying above the line y = x correspond to cases where WINNOW is more tion to its speed and ability to operate online,
accurate. is that it can adapt rapidly to changes in the
target function. This advantage was shown to
be important in the calendar scheduling prob-
34 input features available, some with many lem, where the scheduling of different types of
possible values. Blum defined Boolean features meeting can shift when semesters change (or
corresponding to all possible values of all pairs during summer break). Blum’s experiments
of the input features. For example, given two showed that WINNOW was able to respond
input features—event-type and position-of-at- quickly to such changes. By comparison, to en-
tendees—he would define a separate Boolean able the decision tree algorithm to respond to
feature for each legal combination, such as changes, it was necessary to decide which old
event-type = meeting and position-of-atten- training examples could be deleted. This task is
dees = grad-student. There were a total of difficult to do because although some kinds of
59,731 Boolean input features. He then ap- meeting might change with the change in se-
plied WINNOW with α = 3/2 and β = 1/2. He mesters, other meetings might stay the same. A
modified WINNOW to prune (set to 0) weights decision to keep, for example, only the most re-
that become very small (less than 0.00001). cent 180 training examples means that exam-
WINNOW found that 561 of the Boolean fea- ples of rare meetings whose scheduling does
tures were actually useful for prediction. Its ac- not change will be lost, which hurts the perfor-
curacy was better than that of the best previous mance of the decision tree approach. In con-
classifier for this task (which used a greedy for- trast, because WINNOW only revises the weights
ward-selection algorithm to select relevant fea- on features when they have the value 1, fea-
tures for a decision tree–learning algorithm). tures describing such rare meetings are likely to
Golding and Roth (1996) describe an appli- retain their weights even as the weights for oth-
cation of WINNOW to context-sensitive spelling er features are being modified rapidly.
correction. This is the task of identifying
spelling errors where one legal word is substi-
Summary: Scaling Up
tuted for another, such as “It’s not to late,” Learning Algorithms
where to is substituted for too. The Random This concludes my review of methods for scal-
House dictionary (Flexner 1983) lists many ing up learning algorithms to apply to very
sets of commonly confused words, and Gold- large problems. With the techniques described
ing and Roth developed a separate WINNOW here, problems having one million training ex-
classifier for each of the listed sets (for exam- amples can be solved in reasonable amounts of
112 AI MAGAZINE
Articles
The previous two sections discussed problems where 0 ≤ γ < 1 is a discount factor that controls
in supervised learning from examples. This the relative importance of short-term and
section addresses problems of sequential deci- long-term rewards.
sion making and control that come under the A procedure or rule for choosing each action
heading of reinforcement learning. a given state s is called the policy of the robot,
Work in reinforcement learning dates back to and it can be formalized as a function a = π(s).
the earliest days of AI when Arthur Samuel The goal of reinforcement learning algorithms
(1959) developed his famous checkers program. is to compute the optimal policy, denoted π*,
More recently, there have been several impor- which maximizes the cumulative discounted
tant advances in the practice and theory of re- reward.
inforcement learning. Perhaps the most famous Researchers in dynamic programming (for
work is Gerry Tesauro’s (1992) TD-GAMMON pro- example, Bellman [1957]) found it convenient
gram, which has learned to play backgammon to define a real-valued function fπ(s) called the
value function of policy π. The value function
better than any other computer program and
fπ(s) gives the expected cumulative discounted
almost as well as the best human players. Two
reward that will be received by starting in state
other interesting applications are the work of
s and executing policy π. It can be defined re-
Zhang and Dietterich (1995) on job-shop sched-
cursively by the formula
uling and Crites and Barto (1995) on real-time
scheduling of passenger elevators. fπ(s) =
Kaelbling, Littman, and Moore (1996) pub- ∑s’ P(s’|s, π(s)) · [R(s, π(s), s’) + γf π(s’)], (1)
lished an excellent survey of reinforcement where P(s’|s, π(s)) is the probability that the
learning, and Mahadevan and Kaelbling next state will be s’ given that the current state
(1996) report on a recent National Science is s, and we take action π(s).
Foundation–sponsored workshop on the sub- Given a policy π, a reward function R, and
ject. Two new books (Barto and Sutton 1997; the transition probability function P, it is pos-
Bertsekas and Tsitskilis 1996) describe the new- sible to compute the value function fπ by solv-
ly developed reinforcement learning algo- ing the system of linear equations containing
rithms and the theory behind them. I summa- one equation of the form of equation 1 for
rize these developments here. each possible state s. The system of equations
114 AI MAGAZINE
Articles
116 AI MAGAZINE
Articles
Table 2. Summary of the Performance of TD-GAMMON against Some of the World’s Best Players.
The results are expressed in net points won (or lost) and in points won per game. Taken from Tesauro (1995).
118 AI MAGAZINE
Articles
f(s) s
a1 a2 a3 Q(s,a) s,a1
s’1 s’2
f(s’) s’1 s’2 s’3 s’4 s’5 s’6
The second approach is a model-free algo- a, and observed the resulting state s’ and im-
rithm called Q-LEARNING, developed by Watkins mediate reward R(s, a, s’). We can update the Q
(Watkins 1989; Watkins and Dayan 1992), function as follows:
which is the subject of the next section.
Q(s, a) := (1 – α)Q(s, a) +
Model-Free Reinforcement α[R(s, a, s’) + γmaxa Q(s’, a)].
Learning (Q-LEARNING) Suppose that every time we visit state s, we
Q-LEARNING is an online approximation of value
choose the action a uniformly at random.
iteration. The key to Q-LEARNING is to replace Then, the effect is to approximate a full Bell-
the value function f(s) with an action-value man backup (see equation 4). Each value Q(s,
function, Q(s, a). The quantity Q(s, a) gives the a) is the expected cumulative discounted re-
expected cumulative discounted reward of per- ward of executing action a in s, and the maxi-
forming action a in state s and then pursuing mum of these values is f(s). The random choice
the current policy thereafter. Hence, the value of a ensures that we learn Q values for every ac-
of a state is the maximum of the Q values for tion available in state s. The online updates en-
that state: sure that we experience the resulting states s’
in proportion to their probabilities P(s’|s, a).
f(s) = maxa Q(s,a). In general, we can choose which actions to
We can write down the Q version of the Bell- perform in any way we like, as long as there is
man equation as follows: a nonzero probability of performing every ac-
Q(s, a) = tion in every state. Hence, a reasonable strate-
∑s’ P(s’|s, a) [R(s, a, s’) + maxa’ γQ(s’, a’)]. gy is to choose the best action (that is, the one
whose estimated value Q(s, a) is largest) most
The role of the Q function is illustrated in
of the time but to choose a random action
figure 17, where it is contrasted with the value
with some small probability. Watkins proves
function f. In the left part of the figure, we see
that as long as every action is performed in
that the Bellman backup updates the value of
every state infinitely many times, the Q func-
f(s) by considering the values f(s’) of states that
tion converges to the optimal function Q* with
result from different possible actions. In the
probability 1.
right part of the figure, the analogous backup
Once we have learned Q*, we must convert
works by taking the best of the values Q(s’, a’)
it into a policy. Although this conversion pre-
and backing them up to compute an updated
sented an insurmountable difficulty for TD(0),
value for Q(s, a).
it is trivial for the Q representation. The opti-
The Q function can be learned by an algo-
mal policy can be computed as
rithm that exploits the same insight as
TD(0)—online sampling of the transition prob- π*(s) = argmaxa Q*(s, a).
abilities—and an additional idea—online sam- Crites and Barto (1995) applied Q-LEARNING
pling of the available actions. Specifically, sup- to a problem of controlling 4 elevators in a 10-
pose we have visited state s, performed action story office building during peak “down traf-
700 1
1.12 2.74
^| ^|
600
0.8
0.6
400
300 0.4
200
0.2
100
0 0
SECTOR DLB HUFF/B LQF HUFF FIM ESA/nq ESA Q SECTOR DLB HUFF/B LQF HUFF FIM ESA/nq ESA Q
Policy Policy
Figure 18. Comparison of Learned Elevator Policy Q with Eight Published Heuristic Policies.
fic.” People press elevator call buttons on the ING is to the problem of assigning radio chan-
various floors of the building to call the eleva- nels for cellular telephone traffic. Singh and
tor to the floor. Once inside the elevator, they Bertsekas (1997) showed that Q-LEARNING could
can also press destination buttons to request find a much better policy than some sophisti-
that the elevator stop at various floors. The el- cated and complex published methods.
evator control decisions are (1) to decide, after
stopping at a floor, which direction to go next Open Problems in
and (2) to decide, when approaching a floor, Reinforcement Learning
whether to stop at the floor or skip it. Crites Many important problems remain unsolved in
and Barto applied rules to make the first deci- reinforcement learning, which reflects the rel-
sion; so, the reinforcement learning problem is ative youth of the field. I discuss a few of these
to learn whether to stop or skip floors. The goal problems here.
of the controller is to minimize the square of First, the use of multilayer sigmoidal neural
the time that passengers must wait for the ele- networks for value-function approximation
vator to arrive after pressing the call button. has worked, but there is no reason to believe
Crites and Barto used a team of four Q-learn- that such networks are well suited to reinforce-
ers, one for each of the four elevator cars. Each ment learning. First, they tend to forget
Q-function was represented as a neural net- episodes (both good and bad) unless they are
work with 47 input features, 20 sigmoidal hid- retrained on the episodes frequently. Second,
den units, and 2 linear output units (to repre- the need to make small gradient-descent steps
sent Q(s, stop) and Q(s, skip)). The immediate makes learning slow, particularly in the early
reward was the (negative of the) squared wait stages. An important open problem is to clarify
time since the previous action. They employed what properties an ideal value-function ap-
a form of random exploration in which ex- proximator would possess and develop func-
ploratory actions are more likely to be chosen tion approximators with these properties. Ini-
if they have higher estimated Q values. tial research suggests that value-function
Figure 18 compares the performance of the approximators should be local averagers that
learned policy to that of eight heuristic algo- compute the value of a new state by interpolat-
rithms, including the best nonproprietary al- ing among the values of previously visited
gorithms. The left-hand graph shows the states (Gordon 1995).
squared wait time, and the right-hand graph A second key problem is to develop rein-
shows the percentage of passengers that had to forcement methods for hierarchical problem
wait more than 60 seconds for the elevator. solving. For very large search spaces, where the
The learned Q policy performs better than all distance to the goal and the branching factor
the other methods. are big, no search method can work well. Of-
Another interesting application of Q-LEARN- ten such large search spaces have a hierarchical
120 AI MAGAZINE
Articles
(or approximately hierarchical) structure that choosing actions to optimize expected utility
can be exploited to reduce the cost of search. is exactly the problem faced by general intelli-
There have been several studies of ideas for hi- gent agents. Reinforcement learning provides
erarchical reinforcement learning (for exam- one approach to attacking these problems.
ple, Dayan and Hinton [1993], Kaelbling
[1993], and Singh [1992]).
The third key problem is to develop intelli- Learning Stochastic Models
gent exploration methods. Weak exploration The final topic that I discuss is the area of learn-
methods that rely on random or biased ran- ing stochastic models. Traditionally, re-
dom choice of actions cannot be expected to searchers in machine learning have sought
scale well to large, complex spaces. A property general-purpose learning algorithms—such as
of the successful applications shown previous- the decision tree, rule, neural network, and
ly (particularly, backgammon and job-shop nearest-neighbor algorithms—that could effi-
scheduling) is that even random search reach- ciently search a large and flexible space of clas-
es a goal state and receives a reward. In do- sifiers for a good fit to training data. Although
mains where success is contingent on a long these algorithms are general, they have a major
sequence of successful choices, random search drawback. In a practical problem where there is
has a low probability of receiving any reward. extensive prior knowledge, it can be difficult to
More intelligent search methods, such as incorporate this prior knowledge into these
means-ends analysis, need to be integrated in- general algorithms. A secondary problem is
to reinforcement learning systems as they have that the classifiers constructed by these general
been integrated into other learning architec- learning algorithms are often difficult to inter-
tures such as SOAR (Laird, Newell, and Rosen- pret—their internal structure might not have
bloom 1987) and PRODIGY (Minton et al. 1989). any correspondence to the real-world process
A fourth problem is that optimizing cumu- that is generating the training data.
lative discounted reward is not always appro- Over the past five years or so, there has been
priate. In problems where the system needs to tremendous interest in a more knowledge-
operate continuously, a better goal is to maxi- based approach based on stochastic modeling.
mize the average reward per unit time. Howev- A stochastic model describes the real-world
er, algorithms for this criterion are more com- process by which the observed data are gener-
plex and not as well behaved. Several new ated. Sometimes, the terms generative stochastic
methods have been put forward recently (Ma- model and causal model are used to emphasize
hadevan 1996; Ok and Tadepalli 1996; this perspective. The stochastic model is typi-
Schwartz 1993). cally represented as a probabilistic network—a
The fifth, and perhaps most difficult, prob- graph structure that captures the probabilistic
lem is that existing reinforcement learning al- dependencies (and independencies) among a
gorithms assume that the entire state of the set of random variables. Each node in the
environment is visible at each time step. This graph has an associated probability distribu-
assumption is not true in many applications, tion, and from these individual distributions,
such as robot navigation or factory control, the joint distribution of the observed data can
where the available sensors provide only par- be computed. To solve a learning problem, the
tial information about the environment. A few programmer designs the structure of the graph
algorithms for the solution of hidden-state re- and chooses the forms of the probability distri-
inforcement learning problems have been de- butions, yielding a stochastic model with
veloped (Littman, Cassandra, and Kaelbling many free parameters (that is, the parameters
1995; McCallum 1995; Parr and Russell 1995; of the node-probability distributions). Given a
Cassandra, Kaelbling, and Littman 1994). Ex- training sample, learning algorithms can be
act solution appears to be difficult. The chal- applied to determine the values of the free pa-
lenge is to find approximate methods that rameters, thereby fitting the model to the data.
scale well to large hidden-state applications. Once a stochastic model has been learned,
Despite these substantial open problems, re- probabilistic inference can be carried out to
inforcement learning methods are already be- support tasks such as classification, diagnosis,
ing applied to a wide range of industrial prob- and prediction.
lems where traditional dynamic programming More details on probabilistic networks are
methods are infeasible. Researchers in the area given in two recent textbooks: Jensen (1996)
are optimistic that reinforcement learning al- and Castillo, Gutierrez, and Hadi (1997).
gorithms can solve many problems that have
resisted solution by machine-learning meth- Probabilistic Networks
ods in the past. Indeed, the general problem of Figure 19 is an example of a probabilistic net-
122 AI MAGAZINE
Articles
Age P(A)
0–25 Preg P(N)
26–50 0
51–75 1
> 75 >1
P(M|A, N)
Age Preg 0–50 51–100 >100
0–25 0
0–25 1
0–25 >1
26–50 0
26–50 1
26–50 >1
51–75 0
51–75 1
51–75 >1
>75 0
>75 1
>75 >1
Table 3. Probability Tables for the Age, Preg, and Mass Nodes from Figure 19.
A learning algorithm must fill in the actual probability values based on the observed training data.
most current applications, steps 1 and 2 are worst case, these algorithms require exponential
performed by a user, and step 3 is performed time, but if the probabilistic network is sparsely
by a learning algorithm. However, later in this connected, the running time is reasonable.
section, I briefly discuss methods for automat-
ing step 1—learning the graph structure. The Naive Bayes Classifier
Once we learn the model, how can we apply A simple approach to stochastic modeling for
it to predict whether new patients have dia- classification problems is the so-called naive
betes? For a new case, we observe the values of Bayes classifier (Duda and Hart 1973). In this
all the variables in the model except for the Di- approach, the training examples are assumed
abetes node. Our goal is to compute the prob- to be produced by the probabilistic network
ability of the node; that is, we seek the distrib- shown in figure 20, where the class variable is
ution P(D|A, N, M, I, G). We could precompute y, and the features are x1, …, xn. According to
this distribution offline before observing any this model, the environment generates an ex-
of the data, but the resulting conditional prob- ample by first choosing (stochastically) which
ability table would be immense. A better ap- class to generate. Then, once the class is cho-
proach is to wait until the values of the vari- sen, the features describing the example are
ables have been observed and then compute generated independently according to their in-
the single corresponding row of the class prob- dividual distributions P(xj|y).
ability table. In most applications, the values of the fea-
This inference problem has been studied in- tures are discretized so that each feature takes
tensively, and a general and elegant algo- on only a small number of discrete values. The
rithm—the junction-tree algorithm—has been probability distributions are represented as ta-
developed (Jensen, Lauritzen, and Olesen 1990). bles as in the diabetes example, and the net-
In addition, efficient online algorithms have work can be learned directly from the training
been discovered (D’Ambrosio 1993). In the data by counting the fraction of examples in
124 AI MAGAZINE
Articles
C4.5 Accuracy
70
can be computed by any algorithm for infer-
ence in probabilistic networks (including the 60
junction-tree algorithm mentioned earlier).
With this formula, the gradient of the para- 50
meters of a network can be computed with re-
spect to a training sample S. We can then apply 40
uous variables. Instead of computing the max- the network and setting l to zero. We then re-
imum-likelihood estimates of the parameters, peat the following loop L times.
it adopts a prior probability distribution over First, compute new values for y1, …, yn: From
the parameter values and computes the maxi- the probabilistic network in figure 20, we can
mum a posteriori probability (MAP) values. compute P(yi|W, xi) because we know (current
This is easily accomplished by a minor modifi- guesses for) W and (observed values for) xi. To
cation of EM. One of the most interesting appli- sample from this distribution, we flip a biased
cations of AUTOCLASS was to the problem of an- coin with probability of heads P(yi|W, xi).
alyzing the infrared spectra of stars. AUTOCLASS Second, compute new values for W: Let w0
discovered a new class of star, and this discov- be the parameter that represents the probabil-
ery was subsequently accepted by astronomers ity of generating an example from class 0. Sup-
(Cheeseman et al. 1988). pose the prior distribution, P(w0), is the uni-
Gibbs Sampling The final algorithm that I form distribution for all values 0 ≤ w0 ≤ 1. Let
discuss is a Monte Carlo technique called Gibbs m0 be the number of training examples (cur-
sampling (Geman and Geman 1984). Gibbs rently) assigned to class 0. Then, the posterior
sampling is a method for generating random probability P(w0|y1, …, ym) has a special form
samples from a joint-probability distribution known as a beta distribution with parameters
P(A1, …, An) when sampling directly from the m0 + 1 and m – m0 + 1. Algorithms are available
joint distribution is difficult. Suppose we know for drawing samples from this distribution.
the conditional distribution of each variable Ai Similarly, let wjv0 be the parameter that rep-
in terms of all the others: P(A1|A2, …, An), resents the probability that the j-th feature will
P(A2|A1, A3, …, An), …, P(An|A1, …, An-1). The have the value v when drawn from class 0: P(xj
Gibbs sampler works as follows: We start with a = v|yj = 0). Again assuming uniform priors for
set of arbitrary values, a1, …, an, for the random P(wjv0), this variable also has a beta distribution
variables. For each value of i, we then sample a with parameters cjv0 + 1 and m0 – cjv0 + 1, where
new value ai for random variable Ai according cjv0 is the number of training examples in class
to the distribution P(Ai|A1 = a1, …, Ai–1 = ai–1, Ai+1 0 having xj = v. As with w0, we can sample from
= ai+1, …, An = an). If we repeat this long enough, this distribution.
then under certain mild conditions the empir- We can do the same thing for the parame-
ical distribution of these generated points will ters concerning class 1: wjv1.
converge to the joint distribution. Third, record the value of the parameter vec-
How is this useful for learning in probabilis- tor. Let l := l + 1 and set Wl := W. To allow the
tic networks? Suppose we want to take a full Gibbs sampler to converge to a stationary dis-
Bayesian approach to learning the unknown tribution, we should perform some number of
parameters in the network. In such cases, we iterations before recording L values of the pa-
want to compute the posterior probability of rameter vector. This procedure gives us an en-
the unknown parameters given the data. Let semble of L probabilistic networks that can be
W denote the vector of unknown parameters. applied to classify new data points.
In a full Bayesian approach, we treat W as a There is a close resemblance between Gibbs
random variable with a prior probability distri- sampling and the EM algorithm. In both algo-
bution P(W). Given the training examples {x1, rithms, the training examples are augmented
…, xm}, we want to compute P(W|x1, …, xm), with information about the hidden variables. In
which is difficult because of the hidden classes EM, this information describes the probability
y1, …, ym. However, using the Gibbs sampler, distribution of the hidden variables, whereas in
we can generate samples from the distribution Gibbs sampling, this information consists of
P(W, y1, …, ym|x1, …, xm). random values drawn according to the probabil-
Then, by simply ignoring the y values, we ob- ity distribution. In EM, the parameters are recom-
tain samples for W: W1, …, WL. puted to be their maximum-likelihood estimates
These W values constitute an ensemble of based on the current values of the hidden vari-
learned values for the network parameters. To ables, whereas in Gibbs sampling, new parame-
classify a new data point x, we can apply each ter values are sampled from the posterior distri-
of the values of Wl to predict the class of x and bution given the current values of the hidden
have them vote. The voting is with equal variables. Hence, EM can be viewed as a maxi-
weight. If some W values have higher posterior mum-likelihood approximation (or MAP ap-
probability than others, then they will appear proximation) to Gibbs sampling.
more often in our sample. One potential problem that can arise in
To apply the Gibbs sampler to our unsuper- Gibbs sampling is caused by symmetries in the
vised learning problem, we begin by choosing stochastic model. In the case we are consider-
random initial values for the parameters W of ing, for example, suppose there are two true
126 AI MAGAZINE
Articles
underlying hidden classes, class A and class B. data points xi, then choosing an expert ei sto-
We want the model to generate examples from chastically (depending on the value of xi), and
class A when y = 1 and from class B when y = then choosing the class yi depending on the
2. However, there is nothing to force the mod- values of xi and ei.
el to do this. It could just as easily use y = 1 to There are two things to note about this
represent examples of class B and y = 2 to rep- model. First, the direction of causality is re-
resent examples of class A. If permitted to run versed from the naive Bayes network of figure
long enough, in fact, the Gibbs sampler should 20. Second, if we assume that all the features of
explore both possibilities. If our sample of each xi will always be observed, we do not
weight vectors Wl includes weight vectors cor- need to model the probability distribution
responding to both of these alternatives, then P(X) on the bottom node in the graph. Third,
when we combine these weight vectors, we unless we make some strong assumptions, the
will get bad results. In practice, this problem probability distribution P(Y|X, E) is going to be
often does not arise, but in general, steps must extremely complex because it must specify the
be taken to remove symmetries from the mod- probability of the classes as a function of every
el (see Neal [1993]). possible combination of features for X and ex-
Gibbs sampling is a general method; it can pert E.
often be applied in situations where the EM al- In their development of this general model,
gorithm cannot. The generality of Gibbs sam- Jordan and Jacobs assume that each probabili-
pling has made it possible to construct a gen- ty distribution has a simple form (see figure 23,
eral-purpose programming environment, which shows the model as a kind of neural net-
called BUGS, for learning stochastic models work classifier). Each value of the random vari-
(Gilks, Thomas, and Spiegelhalter 1993). In able E specifies a different expert. The input
this environment, the user specifies the graph features x are fed into each of the experts and
structure of the stochastic model, the form of into a gating network. The output of each ex-
the probability distribution at each node, and pert is a probability distribution over the pos-
prior distributions for each parameter. The sys- sible classes. The output of the gate is a proba-
tem then develops a Gibbs sampling algorithm bility distribution over the experts. This overall
for fitting this model to the training data. BUGS model has the following analytic form:
can be downloaded from www.mrc-bsu.cam. P ( y | x) = ∑ g e ( x) pe ( y | x),
e
ac.uk/bugs/.
where the index e varies over the different ex-
More Sophisticated Stochastic Models perts. The value ge(x) is the output of the gating
This section presents three examples of more network for expert e. The value pe(y|x) is the
sophisticated stochastic models that have been probability distribution over the various class-
developed: (1) the hierarchical mixture of ex- es output by expert e.
perts (HME) model, (2) the hidden Markov Jordan and Jacobs have investigated net-
model (HMM), and (3) the dynamic probabilis- works where the individual experts and the
tic network (DPN). Like the naive model from gating network have simple forms. In a two-
figure 20, these models can be applied to a class problem, each expert has the form
wide number of problems without performing Pe ( y | x) = σ ( w eT x),
the kind of detailed modeling of causal con-
where
nections that we performed in the diabetes ex-
ample from figure 19. w eT
The Hierarchical Mixture of Experts is the transpose of a vector of parameters, x is
The HME model (Jordan and Jacobs 1994) is in- the vector of input feature values, and σ is the
tended for supervised learning in situations usual logistic sigmoid function 1/(1 + exp(·)).
where one believes the training data are being The gating network was described earlier in
generated by a mixture of separate experts. For Ensembles of Classifiers. The gating values are
example, in a speech-recognition system, we computed according to
might face the task of distinguishing the spo- ze = veT x
ken words bee, tree, gate, and mate. This task
naturally decomposes into two hard subprob- g e = e ze / Σ u e zu .
lems: (1) distinguishing bee from tree and (2) In other words, ze is the dot product of a weight
distinguishing gate from mate. vector ve and the input features x. The output ge
Figure 22 shows the probabilistic network is the soft-max of the ze values. The ge values are
for a simple mixture-of-experts model for this all positive and sum to 1. This is known as the
case. According to this model, a training exam- multinomial logit model in statistics.
ple (xi, yi) is generated by first generating the The problem of learning the parameters for
a mixture-of-experts model is similar to the can be the same for all states: P(o|s). In this
problem of unsupervised learning, except that variation, the HMM is equivalent to a stochastic
Y here, the hidden variable is not the class yi of finite-state automaton (FSA). At each time
each training example i but, rather, the expert step, the FSA makes a probabilistic state transi-
ei that was responsible for generating the train- tion and then generates an output letter.
ing example. If we knew which expert generat- Another common variation on HMMs is to
ed each training example, then we could fit the include an absorbing state or halting state as
E parameters directly from the training data, as one of the values of the state variable. If the
we did for the naive Bayes algorithm. Jordan HMM makes a transition into this state, it termi-
and Jacobs have applied both gradient-descent nates the string being generated, permitting
and the EM algorithm to solve this learning HMMs to model strings of variable length.
X problem. For the particular choice of sigmoid HMMs have been applied widely in speech
and soft-max functions for the experts and recognition, where the alphabet of letters con-
gates, the EM algorithm has a particularly effi- sists of frames of the speech signal (Rabiner
cient implementation as a sequence of weight- 1989). Each word in the language can be mod-
Figure 22. ed least squares problems. eled as an HMM. Given a new spoken word, a
Probabilistic Model The simple one-level hierarchy of experts speech-recognition system computes the like-
Describing a Mixture shown in figure 22 can be extended to deeper lihood that each of the word HMMs generated
of Experts. hierarchies. Figure 24 shows a three-level hier- this spoken word. The recognizer then predicts
archy. All the fitting algorithms apply to this the most likely word. A similar analysis can be
more general case as well. applied at the level of whole sentences by con-
The Hidden Markov Model Figure 25 catenating word-level HMMs. Then, the goal is
shows the probabilistic network of a hidden to find the sentence most likely to have gener-
Markov model (HMM) (Rabiner 1989). An HMM ated to speech signal.
generates examples that are strings of length n To learn an HMM, a set of training examples
over some alphabet A of letters. Hence, each is provided, where each example is a string.
example x is a string of letters: o1o2…on(where The sequence of states that generated the
the o stands for observable). To generate a string is hidden. Once again, however, the EM
string, the HMM begins by generating an initial algorithm can be applied. In the context of
state s1 according to probability distribution HMMs, it is known as the Baum-Welch, or for-
P(s1). From this state, it then generates the first ward-backward, algorithm. In the E-step, the
letter of the string according to the distribu- algorithm augments each training example
tion P(o1|s1). It then generates the next state s2 with statistics describing the hypothesized
according to P(s2|s1). Then, it generates the sec- string of states that generated the example. In
ond letter according to P(o2|s2), and so on. the M-step, the parameters of the probability
In some applications, the transition proba- distributions are reestimated from the aug-
bility distribution is the same for all pairs of mented training data.
adjacent state variables: P(st|st–1). Similarly, the Stolcke and Omohundro (1994) developed
output (or emission) probability distribution algorithms for learning the structure and para-
meters of HMMs from training examples. They
applied their techniques to several problems in
Y speech recognition and incorporated their al-
gorithm into a speaker-independent speech-
recognition system.
The Dynamic Probabilistic Network In
the HMM, the number of values for the state
variable at each point in time can become
large. Consequently, the number of parameters
in the state-transition probability distribution
Expert 1 Expert 2 Gate P(st|st–1) can become intractably large. One so-
lution is to represent the causal structure with-
in each state by a stochastic model. For exam-
ple, consider a mobile robot with a steerable
television camera. The images observed by the
X
camera will be the output alphabet of the HMM.
The hidden state will consist of the location of
the robot within a room and the direction the
Figure 23. The Mixture-of-Experts Model Viewed camera is pointing. Suppose there are 100 pos-
as a Specialized Neural Network. sible locations and 15 possible camera direc-
128 AI MAGAZINE
Articles
130 AI MAGAZINE
Articles
90
Concluding Remarks
Accuracy of TAN
132 AI MAGAZINE
Articles
References Cheeseman, P.; Self, M.; Kelly, J.; Taylor, W.; Free-
man, D.; and Stutz, J. 1988. Bayesian Classification.
Abu-Mostafa, Y. 1990. Learning from Hints in Neural
In Proceedings of the Seventh National Conference
Networks. Journal of Complexity 6:192–198.
on Artificial Intelligence, 607–611. Menlo Park,
Ali, K. M., and Pazzani, M. J. 1996. Error Reduction Calif.: American Association for Artificial Intelli-
through Learning Multiple Descriptions. Machine gence.
Learning 24(3): 173–202.
Cherkauer, K. J. 1996. Human Expert-Level Perfor-
Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. mance on a Scientific Image Analysis Task by a Sys-
Learning to Act Using Real-Time Dynamic Program- tem Using Combined Artificial Neural Networks. In
ming. Artificial Intelligence 72:81–138. Working Notes of the AAAI Workshop on Integrat-
Barto, A. G., and Sutton, R. 1997. Introduction to Re- ing Multiple Learned Models, ed. P. Chan, 15–21.
inforcement Learning. Cambridge, Mass.: MIT Press. Available at www.cs.fit.edu/imlm/.
Bellman, R. E. 1957. Dynamic Programming. Prince- Chipman, H.; George, E.; and McCulloch, R. 1996.
ton, N.J.: Princeton University Press. Bayesian CART, Department of Statistics, University
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dy- of Chicago. Available at gsbrem.uchicago.edu/Pa-
namic Programming. Belmont, Mass.: Athena Scientif- pers/cart.ps.
ic. Chow, C., and Liu, C. 1968. Approximating Discrete
Blum, A., and Rivest, R. L. 1988. Training a 3-Node Probability Distributions with Dependence Trees.
Neural Network Is NP-Complete (Extended Ab- IEEE Transactions on Information Theory 14:462–467.
stract). In Proceedings of the 1988 Workshop on Com- Clemen, R. T. 1989. Combining Forecasts: A Review
putational Learning Theory, 9–18. San Francisco, and Annotated Bibliography. International Journal of
Calif.: Morgan Kaufmann. Forecasting 5:559–583.
Blum, A. 1997. Empirical Support for WINNOW and Cohen, W. W. 1995. Fast Effective Rule Induction. In
Weighted-Majority Algorithms: Results on a Calen- Proceedings of the Twelfth International Conference on
dar Scheduling Domain. Machine Learning 26(1): Machine Learning, 115–123. San Francisco, Calif.:
5–24. Morgan Kaufmann.
Breiman, L. 1996a. Bagging Predictors. Machine Cooper, G. F., and Herskovits, E. 1992. A Bayesian
Learning 24(2): 123–140. Method for the Induction of Probabilistic Networks
Breiman, L. 1996b. Stacked Regressions. Machine from Data. Machine Learning 9:309–347.
Learning 24:49–64. Craven, M. W., and Shavlik, J. W. 1996. Extracting
Buntine, W. 1996. A Guide to the Literature on Tree-Structured Representations from Trained Net-
Learning Probabilistic Networks from Data. IEEE works. In Advances in Neural Information Processing
Transactions on Knowledge and Data Engineering Systems 8, eds. D. S. Touretzky, M. C. Mozer, and M.
8:195–210. Hasselmo, 24–30. Cambridge, Mass. MIT Press.
Buntine, W. 1994. Operations for Learning with Crites, R. H., and Barto, A. G. 1995. Improving Ele-
Graphical Models. Journal of Artificial Intelligence Re- vator Performance Using Reinforcement Learning.
search 2:159–225. In Advances in Neural Information Processing Systems 8,
Buntine, W. L. 1990. A Theory of Learning Classifi- 1017–1023. San Francisco, Calif.: Morgan Kauf-
cation Rules. Ph.D. thesis, School of Computing Sci- mann.
ence, University of Technology. D’Ambrosio, B. 1993. Incremental Probabilistic In-
Caruana, R. 1996. Algorithms and Applications for ference. In Ninth Annual Conference on Uncertainty on
Multitask Learning. In Proceedings of the Thirteenth AI, eds. D. Heckerman and A. Mamdani, 301–308.
International Conference on Machine Learning, ed. L. San Francisco, Calif.: Morgan Kaufmann.
Saitta, 87–95. San Francisco, Calif.: Morgan Kauf- Dayan, P., and Hinton, G. 1993. Feudal Reinforce-
mann. ment Learning. In Advances in Neural Information
Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. Processing Systems 5, 271–278. San Francisco, Calif.:
1994. Acting Optimally in Partially Observable Sto- Morgan Kaufmann.
chastic Domains. In Proceedings of the Twelfth Na- Dean, T., and Kanazawa, K. 1989. A Model for Rea-
tional Conference on Artificial Intelligence, soning about Persistence and Causation. Computa-
1023–1028. Menlo Park, Calif.: American Associa- tional Intelligence 5(3): 142–150.
tion for Artificial Intelligence. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1976.
Castillo, E.; Gutierrez, J. M.; and Hadi, A. 1997. Ex- Maximum Likelihood from Incomplete Data via the
pert Systems and Probabilistic Network Models. New EM Algorithm. Journal of the Royal Statistical Society B
York: Springer-Verlag. 39: 1–38.
Catlett, J. 1991. On Changing Continuous Attributes Dietterich, T. G., and Bakiri, G. 1995. Solving Multi-
into Ordered Discrete Attributes. In Proceedings of the class Learning Problems via Error-Correcting Output
European Working Session on Learning, ed. Y. Ko- Codes. Journal of Artificial Intelligence Research
dratoff, 164–178. Berlin: Springer-Verlag. 2:263–286.
Chan, P. K., and Stolfo, S. J. 1995. Learning Arbiter Dietterich, T. G., and Kong, E. B. 1995. Machine
and Combiner Trees from Partitioned Data for Scal- Learning Bias, Statistical Bias, and Statistical Vari-
ing Machine Learning. In Proceedings of the First Inter- ance of Decision Tree Algorithms, Department of
national Conference on Knowledge Discovery and Data Computer Science, Oregon State University. Avail-
Mining, 39–44. Menlo Park, Calif.: AAAI Press. able at ftp.cs.orst.edu/pub/tgd/papers/tr-bias.ps.gz.
Domingos, P., and Pazzani, M. 1996. Beyond Indepen- Bayesian Networks, MSR-TR-95-06, Advanced Tech-
dence: Conditions for the Optimality of the Simple nology Division, Microsoft Research, Redmond,
Bayesian Classifier. In Proceedings of the Thirteenth Inter- Washington.
national Conference on Machine Learning, ed. L. Saitta, Heckerman, D.; Geiger, D.; and Chickering, D. M.
105–112. San Francisco, Calif.: Morgan Kaufmann. 1995. Learning Bayesian Networks: The Combina-
Duda, R. O., and Hart, P. E. 1973. Pattern Classifica- tion of Knowledge and Statistical Data. Machine
tion and Scene Analysis. New York: Wiley. Learning 20:197–243.
Fayyad, U. M., and Irani, K. B. 1993. Multi-Interval Hyafil, L., and Rivest, R. L. 1976. Constructing Opti-
Discretization of Continuous-Valued Attributes for mal Binary Decision Trees Is NP-Complete. Informa-
Classification Learning. In Proceedings of the Thir- tion Processing Letters 5(1): 15–17.
teenth International Joint Conference on Artificial Jensen, F. V.; Lauritzen, S. L.; and Olesen, K. G. 1990.
Intelligence, 1022–1027. Menlo Park, Calif.: Interna- Bayesian Updating in Recursive Graphical Models by
tional Joint Conferences on Artificial Intelligence. Local Computations. Computational Statistical Quar-
Flexner, S. B. 1983. Random House Unabridged Dictio- terly 4:269–282.
nary, 2d ed. New York: Random House. Jensen, F. 1996. An Introduction to Bayesian Networks.
Freund, Y., and Schapire, R. E. 1995. A Decision-The- New York: Springer.
oretic Generalization of On-Line Learning and an John, G.; Kohavi, R.; and Pfleger, K. 1994. Irrelevant
Application to Boosting. In Proceedings of the Second Features and the Subset Selection Problem. In Pro-
European Conference on Computational Learning Theo- ceedings of the Eleventh International Conference on Ma-
ry, 23–37. Berlin: Springer-Verlag. chine Learning, 121–129. San Francisco, Calif.: Mor-
Freund, Y., and Schapire, R. E. 1996. Experiments gan Kaufmann.
with a New Boosting Algorithm. In Proceedings of the Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical
Thirteenth International Conference on Machine Learn- Mixtures of Experts and the EM Algorithm. Neural
ing, ed. L. Saitta, 148–156. San Francisco, Calif.: Mor- Computation 6(2): 181–214.
gan Kaufmann.
Kaelbling, L. P. 1993. Hierarchical Reinforcement
Friedman, N., and Goldszmidt, M. 1996. Building Learning: Preliminary Results. In Proceedings of the
Classifiers Using Bayesian Networks. In Proceedings Tenth International Conference on Machine Learning,
of the Thirteenth National Conference on Artificial 167–173. San Francisco, Calif.: Morgan Kaufmann.
Intelligence, 1277–1284. Menlo Park, Calif.: Ameri-
Kaelbling, L. P.; Littman, M. L.; and Moore, A. W.
can Association for Artificial Intelligence. 1996. Reinforcement Learning: A Survey. Journal of
Furnkranz, J., and Widmer, G. 1994. Incremental Re- Artificial Intelligence Research 4:237–285.
duced Error Pruning. In Proceedings of the Eleventh In- Kanazawa, K.; Koller, D.; and Russell, S. 1995. Sto-
ternational Conference on Machine Learning, 70–77. chastic Simulation Algorithms for Dynamic Proba-
San Francisco, Calif.: Morgan Kaufmann. bilistic Networks. In Proceedings of the Eleventh Con-
Geman, S., and Geman, D. 1984. Stochastic Relax- ference on Uncertainty in Artificial Intelligence,
ation, Gibbs Distributions, and the Bayesian Restora- 346–351. San Francisco, Calif.: Morgan Kaufmann.
tion of Images. IEEE Transactions on Pattern Analysis
Kira, K., and Rendell, L. A. 1992. A Practical Ap-
and Machine Intelligence 6:721–741. proach to Feature Selection. In Proceedings of the
Ghahramani, Z., and Jordan, M. I. 1996. Factorial Ninth International Conference on Machine Learning,
Hidden Markov Models. In Advances in Neural Infor- 249–256. San Francisco, Calif.: Morgan Kaufmann.
mation Processing Systems 8, eds. D. S. Touretzky, M. Kohavi, R., and Kunz, C. 1997. Option Decision
C. Mozer, and M. Hasselmo, 472–478. Cambridge, Trees with Majority Votes. In Proceedings of the Four-
Mass.: MIT Press. teenth International Conference on Machine Learning,
Gilks, W.; Thomas, A.; and Spiegelhalter, D. 1993. A 161–169. San Francisco, Calif.: Morgan Kaufmann.
Language and Program for Complex Bayesian Mod- Kohavi, R., and Sahami, M. 1996. Error-Based and
elling. The Statistician 43:169–178. Entropy-Based Discretizing of Continuous Features.
Golding, A. R., and Roth, D. 1996. Applying WINNOW In Proceedings of the Second International Conference on
to Context-Sensitive Spelling Correction. In Proceed- Knowledge Discovery and Data Mining, 114–119. San
ings of the Thirteenth International Conference on Ma- Francisco, Calif.: Morgan Kaufmann.
chine Learning, ed. L. Saitta, 182–190. San Francisco, Kolen, J. F., and Pollack, J. B. 1991. Back Propagation
Calif.: Morgan Kaufmann. Is Sensitive to Initial Conditions. In Advances in
Gordon, G. J. 1995. Stable Function Approximation Neural Information Processing Systems 3, 860–867. San
in Dynamic Programming. In Proceedings of the Francisco, Calif.: Morgan Kaufmann.
Twelfth International Conference on Machine Learning, Kong, E. B., and Dietterich, T. G. 1995. Error-Correct-
261–268. San Francisco, Calif.: Morgan Kaufmann. ing Output Coding Corrects Bias and Variance. In
Hansen, L., and Salamon, P. 1990. Neural Network The Twelfth International Conference on Machine
Ensembles. IEEE Transactions on Pattern Analysis and Learning, eds. A. Prieditis and S. Russell, 313–321.
Machine Intelligence 12:993–1001. San Francisco, Calif.: Morgan Kaufmann.
Hashem, S. 1993. Optimal Linear Combinations of Kononenko, I. 1994. Estimating Attributes: Analysis
Neural Networks. Ph.D. thesis, School of Industrial and Extensions of Relief. In Proceedings of the 1994
Engineering, Purdue University. European Conference on Machine Learning, 171–182.
Heckerman, D. 1996. A Tutorial on Learning with Amsterdam: Springer Verlag.
134 AI MAGAZINE
Articles
Kononenko, I.; Simec, E.; and Robnik-Sikonja, M. Theoretic Subsampling for Induction on Large Data-
1997. Overcoming the Myopic of Inductive Learning bases. In Proceedings of the Tenth International Confer-
Algorithms with RELIEF-F. Applied Intelligence. Forth- ence on Machine Learning, ed. P. E. Utgoff, 212–219.
coming. San Francisco, Calif.: Morgan Kaufmann.
Kucera, H., and Francis, W. N. 1967. Computational Neal, R. 1993. Probabilistic Inference Using Markov
Analysis of Present-Day American English. Providence, Chain Monte Carlo Methods, CRG-TR-93-1, Depart-
R.I.: Brown University Press. ment of Computer Science, University of Toronto.
Kwok, S. W., and Carter, C. 1990. Multiple Decision Ok, D., and Tadepalli, P. 1996. Auto-Exploratory Av-
Trees. In Uncertainty in Artificial Intelligence 4, eds. R. erage Reward Reinforcement Learning. In Proceed-
D. Schachter, T. S. Levitt, L. N. Kannal, and J. F. Lem- ings of the Thirteenth National Conference on Arti-
mer, 327–335. Amsterdam: Elsevier Science. ficial Intelligence, 881–887. Menlo Park, Calif.:
Laird, J. E.; Newell, A.; and Rosenbloom, P. S. 1987. American Association for Artificial Intelligence.
SOAR: An Architecture for General Intelligence. Arti- Opitz, D. W., and Shavlik, J. W. 1996. Generating Ac-
ficial Intelligence 33(1): 1–64. curate and Diverse Members of a Neural-Network
Littlestone, N. 1988. Learning Quickly When Irrele- Ensemble. In Advances in Neural Information Process-
vant Attributes Abound: A New Linear-Threshold Al- ing Systems 8, eds. D. S. Touretzky, M. C. Mozer, and
gorithm. Machine Learning 2:285–318. M. Hesselmo, 535–541. Cambridge, Mass.: MIT Press.
Littman, M. L.; Cassandra, A.; and Kaelbling, L. P. Parmanto, B.; Munro, P. W.; and Doyle, H. R. 1996.
1995. Learning Policies for Partially Observable En- Improving Committee Diagnosis with Resampling
vironments: Scaling Up. In Proceedings of the Twelfth Techniques. In Advances in Neural Information Pro-
International Conference on Machine Learning, cessing Systems 8, eds. D. S. Touretzky, M. C. Mozer,
362–370. San Francisco, Calif.: Morgan Kaufmann. and M. Hesselmo, 882–888. Cambridge, Mass.: MIT
Press.
Lowe, D. G. 1995. Similarity Metric Learning for a
Variable-Kernel Classifier. Neural Computation 7(1): Parmanto, B.; Munro, P. W.; Doyle, H. R.; Doria, C.;
72–85. Aldrighetti, L.; Marino, I. R.; Mitchel, S.; and Fung, J.
J. 1994. Neural Network Classifier for Hepatoma De-
MacKay, D. 1992. A Practical Bayesian Framework
tection. In Proceedings of the World Congress on Neural
for Backpropagation Networks. Neural Computation
Networks 1, 285–290. Hillsdale, N.J.: Lawrence Erl-
4(3): 448–472.
baum.
Mahadevan, S. 1996. Average Reward Reinforcement
Parr, R., and Russell, S. 1995. Approximating Opti-
Learning: Foundations, Algorithms, and Empirical
mal Policies for Partially Observable Stochastic Do-
Results. Machine Learning 22:159–195.
mains. In Proceedings of the Fourteenth Internation-
Mahadevan, S., and Kaelbling, L. P. 1996. The Na- al Joint Conference on Artificial Intelligence,
tional Science Foundation Workshop on Reinforce- 1088–1094. Menlo Park, Calif.: International Joint
ment Learning. AI Magazine 17(4): 89–97. Conferences on Artificial Intelligence.
McCallum, R. A. 1995. Instance-Based Utile Distinc- Perrone, M. P., and Cooper, L. N. 1993. When Net-
tions for Reinforcement Learning with Hidden State. works Disagree: Ensemble Methods for Hybrid Neur-
In Proceedings of the Twelfth International Conference al Networks. In Neural Networks for Speech and Image
on Machine Learning, 387–396. San Francisco, Calif.: Processing, ed. R. J. Mammone. New York: Chapman
Morgan Kaufmann. and Hall.
Mehta, M.; Agrawal, R.; and Rissanen, J. 1996. SLIQ: A Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; and Ver-
Fast Scalable Classifier for Data Mining. In Lecture rerling, W. T. 1992. Numerical Recipes in C: The Art of
Notes in Computer Science, 18–32. New York: Springer- Scientific Computing, 2d ed. Cambridge, U.K.: Cam-
Verlag. bridge University Press.
Merz, C. J., and Murphy, P. M. 1996. UCI Repository Quinlan, J. R. 1996. Bagging, Boosting, and C4.5. In
of Machine Learning Databases. Available at Proceedings of the Thirteenth National Conference
www.ics.uci.edu/mlearn/MLRepository.html. on Artificial Intelligence, 725–730. Menlo Park,
Miller, A. J. 1990. Subset Selection in Regression. New Calif.: American Association for Artificial Intelli-
York: Chapman and Hall. gence.
Minton, S.; Carbonell, J. G.; Knoblock, C. A.; Kuok- Quinlan, J. R. 1993. C4.5: Programs for Empirical
ka, D. R.; Etzioni, O.; and Gil, Y. 1989. Explanation- Learning. San Francisco, Calif.: Morgan Kaufmann.
Based Learning: A Problem-Solving Perspective. Arti- Quinlan, J. R. 1990. Learning Logical Definitions
ficial Intelligence 40:63–118. from Relations. Machine Learning 5(3): 239–266.
Moore, A. W., and Lee, M. S. 1994. Efficient Algo- Quinlan, J. R., and Rivest, R. L. 1989. Inferring Deci-
rithms for Minimizing Cross-Validation Error. In Pro- sion Trees Using the Minimum Description Length
ceedings of the Eleventh International Conference on Ma- Principle. Information and Computation 80:227–248.
chine Learning, eds. W. Cohen and H. Hirsh, 190–198. Rabiner, L. R. 1989. A Tutorial on Hidden Markov
San Francisco, Calif.: Morgan Kaufmann. Models and Selected Applications in Speech Recog-
Munro, P., and Parmanto, B. 1997. Competition nition. In Proceedings of the IEEE 77(2): 257–286.
among Networks Improves Committee Performance. Washington, D.C.: IEEE Computer Society.
In Advances in Neural Information Processing Systems 9, Raviv, Y., and Intrator, N. 1996. Bootstrapping with
592–598. Cambridge, Mass.: MIT Press. Noise: An Effective Regularization Technique. Con-
Musick, R.; Catlett, J.; and Russell, S. 1992. Decision- nection Science 8(3–4): 355–372.
Revow, M.; Williams, C. K. I.; and Hinton, G. E. Sutton, R. S. 1988. Learning to Predict by the Meth-
1996. Using Generative Models for Handwritten Dig- ods of Temporal Differences. Machine Learning 3(1):
it Recognition. IEEE Transactions on Pattern Analysis 9–44.
and Machine Intelligence 18(6): 592–606. Tesauro, G. 1995. Temporal Difference Learning and
Ricci, F., and Aha, D. W. 1997. Extending Local TD - GAMMON . Communications of the ACM 28(3):
Learners with Error-Correcting Output Codes, Naval 58–68.
Center for Applied Research in Artificial Intelligence, Tesauro, G. 1992. Practical Issues in Temporal Differ-
Washington, D.C. ence Learning. Machine Learning 8:257–278.
Rosen, B. E. 1996. Ensemble Learning Using Decor- Tsitsiklis, J. N., and Van Roy, B. 1996. An Analysis of
related Neural Networks. Connection Science 8(3–4): Temporal-Difference Learning with Function Ap-
373–384. proximation, LIDS-P-2322, Laboratory for Informa-
Russell, S.; Binder, J.; Koller, D.; and Kanazawa, K. tion and Decision Systems, Massachusetts Institute
1995. Local Learning in Probabilistic Networks with of Technology.
Hidden Variables. In Proceedings of the Fourteenth Tumer, K., and Ghosh, J. 1996. Error Correlation and
International Joint Conference on Artificial Intelli- Error Reduction in Ensemble Classifiers. Connection
gence, 1146–1152. Menlo Park, Calif.: International Science 8(3–4): 385–404.
Joint Conferences on Artificial Intelligence. Verma, T., and Pearl, J. 1990. Equivalence and Syn-
Samuel, A. L. 1959. Some Studies in Machine Learn- thesis of Causal Models. In Proceedings of the Sixth
ing Using the Game of Checkers. IBM Journal of Re- Conference on Uncertainty in Artificial Intelligence,
search and Development 3:211–229. 220–227. San Francisco, Calif.: Morgan Kaufmann.
Schapire, R. E. 1997. Using Output Codes to Boost Watkins, C. J. C. H. 1989. Learning from Delayed Re-
Multiclass Learning Problems. In Proceedings of the wards. Ph.D. thesis, King’s College, Oxford.
Fourteenth International Conference on Machine Learn- Watkins, C. J., and Dayan, P. 1992. Technical Note:
ing, 313–321. San Francisco, Calif.: Morgan Kauf- Q-LEARNING. Machine Learning 8:279–292.
mann. Wettschereck, D.; Aha, D. W.; and Mohri, T. 1997. A
Schwartz, A. 1993. A Reinforcement Learning Review and Empirical Evaluation of Feature Weight-
Method for Maximizing Undiscounted Rewards. In ing Methods for a Class of Lazy Learning Algorithms.
Proceedings of the Tenth International Conference on Artificial Intelligence Review 10:1–37.
Machine Learning, ed. P. Utgoff, 298–305. San Francis- Wettschereck, D., and Dietterich, T. G. 1995. An Ex-
co, Calif.: Morgan Kaufmann. perimental Comparison of the Nearest-Neighbor
Shafer, J.; Agrawal, R.; and Mehta, M. 1996. SPRINT: A and Nearest-Hyperrectangle Algorithms. Machine
Scalable Parallel Classifier for Data Mining. In Pro- Learning 19:5–27.
ceedings of the Twenty-Second VLDB Conference, Wolpert, D. 1992. Stacked Generalization. Neural
544–555. San Francisco, Calif.: Morgan Kaufmann. Networks 5(2): 241–260.
Singh, S. P. 1992. Transfer of Learning by Composing Zhang, W., and Dietterich, T. G. 1995. A Reinforce-
Solutions to Elemental Sequential Tasks. Machine ment Learning Approach to Job-Shop Scheduling. In
Learning 8(3): 323–340. Proceedings of the Twelfth International Joint Con-
Singh, S., and Bertsekas, D. 1997. Reinforcement ference on Artificial Intelligence, 1114–1120. Menlo
Learning for Dynamic Channel Allocation in Cellu- Park, Calif.: International Joint Conferences on Arti-
lar Telephone Systems. In Advances in Neural Infor- ficial Intelligence.
mation Processing Systems 10, 974–980. Cambridge, Zhang, X.; Mesirov, J. P.; and Waltz, D. L. 1992. Hy-
Mass.: MIT Press. brid System for Protein Secondary Structure Predic-
Smyth, P.; Heckerman, D.; and Jordan, M. I. 1997. tion. Journal of Molecular Biology 225:1049–1063.
Probabilistic Independence Networks for Hidden Zweben, M; Daun, B.; and Deale, M. 1994. Schedul-
Markov Probability Models. Neural Computation 9(2): ing and Rescheduling with Iterative Repair. In Intel-
227–270. ligent Scheduling 8, eds. M. Zweben and M. Fox,
241–255. San Francisco, Calif.: Morgan Kaufmann.
Spiegelhalter, D.; Dawid, A.; Lauritzen, S.; and Cow-
ell, R. 1993. Bayesian Analysis in Expert Systems.
Statistical Science 8:219–282. Thomas G. Dietterich (Ph.D,
Spirtes, P.; Glymour, C.; and Scheines, R. 1993. Cau- Stanford University) has been a
sation, Prediction, and Search. New York: Springer-Ver- faculty member in computer sci-
lag. Also available online at hss.cmu.edu/html/de- ence at Oregon State University
partments/philosophy/TETRAD.BOOK/book.html. since 1985. He also held a position
as senior scientist at Arris Pharma-
Spirtes, P., and Meek, C. 1995. Learning Bayesian
ceutical Corporation from 1991 to
Networks with Discrete Variables from Data. In Pro-
1993, where he applied machine-
ceedings of the First International Conference on Knowl- learning methods to drug design
edge Discovery and Data Mining, 294–299. San Francis- problems. He is editor (with Jude Shavlik) of Readings
co, Calif.: Morgan Kaufmann. in Machine Learning and executive editor of the jour-
Stolcke, A., and Omohundro, S. M. 1994. Best-First nal Machine Learning. His e-mail address is
Model Merging for Hidden Markov Model Induc- tgd@cs.orst.edu.
tion, TR-94-003, International Computer Science In-
stitute, Berkeley, California.
136 AI MAGAZINE