Professional Documents
Culture Documents
A Decision-Theoretic Generalization of On-Line Learning
A Decision-Theoretic Generalization of On-Line Learning
A Decision-Theoretic Generalization of On-Line Learning
File: 571J 150401 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6428 Signs: 4372 . Length: 60 pic 11 pts, 257 mm
120 FREUND AND SCHAPIRE
handle this problem, and we prove a number of bounds on Boosting refers to this general problem of producing a
the net loss. For instance, one of our results shows that the very accurate prediction rule by combining rough and
net loss of our algorithm can be bounded by O(- T ln N) moderately inaccurate rules-of-thumb. In the second part of
or, put another way, that the average per trial net loss is the paper, we present and analyze a new boosting algorithm
decreasing at the rate O(- (ln N)T). Thus, as T increases, inspired by the methods we used for solving the on-line
this difference decreases to zero. allocation problem.
Our results for the on-line allocation model can be Formally, boosting proceeds as follows: The booster is
applied to a wide variety of learning problems, as we provided with a set of labelled training examples (x 1 , y 1 ),
describe in Section 3. In particular, we generalize the results ..., (x N , y N ), where y i is the label associated with instance x i ;
of Littlestone and Warmuth [20] and Cesa-Bianchi et al. for instance, in the horse-racing example, x i might be the
[4] for the problem of predicting a binary sequence using observable data associated with a particular horse race, and
the advice of a team of ``experts.'' Whereas these authors y i the outcome (winning horse) of that race. On each round
proved worst-case bounds for making on-line randomized t=1, ..., T, the booster devises a distribution D t over the set
decisions over a binary decision and outcome space with of examples, and requests (from an unspecified oracle) a
a [0, 1]-valued discrete loss, we prove (slightly weaker) weak hypothesis (or rule-of-thumb) h t with low error = t with
bounds that are applicable to any bounded loss function respect to D t (that is, = t =Pr itDt[h t(x i ){ yi ]). Thus, dis-
over any decision and outcome spaces. Our bounds express tribution D t specifies the relative importance of each example
explicitly the rate at which the loss of the learning algorithm for the current round. After T rounds, the booster must
approaches that of the best expert. combine the weak hypotheses into a single prediction rule.
Related generalizations of the expert prediction model Unlike the previous boosting algorithms of Freund
were studied by Vovk [25], Kivinen and Warmuth [19], [10, 11] and Schapire [22], the new algorithm needs no
and Haussler et al. [15]. Like us, these authors focused prior knowledge of the accuracies of the weak hypotheses.
primarily on multiplicative weight-update algorithms. Chung Rather, it adapts to these accuracies and generates a
[5] also presented a generalization, giving the problem a weighted majority hypothesis in which the weight of each
game-theoretic treatment. weak hypothesis is a function of its accuracy. For binary
prediction problems, we prove in Section 4 that the error
Boosting of this final hypothesis (with respect to the given set of
examples) is bounded by exp( &2 Tt=1 # 2t ) where = t =
Returning to the horse-racing story, suppose now that the 12&# t is the error of the tth weak hypothesis. Since a
gambler grows weary of choosing among the experts and hypothesis that makes entirely random guesses has error
instead wishes to create a computer program that will 12, # t measures the accuracy of the t th weak hypothesis
accurately predict the winner of a horse race based on the relative to random guessing. Thus, this bound shows that if
usual information (number of races recently won by each we can consistently find weak hypotheses that are slightly
horse, betting odds for each horse, etc.). To create such a better than random guessing, then the error of the final
program, he asks his favorite expert to explain his betting hypothesis drops exponentially fast.
strategy. Not surprisingly, the expert is unable to articulate Note that the bound on the accuracy of the final
a grand set of rules for selecting a horse. On the other hand, hypothesis improves when any of the weak hypotheses is
when presented with the data for a specific set of races, the improved. This is in contrast with previous boosting algo-
expert has no trouble coming up with a ``rule-of-thumb'' for rithms whose performance bound depended only on the
that set of races (such as, ``Bet on the horse that has recently accuracy of the least accurate weak hypothesis. At the same
won the most races'' or ``Bet on the horse with the most time, if the weak hypotheses all have the same accuracy, the
favored odds''). Although such a rule-of-thumb, by itself, is performance of the new algorithm is very close to that
obviously very rough and inaccurate, it is not unreasonable achieved by the best of the known boosting algorithms.
to expect it to provide predictions that are at least a little bit In Section 5, we give two extensions of our boosting algo-
better than random guessing. Furthermore, by repeatedly rithm to multi-class prediction problems in which each
asking the expert's opinion on different collections of races, example belongs to one of several possible classes (rather than
the gambler is able to extract many rules-of-thumb. just two). We also give an extension to regression problems in
In order to use these rules-of-thumb to maximum advan- which the goal is to estimate a real-valued function.
tage, there are two problems faced by the gambler: First,
how should he choose the collections of races presented to 2. THE ON-LINE ALLOCATION ALGORITHM
the expert so as to extract rules-of-thumb from the expert AND ITS ANALYSIS
that will be the most useful? Second, once he has collected
many rules-of-thumb, how can they be combined into a In this section, we present our algorithm, called
single, highly accurate prediction rule? Hedge( ;), for the on-line allocation problem. The algorithm
File: 571J 150402 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6472 Signs: 5755 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 121
and its analysis are direct generalizations of Littlestone and Algorithm Hedge(;)
Warmuth's weighted majority algorithm [20]. Parameters: ; # [0, 1]
initial weight vector w 1 # [0, 1] N with N 1
i=1 w i =1
The pseudo-code for Hedge( ;) is shown in Fig. 1. The number of trials T
algorithm maintains a weight vector whose value at time t Do for t=1, 2, ..., T
is denoted w t =( w t1 , ..., w tN ). At all times, all weights will
1. Choose allocation
be nonnegative. All of the weights of the initial weight
vector w 1 must be nonnegative and sum to one, so that wt
N 1 pt =
i=1 w i =1. Besides these conditions, the initial weight vec- N
i=1 w ti
tor may be arbitrary, and may be viewed as a ``prior'' over
the set of strategies. Since our bounds are strongest for those 2. Receive loss vector l t # [0, 1] N from environment.
strategies receiving the greatest initial weight, we will want 3. Suffer loss p t } l t.
to choose the initial weights so as to give the most weight to
4. Set the new weights vector to be
those strategies which we expect are most likely to perform
the best. Naturally, if we have no reason to favor any of the w t+1
i =w ti ; li
t
wt for :0 and r # [0, 1]. Combined with Eqs. (1) and (2),
pt = N
. (1) this implies
i=1 w ti
N N
After the loss vector l t has been received, the weight : w t+1 = : w ti ; li
t
i
vector w t is updated using the multiplicative rule i=1 i=1
N
t
w t+1
i =w ti } ; . li
(2) : w ti(1&(1&;) l ti )
i=1
N
More generally, it can be shown that our analysis is
applicable with only minor modification to an alternative
update rule of the form
\ i=1
+
= : w ti (1&(1&;) p t } l t ). (4)
; r U ;(r)1&(1&;) r \
exp &(1&;) : p t } l t
t=1
+
for all r # [0, 1]. since 1+xe x for all x. The lemma follows imme-
diately. K
2.1. Analysis Thus,
The analysis of Hedge( ;) mimics directly that given by
Littlestone and Warmuth [20]. The main idea is to derive &ln( N T+1
i=1 w i )
L Hedge( ;) . (5)
upper and lower bounds on N T+1
i=1 w i which, together, 1&;
imply an upper bound on the loss of the algorithm. We
begin with an upper bound. Note that, from Eq. (2),
Lemma 1. For any sequence of loss vectors l 1, ..., l T, T
t
w T+1
i =w 1i ` ; li =w 1i ; Li. (6)
N t=1
ln
\ : w + &(1&;) L
i=1
T+1
i Hedge( ;) .
This is all that is needed to complete our analysis.
File: 571J 150403 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 4496 Signs: 2689 . Length: 56 pic 0 pts, 236 mm
122 FREUND AND SCHAPIRE
Theorem 2. For any sequence of loss vectors l 1, ..., l T, Theorem 3. Let B be an algorithm for the on-line alloca-
and for any i # [1, ..., N], we have tion problem with an arbitrary number of strategies. Suppose
that there exists positive real numbers a and c such that for
&ln(w 1i )&L i ln ; any number of strategies N and for any sequence of loss
L Hedge( ;) . (7) vectors l 1, ..., l T
1&;
More generally, for any nonempty set S[1, ..., N], we have L B c min L i +a ln N.
i
N
2.2. How to Choose ;
: w T+1
i : w T+1
i
i=1 i#S So far, we have analyzed Hedge( ;) for a given choice of
;, and we have proved reasonable bounds for any choice of
= : w 1i ; Li ;. In practice, we will often want to choose ; so as to maxi-
i#S
mally exploit any prior knowledge we may have about the
; maxi # S Li : w 1i . specific problem at hand.
i#S The following lemma will be helpful for choosing ; using
the bounds derived above.
The theorem now follows immediately from Eq. (5). K
Lemma 4. Suppose 0LL and 0<RR. Let ;=
The simpler bound (7) states that Hedge(;) does not g(LR ) where g(z)=1(1+- 2z). Then
perform ``too much worse'' than the best strategy i for the
sequence. The difference in loss depends on our choice of ; &L ln ;+R
and on the initial weight w 1i of each strategy. If each weight L+- 2LR +R.
1&;
is set equally so that w 1i =1N, then this bound becomes
File: 571J 150404 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 4723 Signs: 3172 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 123
Since L T, this gives a worst case rate of convergence of Thus, if we define l ti =4(E ti , | t ) then the loss suffered by the
O(- (ln N)T). However, if L is close to zero, then the rate learner is p t } l t, i.e., exactly the mixture loss that was
of convergence will be much faster, roughly, O((ln N)T ). analyzed in Section 2.
Lemma 4 can also be applied to the other bounds given in Hence, the bounds of Section 2 can be applied to our
Theorem 2 to obtain analogous results. current framework. For instance, applying Eq. (11), we
The bound given in Eq. (11) can be improved in special obtain the following:
cases in which the loss is a function of a prediction and an
outcome and this function is of a special form (see Example 4 Theorem 5. For any loss function *, for any set of
below). However, for the general case, one cannot improve experts, and for any sequence of outcomes, the expected loss
of Hedge( ;) if used as described above is at most
the square-root term - 2L ln N, by more than a constant
factor. This is a corollary of the lower bound given by Cesa- T T
Bianchi et al. ([4], Theorem 7) who analyze an on-line : 4(D t, | t )min : 4(E ti , | t )+- 2L ln N+ln N
prediction problem that can be seen as a special case of the t=1
i
t=1
on-line allocation model.
where L T is an assumed bound on the expected loss of the
3. APPLICATIONS best expert, and ;= g(Lln N).
The framework described up to this point is quite general Example 1. In the k-ary prediction problem, 2=0=
and can be applied in a wide variety of learning problems. [1, 2, ..., k], and *($, |) is 1 if ${| and 0 otherwise. In
Consider the following set-up used by Chung [5]. We are other words, the problem is to predict a sequence of letters
given a decision space 2, a space of outcomes 0, and a over an alphabet of size k. The loss function * is 1 if a
bounded loss function * : 2_0 [0, 1]. (Actually, our mistake was made, and 0 otherwise. Thus, 4(D, |) is the
results require only that * be bounded, but, by rescaling, we probability (with respect to D) of a prediction that disagrees
can assume that its range is [0, 1].) At every time step t, the with |. The cumulative loss of the learner, or of any expert,
learning algorithm selects a decision $ t # 2, receives an out- is therefore the expected number of mistakes on the entire
come | t # 0, and suffers loss *($ t, | t ). More generally, we sequence. So, in this case, Theorem 2 states that the expec-
may allow the learner to select a distribution D t over the ted number of mistakes of the learning algorithm will exceed
space of decisions, in which case it suffers the expected loss the expected number of mistakes of the best expert by at
of a decision randomly selected according to D t; that is, its most O(- T ln N), or possibly much less if the loss of the
expected loss is 4(D t, | t ) where best expert can be bounded ahead of time.
Bounds of this type were previously proved in the binary
4(D, |)=E $t D [*($, |)]. case (k=2) by Littlestone and Warmuth [20] using the
same algorithm. Their algorithm was later improved by
To decide on distribution D t, we assume that the learner Vovk [25] and Cesa-Bianchi et al. [4]. The main result of
has access to a set of N experts. At every time step t, expert this section is a proof that such bounds can be shown to
i produces its own distribution E ti on 2, and suffers loss hold for any bounded loss function.
4(E ti , | t ).
The goal of the learner is to combine the distributions Example 2. The loss function * may represent an
produced by the experts so as to suffer expected loss ``not arbitrary matrix game, such as ``rock, paper, scissors.'' Here,
much worse'' than that of the best expert. 2=0=[R, P, S], and the loss function is defined by the
The results of Section 2 provide a method for solving this matrix:
problem. Specifically, we run algorithm Hedge(;), treating
each expert as a strategy. At every time step, Hedge( ;) |
produces a distribution p t on the set of experts which is used R P S
to construct the mixture distribution 1
R 2 1 0
1
N $ P 0 2 1
D t = : p ti E ti . 1
i=1
S 1 0 2
For any outcome | t, the loss suffered by Hedge( ;) will The decision $ represents the learner's play, and the out-
then be come | is the adversary's play; then *($, |), the learner's
N
loss, is 1 if the learner loses the round, 0 if it wins the round,
4(D t, | t )= : p ti 4(E ti , | t ). and 12 if the round is tied. (For instance, *(S, P)=0 since
i=1 ``scissors cut paper.'') So the cumulative loss of the learner
File: 571J 150405 . By:BV . Date:29:07:01 . Time:05:35 LOP8M. V8.0. Page 01:01
Codes: 5719 Signs: 4418 . Length: 56 pic 0 pts, 236 mm
124 FREUND AND SCHAPIRE
(or an expert) is the expected number of losses in a series of To see how to handle the problem of finding deterministic
rounds of game play (counting ties as half a loss). Our predictions, notice that the loss function *($, |) is convex
results show then that, in repeated play, the expected with respect to $:
number of rounds lost by our algorithm will converge
quickly to the expected number that would have been lost &(a$ 1 +(1&a) $ 2 )&|&a &$ 1 &|&+(1&a) &$ 2 &|&
by the best of the experts (for the particular sequence of (13)
moves that were actually played by the adversary).
Example 3. Suppose that 2 and 0 are finite, and that * for any a # [0, 1] and any | # 0. Thus we can do as follows.
represents a game matrix as in the last example. Suppose At time t, the learner predicts with the weighted average of
further that we create one expert for each decision $ # 2 and the experts' predictions: $ t = N t t t n
i=1 p i = i where = i # R is the
that expert always recommends playing $. In game- prediction of the i th expert at time t. Regardless of the
theoretic terminology such experts would be identified with outcome | t, Eq. (13) implies that
pure strategies. Von Neumann's classical min-max theorem
states that for any fixed game matrix there exists a distribu- N
tion over the actions, also called a mixed strategy, which &$ t &| t& : p ti &= ti &| t&.
i=1
achieves the min-max optimal value of the expected loss
against any adversarial strategy. This min-max value is also
Since Theorem 2 provides an upper bound on the right
called the value of the game.
hand side of this inequality, we also obtain upper bounds
Suppose that we use algorithm Hedge(;) to choose dis-
for the left hand side. Thus, our results in this case give
tributions over the actions when playing a matrix game
explicit bounds on the total error (i.e., distance between
repeatedly. In this case, Theorem 2 implies that the gap
predicted and observed points) for the learner relative to the
between the learner's average per-round loss can never be
best of a team of experts.
much larger than that of the best pure strategy, and that the
In the one-dimensional case (n=1), this case was
maximal gap decreases to zero at the rate O(1- T log |2| ).
previously analyzed by Littlestone and Warmuth [20], and
However, the expected loss of the optimal mixed strategy is
later improved upon by Kivinen and Warmuth [19].
a fixed convex combination of the losses of the pure
This result depends only on the convexity and the
strategies, thus it can never be smaller than the loss of the
bounded range of the loss function *($, |) with respect to $.
best pure strategy for a particular sequence of events.
Thus, it can also be applied, for example, to the squared-
We conclude that the expected per-trial loss of Hedge( ;)
distance loss function *($, |)=&$&|& 2, as well as the log
is upper bounded by the value of the game plus
loss function *($, |)=&ln($ } |) used by Cover [6] for the
O(1- T log |2| ). In other words, the algorithm can never
design of ``universal'' investment portfolios. (In this last
perform much worse that an algorithm that uses the optimal
case, 2 is the set of probability vectors on n points, and
mixed strategy for the game, and it can be better if the
0=[1B, B] n for some constant B>1.)
adversary does not play optimally. Moreover, this holds
true even if the learner knows nothing at all about the game In many of the cases listed above, superior algorithms or
that is being played (so that * is unknown to the learner), analyses are known. Although weaker in specific cases, it
and even if the adversarial opponent has complete should be emphasized that our results are far more general,
knowledge both of the game that is being played and the and can be applied in settings that exhibit considerably less
algorithm that is being used by the learner. Algorithms with structure, such as the horse-racing example described in the
similar properties (but weaker convergence bounds) were introduction.
first devised by Blackwell [2] and Hannan [14]. For more
details see our related paper [13].
4. BOOSTING
Example 4. Suppose that 2=0 is the unit ball in R n,
and that *($, |)=&$&|&. Thus, the problem here is to In this section we show how the algorithm presented in
predict the location of a point |, and the loss suffered is the Section 2 for the on-line allocation problem can be modified
Euclidean distance between the predicted point $ and the to boost the performance of weak learning algorithms.
actual outcome |. Theorem 2 can be applied if probabilistic We very briefly review the PAC learning model (see, for
predictions are allowed. However, in this setting it is more instance, Kearns and Vazirani [18] for a more detailed
natural to require that the learner and each expert predict a description). Let X be a set called the domain. A concept is
single point (rather than a measure on the space of possible a Boolean function c : X [0, 1]. A concept class C is a
points). Essentially, this is the problem of ``tracking'' a collection of concepts. The learner has access to an oracle
sequence of points | 1, ..., | T where the loss function which provides labelled examples of the form (x, c(x))
measures the distance to the predicted point. where x is chosen randomly according to some fixed but
File: 571J 150406 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6096 Signs: 5274 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 125
unknown and arbitrary distribution D on the domain X, Also, this new algorithm gives a clean method for
and c # C is the target concept. After some amount of time, handling real-valued hypotheses which often are produced
the learner must output a hypothesis h: X [0, 1]. The by neural networks and other learning algorithms.
value h(x) can be interpreted as a randomized prediction of
the label of x that is 1 with probability h(x) and 0 with prob-
ability 1&h(x). (Although we assume here that we have 4.1. The New Boosting Algorithm
direct access to the bias of this prediction, our results can be Although boosting has its roots in the PAC model, for the
extended to the case that h is instead a random mapping remainder of the paper, we adopt a more general learning
into [0, 1].) The error of the hypothesis h is the expected framework in which the learner receives examples (x i , y i )
value E xt D ( |h(x)&c(x)| ) where x is chosen according to chosen randomly according to some fixed but unknown
D. If h(x) is interpreted as a stochastic prediction, then this distribution P on X_Y, where Y is a set of possible labels.
is simply the probability of an incorrect prediction. As usual, the goal is to learn to predict the label y given an
A strong PAC-learning algorithm is an algorithm that, instance x.
given =, $>0 and access to random examples, outputs with We start by describing our new boosting algorithm in the
probability 1&$ a hypothesis with error at most =. Further, simplest case that the label set Y consists of just two possible
the running time must be polynomial in 1=, 1$ and other labels, Y=[0, 1]. In later sections, we give extensions of the
relevant parameters (namely, the ``size'' of the examples algorithm for more general label sets.
received, and the ``size'' or ``complexity'' of the target Freund [11] describes two frameworks in which boosting
concept). A weak PAC-learning algorithm satisfies the same can be applied: boosting by filtering and boosting by
conditions but only for =12&# where #>0 is either a sampling. In this paper, we use the boosting by sampling
constant, or decreases as 1p where p is a polynomial in the framework, which is the natural framework for analyzing
relevant parameters. We use WeakLearn to denote a generic ``batch'' learning, i.e., learning using a fixed training set
weak learning algorithm. which is stored in the computer's memory.
Schapire [22] showed that any weak learning algorithm We assume that a sequence of N training examples
can be efficiently transformed or ``boosted'' into a strong (labelled instances) (x 1 , y 1 ), ..., (x N , y N ) is drawn randomly
learning algorithm. Later, Freund [10, 11] presented the from X_Y according to distribution P. We use boosting to
``boost-by-majority'' algorithm that is considerably more find a hypothesis h f which is consistent with most of the
efficient than Schapire's. Both algorithms work by calling a sample (i.e., h f (x i )= y i for most 1iN). In general, a
given weak learning algorithm WeakLearn multiple times, hypothesis which is accurate on the training set might not
each time presenting it with a different distribution over the be accurate on examples outside the training set; this
domain X, and finally combining all of the generated problem is sometimes referred to as ``over-fitting.'' Often,
hypotheses into a single hypothesis. The intuitive idea is to however, overfitting can be avoided by restricting the
alter the distribution over the domain X in a way that hypothesis to be simple. We will come back to this problem
increases the probability of the ``harder'' parts of the space, in Section 4.3.
thus forcing the weak learner to generate new hypotheses The new boosting algorithm is described in Fig. 2. The
that make less mistakes on these parts. goal of the algorithm is to find a final hypothesis with low
An important, practical deficiency of the boost-by- error relative to a given distribution D over the training
majority algorithm is the requirement that the bias # of the examples. Unlike the distribution P which is over X_Y
weak learning algorithm WeakLearn be known ahead of and is set by ``nature,'' the distribution D is only over the
time. Not only is this worst-case bias usually unknown in instances in the training set and is controlled by the learner.
practice, but the bias that can be achieved by WeakLearn Ordinarily, this distribution will be set to be uniform so that
will typically vary considerably from one distribution to the D(i)=1N. The algorithm maintains a set of weights w t over
next. Unfortunately, the boost-by-majority algorithm can- the training examples. On iteration t a distribution p t is
not take advantage of hypotheses computed by WeakLearn computed by normalizing these weights. This distribution is
with error significantly smaller than the presumed worst- fed to the weak learner WeakLearn which generates a
case bias of 12&#. hypothesis h t that (we hope) has small error with respect to
In this section, we present a new boosting algorithm the distribution. 1 Using the new hypothesis h t , the boosting
which was derived from the on-line allocation algorithm of
Section 2. This new algorithm is very nearly as efficient as 1
Some learning algorithms can be generalized to use a given distribution
boost-by-majority. However, unlike boost-by-majority, directly. For instance, gradient based algorithms can use the probability
associated with each example to scale the update step size which is based
the accuracy of the final hypothesis produced by the new
on the example. If the algorithm cannot be generalized in this way, the
algorithm depends on the accuracy of all the hypotheses training sample can be re-sampled to generate a new set of training exam-
returned by WeakLearn, and so is able to more fully exploit ples that is distributed according to the given distribution. The computa-
the power of the weak learning algorithm. tion required to generate each re-sampled example takes O(log N) time.
File: 571J 150407 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6774 Signs: 6015 . Length: 56 pic 0 pts, 236 mm
126 FREUND AND SCHAPIRE
File: 571J 150408 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6184 Signs: 4760 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 127
by h f 's definition, and since y i # [0, 1]. Thus, by Theorem 2, Combining this inequality over t=1, ..., T, we get that
N T
T
T } (12+#) : p } l t t : w T+1
i ` (1&(1&= t )(1&; t )). (16)
i=1 t=1
t=1
&ln( i # S D(i))+(#+# 2 )(T2) The final hypothesis h f , as defined in Fig. 2, makes a mistake
# on instance i only if
T T &12
since &ln(;)=&ln(1&#)#+# 2 for # # [0, 12]. This
2
implies that the error == i # S D(i) of h f is at most e &T# 2.
` ; &|h
t=1
t
t(xi )& yi |
` ;t
\ t=1
+ (17)
WeakLearn, when called by AdaBoost, generates hypotheses where = is the error of h f . Combining Eqs. (16) and (19), we
with errors = 1 , ..., = T (as defined in Step 3 of Fig. 2.) Then the get that
error ==Pr itD[h f (x i ){ y i ] of the final hypothesis h f output
by AdaBoost is bounded above by T
1&(1&= t )(1&; t )
= ` . (20)
T t=1 - ;t
=2 T ` - = t(1&= t ). (14)
t=1 As all the factors in the product are positive, we can
minimize the right hand side by minimizing each factor
separately. Setting the derivative of the t th factor to zero, we
Proof. We adapt the main arguments from Lemma 1 find that the choice of ; t which minimizes the right hand
and Theorem 2. We use p t and w t as they are defined in side is ; t == t (1&= t ). Plugging this choice of ; t into
Fig. 2. Eq. (20) we get Eq. (14), completing the proof. K
Similar to Eq. (4), the update rule given in Step 5 in Fig. 2
implies that The bound on the error given in Theorem 6, can also be
written in the form
N N
T
: w t+1
i = : w ti ; 1&
t
|ht(xi )& yi |
= ` - 1&4# 2t
i=1 i=1
t=1
N T
: w ti(1&(1&; t )(1& |h t(x i )& y i | ))
i=1 \
=exp & : KL(12 & 12&# t )
t=1
+
N T
\ i=1
+
= : w ti (1&(1&= t )(1&; t )). (15)
\
exp &2 : #
+ t=1
2
t (21)
File: 571J 150409 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 4650 Signs: 2859 . Length: 56 pic 0 pts, 236 mm
128 FREUND AND SCHAPIRE
where KL(a & b)=a ln(ab)+(1&a) ln((1&a)(1&b)) is on the VC-dimension of the concept class. This method
the KullbackLeibler divergence, and where = t has been is sometimes called ``structural risk minimization.'' See
replaced by 12&# t . In the case where the errors of all the Vapnik's book [23] for an extensive discussion of the
hypotheses are equal to 12&#, Eq. (21) simplifies to theory of structural risk minimization. For our purposes, we
quote Vapnik's Theorem 6.7:
=(1&4# 2 ) T2
Theorem 7 (Vapnik). Let H be a class of binary func-
=exp( &T } KL(12 & 12&#)) tions over some domain X. Let d be the VC-dimension of H.
exp( &2T# 2 ). (22) Let P be a distribution over the pairs X_[0, 1]. For h # H,
define the ( generalization) error of h with respect to P to be
This is a form of the Chernoff bound for the probability that
less than T2 coin flips turn out ``heads'' in T tosses of a ran- = g(h).Pr (x, y)t P [h(x){ y].
dom coin whose probability for ``heads'' is 12&#. This
bound has the same asymptotic behavior as the bound Let S=[(x 1 , y 1 ), ..., (x N , y N )] be a sample (training set) of
given for the boost-by-majority algorithm [11]. From N independent random examples drawn from X_[0, 1]
Eq. (22) we get that the number of iterations of the boosting according to P. Define the empirical error of h with respect
algorithm that is sufficient to achieve error = of h f is to the sample S to be
1 1 |[i : h(x i ){ y i ]|
T=
ln
KL(12 & 12&#) = | =^ (h).
N
.
1 1
2# =|
ln .
2
(23) Then, for any $>0 we have that
implies that the final error depends on the error of all of the d(ln 2Nd+1)+ln 9$
weak hypotheses. Previous bounds on the errors of boosting
algorithms depended only on the maximal error of the
>2
N
$
&
weakest hypothesis and ignored the advantage that can be
gained from the hypotheses whose errors are smaller. This where the probability is computed with respect to the random
advantage seems to be very relevant to practical applica- choice of the sample S.
tions of boosting, because there one expects the error of the Let % : R [0, 1] be defined by
learning algorithm to increase as the distributions fed to
WeakLearn shift more and more away from the target
1 if x0
distribution. %(x)=
{ 0 otherwise
4.3. Generalization Error
We now come back to discussing the error of the final and, for any class H of functions, let 3 T (H) be the class of
hypothesis outside the training set. Theorem 6 guarantees all functions defined as a linear threshold of T functions
that the error of h f on the sample is small; however, the in H:
quantity that interests us is the generalization error of h f ,
T
which is the error of h f over the whole instance space X; that
is, = g =Pr (x, y)t P [h f (x){ y]. In order to make = g close to
3 T (H)= %
{\ +
: a t h t &b : b, a 1 , ..., a T # R;
t=1
the empirical error =^ on the training set, we have to restrict
the choice of h f in some way. One natural way of doing this
in the context of boosting is to restrict the weak learner to =
h 1 , ..., h T # H .
File: 571J 150410 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5261 Signs: 3766 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 129
Theorem 8. Let H be a class of binary functions of problems, even after hundreds of rounds of boosting, the
VC-dimension d2. Then the VC-dimension of 3 T (H) is at generalization error continues to drop, or at least does not
most 2(d+1)(T+1) log 2(e(T+1)) (where e is the base of increase.
the natural logarithm).
Therefore, if the hypotheses generated by WeakLearn are 4.4. A Bayesian Interpretation
chosen from a class of VC-dimension d2, then the final
hypotheses generated by AdaBoost after T iterations The final hypothesis generated by AdaBoost is closely
belong to a class of VC-dimension at most 2(d+1)(T+1) related to one suggested by a Bayesian analysis. As usual,
log 2[e(T+1)]. we assume that examples (x, y) are being generated
according to some distribution P on X_[0, 1]; all
Proof. We use a result about the VC-dimension of com- probabilities in this subsection are taken with respect to P.
putation networks proved by Baum and Haussler [1]. We Suppose we are given a set of [0, 1]-valued hypotheses
can view the final hypothesis output by AdaBoost as a func- h 1 , ..., h T and that our goal is to combine the predictions of
tion that is computed by a two-layer feed-forward network these hypotheses in the optimal way. Then, given an
where the computation units of the first layer are the weak instance x and the hypothesis predictions h t(x), the Bayes
hypotheses and the computation unit of the second layer is optimal decision rule says that we should predict the label
the linear threshold function which combines the weak with the highest likelihood, given the hypothesis values, i.e.,
hypotheses. The VC-dimension of the set of linear threshold we should predict 1 if
functions over R T is T+1 [26]. Thus the sum over all
computation units of the VC-dimensions of the classes of Pr[ y=1 | h 1(x), ..., h T (x)]>Pr[ y=0 | h 1(x), ..., h T (x)],
functions associated with each unit is Td+(T+1)<
(T+1)(d+1). Baum and Haussler's Theorem 1 [1] implies and otherwise we should predict 0.
that the number of different functions that can be realized This rule is especially easy to compute if we assume that
by h # 3 T (H) when the domain is restricted to a set of size the errors of the different hypotheses are independent of one
m is at most ((T+1) em(T+1)(d+1)) (T+1)(d+1). If d2, another and of the target concept, that is, if we assume that
T1 and we set m=W2(T+1)(d+1) log 2[e(T+1)]X, the event h t(x){ y is conditionally independent of the
then the number of realizable functions is smaller than 2 m actual label y and the predictions of all the other hypotheses
which implies that the VC-dimension of 3 T (H) is smaller h 1(x), ..., h t&1(x), h t+1(x), ..., h T (x). In this case, by
than m. K applying Bayes rule, we can rewrite the Bayes optimal deci-
sion rule in a particularly simple form in which we predict
Following the guidelines of structural risk minimization 1 if
we can do the following (assuming we know a reasonable
upper bound on the VC-dimension of the class of weak
hypotheses). Let h Tf be the hypothesis generated by running Pr[ y=1] ` =t ` (1&= t )
t : ht(x)=0 t : ht(x)=1
AdaBoost for T iterations. By combining the observed
empirical error of h Tf with the bounds given in Theorems 7 >Pr[ y=0] ` (1&= t ) ` =t ,
and 8, we can compute an upper bound on the generaliza- t : ht(x)=0 t : ht(x)=1
File: 571J 150411 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6095 Signs: 5147 . Length: 56 pic 0 pts, 236 mm
130 FREUND AND SCHAPIRE
4.5. Improving the Error Bound Since y i # [0, 1] and by definition of r(x i ), this implies that
T
Tt=1 (log ;1 ) h t(x) 1
r(x)= t ` ((1&(1&= t )(1&; t )) ; &12
t ).
Tt=1 log ;1 2 t=1
t
The last two steps follow from Eqs. (18) and (16), respec-
be a weighted average of the weak hypotheses h t . We will
tively. The theorem now follows from our choice of ; t . K
here consider final hypotheses of the form h f (x)=F(r(x))
where F: [0, 1] [0, 1]. For the version of AdaBoost given
in Fig. 2, F(r) is the hard threshold that equals 1 if r12 5. BOOSTING FOR MULTI-CLASS AND
and 0 otherwise. In this section, we will instead use soft REGRESSION PROBLEMS
threshold functions that take values in [0, 1]. As mentioned
above, when h f (x) # [0, 1], we can interpret h f as a So far, we have restricted our attention to binary
randomized hypothesis and h f (x) as the probability of classification problems in which the set of labels Y contains
predicting 1. Then the error E itD[ |h f (x i )& y i | ] is simply only two elements. In this section, we describe two possible
the probability of an incorrect prediction. extensions of AdaBoost to the multi-class case in which Y is
any finite set of class labels. We also give an extension for a
Theorem 9. Let = 1 , ..., = T be as in Theorem 6, and let regression problem in which Y is a real bounded interval.
r(x i ) be as defined above. Let the modified final hypothesis be We start with the multiple-label classification problem.
defined by h f (x)=F(r(x)) where F satisfies the following for Let Y=[1, 2, ..., k] be the set of possible labels. The
r # [0, 1]: boosting algorithms we present output hypotheses h f :
X Y, and the error of the final hypothesis is measured in
1 T 12&r the usual way as the probability of an incorrect prediction.
F(1&r)=1&F(r); and F(r)
2 \ ` ;t
t=1
+ . The first extension of AdaBoost, which we call
AdaBoost.M1, is the most direct. The weak learner
generates hypotheses which assign to each instance one of
Then the error = of h f is bounded above by the k possible labels. We require that each weak hypothesis
have prediction error less than 12 (with respect to the dis-
T
tribution on which it was trained). Provided this require-
=2 T&1 ` - = t(1&= t ). ment can be met, we are able to prove that the error of the
t=1 combined final hypothesis decreases exponentially, as in
the binary case. Intuitively, however, this requirement on
the performance of the weak learner is stronger than might
For instance, it can be shown that the sigmoid function be desired. In the binary case (k=2), a random guess will be
F(r)=(1+> Tt=1 ; 2r&1 ) &1 satisfies the conditions of the correct with probability 12, but when k>2, the probability
t
theorem. of a correct random prediction is only 1k<12. Thus, our
requirement that the accuracy of the weak hypothesis be
Proof. By our assumptions on F, the error of h f is greater than 12 is significantly stronger than simply
requiring that the weak hypothesis perform better than
N
random guessing.
== : D(i) } |F(r(x i ))& y i | In fact, when the performance of the weak learner is
i=1 measured only in terms of error rate, this difficulty is
N
unavoidable as is shown by the following informal example
= : D(i) F(|r(x i )& y i | ) (also presented by Schapire [22]): Consider a learning
i=1 problem where Y=[0, 1, 2] and suppose that it is ``easy'' to
predict whether the label is 2 but ``hard'' to predict whether
1 N T
2 i=1 \
: D(i) ` ; 12&
t=1
t
|r(xi )& yi |
.
+ the label is 0 or 1. Then a hypothesis which predicts
correctly whenever the label is 2 and otherwise guesses
File: 571J 150412 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5125 Signs: 3769 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 131
randomly between 0 and 1 is guaranteed to be correct at Finally, in Section 5.3 we extend AdaBoost to boosting
least half of the time (significantly beating the 13 accuracy regression algorithms. In this case Y=[0, 1], and the error
achieved by guessing entirely at random). On the other of a hypothesis is defined as E (x, y)t P [(h(x)& y) 2 ]. We
hand, boosting this learner to an arbitrary accuracy is describe a boosting algorithm AdaBoost.R. which, using
infeasible since we assumed that it is hard to distinguish methods similar to those used in AdaBoost.M2, boosts the
0- and 1-labelled instances. performance of a weak regression algorithm.
As a more natural example of this problem, consider
classification of handwritten digits in an OCR application. 5.1. First Multi-class Extension
It may be easy for the weak learner to tell that a particular
In our first and most direct extension to the multi-class
image of a ``7'' is not a ``0'' but hard to tell for sure if it is a
case, the goal of the weak learner is to generate on round t
``7'' or a ``9''. Part of the problem here is that, although the
a hypothesis h t : X Y with low classification error = t .
boosting algorithm can focus the attention of the weak
Pr it p t[h t(x i ){ y i ]. Our extended boosting algorithm,
learner on the harder examples, it has no way of forcing the
called AdaBoost.M1, is shown in Fig. 3, and differs only
weak learner to discriminate between particular labels that
slightly from AdaBoost. The main difference is in the
may be especially hard to distinguish.
replacement of the error |h t(x i )&y i | for the binary case by
In our second version of multi-class boosting, we attempt
h t(x i ){ y i where, for any predicate ?, we define ? to be
to overcome this difficulty by extending the communication
1 if ? holds and 0 otherwise. Also, the final hypothesis h f , for
between the boosting algorithm and the weak learner. First,
a given instance x, now outputs the label y that maximizes
we allow the weak learner to generate more expressive
the sum of the weights of the weak hypotheses predicting
hypotheses whose output is a vector in [0, 1] k, rather than
that label.
a single label in Y. Intuitively, the y th component of this
In the case of binary classification (k=2), a weak
vector represents a ``degree of belief'' that the correct label
hypothesis h with error significantly larger than 12 is of
is y. The components with large values (close to 1)
equal value to one with error significantly less than 12 since
correspond to those labels considered to be plausible.
h can be replaced by 1&h. However, for k>2, a hypothesis
Likewise, labels considered implausible are assigned a small
h t with error = t 12 is useless to the boosting algorithm. If
value (near 0), and questionable labels may be assigned a
value near 12. If several labels are considered plausible (or Algorithm AdaBoost.M1
implausible), then they all may be assigned large (or small) Input: sequence of N examples ( (x1 , y 1 ). .., (x N , y N )) with labels
values. y i # Y=[1, ..., k]
While we give the weak learning algorithm more distribution D over the N examples
expressive power, we also place a more complex require- weak learning algorithm WeakLearn
integer T specifying number of iterations
ment on the performance of the weak hypotheses. Rather Initialize the weight vector: w 1i =D(i) for i=1, ..., N.
than using the usual prediction error, we ask that the weak Do for t=1, 2, ..., T
hypotheses do well with respect to a more sophisticated
1. Set
error measure that we call the pseudo-loss. This pseudo-loss
varies from example to example, and from one round to the
wt
next. On each iteration, the pseudo-loss function is supplied pt =
N t
i=1 w i
to the weak learner by the boosting algorithm, along with
the distribution on the examples. By manipulating the
pseudo-loss function, the boosting algorithm can focus the 2. Call WeakLearn, providing it with the distribution p t; get back a
weak learner on the labels that are hardest to discriminate. hypothesis h t : X Y.
The boosting algorithm AdaBoost.M2, described in 3. Calculate the error of h t : = t = N t
i=1 p i h t(x i ){ y i.
If = t >12, then set T=t&1 and abort loop.
Section 5.2, is based on these ideas and achieves boosting if
each weak hypothesis has pseudo-loss slightly better than 4. Set ; t == t (1&= t ).
random guessing (with respect to the pseudo-loss measure 5. Set the new weights vector to be
that was supplied to the weak learner).
In addition to the two extensions described in this paper, w t+1
i =w ti ; t1& ht(xi ){y% i
we mention an alternative, standard approach which would
be to convert the given multi-class problem into several Output the hypothesis
binary problems, and then to use boosting separately on
each of the binary problems. There are several standard T
1
ways of making such a conversion, one of the most success-
ful being the error-correcting output coding approach
h f (x)=arg max :
y#Y
t=1
\ +
log
;t
h t(x)= y.
advocated by Dietterich and Bakiri [7]. FIG. 3. A first multi-class extension of AdaBoost.
File: 571J 150413 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 6060 Signs: 5002 . Length: 56 pic 0 pts, 236 mm
132 FREUND AND SCHAPIRE
File: 571J 150414 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5566 Signs: 4204 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 133
In the more general case that h takes values in [0, 1], we The weak learner's goal is to minimize the expected pseudo-
interpret h(x, y) as a randomized decision for the procedure loss for given distribution D and weighting function q:
above. That is, we first choose a random bit b(x, y) which
is 1 with probability h(x, y) and 0 otherwise. We then apply ploss D, q(h) :=E itD [ploss q(h, i)].
the above procedure to the stochastically chosen binary
function b. The probability of choosing the incorrect answer As we have seen, by manipulating both the distribution
y to the question above is on instances, and the label weighting function q, our
boosting algorithm effectively forces the weak learner to
Pr[b(x i , y i )=0 7 b(x i , y)=1]+ 12 Pr[b(x i , y i )=b(x i , y)] focus not only on the hard instances, but also on the
incorrect class labels that are hardest to eliminate. Con-
= 12 (1&h(x i , y i )+h(x i , y)). versely, this pseudo-loss measure may make it easier for the
weak learner to get a weak advantage. For instance, if the
If the answers to all k&1 questions are considered weak learner can simply determine that a particular
equally important, then it is natural to define the loss of the instance does not belong to a certain class (even if it has no
hypothesis to be the average, over all k&1 questions, of the idea which of the remaining classes is the correct one), then,
probability of an incorrect answer: depending on q, this may be enough to gain a weak advan-
tage.
Theorem 11, the main result of this section, shows that a
1 1
: (1&h(x i , y i )+h(x i , y)) weak learner can be boosted if it can consistently produce
k&1 y{ yi 2 weak hypotheses with pseudo-losses smaller than 12. Note
1 1 that pseudo-loss 12 can be achieved trivially by any unin-
=
2 \
1&h(x i , y i )+ : h(x i , y) .
k&1 y{ yi + (24) formative hypothesis. Furthermore, a weak hypothesis h
with pseudo-loss =>12 is also beneficial to boosting since
it can be replaced by the hypothesis 1&h whose pseudo-loss
However, as was discussed in the introduction to Section 5, is 1&=<12.
different discrimination questions are likely to have different
importance in different situations. For example, considering Example 5. As a simple example illustrating the use of
the OCR problem described earlier, it might be that at some pseudo-loss, suppose we seek an oblivious weak hypothesis,
point during the boosting process, some example of the digit i.e., a weak hypothesis whose value depends only on the
``7'' has been recognized as being either a ``7'' or a ``9''. At this class label y so that h(x, y)=h( y) for all x. Although
stage the question that discriminates between ``7'' (the oblivious hypotheses per se are generally too weak to be of
correct label) and ``9'' is clearly much more important than interest, it may often be appropriate to find the best
the other eight questions that discriminate ``7'' from the oblivious hypothesis on a part of the instance space (such as
other digits. the set of instances covered by a leaf of a decision tree).
A natural way of attaching different degrees of impor- Let D be the target distribution, and q the label weighting
tance to the different questions is to assign a weight to each function. For notational convenience, let us define q(i, y i )
question. So, for each instance x i and incorrect label y{ y i , =&1 for all i so that
we assign a weight q(i, y) which we associate with the ques-
tion that discriminates label y from the correct label y i . We
then replace the average used in Eq. (24) with an average \
ploss q(h, i)= 12 1+ : q(i, y) h(x i , y) .
y#Y
+
weighted according to q(i, y); the resulting formula is called
the pseudo-loss of h on training instance i with respect to q: Setting $( y)= i D(i) q(i, y), it can be verified that for an
oblivious hypothesis h,
\
ploss q(h, i). 12 1&h(x i , y i )+ : q(i, y) h(x i , y) .
y{ yi
+ \
ploss D, q(h)= 12 1+ : h( y) $( y) ,
y#Y
+
The function q=[1, ..., N]_Y [0, 1], called the label which is clearly minimized by the choice
weighting function, assigns to each example i in the training
set a probability distribution over the k&1 discrimination 1 if $( y)<0
problems defined above. So, for all i, h( y)=
{0 otherwise.
: q(i, y)=1. Suppose now that q(i, y)=1(k&1) for y{ y i , and let
y{ yi d( y)=Pr itD[ y i = y] be the proportion of examples with
File: 571J 150415 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5466 Signs: 4208 . Length: 56 pic 0 pts, 236 mm
134 FREUND AND SCHAPIRE
label y. Then it can be verified that h will always have Algorithm AdaBoost.M2
pseudo-loss strictly smaller than 12 except in the case of a Input: sequence of N examples ( (x1 , y 1 ). .., (x N , y N )) with labels
y i # Y=[1, ..., k]
uniform distribution of labels (d( y)=1k for all y). In distribution D over the N examples
contrast, when the weak learner's goal is minimization of weak learning algorithm WeakLearn
prediction error (as in Section 5.1), it can be shown that an integer T specifying number of iterations
oblivious hypothesis with prediction error strictly less than Initialize the weight vector: w 1i, y =D(i)(k&1) for i=1, ..., N, y # Y&[ y i ].
12 can only be found when one label y covers more than 12 Do for t=1, 2, ..., T
the distribution (d( y)>12). So in this case, it is much 1. Set W ti = y{ yi w ti, y ;
easier to find a hypothesis with small pseudo-loss rather
than small prediction error. w ti, y
On the other hand, if q(i, y)=0 for some values of y, then q t(i, y)=
W ti
the quality of prediction on these labels is of no conse-
quence. In particular, if q(i, y)=0 for all but one incorrect
label for each instance i, then in order to make the pseudo- for y{ y i ; and set
loss smaller than 12 the hypothesis has to predict the
correct label with probability larger than 12, which means W ti
D t(i)= .
that in this case the pseudo-loss criterion is as stringent as N
i=1 W ti
the usual prediction error. However, as discussed above,
this case is unavoidable because a hard binary classification
2. Call WeakLearn, providing it with the distribution D t and label
problem can always be embedded in a multi-class problem. weighting function q t ; get back a hypothesis h t : X_Y [0, 1].
This example suggests that it may often be significantly
3. Calculate the pseudo-loss of h t :
easier to find weak hypotheses with small pseudo-loss rather
than hypotheses whose prediction error is small. On the N
1
other hand, our theoretical bound for boosting using the
prediction error (Theorem 10) is stronger than the bound
=t =
\
: D t(i) 1&h t(x i , y i )+ : q t(i, y) h t(x i , y) .
2 i=1 y{ y i
+
for ploss (Theorem 11). Empirical tests [12] have shown
that pseudo-loss is generally more successful when the weak 4. Set ; t == t (1&= t ).
learners use very restricted hypotheses. However, for more
5. Set the new weights vector to be
powerful weak learners, such as decision-tree learning algo-
rithms, there is little difference between using pseudo-loss
w t+1 t (12)(1+ht(xi , yi ) &ht(xi , y))
i, y =w i, y ; t
and prediction error.
Our algorithm called AdaBoost.M2, is shown in Fig. 4.
Here, we maintain weights w ti, y for each instance i and each for i=1, ..., N, y # Y&[ y i ].
label y # Y&[ y i ]. The weak learner must be provided both Output the hypothesis
with a distribution D t and a label weight function q t . Both
of these are computed using the weight vector w t as shown T
1
in Step 1. The weak learner's goal then is to minimize the
pseudo-loss = t , as defined in Step 3. The weights are updated
hf (x)=arg max :
y#Y
t=1
\ +
log
;t
h t(x, y).
as shown in Step 5. The final hypothesis h f outputs, for a FIG. 4. A second multi-class extension of AdaBoost.
given instance x, the label y that maximizes a weighted
average of the weak hypothesis values h t(x, y). For each training instance (x i , y i ) and for each incorrect
Theorem 11. Suppose the weak learning algorithm label y # Y&[ y i ], we define one AdaBoost instance x~ i, y =
WeakLearn, when called by AdaBoost.M2 generates (i, y) with associated label y~ i, y =0. Thus, there are N =
hypotheses with pseudo-losses = 1 , ..., = T , where = t is as defined N(k&1) AdaBoost instances, each indexed by a pair (i, y).
in Fig. 4. Then the error ==Pr itD[h f (x i ){ y i ] of the final The distribution over these instances is defined to be D(i, y)
hypothesis h f output by AdaBoost.M2 is bounded above by =D(i)(k&1). The tth hypothesis h t provided to AdaBoost
for this reduction is defined by the rule
T
=(k&1) 2 T ` - = t(1&= t ).
t=1
h t(i, y)= 12 (1&h t(x i , y i )+h t(x i , y)).
Proof. As in the proof of Theorem 10, we reduce to an With this setup, it can be verified that the computed dis-
instance of AdaBoost and apply Theorem 6. As before, we tributions and errors will be identical so that w~ ti, y =w ti, y ,
mark AdaBoost variables with a tilde. p~ ti, y = p ti, y , =~ t == t and ; t =; t .
File: 571J 150416 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5640 Signs: 4166 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 135
Suppose now that h f (x i ){ y i for some example i. Then, P, and we focus only on the minimization of the empirical
by definition of h f , MSE:
T T 1 N
: : t h t(x i , y i ) : : t h t(x i , h f (x i )), : (h(x i )& y i ) 2.
t=1 t=1
N i=1
where : t =ln(1; t ). This implies that Using techniques similar to those outlined in Section 4.3,
the true MSE given in Eq. (25) can be related to the empiri-
T T cal MSE.
: : t h t(i, h f (x i ))= 12 : : t (1&h t(x i , y i )+h t(x i , h f (x i ))) To derive a boosting algorithm in this context, we reduce
t=1 t=1
the given regression problem to a binary classification
T
problem, and then apply AdaBoost. As was done for the
12 : : t reductions used in the proofs of Theorems 10 and 11, we
t=1
mark with tildes all variables in the reduced (AdaBoost)
space. For each example (x i , y i ) in the training set, we
so h f (i, h f (x i ))=1 by definition of h f .
define a continuum of examples indexed by pairs (i, y) for
Therefore,
all y # [0, 1]: the associated instance is x~ i, y =(x i , y), and
the label is y~ i, y = y y i. (Recall that ? is 1 if predicate
Pr itD[h f (x i ){ y i ]Pr itD[_y{ y i : h f (i, y)=1]. ? holds and 0 otherwise.) Although it is obviously infeasible
to explicitly maintain an infinitely large training set, we will
Since all AdaBoost instances have a 0-label, and by defini- see later how this method can be implemented efficiently.
tion of D, the error of h f is Also, although the results of Section 4 only dealt with
finitely large training sets, the extension to infinite training
Pr (i, y)tD[h f (i, y)=1] sets is straightforward.
Thus, informally, each instance (x i , y i ) is mapped to an
1 infinite set of binary questions, one for each y # Y, and each
Pr itD[_y{ y i : h f (i, y)=1].
k&1 of the form: ``Is the correct label y i bigger or smaller than y?''
In a similar manner, each hypothesis h : X Y is reduced
Applying Theorem 6 to bound the error of h f , this to a binary-valued hypothesis h : X_Y [0, 1] defined by
completes the proof. K the rule
Although we omit the details, the bound for
AdaBoost.M2 can be improved by a factor of two in a h(x, y)= yh(x).
manner similar to that described in Section 4.5.
Thus, h attempts to answer these binary questions in a
5.3. Boosting Regression Algorithms natural way using the estimated value h(x).
Finally, as was done for classification problems, we
In this section we show how boosting can be used for a assume we are given a distribution D over the training set;
regression problem. In this setting, the label space is ordinarily, this will be uniform so that D(i)=1N. In our
Y=[0, 1]. As before, the learner receives examples (x, y) reduction, this distribution is mapped to a density D over
chosen at random according to some distribution P, and its pairs (i, y) in such a way that minimization of classification
goal is to find a hypothesis h : X Y which, given some x error in the reduced space is equivalent to minimization of
value, predicts approximately the value y that is likely to be MSE for the original problem. To do this, we define
seen. More precisely, the learner attempts to find an h with
small mean squared error (MSE):
D(i) | y& y i |
D(i, y)=
Z
E (x, y)t P [(h(x)& y) 2 ]. (25)
where Z is a normalization constant:
Our methods can be applied to any reasonable bounded
error measure, but, for the sake of concreteness, we concen- N 1
trate here on the squared error measure.
Following our approach for classification problems, we
Z= : D(i)
i=1
|0
| y& y i | dy.
assume that the leaner has been provided with a training set
(x 1 , y 1 ), ..., (x N , y N ) of examples distributed according to It is straightforward to show that 14Z12.
File: 571J 150417 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5122 Signs: 3534 . Length: 56 pic 0 pts, 236 mm
136 FREUND AND SCHAPIRE
1 N h(xi )
D(i) | y& y i |
= : D(i)
Z i=1 }| yi
| y& y i | dy
} w 1i, y =
Z
1 N
= : D(i)(h(x i )& y i ) 2. for i=1, ..., N, y # Y, where
2Z i=1
N 1
File: 571J 150418 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5416 Signs: 3774 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 137
the final hypothesis h f output by AdaBoost.R is bounded an outcome | t # 0, the learner and each expert i incur loss
above by *($ t, | t ) and *(= ti , | t ), respectively. The goal of the learning
algorithm is to generate decisions in such a way that its
T cumulative loss will not be much larger than the cumulative
=2 T ` - = t(1&= t ). (26) loss of the best expert. The following four properties are
t=1
assumed to hold:
An unfortunate property of this setup is that there is no 1. 2 is a compact topological space.
trivial way to generate a hypothesis whose loss is 12. This
2. For each |, the function $ *($, |) is continuous.
is a similar situation to the one we encountered with algo-
rithm AdaBoost.M1. A remedy to this problem might be to 3. There exists $ such that, for all |, *($, |)<.
allow weak hypotheses from a more general class of 4. There exists no $ such that, for all |, *($, |)=0.
functions. One simple generalization is to allow for weak
We now give Vovk's main result [24]. Let a decision
hypotheses that are defined by two functions: h : X [0, 1]
problem defined by 0, 2 and * obey Assumptions 14. Let
as before, and } : X [0, 1] which associates a measure of
c and a be positive real numbers. We say that the decision
confidence to each prediction of h. The reduced hypothesis
problem is (c, a)-bounded if there exists an algorithm A such
which we associate with this pair of functions is
that for any finite set of experts and for any finite sequence
of trials, the cumulative loss of the algorithm is bounded by
(1+}(x))2 if h(x) y
h(x, y)=
{ (1&}(x))2 otherwise. T T
: *($ t, | t )c min : *(= ti , | t )+a ln N,
t=1 i t=1
These hypotheses are used in the same way as the ones
defined before and a slight variation of algorithm
where N is the number of experts.
AdaBoost.R can be used to boost the accuracy of these more
We say that a distribution D is simple if it is non-zero on
general weak learners (details omitted). The advantage of
a finite set denoted dom(D). Let S be the set of simple dis-
this variant is that any hypothesis for which }(x) is identi-
tributions over 2. Vovk defines the following function
cally zero has pseudo-loss exactly 12 and slight deviations
c : (0, 1) [0, ] which characterizes the hardness of any
from this hypothesis can be used to encode very weak
decision problem:
predictions.
The method presented in this section for boosting with
square loss can be used with any reasonable bounded loss *($, |)
c( ;)= sup inf sup *(=, |)
. (27)
function L : Y_Y [0, 1]. Here, L( y$, y) is a measure of D#S $#2 |#0 log ; = # dom(D) ; D(=)
the ``discrepancy'' between the observed label y and a
predicted label y$; for instance, above we used L( y$, y)= He then proves the following powerful theorem:
( y$& y) 2. The goal of learning is to find a hypothesis h with
small average loss E (x, y)t P [L(h(x), y)]. Assume, for any Theorem 13 (Vovk). A decision problem is (c, a)-
y, that L( y, y)=0 and that L( y$, y) is differentiable with bounded if and only if for all ; # (0, 1), cc( ;) or
respect to y$, non-increasing for y$ y and non-decreasing ac( ;)ln(1;).
for y$ y. Then, to modify AdaBoost.R to handle such a loss Proof of Theorem 3. The proof consists of the following
function, we need only replace | y& y i | in the initialization three steps: We first define a decision problem that conforms
step with |L( y, y i )y|. The rest of the algorithm is to Vovk's framework. We then show a lower bound on the
unchanged, and the modifications needed for the analysis function c( ;) for this problem. Finally, we show how any
are straightforward. algorithm A for the on-line allocation problem can be used
to generate decisions in the defined problem, and so we get
APPENDIX: PROOF OF THEOREM 3 from Theorem 13 a lower bound on the worst case
cumulative loss of A.
We start with a brief review of a framework used by Vovk The decision problem is defined as follows. We fix an
[24], which is very similar to the framework used in integer K>1 and set 2=S K where S K is the K dimensional
Section 3. In this framework, an on-line decision problem simplex, i.e., S K =[x # [0, 1] K : K i=1 x i =1]. We set 0 to
consists of a decision space 2, an outcome space 0 and a be the set of unit vectors in R K, i.e., 0=[e 1 , ..., e K ] where
loss function *: 2_0 [0, ], which associates a loss to e i # [0, 1] K has a 1 in the ith component, and 0 in all other
each decision and outcome pair. At each trial t the learning components. Finally, we define the loss function to be
algorithm receives the decisions = t1 , ..., = tN # 2 of N experts, *($, e i ).$ } e i =$ i . One can easily verify that these defini-
and then generates its own decision $ t # 2. Upon receiving tions conform to Assumptions 14.
File: 571J 150419 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 5890 Signs: 4625 . Length: 56 pic 0 pts, 236 mm
138 FREUND AND SCHAPIRE
To prove a lower bound on c( ;) for this decision problem then the decision problem is (c, a))-bounded. On the other
we choose a particular simple distribution over the decision hand, using the lower bound given by Theorem 13 and the
space 2. Let D be the uniform distribution over the unit vec- lower bound on c( ;) given in Eq. (31), we get that for any
tors, i.e., dom(D)=[e 1 , ..., e K ]. For this distribution, we K and any ;, either
can explicitly calculate
ln(1;) 1
*($, |) c or a . (32)
c( ;) inf sup *(=, |)
. (28) K ln(1&(1&;)K) K ln(1&(1&;)K)
$#2 |#0 log ; = # dom(D) ; D(=)
First, it is easy to see that the denominator in Eq. (28) is a As K is a free parameter we can let K and the
constant: denominators in Eq. (22) become 1&; which gives the
statement of the theorem. K
; K&1
: ; *(=, |)D(=)= + . (29)
= # dom(D)
K K ACKNOWLEDGMENTS
For any probability vector $ # 2, there must exist one com- Thanks are due to Corinna Cortes, Harris Drucker, David Helmbold,
ponent i for which $ i 1K. Thus Keith Messer, Volodya Vovk, and Manfred Warmuth for helpful discus-
sions.
Combining Eqs. (28), (29), (30), we get that 1. E. B. Baum and D. Haussler, What size net gives valid generalization?,
Adv. Neural Inform. Process. Systems I,'' pp. 8190, Morgan Kauf-
ln(1;) mann, 1989.
c(;) . (31) 2. D. Blackwell, An analog of the minimax theorem for vector payoffs,
K ln(1&(1&;)K)
Pacific J. Math. 6, No. 1 (Spring 1956), 18.
3. L. Breiman, Bias, variance, and arcing classifiers, Technical Report
We now show how an on-line allocation algorithm A can *460, Statistics Dept., University of California, 1996; available from
be used as a subroutine for solving this decision problem. ftp:ftp.stat.berkeley.edupubusersbreimanarcall.ps.Z., 1996.
We match each of the N experts of the decision problem 4. N. Cesa-Bianchi, Y. Freund, D. P. Helmhold, D. Haussler, R. E.
Schapire, and M. K. Warmuth, How to use expert advice, in
with a strategy of the allocation problem. Each iteration t of ``Proceedings of the Twenty-Fifth Annual ACM Symposium on the
the decision problem proceeds as follows. Theory of Computing, 1993,'' pp. 382391.
5. T. H. Chung, Approximate methods for sequential decision making
1. Each of the N experts generates a decision = ti # S K . using expert advice, in ``Proceedings of the Seventh Annual ACM Con-
2. The algorithm A generates a distribution p t # S N . ference on Computational Learning Theory, 1994,'' pp. 183189.
6. T. M. Cover, Universal portfolios, Math. Finance 1, No. 1 (Jan. 1991),
3. The learner chooses the decision $ t = N t t
i=1 p i = i . 129.
4. The outcome | t # 0 is generated. 7. T. G. Dietterich and G. Bakiri, Solving multiclass learning problems
via error-correcting output codes, J. Artif. Intell. Res. 2 (January 1995),
5. The learner incurs loss $ t } | t, and each expert suffers 263286.
loss = ti } | t. 8. H. Drucker and C. Cortes, Boosting decision trees, Adv. Neural Inform.
Process. Systems 8 (1996).
6. Algorithm A receives the loss vector l t where
9. H. Drucker, R. Schapire, and P. Simard, Boosting performance in
l == ti } | t, and incurs loss
t
i neural networks, Int. J. Pattern Recognition Artif. Intell. 7, No. 4
(1993), 705719.
N 10. Y. Freund, ``Data Filtering and Distribution Modeling Algorithms
p t } l t = : p ti(= ti } | t ) for Machine Learning,'' Ph.D. thesis, University of California at
i=1 Santa Cruz, 1993; retrievable from ftp.cse.ucsc.edupubtrucsc-crl-93-
N 37.ps.Z.
\ i=1
+
= : p ti = ti } | t =$ t } | t. 11. Y. Freund, Boosting a weak learning algorithm by majority, Inform.
and Comput. 121, No. 2 (September 1995), 256285; an extended
abstract appeared in ``Proceedings of the Third Annual Workshop on
Observe that the loss incurved by the learner in the deci- Computational Learning Theory, 1990.''
12. Y. Freund and R. E. Schapire, Experiments with a new boosting algo-
sion problem is equal to the loss incurred by A. Thus, if for rithm, in ``Machine Learning: Proceedings of the Thirteenth Interna-
algorithm A we have an upper bound of the form tional Conference, 1996,'' pp. 148156.
13. Y. Freund and R. E. Schapire, Game theory, on-line prediction and
L A c min L i +a ln N, boosting, in ``Proceedings of the Ninth Annual Conference on
i Computational Learning Theory, 1996,'' pp. 325332.
File: 571J 150420 . By:CV . Date:28:07:01 . Time:05:54 LOP8M. V8.0. Page 01:01
Codes: 8609 Signs: 4174 . Length: 56 pic 0 pts, 236 mm
A DECISION THEORETIC GENERALIZATION 139
14. J. Hannan, Approximation to Bayes risk in repeated play, ``Contribu- EuroCOLT '93,'' pp. 109120, Springer-Verlag, New YorkBerlin,
tions to the Theory of Games,'' Vol. III (M. Dresher, A. W. Tucker, 1994.
and P. Wolfe, Eds.), pp. 97139, Princeton Univ. Press, Princeton, NJ, 20. N. Littlestone and M. K. Warmuth, The weighted majority algorithm,
1957. Inform. and Comput. 108 (1994), 212261.
15. D. Haussler, J. Kivinen, and M. K. Warmuth, Tight worst-case loss 21. J. R. Quinlan, Bagging, boosting, and C4.5, in ``Proceedings, Four-
bounds for predicting with expert advice, in ``Computational Learning teenth National Conference on Artificial Intelligence, 1996.''
Theory: Second European Conference, EuroCOLT '95,'' pp. 6983, 22. R. E. Schapire, The strength of weak learnability, Machine Learning 5,
Springer-Verlag, New YorkBerlin, 1995. No. 2 (1990), 197227.
16. J. C. Jackson and M. W. Craven, Learning sparse perceptrons, Adv. 23. V. N. Vapnik, ``Estimation of Dependences Based on Empirical Data,''
Neural Inform. Process. Systems 8 (1996). Springer-Verlag, New YorkBerlin, 1982.
17. M. Kearns, Y. Mansour, A. Y. Ng, and D. Ron, An experimental and 24. V. G. Vovk, A game of prediction with expert advice, in ``Proceedings
theoretical comparison of model selection methods, in ``Proceedings of of the Eighth Annual Conference on Computational Learning Theory,
the Eighth Annual Conference on Computational Learning Theory, 1995.''
1995.'' 25. V. G. Vovk, Aggregating strategies, in ``Proceedings of the Third
18. M. J. Kearns and U. V. Vazirani, ``An Introduction to Computational Annual Workshop on Computational Learning Theory, 1990,''
Learning Theory,'' MIT Press, Cambridge, MA, 1994. pp. 321383.
19. J. Kivinen and M. K. Warmuth, Using experts for predicting 26. R. S. Wenocur and R. M. Dudley, Some special VapnikChervonenkis
continuous outcomes, in ``Computational Learning Theory: classes, Discrete Mathematics 33 (1981), 313318.
File: 571J 150421 . By:CV . Date:28:07:01 . Time:06:07 LOP8M. V8.0. Page 01:01
Codes: 5332 Signs: 2008 . Length: 56 pic 0 pts, 236 mm