Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: https://www.tandfonline.com/loi/uasa20

Improvements on Cross-Validation: The 632+


Bootstrap Method

Bradley Efron & Robert Tibshirani

To cite this article: Bradley Efron & Robert Tibshirani (1997) Improvements on Cross-Validation:
The 632+ Bootstrap Method, Journal of the American Statistical Association, 92:438, 548-560, DOI:
10.1080/01621459.1997.10474007

To link to this article: https://doi.org/10.1080/01621459.1997.10474007

Published online: 17 Feb 2012.

Submit your article to this journal

Article views: 3057

View related articles

Citing articles: 34 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=uasa20
Improvements on Cross-Validation:
The .632 + Bootstrap Method
Bradley EFRON and Robert TIBSHIRANI

A training set of data has been used to construct a rule for predicting future responses. What is the error rate of this rule? This is an
i~portant questio~ bo~h for comparing models and for assessing a final selected model. The traditional answer to this question is
g~ven by cross-vahd~tlOn. The cros~-~alidation esti.mate of prediction error is nearly unbiased but can be highly variable. Here we
discuss bootstrap estimates of prediction error, which can be thought of as smoothed versions of cross-validation. We show that a
part~cular bo?t~trap ~etho~, the .632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments.
Besides providing point estimates, we also consider estimating the variability of an error rate estimate. All of the results here are
nonparametric and apply to any possible prediction rule; however, we study only classification problems with 0-1 loss in detail. Our
simulations include "smooth" prediction rules like Fisher's linear discriminant function and unsmooth ones like nearest neighbors.
KEY WORDS: Classification; Cross-validation bootstrap; Prediction rule.

1. INTRODUCTION The data shown in Figure 1 were generated as part of the


This article concerns estimating the error rate of a pre- extensive simulation experiments described in Section 4. In
diction rule constructed from a training set of data. The this case the Vi were selected randomly and the t, were
training set x = (Xl, X2, ... ,xn ) consists of n observations bivariate normal vectors whose means depended on Vi,
Xi = (t i , Vi), with t, the predictor or feature vector and Vi
the response. On the basis of x, the statistician constructs Vi = { 0 Pr!
a prediction rule r x (t) and wishes to estimate the error rate 1 Pr 1.
2
of this rule when it is used to predict future responses from
their predictor vectors. independently for i = 1,2, ... ,n = 20.
Cross-validation, the traditional method of choice for this Table 1 shows results from the simulations. Cross-
problem, provides a nearly unbiased estimate of the future validation is compared to the bootstrap-based estimator .
error rate. However, the low bias of cross-validation is of- 632+ described in Section 3. Cross-validation is nearly un-
ten paid for by high variability. Here we show that suit- biased as an estimator of the true error rate for both rules
ably defined bootstrap procedures can substantially reduce . LDF and NN, but the bootstrap-based estimator has a root
the variability of error rate predictions. The gain in effi- mean squared (RMS) error only 80% as large. These results
ciency in our catalog of simulations is roughly equivalent are fairly typical of the 24 simulation experiments reported
to a 60% increase in the size of the training set. The boot- in Section 4. The bootstrap estimator in these experiments
strap procedures are nothing more than smoothed versions was run with only 50 bootstrap replications per training set,
of cross-validation, with some adjustments made to correct but this turns out to be sufficient for most purposes, as the
for bias. internal variance calculations of Section 2 show.
We are interested mainly in the situation when the re- The bootstrap has other important advantages besides
sponse is dichotomous. This is illustrated in Figure 1, where providing more accurate point estimates for prediction er-
the n = 20 observations, Xi = (t i , Vi), in the training set x ror. The bootstrap replications also provide a direct assess-
each consist of a bivariate feature vector ti and a 0 - 1 ment of variability for estimated parameters in the predic-
response Vi; 12 of the points are labeled 0 and 8 are la- tion rule. For example, Efron and Gong (1983) discussed
belled 1. Two different prediction rules are indicated. Fig- the stability of the "significant" predictor variables chosen
ure 1a shows the prediction rule based on Fisher's linear by a complicated stepwise logistic regression program. Sec-
discriminant function (LDF), following Efron (1983). The tion 5 provides another use for the bootstrap replications:
rule rx (t) will predict V = 0 if t lies to the lower left of the to estimate the variance of a point estimate of prediction
LDF boundary and will predict V = 1 if t lies to the upper error.
right of the LDF boundary. Figure 1b shows the nearest- Section 2 begins with a discussion of bootstrap smooth-
neighbor (NN) rule, in which future t vectors will have V ing, a general approach to reducing the variability of non-
predicted according to the label of the nearest observation parametric point estimators. When applied to the predic-
in the training set. We wish to estimate the error rates of tion problem, bootstrap smoothing gives a smoothed ver-
the two prediction rules. sion of cross-validation with considerably reduced variabil-
ity but with an upward bias. A bias correction discussed
in Section 3 results in the .632+ estimates of Table 1. The
Bradley Efron is Professor, Department of Statistics, Stanford Univer- .632+ estimator is shown to substantially outperform ordi-
sity, Stanford, CA 94305. Robert Tibshirani is Professor, Department of
Preventive Medicine and Biostatistics, University of Toronto, Toronto, On-
nary cross-validation in the catalog of 24 sampling experi-
tario M5S lA8, Canada. The authors thank Ronny Kohavi for reviving
their interest in this problem, Jerry Friedman, Trevor Hastie, and Richard
Olshen for enjoyable and fruitful discussions, and the associate editor and © 1997 American Statistical Association
referee for helpful comments. Tibshirani was supported by the Natural Journal of the American Statistical Association
Sciences and Engineering Research Council of Canada. June 1997, Vol. 92, No. 438, Theory and Methods

548
Efron and Tibshirani: The .632+ Bootstrap Method 549

2. CROSS-VALIDATION AND THE


'0
, LEAVE-ONE-OUT BOOTSTRAP
o
This section discusses a bootstrap smoothing of cross-
validation that reduces the variability of error-rate esti-
o mates. Here the notation Q[y, r] indicates the discrepancy
o ll. o_o I __ ~ ~1
o between a predicted value r and the actual response y.
:
I

I
, We are particularly interested in the dichotomous situation
I
": 0
! where both y and r are either 0 or 1, with
I
0:
., ., 0 ifr=y
·2
Q[y, r] = { 1 if r =1= y. (2)
Figure 1. A Training Set Consisting of n = 20 Observations, 12
Labelled a and 8 Labelled 1. (a) The linear discriminant function predicts We also use the shorter notation
a to lower left of the solid line and 1 to upper right; (b) the nearest-
neighbor rule predicts 1 in the three indicated islands and a elsewhere.
(3)
to indicate the discrepancy between the predicted value and
ments described in Section 4. Section 5 shows how the same response for a test point Xo = (to, Yo) when using the rule
bootstrap replications that provide a point estimate of pre- r x based on training set x.
diction error can also provide an assessment of variability Suppose that the observations Xi = (t i , Yi) in the training
for that estimate. Section 6 presents the distance argument set are a random sample from some distribution F,
underlying the .632+ rule, along with other bias-correction
techniques. (4)
All of the results here are nonparametric and apply to any and that Xo = (to, Yo) is another independent draw from F,
possible prediction rule; however, we study only classifica- called a test point. The true error rate of the rule T« is
tion problems with 0-1 loss in detail. Regression problems
may exhibit qualitatively different behavior, and the statis- Err = Err(x, F) = EOFQ(XO, x) = EOFQ[yO, rx(to)], (5)
tical approach may also differ. "In-sample" prediction error
with the notation E OF indicating that only Xo ~ (to, Yo) is
is often the focus in regression, especially for model selec-
random in (5), with x and r x being fixed. Thus Err is the
tion. In contrast, the error that we study here might be called
conditional error rate, conditional on the training set x.
"extra-sample'terror, Efron (1986) studied estimates of the
We compare error rate estimators in terms of their ability
in-sample prediction error problem, including generalized to predict Err. Section 4 briefly discusses estimating instead
cross-validation (Wahba 1980) and the Cp statistic of Mal- the expected true error,
lows (1973). The note at the end of Section 2 clarifies the
distinction between extra-sample and in-sample prediction JL = JL(F) = Ep{Err} = EFEoFQ(XO, x). (6)
errors.
The results in this case are somewhat more favorable to
Considerable work has been done in the literature on
the bootstrap estimator. Note, however, that although the
cross-validation and the bootstrap for error rate estima-
conditional error rate is often what we would like to ob-
tion. (A good general discussion can be found in McLach- tain, none of the methods correlates very well with it on a
lan 1992; key references for cross-validation are Allen sample-by-sample basis (see Zhang 1995).
1974 and Stone 1974, 1977.) Efron (1983) proposed a The apparent error rate (or resubstitution rate) is
number of bootstrap estimates of prediction error, includ-
ing the optimism and .632 estimates. The use of cross-
validation and the bootstrap for model selection was stud-
ied by Breiman (1992), Breiman and Spector (1992), Shao
(1993), and Zhang (1993). Breiman and Spector demon- with F indicating the empirical distribution that puts proba-
strated that leave-one-out cross-validation has high variance bility lin on each observation Xl, X2, ... X n ; err tends to be
if the prediction rule is unstable, because the leave-one-out
training sets are too similar to the full dataset. Fivefold Table 1. Error Rate Estimation for Situation (1)
or tenfold cross-validation displayed lower variance in this
LDF NN
case. A study of cross-validation and bootstrap methods for
Exp RMS Exp RMS
tree-structured models was carried out by Crawford (1989).
Substantial work has also been done on the prediction er- CV1 .362 .123 .419 .123
.632+ .357 .096 .380 .099
ror problem in the machine learning and pattern recognition
fields; see, for example, the simulation studies of Chernick, True .357 .418
Murthy, and Nealy (1985,1986) and Jain, Dubes, and Chen NOTE: CV1 is the cross-validation estimate based on omitting one observation at a time from
(1987). Kohavi (1995) performed a particularly interesting the training set; 632+ is the bootstrap-based estimator described in Section 3. The table shows
the expectation and RMS error of the two estimates for both the LDF and NN prediction rules.
study that renewed our interest in this problem. In both cases, RMS(.632+ )/RMS(GV1) is aboU180%.
550 Journal of the American Statistical Association, June 1997

biased downward as an estimate of Err because the training a smoothed version of ifri.(cvl). This estimate predicts the
set x has been used twice, both to construct the rule and to error at point i only from bootstrap samples that do not
test it. contain the point i.
Cross-validation (Geisser 1975; Stone 1974) avoids this The actual calculation of ifri.(l) is a straightforward
problem by removing the. data point to be predicted from bootstrap exercise. Ordinary bootstrap samples x" =
the training set. The ordinary cross-validation estimate of (xi, x z,... , x~) are generated as in (11), so x" is a random
prediction error is draw of size n, with replacement, from {XI,X2,""X n } .
A total of B such samples are independently drawn, say
X*l, x*2 , ... , x*B, with B = 50 in our simulations, as dis-
cussed later. Let Nf be the number of times that Xi is in-
cluded in the bth bootstrap sample and define
where xci) is the training set with the ith observation re-
moved. ifri.(cvI) is leave-one-out cross-validation; the k-fold I b = {I if Nf = 0 (15)
version ifri.(cvk) partitions the training set into k parts, pre- 2 0 if Nf > O.
dicting in turn the observations in each part from the train-
ing sample formed from all of the remaining parts. Also define
The statistic ifri.(cvI) is a discontinuous function of the
training set x when Q[y, r] itself is discontinuous as in (2).
Bootstrap smoothing is a way to reduce the variance of such Then
functions by averaging. Suppose that Z(x) is an unbiased
estimate of a parameter of interest, say -(1) 1~ .
Err = - L...J E i , where Ei = L:IfQf!L:If. (17)
((F) = EF{Z(x)}. (9) n i=l b b

By definition, the nonparametric maximum likelihood esti- This definition agrees with (14) because a bootstrap sam-
mate (MLE) of the same parameter ( is ple that has If = 1 is the same as a bootstrap. sample
from F(i) (see Efron 1992). A slightly different definition
(= ((F) = Ep{Z(x*)}. (10) was given by Efron (1983) (where ifri.(1) is teO)), namely
Here x" is a random sample from F, Li Lb IfQ?/ Li Lb If, but the two definitions agree as
B -+ oo and produced nearly the same results in our simu-
lations.
(11)
There is another way to view cross-validation and ifri.(1):
that is, a bootstrap sample. The bootstrap expectation in as estimates of the average error J-L(F). Direct applica-
(10) smooths out discontinuities in Z(x), usually reducing tion of the bootstrap gives the plug-in estimate J-L(F) =
its variability. However, ( may now be biased as an estimate EpEopQ(xo, x). This estimate, discussed in Section 6,
of (. Breiman (1994) introduced a very similar idea under tends to be biased downward. The reason is that F is being
the sobriquet "bagging." used twice: as the population, say F o, from which bootstrap
Now consider applying bootstrap smoothing to Zi(X) = training sets x" are drawn, and as the population F I from
Q(Xi, XCi)~' with Xi fixed. The nonparametric MLE of which test points X o are drawn. Let us write J-L(F) explicitly
asa function of both F I and Fo:
EFZi(X) is (i = EPC') {Q(Xi, xCi))}' where XCi) is a boot-
strap sample from the empirical distribution on XCi), J-L(FI, Fo) = EF1{EoFoQ[YO, rx(To)]}
1 = EOFoEF1Q[Yo, r«(To)]) , (18)
F(i) : Pr - - on Xl, X2,···, Xi-I, Xi+l,···, Xn · (12)
n-l
where, for convenience, we have switched the order of ex-
It might be argued that the bootstrap samples should XCi) pectation in the second expression. We assume that in the
be of size n - 1 instead of n, but there is no advantage to unknown true state of affairs, F I = Fo = F. Plugging in F
this. In what follows we take bootstrap samples from F(i) for the test distribution F o gives
to be of size n and indicate them by x" rather than XCi)' so

(13)

Notice that an x" sample drawn fromF(i) never contains The remaining task is to estimate the training sampledistri-
the point Xi. bution Fl. Ideally, we would take F I = F. Notice that for
Applying (11) and (12) to each case i in turn leads to the continuous populations F, the probability of the test point
leave-one-out bootstrap, X o = Xi appearing in a training sample drawn from F I = F
1 n
is O. The plug-in estimate J-L(F) = J-L(F, F) uses F for
= -ti ~ Fl. With this choice, the probability that X o = Xi appears
ifri.(1)
L...J E·F C') {Q(x'2, x*)} , (14)
in the training sample is 1 ~ (l-l/n)n ~ .632. Hence J-L(F)
i=l
Efron and Tibshirani: The .632+ Bootstrap Method 551

ger prediction errors, larger k gives bigger upward bias


Iln-k - Iln. The amount of bias depends on the slope of
the error curve Iln at sample size n. Bootstrap samples are
l"" typically supported on about .632n of the original sample
"" points, so we might expect fui(1) to be estimating 1l.632n'
""
.,"" ,,
,, ,, The more precise calculations of Efron (1983, Sec. 8) show
: ".. that fui(1) closely agrees with half-sample cross-validation
:,
:,
......
'-,
(where x is repeatedly split into equal-sized training and test

--- . . .,.----..
-,
: '-, : ~ .. -, sets), and that the expectation of fui(l) is Iln/2 to second

"'=s .~= ~.~.~.::::.~ ~.: ~


oE order. The next section concerns a bias-corrected version of
&r(1) called the .632+ rule, that reduces the upward bias.
The choice of B = 50 bootstrap replications in our sim-
It)
ulation experiments was based on an assessment of inter-
c:i nal error, the Monte Carlo error due to using B instead
of infinity replications (Efron 1992). The same bootstrap
replications that give &r(1) also give a jackknife estimate
o i mterna1 error. Let q;b = JbQb
of 1tS i i ' qi+ = ""B b an
L"b=l qi' . d
c:i
rt = 2.:~=1 tr. Then the estimate of &r(l) with the bth
0.1 0.2 0.3 0.4 0.5 0.6 bootstrap replication removed is
Figure 2. Ratio of Standard Deviations and Expectations for the
-(1) _ 1 ~ ,
Leave-One-Out Bootstrap &;(1) Compared to Cross-Validation &;(cVI) Err(b) - ~ L..JEi(b)' (21)
for the 24 sampling experiments described in Section 4, plotted versus t=l
expected true error u, (6). The median SO ratio for the 24 experiments
was .79; the median expectation ratio was 1.07. The jackknife estimate of internal standard deviation for
fui(1) is then
uses training samples that are too close to the test points,
leading to potential underestimation of the error rate. Cross-
validation uses the leave-one-out training samples to ensure
that the training samples do not contain the test point; that
is, cross-validation estimates E F1Q[Yi, rx(Ti)] by
-(1) - (1)
Err(.) = 2.:b Err(b) / B.
EF1Q[Yi, rX(Ti)] = Q[Yi, rX(i) (Ti)]. (20)
In our simulations §Dint was typically about .02 for
On the other hand, using the estimate F1 = F(i) in each term . B = 50. The external standard deviation of &r(1) (i.e., the
E F1Q[Yi, rx(Ti)] gives the leave-one-out bootstrap esti- standard deviation due to randomness of x) was typically
mate :&:'r(1). .10. (Sec. 5 discusses the estimation of external error, also
The efficacy of bootstrap smoothing is shown in Figure using the same set of bootstrap replications.) This gives cor-
2, where the solid line plots the standard deviation ratio rected external standard deviation [.102 - .02 2]1/ 2 = .098,
for the 24 sampling experiments of Section 4. The horizon- indicating that B = 50 is sufficient here.
tal axis is the expected true error 11 for each experiment, Note that definition (5) of prediction error, Err =
(6). We see that fui(1) always has smaller standard devia- EOFQ[yO, rx(xo)], might be called extra-sample error, be-
cause the test point (to, Yo) is chosen randomly from F
tion than fui(cv1), with the median ratio over the 24 exper-
without reference to the training sample x. Efron (1986)
iments being .79. Going from fui(cv1) to fui(1) is roughly
investigated a more restrictive definition of prediction er-
equivalent to multiplying the size n of the training set by
ror. For dichotomous problems, let Jr(t) = Pr F {y = lit}
1/.79 2 = 1.60. The improvement ofthe bootstrap estimators
andz, = Jr(ti)' The in-sample error of a rule r x is defined
over cross-validation is due mainly to the effect of smooth-
to be
ing. Cross-validation and the bootstrap are closely related,
as Efron (1983, Sec. 2) has shown. In smoother prediction 1 n
problems-for example, when Y and r are continuous and err = ~ LE07r;{Q[YOi,rX(ti)]}, (23)
Q[y, r] = (y - r )2-we would expect to see little difference i=l

between fui(cv1) and fui(1). where the notation E 07ri indicates that only YOi rv binomial
The dotted curve in Figure 2 is the expectation ratio (1, Jri) is random, with x and rx(t i ) being fixed. This situa-
E{fui(1)} / E{fui(cv1)}. We see that fui(1) is biased upward tion is similar to a standard regression problem in that the
relative to the nearly unbiased estimate fui(cv1). This is not predictors t, are treated as fixed at their observed values,
surprising. Let Iln indicate the expected true error (6) when rather than as random. In-sample error prediction is math-
x has sample size n. Ordinary cross-validation produces an ematically simpler than the extra-sample case and leads to
unbiased estimate ofll n - 1 , whereas k-fold cross-validation quite different solutions for the error rate prediction prob-
estimates f-l-n-k. Because smaller training sets produce big- lem (see Efron 1986).
552 Journal of the American Statistical Association, June 1997

3. THE .632+ ESTIMATOR permuting the responses Yi and predictors t j ,

Efron (1983) proposed the .632 estimator, n n

:fui-(.632) = .368 . err + .632· :fui-(1), (24) 1 = LLQ[Yi,Tx (t j )]/ n2 . (26)


i=l j=l

designed to correct the upward bias in :fui-(1) by averaging it


For the dichotomous classification problem (2), let fh
with the downwardly biased estimate :fui-(.632). The coeffi- be the observed proportion of responses Yi equalling 1,
cients .368 = e- 1 and .632 were suggested by an argument and let (11 be the observed proportion of predictions Tx(tj)
based on the fact that bootstrap samples are supported on equalling 1. Then
approximately .632nof the original data points. In Efron's
article &r(·632) performed better than all competitors, but (27)
the simulation studies did not include highly overfit rules
like nearest-neighbors, where err = O. Such statistics make With a rule like nearest-neighbors for which ill = ill, the
value of 1 is 2fll (1- 131)' The multicategory generalization
&r(·632) itself downwardly biased. For example, if y equals
of (27) is 1 = I:lPl(l- £]1).
o or 1 with probability 1/2, independently of the (useless) The relative overfitting rate is defined as
predictor vector t, then Err = .50 for any prediction rule, but
the expected value of :fui-(.632) for the nearest-neighbor rule - :fui-(1) - err
is .632· .5 = .316. Both :fui-(1) and :fui-(cv1) have the correct R= _ , (28)
"t>: err
expectation .50 in this case. Breiman, Friedman, Olshen,
and Stone (1984) suggested this example. a quantity that ranges from 0 with no overfitting (.llir(1) =
This section proposes a new estimator :fui-(,632+), de- err) to 1 with overfitting equal to the no-information value
signed to be a less-biased compromise between err and 1 - err. The "distance" argument of Section 6 suggests a
:fui-(1). The .632+ rule puts greater weight on :fui-(1) in sit- less-biased version of (24) where the weights on err and
uations where the amount of overfitting, as measured by &r(1) depend on R,
:fui-(1) - err, is large. To correctly scale the amount of
overfitting, we first need to define the no-information error
:fui-(.632+) = (1 - w) . err + w. :fui-(1),
rate, [, that would apply if t and y were independent, as in
the example of the previous paragraph. _ .632]
[w = 1- .368R . (29)
Let Find be the probability distribution on points x =
(t, y) having the same t and y marginals as F, but with y
The weight w ranges from .632 if R = a to 1 if R = 1,
independent of t. As in (5), define
so .llir(.632+) ranges from .llir(.632) to .llir(1). We can also
[ = EOFind Q(xo, x) = EOFind Q[yO, Tx(t O)], (25) express (29) as
the expected prediction error for rule Fx. given a test point
~ (632) ~ ( 632) ~ (1) .368· .632· R
xo = (to, Yo) from Find. An estimate of [ is obtained by Err' + = Err ' + (Err - err) _ , (30)
1- .368R

emphasizing that &r(·632+) exceeds &r(·632) by an amount


depending on k
It may happen that 1 :::; err or err < 1:::; :fui-(1) ,
it
Ii
in which case R can fall outside of [0, 1]. To account for
j\
this possibility, we modify the definitions of .llir(1) and R:
""
i I
III
o
I I
I
:fui-(1)' = min(:fui-(1), 1)
-,
-,
,,
, and
, '
" ...' " ...I
/
/
R' = { o(&r(1) - err)/(1 - err) if :fui-(1:"-Y > err (31)
o
c:i
otherwise.

The .632+ rule used in the simulation experiments of Sec-


tion 4 was

~ (.632+ ) = Err
Err ~ (.632) ~ (1)' -
+ (Err err) .368· .632·_k . (32)
0.1 0.2 0;3 0.4 0.5 0.6 u .... 1- .368R'

Figure 3 shows that &r(·632+) was a reasonably success-


Figure 3. Relative Bias of Err(.632+) for the 24 Experiments (Solid
ful compromise between the upwardly biased .llir(1) and
Curve), Compared to Err(1) (Top Curve) and Err(·632) (bottom curve).
Plotted values are (mean - Jl)/Jl. Dashes indicate the seven no- the downwardly biased :fui-(.632). The plotted values are the
information experiments, where Err = .50. relative bias in each of the 24 experiments, measured by
Efron and Tibshirani: The .632+ Bootstrap Method 553

Table 2. The 24 Sampling Experiments Described in Text

Experiment n p Class J.L Rule F

1 14 5 2 .24 LDF Normals ± (1, 0, 0, 0, 0)


2 14 5 2 .50 LDF Normals ± (0, 0, 0, 0, 0)
3' 20 2 2 .34 LDF Norrnab ± (.5, 0)
4 20 2 2 .50 LDF Norrnal-, ± (0, 0)
5 14 5 2 .28 NN Normals ± (1, 0, 0, 0, 0)
6 14 5 2 .50 NN Normals ± (0, 0, 0, 0, 0)
7' 20 2 2 .41 NN Normal2 ± (.5, 0)
8 20 2 2 .50 NN Normal, ± (0, 0)
9 14 5 2 .26 3NN Normals ± (1, 0, 0, 0, 0)
10 14 5 2 .50 3NN Normals ± (0, 0, 0, 0, 0)
11 20 2 2 .39 3NN Norrnal- ± (.5, 0)
12 20 2 2 .50 3NN Normal, ± (0, 0)
13 100 10 2 .15 LDF . N lO(O, I) versus
14 .18 NN Y1
N 10 ( 2'71)
15 .17 TREES
16 .05 QDF
17 20 2 2 .18 LDF N2±(1,0)
18 14 12 2 .50 LDF N12(0, I)
19 20 2 4 .60 LDF Norrnale ± (.5, 0) .
20 100 19 4 .26 LDF Vehicle
21 .07 NN data
22 36 10 2 .07 LDF Breast cancer
23 .04 NN data
24 80 15 15 .47 3NN Soybean data

* Experiments #3 and #7 appear in Table 1 and Figure 1.


NOTE: .Results for experiments #1 -4 in Table 3;# s -8 in Table 4;#4-12 in 'Table S; #13 -16 in Table 6;#17 -19 in Table 7;#20 -24
in Table 8.

(mean - IL) / IL. The dashes in Figute 3 indicate the seven iments each comprised 200 Monte Carlo simulations (i.e.,
"no-information" experiments, where y was independent of 200 independent choices of the training set x) and the last
t and Err == .50. (The IL values for these seven experiments 12 experiments each comprised 50 simulations. The boot-
have been spread out for clarity.) In these cases the defini- strap samples were balanced in the sense that the indices
tions in (31) effectively truncate fur(·632+) at or near .50. of the bootstrap data points were obtained by randomly
This almost always gives a more accurate estimate of Err permuting a string of Bcopies of the integers 1 to n (see
on a case-by-case basis but yields a downward bias over-. Davison, Hinkley, and Schechtman (1986), but the balanc-
all. To put things another way, we could also improve the ing had little effect on our results.
. accuracy of fu(cvl) by truncating it at;Y, but then fu(cvl) This simulation study might seem excessive, but it must
would no longer be nearly unbiased. be large to investigate the effects of different classifiers,
training set sizes, signal-to-noise ratio, number of classes,
Better bias adjustments of fuel) are available, as dis-
and number of observations. Here are some explanatory
cussed in Section 6. However, in reducing the bias of fu Cl), comments concerning the 24 experiments:
they lose about half of the reduction in variance enjoyed .
by fur(.632+) and so offer less dramatic improvements over • In experiments #1-18, the response Yi equals 0 or 1
fu Ccvl) . with probability .50, and the conditional distribution
Note that fu(.632+) requires no additional applications of tilYi is multivariate normal. For example, exper-
of the prediction rule after computation of fu(1). Because iment #3 is as described in (1), tilYi rv Nz ((Yi -
50 bootstrap samples are often sufficient, both fuCl) and .5,0), I), but #4 has the no-information form tilYi rv
fu(.63Z+) can be less costly than fu(cvl) if n is large.
Nz((O,O),I), so that Yi is independent of t i and ev-
ery prediction rule has Err = .50. In experiment #19
the response Yi equals 1, 2, 3, or 4 with probability
.25, and tilYi rv Ni(~i, T) with 6 = (-.5, -.5),6 =
4. SAMPLING EXPERIMENTS (-.5, .5),6 = (.5 - .5), or ~4 = (.5, .5).
Table 2 describes the 24 sampling experiments performed • Experiments #13-16 are taken from Friedman (1994).
for this study. Each experiment involved the choice of a There are two classes in 10 dimensions and 100 train-
training set size n, a probability distribution F giving x ing observations; tors in class 1 are independent stan-
as in (4), a dimension p for the prediction vectors t, and dard normal, and those in class 2 are independent
a prediction rule r x . Twenty-one of the experiments were normal with mean VI /2 and variance 1/j, for j =
dichotomous, and three involved four or more classes; 0-1 1,2, ... , 10. All predictors are useful here, but the ones
error (2) was used in all 24 experiments. The first 12 exper- with higher index j are more so.
554 Journal of the American Statistical Association, JU(1e 1997

Table 3. Results for the LOF Classifier

1: (14,5) 2: (14, 5, indy 3: (20, 2) 4: (20, 2, indy


Exp SO RMS Exp SO RMS Exp SO RMS Exp SO RMS

Err .259 .063 0 .501 .011 0 .357 .051 0 .500 .010 0


Errhat1 .327 .116 .147 .500 .115 .115 .388 .101 .104 .502 .087 .088
632 .232 .095 .117 .393 .106 .150 .343 .093 .093 .448 .081 .097
632+ .286 .116 .133 .416 .086 .121 .357 .092 .096 .443 .073 .094
cv1 .269 .144 .156 .501 .176 .175 .362 .130 .123 .505 .135 .135
cv5f .303 .158 .173 .485 .158 .158 .379 .129 .123 .515 .137 .139
cf5fr .299 .125 .141 .497 .135 .134 .371 .114 .109 .506 .117 .117
bootop .182 .105 .147 .375 .135 .183 .345 .107 .106 .459 .102 .110
bc1 .275 .153 .164 .499 .163 .163 .376 .115 .113 .499 .109 .109
bc2 .180 .231 .250 .498 .259 .258 .355 .151 .147 .493 .161 .161
Errhat2 .256 .118 .136 .458 .142 .147 .358 .109 .107 .472 .103 .107
errbar .070 .075 .214 .209 .104 .310 .267 .090 .128 .355 .084 .168
R .671 .234 .931 .136 .657 .300 .936 .165

• Experiments #20-24 refer to real datasets taken from Tables 3-8 report the performance of several error rate
the machine learning archive at UC Irvine. The dimen- estimators in the 24 sampling experiments. In each table the
sions ofthese datasets are (846, 19), (683, 10), and(562, Exp and SD columns give means and standard deviations,
18) (after removing incomplete observations) with four, and RMS is the square root of the average squared error
two, and 15 classes. We followed Kohavi (1995) and for estimating Err(x, F), (5). The bootstrap estimators all
chose a random subset of the data to act as' the train- used B = 50 bootstrap replications per simulation.
ing set, choosing the training set size n so that /1n still The error rate estimators include :ifri.(.632) , :ifri.(.632+) ,
sloped downward. The idea is that if n is so large that :ifri.(1), and four different cross-validation rules: cv1, cv5f,
the error curve is flat, then the error rate estimation and cvlOf are leave-one-out and fivefold and tenfold cross-
problem is too easy, because the potential biases aris- validations, whereas cv5fr is fivefold cross-validation aver-
ing from changing the training set size will be small. aged over 10 random choices of the partition (making the
We chose training set sizes 100, 36, and 80. The soy- total number of recomputations 50, the same as for the boot-
bean data actually have 35 categorical predictors, many strap rules). Also shown are other bias-corrected versions
with more than two possible values. To keep the com- of :ifri.(1) called bootop, bel, bc2, and :&'r(2); see Section 6.
putations manageable, we used only the 15 binary pre- The tables also give statistical summaries for Err, err, and
dictors. il'; see (31).
• The prediction rules are LDF, Fisher's linear discrim- The results vary considerably from experiment to ex-
inant analysis; 1-NN and 3-NN, one-nearest-neighbor periment, but in terms of RMS error the .632+ rule
and three-nearest-neighbor classifiers; TREES, a clas- is an overall winner. In Figure 4 the solid line graphs
sification tree using the tree function in S-PLUS; and RMS{:ifri.(.632+)} jRMS{:ifri.( cVl)} versus the true expected
QDF, quadratic discriminant function (i.e., estimating error /1, (6). The median value of the ratio for the 24 ex-
a separate mean and covariance in each class and using periments was .78. The dotted line is the rms ratio for es-
the Gaussian log-likelihood for classification). timating, /1 rather than Err, a measure that is slightly more

Table 4. Results for the 1-NN Classifier

5: (14, 5) 6: (14, 5, indy 7: (20, 2) 8: (20, 2, indy


Exp SO RMS Exp SO RMS Exp SO RMS Exp SO RMS

Err .293 .056 0 .500 .011 0 .418 .047 0 .500 .011 0


Err 1 .303 .134 .122 .491 .132 .132 .424 .105 .095 .507 .097 .097
Err· 632 .192 .085 .129 .310 .083 .207 .268 .067 .162 .320 .062 .190
Err· 632 + .257 .127 .120 .413 .094 .128 .380 .101 .099 .439 .068 .092
cv1 .287 .161 .151 .496 ..169 .168 .419 .133 .123 .513 .136 .136
cv5f .297 .167 .155 .490 .162 .163 .423 .144 .134 .508 .139 .139
cv5fr .297 .144 .133 .496 .138 .138 .420 .122 .110 .509 .117 .117
bootop .107 .047 .194 .172 .046 .331 .150 .037 .271 .180 .035 .322
bc1 .307 .139 .128 .490 .143 .143 .423 .109 .100 .506 .102 .101
bc2 .313 .158 .149 .486 .179 .179 .421 .131 .124 .503 .126 .126
Err 2 .197 .088 .126 .319 .087 .201 .274 .069 .157 .327 .063 .185
Err 0 0 .298 0 0 .500 0 0 .420 0 0 .500
R .641 .252 .922 .134 .853 .169 .949 .099
Efron and Tibshirani: The .632+ Bootstrap Method 555

Table 5. Results for the 3-NN Classifier

9: (14, 5) 10: (14, 5, indy 11: (20, 2) 12: (20, 2, indy

Exp SO RMS . Exp SO RMS Exp SO RMS Exp SO RMS

Err .273 .065 0 .500 .011 0 .399 .062 0 .501 .011 0


Err .314 .116 .131 .494 .112 .113 .427 .097 .091 .507 .083 .083
Err· 632 .245 .099 .110 .400 .100 .142 .346 .084 .093 .412 .074 .115
Err· 632 + .277 .113 .122 .421 .087 .119 .388 .091 .090 .437 .066 .091
cv1 .263 .154 .154 .496 .173 .173 .401 .139 .1.26 .509 .138 .138
cv5f .273 .154 .155 .491 .161 .161 .405 .133 .124 .511 .143 .143
cv5fr .290 .133 .139 .495 .144 .145 .411 .123 .110 .509 .117 .117
bootop .237 .122 .127 .412 .135 .162 .359 .106 .106 .431 .101 .123
bc1 .329 .125 .146 .495 .128 .128 .431 .114 .106 .505 .098 .098
bc2 .355 .187 .213 .499 .210 .209 .438 .181 .175 .502 .177 .176
Err2 .250 .120 .124 .425 .131 .152 .369 .103 .099 .441 .099 .115
Err .126 .091 .175 .239 .106 .283 .207 .078 .208 .250 .080 .263
R .604 .265 .943 .121 .814 .218 .946 .119

favorable to the .632+ rule, the median ratio now being. 72. We discuss estimating the usual external standard devia-
Simulation results must be viewed with caution, es- tion of fui.(1); that is, the variability in fui.(1) caused by the
pecially in an area as broadly defined as the prediction random choice of x. We also discuss the internal variability,
problem. The smoothing argument of Section 2 strongly due to the random choice of the B bootstrap samples (as
suggests that it should be possible to improve on cross- at the end of Sec. 2), because it affects the assessment of
validation. With this in mind, Figure 4 demonstrates that external variability. Finally, we discuss estimating the stan-
&r(·632+) has made full use of the decreased standard de- - ( )
dard deviation of the difference Err 1 (rule 1) - Err (rule
-(2)
viation seen in Figure 2. However, the decrease in RMS is 2), where rule 1 and rule 2 are two different prediction rules
less dependable than the decrease in SO, and part of the applied to the same set of data.
RMS decrease is due to the truncation at ;Y in definitions The nonparametric delta method estimate of standard er-
(31) and (32). The truncation effect is particularly notice- ror applies to symmetrically defined statistics, those that
able in the seven no-information experiments. are invariant under permutation of the points Xi in x =
5. ESTIMATING THE STANDARD ERROR OF Eri-(1) (X1,X2, ... ,Xn ) . In this case we can write the statistic as

The same set' of bootstrap replications that gives a point a function S (F) of the empirical distribution. The form
estimate of prediction error can also be used to assess the of S can depend on n, but S(F) must be smoothly defined
variability of that estimate. This can be useful for infer- in the following sense: Let Fg,i be a version of F that puts
ence purposes, model selection, or comparing two mod- extra probability on Xi,
els. The method presented here, called "delta-method-after-
bootstrap" by Efron (1992), works well for estimators like l-g + E: on Xi
A {

F g i: Pr n (33)
&r(1) that are smooth functions of x. It is more difficult to , l~g on Xj for j of- i.
obtain standard error estimates for cross-validation or the
.632+ estimator, and we do not study those estimators in Then we need the derivatives 8S(Fg ,i )/ fJc to exist at E: = O.
this section. , Defining

Table 6. Results for the (100, 10) Problem

13: LOF 14: 1 nearest neighbor 15: trees 16: QOA


Exp SO RMS Exp SO RMS Exp SO RMS Exp SO RMS

Err .155 .013 0 .188 .017 0 .161 .033 0 .053 .012 0


Err1 .180 .035 .045 .203 .028 .032 .203 .032 .061 .111 .022 .063
Err· 632 .157 .033 .036 .128 .018 .063 .145 .024 .042 .074 .016 .029
Err· 632 + .160 .033 .037 .151 .025 .044 .161 .028 .041 .079 .018 .034
cv-1 .163 .039 .042 .181 .034 .034 .169 .057 .057 .054 .026 .028
cv5f .169 .040 .044 .193 .036 .036 .180 .046 .056 .066 .028 .032
cv10 .164 .035 .038 .185 .034 .034 .171 .042 .046 .060 .026 .029
bootop .157 .037 .039 .073 .010 .116 .116 .025 .058 .049 .015 .019
bc1 .167 .038 .042 .202 .030 .033 ;188 .042 .059 .063 .026 .029
bc2 .142 .046 .049 .201 .039 .041 .160 .066 .075 -.024 .044 .088
Err .118 .032 .050 0 0 .188 .047 .019 .119 .011 ..010 .045
R .162 .047 .406 .056 .346 .065 .204 .037
556 Journal of the American Statistical Association, June 1997

Table 7. LOF for the (20, 2, +), (14, 12, ind), and (20,2, 4) Problems

17: (20, 2, +)
Exp SO RMS

Err .187 .028 0


Errhat1 .221 .088 .094
632 .191 .082 .082
632+ .199 .087 .088
cv-1 .196 .093 .095
cv5f .207 .104 .106
cv5fr .204 .092 .094
bootop .188 .091 .092
bc1 .203 .093 .094
bc2 .169 .115 .115
errbar .141 .078 .092
R .271 .196

D. _ .!. BS(Fc :i )
,- n Be: '
I
a
where Ei = LqS
B /
LIf.
B
(37)
b=l b=l
the nonparametric delta method standard error estimate for
S(F) is
We also define c, to be the bootstrap covariance between
Nf and qf,
(35) B
, 1,,", b b
C; = B L...J (Ni - l)q.. (38)
b=l

(see Efron 1992, Sec. 5). The vector D = (fh, D2 , ... , Dn ) The following lemma was proven by Efron and Tibshirani
is l/n times the empirical influence function of S. (1995).
If the prediction rule r x (t) is a symmetric function of the
points Xi in x, as it usually is, then fui.(l) is also symmetri- Lemma. The derivative (34) for S(F) = fur(l) is
cally defined. The expectation in (13) guarantees that fui.(1)
will be smoothly defined in x, so we can apply formulas
(34)--(35). (39)
We first consider the ideal case where fui.(1) is based
on all B ,= nn possible bootstrap samples x"
en == (1 -l/n)-n.
(xil'Xi2"",xi,J,each i j E {1,2, ... ,n}. Followingthe
notation in (14)--(16), let A naive estimate of standard error for fui.(1) would be
[Li(Ei - ifrr(1))2/ n2jl/2, based on the false assumption
, that the Ei are independent. This amounts to taking D, =
(36) (Ei - fui.(1))/n in (35). The actual influences (39) usually
result in a larger standard error than the naive estimates.

Table 8. Results for Real Data Examples

20: veh/LOF 21: veh/1-NN 22: breast/Ida 23: breast/1-NN


Exp SO RMS Exp SO RMS Exp SO RMS Exp SO RMS

Err .262 .022 0 .442 .023 0 .067 .025 0 .050 .018 0


Errhat1 .300 .035 .057 .476 .050 .067 .098 .040 .055 .058 .046 .041
632 .236 .033 .049 .301 .032 .147 .067 .030 .038 .037 .029 .029
632+ .249 .034 .044 .395 .054 .077 .072 .033 .040 .040 .034 .032
cv-1 .262 .041 .047 .446 .058 .065 .066 .050 .051 .054 .048 .042
cv5f .275 .052 .058 .458 .062 .068 .084 .053 .056 .056 .053 .046
cv10 .269 .044 .047 .452 .059 .067 .073 .057 .058 .060 .054 .047
bootop .222 .043 .065 .172 .018 .271 .048 .029 .042 .021 .016 .034
errbar .128 .035 .142 0 0 .442 .013 .019 .062 0 0 .054
R .280 .036 .640 .067 .202 .081 .134 .106
Efron and Tibshirani: The .632+ Bootstrap Method 557

These formulas were applied to a single realization of ex-


periment #3, having n = 20 points as in Table 2. B = 1,000
bootstrap replications were generated, and the standard er-
ror formulas (25), (32), and (33) were calculated from the
C! first 50, the first 100, and so on. The results appear in Ta-
ble 9. Using all B ~ 1,000 replications gave SEd e l = .100,
i. nearly the same as SEa dj = .097. This might be compared to
g CIO
GI
c::i the actual standard deviation .110 for &r(1) in experiment
.~ #3, though of course we expect any data-based standard
"': . error estimate to vary from sample to sample.
0

The right side of Table 9 shows SEdel and SEad j for suc-
CIO
c::i cessive groups of 100 bootstrap replications. The values of
BEdel are remarkably stable but biased upward from the an-
II) sweryased on all 1,000 replications; the bias-adjusted val-
c::i
ues SEa dj are less biased but about twice as variable from
group to group. In this example both SEdel and SEadj gave
.... I 111111 useful results even for B as small as 100.
c::i
The delta method cannot be directly applied to find the
1-1 0+- 0.1 0.2 0.3 0.4 0.5 0.6
standard error of E;T(.632+), because the .632+ rule involves
Figure 4. The RMS(f;;(· 632+)}IRMS(f/T(CV 1)}IRMS Ratio From Ta- err, an unsmooth function of x. A reasonable estimate of
bles 3-8, Plotted Versus Expected True Error i.L (Solid line) and the RMS standard error for E;T(.632+) is obtained by multiplying (35)
Ratio for Estimating i.L Instead of Err (Dotted line), Dashes indicate the
seven no-information experiments. The vertical axis is plotted logarith-
or (43) by E;T(.632+) l&r(1). This is reasonable, because the
mically. coefficient of variation for' the two estimators was nearly
the same in our experiments.
In practice, we have only B bootstrap samples, with B Finally, suppose that we apply two different prediction
being much smaller than n", In that case, we can set rules, r~ and r~, to the same training set and wish to assess
the significance of the difference 5iff = E;T(1)' - E;T(1)1I
D. =

(2 _1_)
+n- 1
Ei - E;T(1) L~=l (N? - Ni)qb
+ 'B
between the error rate estimates. For example, r~ and r~
n Lb=1 Iib ' could be LDF and NN, as in Figure 1. The previous theory
goes through if we change the definition of Q1 in (36) to
(40)
Q~ = Q[Yi, rX(ti)'] - Q[Yi, rx(td']· (44)
where Ei and E;T(1) are as given in (37), and Ni =
B b -
,Lb=1 N; 1B. tN, = 1 for a balanced set of bootstrap sam- Then the delta-method estimate of standard error for Diff -
ples.) The bootstrap expectation of If equals e;:;1, so (40) is (Li Dn
1 2
/ , where
goes to (39) as we approach the ideal case B = n":
Finally, we can use the jackknife to assess the internal
error of the Di estimates, namely the Monte Carlo error D; =
A (

2+ --
1) 5iff
n-1
i -
n
5iff + eiCi'A
(45)
that comes from using only a limited number of bootstrap
replications. Let Di(b) indicate the value of b, calculated Here 5iffi = Lb qf 1 Lb If, qf == If Q~, and everything else
from the B-1 bootstrap samples not including the bth sam- is as defined in (38) and (39).
ple. Simple computational formulas for Di(b) are available
along the lines of (17). The internal standard error of Ei is Table 9. B = 1,000 Bootstrap Replications From a Single
given by the jackknife formula Realization of Experiment #3, Table 2

B - - -
SE del SE'nl SEadj B -
SEdel SEint SEadj
(41) 1:00 .131 .063 .115
1:50 .187 .095 .161 101:200 .122 .077 .094
201 :300 .128 .068 .108
Di(.) == Lb Di(b)1B, with the total internal error 1:100 .131 .063 .115 301 :400 .118 .068 .097
401:500 .134 .076 .110
1:200 .119 .049 .109 501 :600 .116 .085 .079
n ] 1/2
SE int = '
[
L LiT (42)
1:400 .110 .034 .105
601 :700
701 :800
.126
.108
.060
.076
.111
.077
.=1 801 :900 .109 .084 .068
1:1000 .100 .023 .097 901:1000 .116 .082 .082
This leads to an adjusted standard error estimate for E;T(1),
Mean .121 .074 .0941
(Std.dev.) (.009) (.009) (.0168)
(43) NOTE: Standard error estimates (25), (32), and (33) were based on portions of the 1,000
replications.
558 Journal of the American Statistical Association, June 1997

Figure 5 shows neighborhoods of probability content


.6. = .05 for a realization from the (20, 2) problem. Here
we have chosen S(x,.6.) to be circles in the planes y = 0
and y = 1. That is, if Xo = (to, Yo), then S(xo,.6.) = {(t, y)
: y = Yo and lit - to I ::; r}, where r is chosen so that (46)
holds. Notice how the neighborhoods grow smaller near
the decision boundary; this occurs because the probability
in (46) refers not to the distribution of t but rather to the
joint distribution of t and y.
Let

J.L(.6.) = E{Q(Xo, X)18(Xo, X) = .6.}, (48)

the expected prediction error for test points distance .6. from
the training set. The true expected error (6) is given by
-4 -2 o 2

xl

Figure 5. Two Gaussian Classes in Two Dimensions, From the (20,


2) Problem. The circles represent neighborhoods of probability content
J.L = 1 1
J.L (.6.)g (.6.) d.6., (49)

ll. = .05 around each training point. The dotted line represents the LDF where g(.6.) is the density of 8(Xo, X). Under reasonable
decision boundary. conditions, g(.6.) approaches the exponential density

6. DISTANCE AND BIAS CORRECTIONS (.6. E (0, (0)), (50)


One way to understand the biases of the various error rate (see 1983, appendix). Another important fact is that
estimators is in terms of the distance of test points from the
training set: err, (7), is the error rate for test points that are /1(0) == // == EF{err}, (51)
zero distance away from the training set x, whereas the
which is just another way of saying that err is the error rate
true value Err, (5), is the error rate for a new test point
for test points zero distance away from x.
Xo that may lie some distance away from x. Because we
We can also define a bootstrap analog of J.L(.6.):
expect the error rate of a rule rx(to) to increase with dis-
tance from x, err underestimates Err. Cross-validation has J.L*(.6.) = E{Q(XQ,X*)18(XQ,X*) = .6.}, (52)
the test points nearly the right distance away from the train-
ing set and so is nearly unbiased. The leave-one-out boot- with the expectation in (52) being over the choice of
strap &r(l), (14), has Xo too far away from the bootstrap X 1,X2"",Xn '" F and then XQ,Xi,X2""'X~ '" F.
training sets x", because these are supported on only about Notice that if 8> 0, then X o must not equal any of the Xi
.632n of the points in" x, producing an upward bias. points in X*. This and definition (14) give
A quantitative version of the distance argument leads to
the .632+ rule (29). This section presents the argument,
which is really quite rough, and then goes on to discuss
~ == E{fui-(1)} = 1 1
J.L* (.6.)g*(.6.) d.6., (53)

other more careful bias-correction methods. However, these where g*(.6.) is the density of 8* = 8(X given that o,X*)
"better" methods did not produce better estimators in our 8* > O.
experiments, with the reduced bias paid for by too great an Because bootstrap samples are supported on about .632n
increase in variance. of the points in the training set, the same argument that
Efron (1983) used distance methods to motivate fui-(.632), gives g(.6.) ~ ne ntl also shows that
(24). Here is a brief review of the arguments in that article,
which lead up to the motivation for fui-( .632+). Given a sys- g*(.6.) ~ .632ne-·632ntl. (54)
tem of neighborhoods around points x = (t, y), let S(x,.6.) Efron (1983) further supposed that
indicate the neighborhood of x having probability content
.6., J.L* (.6.) ~ J.L(.6.) and J.L(.6.) ~ // + 13.6. (55)
Pr{ X o E S(x,.6.)} = .6.. (46) for some value 13, with the intercept // coming from (51).
(In this section capital letters indicate random quantities Combining these approximations gives
distributed according to F [e.g., X o in (46)], and lower-case
. 13 an d ~.-//+ 13
11=//+- (56)
x values are considered fixed.) As .6. -; 0, we assume that ~ n - .632n'
the neighborhood S(x,.6.) shrinks toward the point x. The
so
distance of test point Xo from a training set x is defined by
its distance from the nearest point in x, J.L ~ {.318// + .6320· (57)

8(xo, x) = inf { .6. : Xo E Q


S(Xi,.6.) } . (47) Substituting err for
&r(·632), (24).
// and fui-(1) for ~ results in
Efron and Tibshirani: The .632+ Bootstrap Method 559

LDF NN has noticeable bias. The new .632+ estimator is the best
overall performer, combining low variance with only mod-

c=:
CD "l erate bias. All in all, we feel that bias was not a major

--
ci 0
t t problem for fur(·632+) in our simulation studies, and that
~

~
~
ci
S "":
0 attempts to correct bias were too expensive in terms of
;:l. ;:l.

added variability. At the same time, it seems possible that


q 0
0 ci further research might succeed in producing a better com-
0.0 0.10 0.20 0.0 0.10 0.20 promise between the unbiasedness of cross-validation and
"
(a)
... "...
(b)
the reduced variance of the leave-one-out bootstrap.

APPENDIX: SOME ASYMPTOTIC ANALYSIS OF


Figure 6. Jl(b>.) Curves, (38), for Experiments #3 (a) and #7 (b). The THE ERROR RATE ESTIMATORS
linearity assumption v(b>.) = v + f3b>. is reasonably accurate for the LDF
rule but not for NN. The histogram for distance 8 (47) supports the ex- Define
ponential approximation (50).
a=-nE{Q(Xl,X)-Jl} and b=n2E{Q(X1 , X ' ) - J l } ,
Figure 6 shows p,(b..) for experiments #3 and #7, esti- (A.1)
mated using all of the data from each set of 200 simulations.
where
The linearity assumption p, = v + /3b.. in (55) is reasonably
accurate for the LDF rule of experiment #3 but not for the X = (X1,X2,X3, ... ,Xn ) and X' = (X 2,X2,X3, ... ,Xn ) .
NN rule of #7. The expected apparent error v = p,(O) equals (A.2)
ofor NN, producing a sharp bend near 0 in the p,(b..) curve. (Both a and b will usually be positive.) Then the formal ANOVA
The .632+ rule of Section 3 replaces linearity with an ex- decompositions of Efron (1983, Sec. 7) give
ponential curve, better able to match the form of p,(b..) seen
in Figure 5. Now we assume that E{err} = Jl- af ti,

(58) and

Here a and /3 are positive parameters and, is an upper E{:ErT(1)} = Jl + ~ + (~)


2
. . (A.3)
asymptote for p,(b..) as b.. gets large. Formulas (50) and (54), 2n n
taken literally, produce simple expressions for p" (49), and Also, letting fl == EpEopQ(X o,X*) denote the nonparametric
for ~, (53): MLE of Jl == EQ(Xo, X),

/3, + nv and ~ = /3, + .632nv . (59) E{:ErT(1) - fl} = I!.. + 0 (~)


2
. (A.4)
p,= /3+n /3 + .632n n n
We can combine (A.3) and (A.4) to obtain a bias-corrected ver-
Combined with v = "t - e- a from (51), (59) gives sion of :ErT(I) that, like :ErT(CVl), has bias of order l/n 2 rather than
l/n:
p = (1 - w)v + w~, (60)
(A.S)
where
.632 ~-v :ErT(2)can be attractively reexpressed in terms of the bootstrap
R= - - .
w=----
1- .368R
and
,-v (61) covariances between If and Q~. Following the notation in (36)
and (39), it turns out that
lli'r(·632+), (29), is obtained by substituting err for v, fur(l)
for ~, and fY, (27), for ,. fu(2) = err + e: L cavi,
n
(A.6)
All of this. is a reasonable plausibility argument for the i=l

.632+ rule, but not much more than that. The assumption where
p,(b..) ~ p,*(b..) in (55) is particularly vulnerable to criti-
cism, though in the example shown in figure 2 of Efron (A.7)
(1983) it is quite accurate. A theoretically more defensible
approach to the bias correction of fui(1) can be made us-
ing Efron's analysis of variance (ANOVA) decomposition Ii = "'£b IN B. Formula (A.6) says that we can bias correct the
apparent error rate err by adding en times the average covariance
arguments (Efron 1983); see the Appendix.
between IJ (15), the absence or presence of Xi in x", and Q~ (16),
7. CONCLUSIONS whether or not rx* (ti) incorrectly predicts Yi. These covariances
will usually be positive.
Our studies show that leave-one-out cross-validation is Despite its attractions, :ErT(2) was an underachiever in the work
reasonably unbiased but can suffer from high variability of Efron (1983), where it appeared in table 2 as we O) , and also in
in some problems. Fivefold or tenfold cross-validation ex- the experiments here. Generally speaking, it gains only about half
hibits lower variance but higher bias when the error rate of the RMS advantage over :ErT(cvI) enjoyed by :ErT(1). Moreover,
curve is still sloping at the given training set size. Similarly, its bias-correction powers fail for cases like experiment #7, where
the leave-one-out bootstrap has low variance but sometimes the formal expansions (A.3) and (A.4) are also misleading.
560 Journal of the American Statistical Association, June 1997

The estimator called bootop in Tables 3-8 was defined as Cosman, P., Perlmutter, K., Perlmutter, S., Olshen, R, and Gray, R (1991),
n "Training Sequence Size and Vector Quantizer Performance," in 25th
fuT(bootop) = err - ~ 2: cov.(Nt, Qt) (A.8) Asilomar Conference on Signals, Systems, and Computers, November
4--6, 1991, ed. Ray RChen, Los Alamitos, CA: IEEE Computer Society
i=l Press, pp. 434-448.
in (2.10) of Efron (1983), "bootop" standing for "bootstrap opti- Davison, A. C., Hinkley, D. v., and Schechtman, E. (1986), "Efficient Boot-
mism." Here cov.(Nf,Qn = Eb(Nf -l)Qt!B. Efron's (1983) strap Simulations," Biometrika, 73, 555-566.
section 7 showed that the bootop rule, like fuT(cvl) and fuT(2), Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife,"
The Annals of Statistics, 7, 1-26.
has bias of order only l/n 2 instead of lin. This does not keep
- - - (1983), "Estimating the Error Rate of a Prediction Rule: Some
it from being badly biased downward in several of the sampling
Improvements on Cross-Validation," Journal of the American Statistical
experiments, particularly for the NN rules. Association, 78, 316-331.
We also tried to bias correct fuT(1) using a second level of - - - . (1986), "How Biased is the Apparent Error Rate of a Prediction
bootstrapping. For each training set x, 50 second-level bootstrap Rule?," Journal of the American Statistical Association, 81, 461-470.
samples were drawn by resampling (one time each) from the 50 - - - (1992), "Jackknife-After-Bootstrap Standard Errors and Influence
original bootstrap samples. The number of distinct original points Functions" (with discussion), Journal of the Royal Statistical Society,
Xi appearing in a second-level bootstrap sample is approximately Ser. B, 83-111.
.502· n, compared to .632· n for a first-level sample. LetfuT(sec) Efron, B., and Gong, G. (1983), "A Leisurely Look at the Bootstrap, The
Jackknife and Cross-Validation," TheAmerican Statistician, 37, 36-48.
be the fuT(1) statistic (17) computed using the second-level sam-
Efron, B., and Tibshirani, R (1993), An Introduction to the Bootstrap,
ples instead of the first-level samples. The rules called bel .and London: Chapman and Hall.
bc2 in Tables 3-7 are linear combinations of fuT(1) and fuT(sec) , - - - (1995), "Cross-Validation and the Bootstrap: Estimating the Error
fuT(bcl) = 2 . fuT(1) _ fuT(sec) Rate of a Prediction Rule," Technical Report 176, Stanford University,
Department of Statistics. .
and Friedman, J. (1994), "Flexible Metric Nearest-Neighbor Classification,"
fuT(bc2) = 3.83fuT(1) - 2.83fuT(sec) . (A.9) Technical Report, Stanford University, Department of Statistics.
Geisser, S. (1975), "The Predictive Sample Reuse Method With Applica-
The first of these is suggested by the usual formulas for boot- tions," Journal of the American Statistical Association, 70, 320-'328.
strap bias correction. The second is based on linearly extrapolating Jain, A., Dubes, R P., and Chen, C. (1987), "Bootstrap Techniques for
{l.502n = fuT(sec) and {l.632n = Err(1) to an estimate for {In. The Error Estimation," IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9, 628--633.
bias correction works reasonably in Tables 3-7, but once again at
Kohavi, R (1995), "A Study of Cross-Validation and Bootstrap for Accu-
a substantial price in variability.
racy Assessment and model selection," technical report, Stanford Uni-
versity, Department of Computer Sciences.
[Received May 1995. Revised July 1996.] Mallows, C. (1973), "Some Comments on Cp," Technometrics, 15, 661-
675.
REFERENCES McLachlan, G. (1992), Discriminant Analysis and Statistical Pattern
Allen, D. M. (1974), "The Relationship Between Variable Selection and
Recognition, New York: Wiley.
Data Augmentation and a Method of Prediction," Technometrics, 16, Stone, M. (1974), "Cross-Validatory Choice and Assessment of Statistical
125-127. Predictions," Journal of the Royal Statistical Society, Ser. B, 36, 111-
Breiman, L. (1994), "Bagging Predictors," technical report, University of 147.
California, Berkeley. - - - (1977), "An Asymptotic Equivalence of Choice of Model by Cross-
Breiman, L., Friedman, J., Olshen, R, and Stone, C. (1984), Classification Validation and Akaike's Criterion," Journal of the Royal Statistical So-
and Regression Trees, Pacific Grove, CA: Wadsworth. ciety, Ser. B, 39, 44-47.
Breiman, L., and Spector, P. (1992), "Submodel Selection and Evaluation Wahba, G. (1980), "Spline Bases, Regularization, and Generalized Cross-
in Regression: The x-Random Case," International Statistical Review, Validation for Solving Approximation Problems With Large Quantities
60, 291-319. of Noisy Data," in Proceedings of the International Conference on Ap-
Chernick, M., Murthy, V., and Nealy, C. (1985), "Application of Bootstrap proximation Theory in Honourof George Lorenz, Austin, TX: Academic
and Other Resampling Methods: Evaluation of Classifier Performance," Press.
Pattern Recognition Letters, 4, 167-178. Zhang, P. (1993), "Model Selection via Multifold CV," The Annals of
- - - (1986), Correction to "Application of Bootstrap and Other esam- Statistics, 21, 299-311.
pling Methods: Evaluation of Classifier Performance," Pattern Recogni- - - - (1995), "APE and Models for Categorical Panel Data," Scandina-
tion Letters, 4, 133-142. vian Journal of Statistics, 22, 83-94.

You might also like