Professional Documents
Culture Documents
Improvements on Cross Validation The 632 Bootstrap Method
Improvements on Cross Validation The 632 Bootstrap Method
To cite this article: Bradley Efron & Robert Tibshirani (1997) Improvements on Cross-Validation:
The 632+ Bootstrap Method, Journal of the American Statistical Association, 92:438, 548-560, DOI:
10.1080/01621459.1997.10474007
A training set of data has been used to construct a rule for predicting future responses. What is the error rate of this rule? This is an
i~portant questio~ bo~h for comparing models and for assessing a final selected model. The traditional answer to this question is
g~ven by cross-vahd~tlOn. The cros~-~alidation esti.mate of prediction error is nearly unbiased but can be highly variable. Here we
discuss bootstrap estimates of prediction error, which can be thought of as smoothed versions of cross-validation. We show that a
part~cular bo?t~trap ~etho~, the .632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments.
Besides providing point estimates, we also consider estimating the variability of an error rate estimate. All of the results here are
nonparametric and apply to any possible prediction rule; however, we study only classification problems with 0-1 loss in detail. Our
simulations include "smooth" prediction rules like Fisher's linear discriminant function and unsmooth ones like nearest neighbors.
KEY WORDS: Classification; Cross-validation bootstrap; Prediction rule.
548
Efron and Tibshirani: The .632+ Bootstrap Method 549
I
, We are particularly interested in the dichotomous situation
I
": 0
! where both y and r are either 0 or 1, with
I
0:
., ., 0 ifr=y
·2
Q[y, r] = { 1 if r =1= y. (2)
Figure 1. A Training Set Consisting of n = 20 Observations, 12
Labelled a and 8 Labelled 1. (a) The linear discriminant function predicts We also use the shorter notation
a to lower left of the solid line and 1 to upper right; (b) the nearest-
neighbor rule predicts 1 in the three indicated islands and a elsewhere.
(3)
to indicate the discrepancy between the predicted value and
ments described in Section 4. Section 5 shows how the same response for a test point Xo = (to, Yo) when using the rule
bootstrap replications that provide a point estimate of pre- r x based on training set x.
diction error can also provide an assessment of variability Suppose that the observations Xi = (t i , Yi) in the training
for that estimate. Section 6 presents the distance argument set are a random sample from some distribution F,
underlying the .632+ rule, along with other bias-correction
techniques. (4)
All of the results here are nonparametric and apply to any and that Xo = (to, Yo) is another independent draw from F,
possible prediction rule; however, we study only classifica- called a test point. The true error rate of the rule T« is
tion problems with 0-1 loss in detail. Regression problems
may exhibit qualitatively different behavior, and the statis- Err = Err(x, F) = EOFQ(XO, x) = EOFQ[yO, rx(to)], (5)
tical approach may also differ. "In-sample" prediction error
with the notation E OF indicating that only Xo ~ (to, Yo) is
is often the focus in regression, especially for model selec-
random in (5), with x and r x being fixed. Thus Err is the
tion. In contrast, the error that we study here might be called
conditional error rate, conditional on the training set x.
"extra-sample'terror, Efron (1986) studied estimates of the
We compare error rate estimators in terms of their ability
in-sample prediction error problem, including generalized to predict Err. Section 4 briefly discusses estimating instead
cross-validation (Wahba 1980) and the Cp statistic of Mal- the expected true error,
lows (1973). The note at the end of Section 2 clarifies the
distinction between extra-sample and in-sample prediction JL = JL(F) = Ep{Err} = EFEoFQ(XO, x). (6)
errors.
The results in this case are somewhat more favorable to
Considerable work has been done in the literature on
the bootstrap estimator. Note, however, that although the
cross-validation and the bootstrap for error rate estima-
conditional error rate is often what we would like to ob-
tion. (A good general discussion can be found in McLach- tain, none of the methods correlates very well with it on a
lan 1992; key references for cross-validation are Allen sample-by-sample basis (see Zhang 1995).
1974 and Stone 1974, 1977.) Efron (1983) proposed a The apparent error rate (or resubstitution rate) is
number of bootstrap estimates of prediction error, includ-
ing the optimism and .632 estimates. The use of cross-
validation and the bootstrap for model selection was stud-
ied by Breiman (1992), Breiman and Spector (1992), Shao
(1993), and Zhang (1993). Breiman and Spector demon- with F indicating the empirical distribution that puts proba-
strated that leave-one-out cross-validation has high variance bility lin on each observation Xl, X2, ... X n ; err tends to be
if the prediction rule is unstable, because the leave-one-out
training sets are too similar to the full dataset. Fivefold Table 1. Error Rate Estimation for Situation (1)
or tenfold cross-validation displayed lower variance in this
LDF NN
case. A study of cross-validation and bootstrap methods for
Exp RMS Exp RMS
tree-structured models was carried out by Crawford (1989).
Substantial work has also been done on the prediction er- CV1 .362 .123 .419 .123
.632+ .357 .096 .380 .099
ror problem in the machine learning and pattern recognition
fields; see, for example, the simulation studies of Chernick, True .357 .418
Murthy, and Nealy (1985,1986) and Jain, Dubes, and Chen NOTE: CV1 is the cross-validation estimate based on omitting one observation at a time from
(1987). Kohavi (1995) performed a particularly interesting the training set; 632+ is the bootstrap-based estimator described in Section 3. The table shows
the expectation and RMS error of the two estimates for both the LDF and NN prediction rules.
study that renewed our interest in this problem. In both cases, RMS(.632+ )/RMS(GV1) is aboU180%.
550 Journal of the American Statistical Association, June 1997
biased downward as an estimate of Err because the training a smoothed version of ifri.(cvl). This estimate predicts the
set x has been used twice, both to construct the rule and to error at point i only from bootstrap samples that do not
test it. contain the point i.
Cross-validation (Geisser 1975; Stone 1974) avoids this The actual calculation of ifri.(l) is a straightforward
problem by removing the. data point to be predicted from bootstrap exercise. Ordinary bootstrap samples x" =
the training set. The ordinary cross-validation estimate of (xi, x z,... , x~) are generated as in (11), so x" is a random
prediction error is draw of size n, with replacement, from {XI,X2,""X n } .
A total of B such samples are independently drawn, say
X*l, x*2 , ... , x*B, with B = 50 in our simulations, as dis-
cussed later. Let Nf be the number of times that Xi is in-
cluded in the bth bootstrap sample and define
where xci) is the training set with the ith observation re-
moved. ifri.(cvI) is leave-one-out cross-validation; the k-fold I b = {I if Nf = 0 (15)
version ifri.(cvk) partitions the training set into k parts, pre- 2 0 if Nf > O.
dicting in turn the observations in each part from the train-
ing sample formed from all of the remaining parts. Also define
The statistic ifri.(cvI) is a discontinuous function of the
training set x when Q[y, r] itself is discontinuous as in (2).
Bootstrap smoothing is a way to reduce the variance of such Then
functions by averaging. Suppose that Z(x) is an unbiased
estimate of a parameter of interest, say -(1) 1~ .
Err = - L...J E i , where Ei = L:IfQf!L:If. (17)
((F) = EF{Z(x)}. (9) n i=l b b
By definition, the nonparametric maximum likelihood esti- This definition agrees with (14) because a bootstrap sam-
mate (MLE) of the same parameter ( is ple that has If = 1 is the same as a bootstrap. sample
from F(i) (see Efron 1992). A slightly different definition
(= ((F) = Ep{Z(x*)}. (10) was given by Efron (1983) (where ifri.(1) is teO)), namely
Here x" is a random sample from F, Li Lb IfQ?/ Li Lb If, but the two definitions agree as
B -+ oo and produced nearly the same results in our simu-
lations.
(11)
There is another way to view cross-validation and ifri.(1):
that is, a bootstrap sample. The bootstrap expectation in as estimates of the average error J-L(F). Direct applica-
(10) smooths out discontinuities in Z(x), usually reducing tion of the bootstrap gives the plug-in estimate J-L(F) =
its variability. However, ( may now be biased as an estimate EpEopQ(xo, x). This estimate, discussed in Section 6,
of (. Breiman (1994) introduced a very similar idea under tends to be biased downward. The reason is that F is being
the sobriquet "bagging." used twice: as the population, say F o, from which bootstrap
Now consider applying bootstrap smoothing to Zi(X) = training sets x" are drawn, and as the population F I from
Q(Xi, XCi)~' with Xi fixed. The nonparametric MLE of which test points X o are drawn. Let us write J-L(F) explicitly
asa function of both F I and Fo:
EFZi(X) is (i = EPC') {Q(Xi, xCi))}' where XCi) is a boot-
strap sample from the empirical distribution on XCi), J-L(FI, Fo) = EF1{EoFoQ[YO, rx(To)]}
1 = EOFoEF1Q[Yo, r«(To)]) , (18)
F(i) : Pr - - on Xl, X2,···, Xi-I, Xi+l,···, Xn · (12)
n-l
where, for convenience, we have switched the order of ex-
It might be argued that the bootstrap samples should XCi) pectation in the second expression. We assume that in the
be of size n - 1 instead of n, but there is no advantage to unknown true state of affairs, F I = Fo = F. Plugging in F
this. In what follows we take bootstrap samples from F(i) for the test distribution F o gives
to be of size n and indicate them by x" rather than XCi)' so
(13)
Notice that an x" sample drawn fromF(i) never contains The remaining task is to estimate the training sampledistri-
the point Xi. bution Fl. Ideally, we would take F I = F. Notice that for
Applying (11) and (12) to each case i in turn leads to the continuous populations F, the probability of the test point
leave-one-out bootstrap, X o = Xi appearing in a training sample drawn from F I = F
1 n
is O. The plug-in estimate J-L(F) = J-L(F, F) uses F for
= -ti ~ Fl. With this choice, the probability that X o = Xi appears
ifri.(1)
L...J E·F C') {Q(x'2, x*)} , (14)
in the training sample is 1 ~ (l-l/n)n ~ .632. Hence J-L(F)
i=l
Efron and Tibshirani: The .632+ Bootstrap Method 551
--- . . .,.----..
-,
: '-, : ~ .. -, sets), and that the expectation of fui(l) is Iln/2 to second
between fui(cv1) and fui(1). where the notation E 07ri indicates that only YOi rv binomial
The dotted curve in Figure 2 is the expectation ratio (1, Jri) is random, with x and rx(t i ) being fixed. This situa-
E{fui(1)} / E{fui(cv1)}. We see that fui(1) is biased upward tion is similar to a standard regression problem in that the
relative to the nearly unbiased estimate fui(cv1). This is not predictors t, are treated as fixed at their observed values,
surprising. Let Iln indicate the expected true error (6) when rather than as random. In-sample error prediction is math-
x has sample size n. Ordinary cross-validation produces an ematically simpler than the extra-sample case and leads to
unbiased estimate ofll n - 1 , whereas k-fold cross-validation quite different solutions for the error rate prediction prob-
estimates f-l-n-k. Because smaller training sets produce big- lem (see Efron 1986).
552 Journal of the American Statistical Association, June 1997
~ (.632+ ) = Err
Err ~ (.632) ~ (1)' -
+ (Err err) .368· .632·_k . (32)
0.1 0.2 0;3 0.4 0.5 0.6 u .... 1- .368R'
(mean - IL) / IL. The dashes in Figute 3 indicate the seven iments each comprised 200 Monte Carlo simulations (i.e.,
"no-information" experiments, where y was independent of 200 independent choices of the training set x) and the last
t and Err == .50. (The IL values for these seven experiments 12 experiments each comprised 50 simulations. The boot-
have been spread out for clarity.) In these cases the defini- strap samples were balanced in the sense that the indices
tions in (31) effectively truncate fur(·632+) at or near .50. of the bootstrap data points were obtained by randomly
This almost always gives a more accurate estimate of Err permuting a string of Bcopies of the integers 1 to n (see
on a case-by-case basis but yields a downward bias over-. Davison, Hinkley, and Schechtman (1986), but the balanc-
all. To put things another way, we could also improve the ing had little effect on our results.
. accuracy of fu(cvl) by truncating it at;Y, but then fu(cvl) This simulation study might seem excessive, but it must
would no longer be nearly unbiased. be large to investigate the effects of different classifiers,
training set sizes, signal-to-noise ratio, number of classes,
Better bias adjustments of fuel) are available, as dis-
and number of observations. Here are some explanatory
cussed in Section 6. However, in reducing the bias of fu Cl), comments concerning the 24 experiments:
they lose about half of the reduction in variance enjoyed .
by fur(.632+) and so offer less dramatic improvements over • In experiments #1-18, the response Yi equals 0 or 1
fu Ccvl) . with probability .50, and the conditional distribution
Note that fu(.632+) requires no additional applications of tilYi is multivariate normal. For example, exper-
of the prediction rule after computation of fu(1). Because iment #3 is as described in (1), tilYi rv Nz ((Yi -
50 bootstrap samples are often sufficient, both fuCl) and .5,0), I), but #4 has the no-information form tilYi rv
fu(.63Z+) can be less costly than fu(cvl) if n is large.
Nz((O,O),I), so that Yi is independent of t i and ev-
ery prediction rule has Err = .50. In experiment #19
the response Yi equals 1, 2, 3, or 4 with probability
.25, and tilYi rv Ni(~i, T) with 6 = (-.5, -.5),6 =
4. SAMPLING EXPERIMENTS (-.5, .5),6 = (.5 - .5), or ~4 = (.5, .5).
Table 2 describes the 24 sampling experiments performed • Experiments #13-16 are taken from Friedman (1994).
for this study. Each experiment involved the choice of a There are two classes in 10 dimensions and 100 train-
training set size n, a probability distribution F giving x ing observations; tors in class 1 are independent stan-
as in (4), a dimension p for the prediction vectors t, and dard normal, and those in class 2 are independent
a prediction rule r x . Twenty-one of the experiments were normal with mean VI /2 and variance 1/j, for j =
dichotomous, and three involved four or more classes; 0-1 1,2, ... , 10. All predictors are useful here, but the ones
error (2) was used in all 24 experiments. The first 12 exper- with higher index j are more so.
554 Journal of the American Statistical Association, JU(1e 1997
• Experiments #20-24 refer to real datasets taken from Tables 3-8 report the performance of several error rate
the machine learning archive at UC Irvine. The dimen- estimators in the 24 sampling experiments. In each table the
sions ofthese datasets are (846, 19), (683, 10), and(562, Exp and SD columns give means and standard deviations,
18) (after removing incomplete observations) with four, and RMS is the square root of the average squared error
two, and 15 classes. We followed Kohavi (1995) and for estimating Err(x, F), (5). The bootstrap estimators all
chose a random subset of the data to act as' the train- used B = 50 bootstrap replications per simulation.
ing set, choosing the training set size n so that /1n still The error rate estimators include :ifri.(.632) , :ifri.(.632+) ,
sloped downward. The idea is that if n is so large that :ifri.(1), and four different cross-validation rules: cv1, cv5f,
the error curve is flat, then the error rate estimation and cvlOf are leave-one-out and fivefold and tenfold cross-
problem is too easy, because the potential biases aris- validations, whereas cv5fr is fivefold cross-validation aver-
ing from changing the training set size will be small. aged over 10 random choices of the partition (making the
We chose training set sizes 100, 36, and 80. The soy- total number of recomputations 50, the same as for the boot-
bean data actually have 35 categorical predictors, many strap rules). Also shown are other bias-corrected versions
with more than two possible values. To keep the com- of :ifri.(1) called bootop, bel, bc2, and :&'r(2); see Section 6.
putations manageable, we used only the 15 binary pre- The tables also give statistical summaries for Err, err, and
dictors. il'; see (31).
• The prediction rules are LDF, Fisher's linear discrim- The results vary considerably from experiment to ex-
inant analysis; 1-NN and 3-NN, one-nearest-neighbor periment, but in terms of RMS error the .632+ rule
and three-nearest-neighbor classifiers; TREES, a clas- is an overall winner. In Figure 4 the solid line graphs
sification tree using the tree function in S-PLUS; and RMS{:ifri.(.632+)} jRMS{:ifri.( cVl)} versus the true expected
QDF, quadratic discriminant function (i.e., estimating error /1, (6). The median value of the ratio for the 24 ex-
a separate mean and covariance in each class and using periments was .78. The dotted line is the rms ratio for es-
the Gaussian log-likelihood for classification). timating, /1 rather than Err, a measure that is slightly more
favorable to the .632+ rule, the median ratio now being. 72. We discuss estimating the usual external standard devia-
Simulation results must be viewed with caution, es- tion of fui.(1); that is, the variability in fui.(1) caused by the
pecially in an area as broadly defined as the prediction random choice of x. We also discuss the internal variability,
problem. The smoothing argument of Section 2 strongly due to the random choice of the B bootstrap samples (as
suggests that it should be possible to improve on cross- at the end of Sec. 2), because it affects the assessment of
validation. With this in mind, Figure 4 demonstrates that external variability. Finally, we discuss estimating the stan-
&r(·632+) has made full use of the decreased standard de- - ( )
dard deviation of the difference Err 1 (rule 1) - Err (rule
-(2)
viation seen in Figure 2. However, the decrease in RMS is 2), where rule 1 and rule 2 are two different prediction rules
less dependable than the decrease in SO, and part of the applied to the same set of data.
RMS decrease is due to the truncation at ;Y in definitions The nonparametric delta method estimate of standard er-
(31) and (32). The truncation effect is particularly notice- ror applies to symmetrically defined statistics, those that
able in the seven no-information experiments. are invariant under permutation of the points Xi in x =
5. ESTIMATING THE STANDARD ERROR OF Eri-(1) (X1,X2, ... ,Xn ) . In this case we can write the statistic as
The same set' of bootstrap replications that gives a point a function S (F) of the empirical distribution. The form
estimate of prediction error can also be used to assess the of S can depend on n, but S(F) must be smoothly defined
variability of that estimate. This can be useful for infer- in the following sense: Let Fg,i be a version of F that puts
ence purposes, model selection, or comparing two mod- extra probability on Xi,
els. The method presented here, called "delta-method-after-
bootstrap" by Efron (1992), works well for estimators like l-g + E: on Xi
A {
F g i: Pr n (33)
&r(1) that are smooth functions of x. It is more difficult to , l~g on Xj for j of- i.
obtain standard error estimates for cross-validation or the
.632+ estimator, and we do not study those estimators in Then we need the derivatives 8S(Fg ,i )/ fJc to exist at E: = O.
this section. , Defining
Table 7. LOF for the (20, 2, +), (14, 12, ind), and (20,2, 4) Problems
17: (20, 2, +)
Exp SO RMS
D. _ .!. BS(Fc :i )
,- n Be: '
I
a
where Ei = LqS
B /
LIf.
B
(37)
b=l b=l
the nonparametric delta method standard error estimate for
S(F) is
We also define c, to be the bootstrap covariance between
Nf and qf,
(35) B
, 1,,", b b
C; = B L...J (Ni - l)q.. (38)
b=l
(see Efron 1992, Sec. 5). The vector D = (fh, D2 , ... , Dn ) The following lemma was proven by Efron and Tibshirani
is l/n times the empirical influence function of S. (1995).
If the prediction rule r x (t) is a symmetric function of the
points Xi in x, as it usually is, then fui.(l) is also symmetri- Lemma. The derivative (34) for S(F) = fur(l) is
cally defined. The expectation in (13) guarantees that fui.(1)
will be smoothly defined in x, so we can apply formulas
(34)--(35). (39)
We first consider the ideal case where fui.(1) is based
on all B ,= nn possible bootstrap samples x"
en == (1 -l/n)-n.
(xil'Xi2"",xi,J,each i j E {1,2, ... ,n}. Followingthe
notation in (14)--(16), let A naive estimate of standard error for fui.(1) would be
[Li(Ei - ifrr(1))2/ n2jl/2, based on the false assumption
, that the Ei are independent. This amounts to taking D, =
(36) (Ei - fui.(1))/n in (35). The actual influences (39) usually
result in a larger standard error than the naive estimates.
The right side of Table 9 shows SEdel and SEad j for suc-
CIO
c::i cessive groups of 100 bootstrap replications. The values of
BEdel are remarkably stable but biased upward from the an-
II) sweryased on all 1,000 replications; the bias-adjusted val-
c::i
ues SEa dj are less biased but about twice as variable from
group to group. In this example both SEdel and SEadj gave
.... I 111111 useful results even for B as small as 100.
c::i
The delta method cannot be directly applied to find the
1-1 0+- 0.1 0.2 0.3 0.4 0.5 0.6
standard error of E;T(.632+), because the .632+ rule involves
Figure 4. The RMS(f;;(· 632+)}IRMS(f/T(CV 1)}IRMS Ratio From Ta- err, an unsmooth function of x. A reasonable estimate of
bles 3-8, Plotted Versus Expected True Error i.L (Solid line) and the RMS standard error for E;T(.632+) is obtained by multiplying (35)
Ratio for Estimating i.L Instead of Err (Dotted line), Dashes indicate the
seven no-information experiments. The vertical axis is plotted logarith-
or (43) by E;T(.632+) l&r(1). This is reasonable, because the
mically. coefficient of variation for' the two estimators was nearly
the same in our experiments.
In practice, we have only B bootstrap samples, with B Finally, suppose that we apply two different prediction
being much smaller than n", In that case, we can set rules, r~ and r~, to the same training set and wish to assess
the significance of the difference 5iff = E;T(1)' - E;T(1)1I
D. =
•
(2 _1_)
+n- 1
Ei - E;T(1) L~=l (N? - Ni)qb
+ 'B
between the error rate estimates. For example, r~ and r~
n Lb=1 Iib ' could be LDF and NN, as in Figure 1. The previous theory
goes through if we change the definition of Q1 in (36) to
(40)
Q~ = Q[Yi, rX(ti)'] - Q[Yi, rx(td']· (44)
where Ei and E;T(1) are as given in (37), and Ni =
B b -
,Lb=1 N; 1B. tN, = 1 for a balanced set of bootstrap sam- Then the delta-method estimate of standard error for Diff -
ples.) The bootstrap expectation of If equals e;:;1, so (40) is (Li Dn
1 2
/ , where
goes to (39) as we approach the ideal case B = n":
Finally, we can use the jackknife to assess the internal
error of the Di estimates, namely the Monte Carlo error D; =
A (
2+ --
1) 5iff
n-1
i -
n
5iff + eiCi'A
(45)
that comes from using only a limited number of bootstrap
replications. Let Di(b) indicate the value of b, calculated Here 5iffi = Lb qf 1 Lb If, qf == If Q~, and everything else
from the B-1 bootstrap samples not including the bth sam- is as defined in (38) and (39).
ple. Simple computational formulas for Di(b) are available
along the lines of (17). The internal standard error of Ei is Table 9. B = 1,000 Bootstrap Replications From a Single
given by the jackknife formula Realization of Experiment #3, Table 2
B - - -
SE del SE'nl SEadj B -
SEdel SEint SEadj
(41) 1:00 .131 .063 .115
1:50 .187 .095 .161 101:200 .122 .077 .094
201 :300 .128 .068 .108
Di(.) == Lb Di(b)1B, with the total internal error 1:100 .131 .063 .115 301 :400 .118 .068 .097
401:500 .134 .076 .110
1:200 .119 .049 .109 501 :600 .116 .085 .079
n ] 1/2
SE int = '
[
L LiT (42)
1:400 .110 .034 .105
601 :700
701 :800
.126
.108
.060
.076
.111
.077
.=1 801 :900 .109 .084 .068
1:1000 .100 .023 .097 901:1000 .116 .082 .082
This leads to an adjusted standard error estimate for E;T(1),
Mean .121 .074 .0941
(Std.dev.) (.009) (.009) (.0168)
(43) NOTE: Standard error estimates (25), (32), and (33) were based on portions of the 1,000
replications.
558 Journal of the American Statistical Association, June 1997
the expected prediction error for test points distance .6. from
the training set. The true expected error (6) is given by
-4 -2 o 2
xl
ll. = .05 around each training point. The dotted line represents the LDF where g(.6.) is the density of 8(Xo, X). Under reasonable
decision boundary. conditions, g(.6.) approaches the exponential density
other more careful bias-correction methods. However, these where g*(.6.) is the density of 8* = 8(X given that o,X*)
"better" methods did not produce better estimators in our 8* > O.
experiments, with the reduced bias paid for by too great an Because bootstrap samples are supported on about .632n
increase in variance. of the points in the training set, the same argument that
Efron (1983) used distance methods to motivate fui-(.632), gives g(.6.) ~ ne ntl also shows that
(24). Here is a brief review of the arguments in that article,
which lead up to the motivation for fui-( .632+). Given a sys- g*(.6.) ~ .632ne-·632ntl. (54)
tem of neighborhoods around points x = (t, y), let S(x,.6.) Efron (1983) further supposed that
indicate the neighborhood of x having probability content
.6., J.L* (.6.) ~ J.L(.6.) and J.L(.6.) ~ // + 13.6. (55)
Pr{ X o E S(x,.6.)} = .6.. (46) for some value 13, with the intercept // coming from (51).
(In this section capital letters indicate random quantities Combining these approximations gives
distributed according to F [e.g., X o in (46)], and lower-case
. 13 an d ~.-//+ 13
11=//+- (56)
x values are considered fixed.) As .6. -; 0, we assume that ~ n - .632n'
the neighborhood S(x,.6.) shrinks toward the point x. The
so
distance of test point Xo from a training set x is defined by
its distance from the nearest point in x, J.L ~ {.318// + .6320· (57)
LDF NN has noticeable bias. The new .632+ estimator is the best
overall performer, combining low variance with only mod-
c=:
CD "l erate bias. All in all, we feel that bias was not a major
--
ci 0
t t problem for fur(·632+) in our simulation studies, and that
~
~
~
ci
S "":
0 attempts to correct bias were too expensive in terms of
;:l. ;:l.
(58) and
.632+ rule, but not much more than that. The assumption where
p,(b..) ~ p,*(b..) in (55) is particularly vulnerable to criti-
cism, though in the example shown in figure 2 of Efron (A.7)
(1983) it is quite accurate. A theoretically more defensible
approach to the bias correction of fui(1) can be made us-
ing Efron's analysis of variance (ANOVA) decomposition Ii = "'£b IN B. Formula (A.6) says that we can bias correct the
apparent error rate err by adding en times the average covariance
arguments (Efron 1983); see the Appendix.
between IJ (15), the absence or presence of Xi in x", and Q~ (16),
7. CONCLUSIONS whether or not rx* (ti) incorrectly predicts Yi. These covariances
will usually be positive.
Our studies show that leave-one-out cross-validation is Despite its attractions, :ErT(2) was an underachiever in the work
reasonably unbiased but can suffer from high variability of Efron (1983), where it appeared in table 2 as we O) , and also in
in some problems. Fivefold or tenfold cross-validation ex- the experiments here. Generally speaking, it gains only about half
hibits lower variance but higher bias when the error rate of the RMS advantage over :ErT(cvI) enjoyed by :ErT(1). Moreover,
curve is still sloping at the given training set size. Similarly, its bias-correction powers fail for cases like experiment #7, where
the leave-one-out bootstrap has low variance but sometimes the formal expansions (A.3) and (A.4) are also misleading.
560 Journal of the American Statistical Association, June 1997
The estimator called bootop in Tables 3-8 was defined as Cosman, P., Perlmutter, K., Perlmutter, S., Olshen, R, and Gray, R (1991),
n "Training Sequence Size and Vector Quantizer Performance," in 25th
fuT(bootop) = err - ~ 2: cov.(Nt, Qt) (A.8) Asilomar Conference on Signals, Systems, and Computers, November
4--6, 1991, ed. Ray RChen, Los Alamitos, CA: IEEE Computer Society
i=l Press, pp. 434-448.
in (2.10) of Efron (1983), "bootop" standing for "bootstrap opti- Davison, A. C., Hinkley, D. v., and Schechtman, E. (1986), "Efficient Boot-
mism." Here cov.(Nf,Qn = Eb(Nf -l)Qt!B. Efron's (1983) strap Simulations," Biometrika, 73, 555-566.
section 7 showed that the bootop rule, like fuT(cvl) and fuT(2), Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife,"
The Annals of Statistics, 7, 1-26.
has bias of order only l/n 2 instead of lin. This does not keep
- - - (1983), "Estimating the Error Rate of a Prediction Rule: Some
it from being badly biased downward in several of the sampling
Improvements on Cross-Validation," Journal of the American Statistical
experiments, particularly for the NN rules. Association, 78, 316-331.
We also tried to bias correct fuT(1) using a second level of - - - . (1986), "How Biased is the Apparent Error Rate of a Prediction
bootstrapping. For each training set x, 50 second-level bootstrap Rule?," Journal of the American Statistical Association, 81, 461-470.
samples were drawn by resampling (one time each) from the 50 - - - (1992), "Jackknife-After-Bootstrap Standard Errors and Influence
original bootstrap samples. The number of distinct original points Functions" (with discussion), Journal of the Royal Statistical Society,
Xi appearing in a second-level bootstrap sample is approximately Ser. B, 83-111.
.502· n, compared to .632· n for a first-level sample. LetfuT(sec) Efron, B., and Gong, G. (1983), "A Leisurely Look at the Bootstrap, The
Jackknife and Cross-Validation," TheAmerican Statistician, 37, 36-48.
be the fuT(1) statistic (17) computed using the second-level sam-
Efron, B., and Tibshirani, R (1993), An Introduction to the Bootstrap,
ples instead of the first-level samples. The rules called bel .and London: Chapman and Hall.
bc2 in Tables 3-7 are linear combinations of fuT(1) and fuT(sec) , - - - (1995), "Cross-Validation and the Bootstrap: Estimating the Error
fuT(bcl) = 2 . fuT(1) _ fuT(sec) Rate of a Prediction Rule," Technical Report 176, Stanford University,
Department of Statistics. .
and Friedman, J. (1994), "Flexible Metric Nearest-Neighbor Classification,"
fuT(bc2) = 3.83fuT(1) - 2.83fuT(sec) . (A.9) Technical Report, Stanford University, Department of Statistics.
Geisser, S. (1975), "The Predictive Sample Reuse Method With Applica-
The first of these is suggested by the usual formulas for boot- tions," Journal of the American Statistical Association, 70, 320-'328.
strap bias correction. The second is based on linearly extrapolating Jain, A., Dubes, R P., and Chen, C. (1987), "Bootstrap Techniques for
{l.502n = fuT(sec) and {l.632n = Err(1) to an estimate for {In. The Error Estimation," IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9, 628--633.
bias correction works reasonably in Tables 3-7, but once again at
Kohavi, R (1995), "A Study of Cross-Validation and Bootstrap for Accu-
a substantial price in variability.
racy Assessment and model selection," technical report, Stanford Uni-
versity, Department of Computer Sciences.
[Received May 1995. Revised July 1996.] Mallows, C. (1973), "Some Comments on Cp," Technometrics, 15, 661-
675.
REFERENCES McLachlan, G. (1992), Discriminant Analysis and Statistical Pattern
Allen, D. M. (1974), "The Relationship Between Variable Selection and
Recognition, New York: Wiley.
Data Augmentation and a Method of Prediction," Technometrics, 16, Stone, M. (1974), "Cross-Validatory Choice and Assessment of Statistical
125-127. Predictions," Journal of the Royal Statistical Society, Ser. B, 36, 111-
Breiman, L. (1994), "Bagging Predictors," technical report, University of 147.
California, Berkeley. - - - (1977), "An Asymptotic Equivalence of Choice of Model by Cross-
Breiman, L., Friedman, J., Olshen, R, and Stone, C. (1984), Classification Validation and Akaike's Criterion," Journal of the Royal Statistical So-
and Regression Trees, Pacific Grove, CA: Wadsworth. ciety, Ser. B, 39, 44-47.
Breiman, L., and Spector, P. (1992), "Submodel Selection and Evaluation Wahba, G. (1980), "Spline Bases, Regularization, and Generalized Cross-
in Regression: The x-Random Case," International Statistical Review, Validation for Solving Approximation Problems With Large Quantities
60, 291-319. of Noisy Data," in Proceedings of the International Conference on Ap-
Chernick, M., Murthy, V., and Nealy, C. (1985), "Application of Bootstrap proximation Theory in Honourof George Lorenz, Austin, TX: Academic
and Other Resampling Methods: Evaluation of Classifier Performance," Press.
Pattern Recognition Letters, 4, 167-178. Zhang, P. (1993), "Model Selection via Multifold CV," The Annals of
- - - (1986), Correction to "Application of Bootstrap and Other esam- Statistics, 21, 299-311.
pling Methods: Evaluation of Classifier Performance," Pattern Recogni- - - - (1995), "APE and Models for Categorical Panel Data," Scandina-
tion Letters, 4, 133-142. vian Journal of Statistics, 22, 83-94.