This Content Downloaded From 140.213.190.131 On Tue, 13 Apr 2021 09:26:31 UTC

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of

Statistical Accuracy
Author(s): B. Efron and R. Tibshirani
Source: Statistical Science , Feb., 1986, Vol. 1, No. 1 (Feb., 1986), pp. 54-75
Published by: Institute of Mathematical Statistics

Stable URL: https://www.jstor.org/stable/2245500

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and


extend access to Statistical Science

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
Statistical Science
1986, Vol. 1, No. 1, 54-77

Bootstrap Methods for Standard Errors,


Confidence Intervals, and Other Measures of
Statistical Accuracy
B. Efron and R. Tibshirani

Abstract. This is a review of bootstrap methods, concentrating on basic


ideas and applications rather than theoretical considerations. It begins with
an exposition of the bootstrap estimate of standard error for one-sample
situations. Several examples, some involving quite complicated statistical
procedures, are given. The bootstrap is then extended to other measures of
statistical accuracy such as bias and prediction error, and to complicated
data structures such as time series, censored data, and regression models.
Several more examples are presented illustrating these ideas. The last third
of the paper deals mainly with bootstrap confidence intervals.

Key words: Bootstrap method, estimated standard errors, approximate


confidence intervals, nonparametric methods.

1. INTRODUCTION particularly Efron (1982a). Some of the discussion


here is abridged from Efron and Gong (1983) and also
A typical problem in applied statistics involves the
from Efron (1984).
estimation of an unknown parameter 0. The two main
Before beginning the main exposition, we will de-
questions asked are (1) what estimator 0 should be
scribe how the bootstrap works in terms of a problem
used? (2) Having chosen to use a particular 0, how
where it is not needed, assessing the accuracy of the
accurate is it as an estimator of 0? The bootstrap is a
sample mean. Suppose that our data consists of a
general methodology for answering the second ques-
random sample from an unknown probability distri-
tion. It is a computer-based method, which substitutes
bution F on the real line,
considerable amounts of computation in place of the-
oretical analysis. As we shall see, the bootstrap can (1.1) Xl X2, * , X.- F.
routinely answer questions which are far too compli-
Having observed X1 = x1, X2 = x2, ... , Xn = xn, we
cated for traditional statistical analysis. Even for rel-
compute the sample mean x = 1 xn/n, and wonder
atively simple problems computer-intensive methods
how accurate it is as an estimate of the true mean
like the bootstrap are an increasingly good data ana-
6 = EFIX}.
lytic bargain in an era of exponentially declining com-
If the second central moment of F is 182(F) EFX2
putational costs.
- (EFX)2, then the standard error a(F; n, x), that is
This paper describes the basis of the bootstrap
the standard deviation of i for a sample of size n from
theory, which is very simple, and gives several exam-
distribution F, is
ples of its use. Related ideas like the jackknife, the
delta method, and Fisher's information bound are also (1.2) o(F) = [,M2(F)/n]112.
discussed. Most of the proofs and technical details are
The shortened notation o(F) -(F; n, i) is allow-
omitted. These can be found in the references given,
able because the sample size n and statistic of interest
x are known, only F being unknown. The standard
B. Efron is Professor of Statistics and Biostatistics, and error is the traditional measure of i's accuracy. Un-
Chairman of the Program in Mathematical and Com- fortunately, we cannot actually use (1.2) to assess the
putational Science at Stanford University. His mailing accuracy of i, since we do not know M2(F), but we can
address is Department of Statistics, Sequoia Hall, Stan- use the estimated standard error
ford University, Stanford, CA 94305. R. Tibshirani is
(1.3) = [#2/n]l/2
a Postdoctoral Fellow in the Department of Preventive
Medicine and Biostatistics, Faculty of Medicine, Uni- where jX2 = Ei (Xi-x)2/(n - 1), the unbiased estimate
versity of Toronto, McMurrick Building, Toronto, of A2(F).
Ontario, M5S 1A8, Canada. There is a more obvious way to estimate o(F). Let

54

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 55

F indicate the empirical probability distribution, We will see that bootstrap confidence intervals can
automatically incorporate tricks like this, without re-
(1.4) F: probability mass 1/n on x1, x2, ... , xn.
quiring the data analyst to produce special techniques,
Then we can simply replace F by F in (1.2), obtaining like the tanh-1 transformation, for each new situation.
An important theme of what follows is the substitution
(1.5) a - F) = [U2(P)/nl . of raw computing power for theoretical analysis. This
is not an argument against theory, of course, only
as the estimated standard error for i. This is the
against unnecessary theory. Most common statistical
bootstrap estimate. The reason for the name "boot-
methods were developed in the 1920s and 1930s, when
strap" will be apparent in Section 2, when we evaluate
computation was slow and expensive. Now that com-
v(F) for statistics more complicated than x. Since
putation is fast and cheap we can hope for and expect
changes in statistical methodology. This paper dis-
(1.6) /2 a 82(F) = X (
cusses one such potential change, Efron (1979b) dis-
cusses several others.
a is not quite the same as a-, but the difference is too
small to be important in most applications.
2. THE BOOTSTRAP ESTIMATE OF STANDARD
Of course we do not really need an alternative
ERROR
formula to (1.3) in this case. The trouble begins when
we want a standard error for estimators more compli- This section presents a more careful description of
cated than x, for example, a median or a correlation the bootstrap estimate of standard error. For now we
or a slope coefficient from a robust regression. In most will assume that the observed data y = (xl, x2, * *
cases there is no equivalent to formula (1.2), which xn) consists of independent and identically distributed
expresses the standard error a(F) as a simple function
(iid) observations X1, X2- .-, Xn fiid F, as in (1.1).
of the sampling distribution F. As a result, formulas Here F represents an unknown probability distribu-
like (1.3) do not exist for most statistics. tion on r, the common sample space of the observa-
This is where the computer comes in. It turns out tions. We have a statistic of interest, say 0(y), to
that we can always numerically evaluate the bootstrap which we wish to assign an estimated standard error.
estimate a = a(F), without knowing a simple expres- Fig. 1 shows an example. The sample space r is
sion for a(F). The evaluation of a is a straightforward n2+, the positive quadrant of the plane. We have
Monte Carlo exercise described in the next section. In observed n = 15 bivariate data points, each corre-
a good computing environment, as described in the
sponding to an American law school. Each point xi
remarks in Section 2, the bootstrap effectively gives
consists of two summary statistics for the 1973 enter-
the statistician a simple formula like (1.3) for any ing class at law school i
statistic, no matter how complicated.
Standard errors are crude but useful measures of (2.1) xi= (LSATi, GPAi);
statistical accuracy. They are frequently used to give
approximate confidence intervals for an unknown
LSATi is the class' average score on a nationwide
parameter 0
exam called "LSAT"; GPAi is the class' average un-
dergraduate grades. The observed Pearson correlation
(1.7) 0 E 0 ? Sz(a),

where z(a) is the 100 * a percentile point of a standard


3.5-
normal variate, e.g., Z(95) = 1.645. Interval (1.7) is
sometimes good, and sometimes not so good. Sections GPA *1

7 and 8 discuss a more sophisticated use of the boot- 3.3t *2


strap, which gives better approximate confidence in-
tervals than (1.7). 3.1 -10
*6
The standard interval (1.7) is based on taking lit- GPA - 97 *4
erally the large sample normal approximation (f - 2.9 - * 15
@14
0)/S N(0, 1). Applied statisticians use a variety of 03
tricks to improve this approximation. For instance if @13 *12
2.7 I ' - l I 1 l
0 is the correlation coefficient and 0 the sample cor- 540 560 580 600 620 640 660 680
relation, then the transformation 4 = tanh-1(0), = LSAT
tanh-1(0) greatly improves the normal approximation,
FIG. 1. The law school data (Efron, 1979b). The data points, begin-
at least in those cases where the underlying sampling
ning with School 1, are (576, 3.39), (635, 3.30), (558, 2.81),
distribution is bivariate normal. The correct tactic
(578, 3.03), (666, 3.44), (580, 3.07), (555, 3.00), (661, 3.43),
then is to transform, compute the interval (1.7) for , (651, 3.36), (605, 3.13), (653, 3.12), (575, 2.74), (545, 2.76),
and transform this interval back to the 0 scale. (572, 2.88), (594, 2.96).

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
56 B. EFRON AND R. TIBSHIRANI

coefficient for these 15 points is 6 = .776. We wish to Carlo algorithm will not converge to a' if the bootstrap
assign a standard error to this estimate. sample size differs from the true n. Bickel and Freed-
Let o(F) indicate the standard error of 0, as a man (1981) show how to correct the algorithm to give
function of the unknown sampling distribution F, a if in fact the bootstrap sample size is taken different
than n, but so far there does not seem to be any
(2.2) a(F) = [VarF{Ny)
practical advantage to be gained in this way.
Of course a (F) is also a function of the sample size n Fig. 2 shows the histogram of B = 1000 bootstrap
and the form of the statistic 0(y), but since both of replications of the correlation coefficient from the law
these are known they need not be indicated in the school data. For convenient reference the abscissa is
notation. The bootstrap estimate of standard error is plotted in terms of 0* - 0 = 0* - .776. Formula (2.4)
gives 6' = .127 as the bootstrap estimate of standard
(2.3) =
error. This can be compared with the usual normal
where F is the empirical distribution (1.4), putting theory estimate of standard error for 0,
probability 1/n on each observed data point xi. In the
(2.5) TNORM = (1 - )/(n - 3)1/ = .115,
law school example, F is the distribution putting mass
1/15 on each point in Fig. 1, and a is the standard [Johnson and Kotz (1970, p. 229)].
deviation of the correlation coefficient for 15 iid points
drawn from F. REMARK. The Monte Carlo algorithm leading to
In most cases, including that of the correlation U7B (2.4) is simple to program. On the Stanford version
coefficient, there is no simple expression for the func- of the statistical computing language S, Professor
tion a(F) in (2.2). Nevertheless, it is easy to numeri- Arthur Owen has introduced a single command which
cally evaluate a = 0(F) by means of a Monte Carlo bootstraps any statistic in the S catalog. For instance
algorithm, which depends on the following notation: the bootstrap results in Fig. 2 are obtained simply by
= (x4, 4, *, x*) indicates n independent draws typing
from F, called a bootstrap sample. Because F is the
tboot(lawdata, correlation, B = 1000).
empirical distribution of the data, a bootstrap sample
turns out to be the same as a random sample of size The execution time is about a factor of B greater than
n drawn with replacement from the actual sample that for the original computation.
{X1, X2, .. * * Xnl.
The Monte Carlo algorithm proceeds in three steps: There is another way to describe the bootstrap
(i) using a random number generator, independently standard error: F is the nonparametric maximum like-
draw a large number of bootstrap samples, say y*(1), lihood estimate (MLE) of the unknown distribution F
y*(2), ***, y*(B); (ii) for each bootstrap sample y*(b),(Kiefer and Wolfowitz, 1956). This means that the
evaluate the statistic of interest, say 0*(b) = 0(y*(b))gbootstrap estimate af = a(F) is the nonparametric
b = 1, 2, * , B; and (iii) calculate the sample standard MLE of v(F), the true standard error.
deviation of the 0*(b) values In fact there is nothing which says that the boot-
strap must be carried out nonparametrically. Suppose
A Zb=1 {8*(b)--A
*.)}0/
0()2A 1/2 for instance that in the law school example we believe
the true sampling distribution F must be bivariate
(2.4) B-i
normal. Then we could estimate F with its parametric
l*(.)= >20*(b) MLE FNORM, the bivariate normal distribution having
the same mean vector and covariance matrix as the
B~~~~~~
It is easy to see that as B - 60, 5B will approach
a = (F), the bootstrap estimate of standard error.
All we are doing is evaluating a standard deviation
by Monte Carlo sampling. Later, in Section 9, we NORMAL

will discuss how large B need be taken. For most THEORYHITGA


DENSITY HISTOGRAM
situations B in the range 50 to 200 is quite adequate.
In what follows we will usually ignore the difference
HISTOGRAM
between 5B and a, calling both simply "a"
PERCENTILES
Why is each bootstrap sample taken with the same 160/o 50 84
sample size n as the original data set? Remember that
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2
o(F) is actually (F, n, 6), the standard error for the
statistic 0( ) based on a random sample of size n from FIG. 2. Histogram of B = 1000 boots
the unknown distribution F. The bootstrap estimate law school data. The normal theory de
f is actually o(F, n, 0) evaluated at F = F. The Monte but falls off more quickly at the upper tail.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 57

data. The bootstrap samples at step (i) of the algo- expression for the standard error of an MLE), and in
rithm could then be drawn from FNORM instead of F, fact all analytic difficulties of any kind. The data
and steps (ii) and (iii) carried out as before. analyst is free to obtain standard errors for enor-
The smooth curve in Fig. 2 shows the results of mously complicated estimators, subject only to the
carrying out this "normal theory bootstrap" on the constraints of computer time. Sections 3 and 6 discuss
law school data. Actually there is no need to do the some interesting applied problems which are far too
bootstrap sampling in this case, because of Fisher's complicated for standard analyses.
formula for the sampling density of a correlation coef- How well does the bootstrap work? Table 1 shows
ficient in the bivariate normal situation (see Chapter the answer in one situation. Here r is the real line,
32 of Johnson and Kotz, 1970). This density can be n = 15, and the statistic 0 of interest is the 25%
thought of as the bootstrap distribution for B = oo. trimmed mean. If the true sampling distribution F is
Expression (2.5) is a close approximation to "1NORM = N(0, 1), then the true standard error is a(F) = .286.
o(FNORM), the parametric bootstrap estimate of stand- The bootstrap estimate 'a is nearly unbiased, averaging
ard error. .287 in a large sampling experiment. The standard
In considering the merits or demerits of the boot- deviation of the bootstrap estimate a' is itself .071 in
strap, it is worth remembering that all of the usual this case, with coefficient of variation .071/.287 = .25.
formulas for estimating standard errors, like g-l/2 (Notice that there are two levels of Monte Carlo
where J is the observed Fisher information, are es- involved in Table 1: first drawing the actual samples
sentially bootstrap estimates carried out in a para- y = (xl, x2, ..., x15) from F, and then drawing boot-
metric framework. This point is carefully explained in strap samples (x4, x2*, * * *, x15) with y held fixed. The
Section 5 of Efron (1982c). The straightforward non- bootstrap samples evaluate a' for a fixed value of y.
parametric algorithm (i)-(iii) has the virtues of avoid- The standard deviation .071 refers to the variability
ing all parametric assumptions, all approximations of a' due to the random choice of y.)
(such as those involved with the Fisher information The jackknife, another common method of assign-
ing nonparametric standard errors, is discussed in
Section 10. The jackknife estimate CJ' is also nearly
TABLE 1
unbiased for a(F), but has higher coefficient of vari-
A sampling experiment comparing the bootstrap and jackknife
estimates of standard error for the 25% trimmed mean,
ation (CV). The minimum possible CV for a scale-
sample size n = 15 invariant estimate of a(F), assuming full knowledge
of the parametric model, is shown in brackets. The
F standard F negative
normal exponential
nonparametric bootstrap is seen to be moderately
efficient in both cases considered in Table 1.
Ave SD CV Ave SD CV
Table 2 returns to the case of 0 the correlation
Bootstrap f .2s87 .071 .25 .242 .078 .32 coefficient. Instead of real data we have a sampling
(B = 200) experiment in which the true F is bivariate normal,
Jackknife 6f .280 .084 .30 .224 .085 .38
true correlation 0 = .50, sample size n = 14. Table 2
True (minimum CV) .286 (.19) .232 (.27)
is abstracted from a larger table in Efron (1981b), in

TABLE 2

Estimates of standard error for the correlation coefficient C and for X = tanh-'6; sample size n 14, distribu
correlation p = .5 (from a larger table in Efron, 1981b)

Summary statistics for 200 trials

Standard error estimates for C Standard error estimates for X

Ave SD CV MSE Ave SD CV -VKfM

1. Bootstrap B = 128 .206 .066 .32 .067 .301 .065 .22 .065
2. Bootstrap B = 512 .206 .063 .31 .064 .301 .062 .21 .062
3. Normal smoothed bootstrap B = 128 .200 .060 .30 .063 .296 .041 .14 .041
4. Uniform smoothed bootstrap B = 128 .205 .061 .30 .062 .298 .058 .19 .058
5. Uniform smoothed bootstrap B = 512 .205 .059 .29 .060 .296 .052 .18 .052

6. Jackknife .223 .085 .38 .085 .314 .090 .29 .091


7. Delta method .175 .058 .33 .072 .244 .052 .21 .076
(Infinitesimal jackknife)

8. Normal theory .217 .056 .26 .056 .302 0 0 .003

True standard error .218 .299

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
58 B. EFRON AND R. TIBSHIRANI

which some of the methods for estimating a standard time (y) in weeks for two groups, treatment (x = 0)
error required the sample size to be even. and control (x = 1), and a 0-1 variable (bi) indicating
The left side of Table 2 refers to 0, while the right whether or not the remission time is censored (0) or
side refers to X = tanh-'(0) = .5 log(1 + 6)/(1 -). complete (1). There are 21 mice in each group.
For each estimator of standard error, the root mean The standard regression model for censored data is
squared error of estimation [E(a - )2]1/2 iS given in Cox's proportional hazards model (Cox, 1972). It as-
the column headed VMi.. sumes that the hazard function h(t I x), the probability
The bootstrap was run with B = 128 and also with of going into remission in next instant given no re-
B = 512, the latter value yielding only slightly better mission up to time t for a mouse with covariate x, is
estimates in accordance with the results of Section 9. of the form
Further increasing B would be pointless. It can be
(3.1) h(t I x) = ho(t)e:x.
shown that B = oo gives Vii-i = .063 for 0, only .001
less than B = 512. The normal theory estimate (2.5), Here ho(t) is an arbitrary unspecified function. Since
which we know to be ideal for this sampling experi- x here is a group indicator, this means simply that the
ment, has ../i-Si = .056. hazard for the control group is e: times the hazard for
We can compromise between the totally nonpara- the treatment group. The regression parameter d is
metric bootstrap estimate a' and the totally parametric estimated independently of ho(t) through maximiza-
bootstrap estimate C7NORM. This is done in lines 3, 4, tion of the so called "partial likelihood"
and 5 of Table 2. Let 2 = Sin-l (xi - )(x- i)'/n be e,3xi
the sample covariance matrix of the observed data. (3.2) PL = 11 e-xi
The normal smoothed bootstrap draws the bootstrap iED EiER, e i'
sample from F (D N2(0, .252), (D indicating convolu-where D is the set of indices of the failure times and
tion. This amounts to estimating F by an equal mix- Ri is the set of indices of those at risk at time yi. This
ture of the n distributions N2(xi, .252), that is bymaximization
a requires an iterative computer search.
normal window estimate. Each point xi* in a smoothed The estimate d for these data turns out to be 1.51.
bootstrap sample is the sum of a randomly selected Taken literally, this says that the hazard rate is e'5'
original data point xj, plus an independent bivariate = 4.33 times higher in the control group than in the
normal point zj - N2(0, .252). Smoothing makes littletreatment group, so the treatment is very effective.
difference on the left side of the table, but is spectac- What is the standard error of fA? The usual asymptotic
ularly effective in the k case. The latter result is maximum likelihood theory, one over the square root
suspect since the true sampling distribution is bivar- of the observed Fisher information, gives an estimate
iate normal, and the function q = tanh- O is specifi- of .41. Despite the complicated nature of the estima-
cally chosen to have nearly constant standard error in tion procedure, we can also estimate the standard error
the bivariate normal family. The uniform smoothed using the bootstrap. We sample with replacement
bootstrap samples from F (D W(0, .252), where from the triples I(y', xi, 5k), * ..*, (Y42, x42, 642)). For
WI(0, .252) is the uniform distribution on a rhombus each bootstrap sample $(y*, x*, 0), .., (y*, x4*2,
selected so VI has mean vector 0 and covariance matrix 6*)) we form the partial likelihood and numerically
.25Z. It yields moderate reductions in vMi-SR for both maximize it to produce the bootstrap estimate A*. A
sides of the table. histogram of 1000 bootstrap values is shown in Fig. 3.
Line 6 of Table 2 refers to the delta method, which The bootstrap estimate of the standard error of A
is the most common method of assigning nonpara- based on these 1000 numbers is .42. Although the
metric standard error. Surprisingly enough, it is badly bootstrap and standard estimates agree, it is interest-
biased downward on both sides of the table. The delta ing to note that the bootstrap distribution is skewed
method, also known as the method of statistical dif- to the right. This leads us to ask: is there other
ferentials, the Taylor series method, and the infinites- information that we can extract from the bootstrap
imal jackknife, is discussed in Section 10. distribution other than a standard error estimate? The
answer is yes-in particular, the bootstrap distribu-
3. EXAMPLES tion can be used to form a confidence interval for fA,
as we will see in Section 9. The shape of the bootstrap
Example 1. Cox's Proportional Hazards Model
distributiion will help determine the shape of the
In this section we apply bootstrap standard error confidence interval.
estimation to some complicated statistics. In this example our resampling unit was the triple
The data for this example come from a study of (yi, xi, bi), and we ignored the unique elements of the
leukemia remission times in mice, taken from Cox problem, i.e., the censoring, and the particular model
(1972). They consist of measurements of remission being used. In fact, there are other ways to bootstrap

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 59

Notice the relation of the projection pursuit regres-


200 sion model to the standard linear regression model.
When the function sj(-) is forced to be linear and is
estimated by the usual least squares method, a one-
term projection pursuit model is exactly the same as
150
the standard linear regression model. That is to say,
the fitted model s'(a'1 xi) exactly equals the least
squares fit a' + f31 jxi,. This is because the least
100 squares fit, by definition, finds the best direction and
the best linear function of that direction. Note also
that adding another linear term s 2(& * X2) would not
change the fitted model since the sum of two linear
50
functions is another linear function.
Hastie and Tibshirani (1984) applied the bootstrap
to the linear and projection pursuit regression models
01 to assess the variability of the coefficients in each.
0.5 1 1.5 2 2.5 3
The data they considered are taken from Breiman and
FIG. 3. Histogram of 1000 bootstrap replications for the mouse Friedman (1985). The response Y is Upland atmos-
leukemia data. pheric ozone concentration (ppm); the covariates X1
= Sandburg Air Force base temperature (CO), X2 =
this problem. We will see this when we discuss boot- inversion base height (ft), X3 = Daggot pressure gra-
strapping censored data in Section 5. dient (mm Hg), X4 = visibility (miles), and X5 = day
of the year. There are 330 observations. The number
Example 2: Linear and Projection Pursuit of terms (m) in the model (3.4) is taken to be two. The
Regression projection pursuit algorithm chose directions al = (.80,
-.38, .37, -.24, -.14)' and 62 = (.07, .16, .04, -.05,
We illustrate an application of the bootstrap to
-.98)'. These directions consist mostly of Sandburg
standard linear least squares regression as well as to
Air Force temperature and day of the year, respec-
a nonparametric regression technique.
Consider the standard regression setup. We have n
observations on a response Y and covariates (X1, X2,
* * *, X,). Denote the ith observed vector of covariates
by xi = (xil, xi2, ... , xip)'. The usual linear regression
model assumes a5 I
p

(3.3) E(Yi) = a + E /fl1xi.


j=l
a4 A. I
Friedman and Stuetzle (1981) introduced a more gen-
eral model, the projection pursuit regression model

m
a3
(3.4) E(Yi)= X sj(aj - xi).
j=l

The p vectors aj are unit vectors ("directions"), and


a2
the functions sj(.) are unspecified.
Estimation of la,, sl(.)), ..., {am1, Sm(-)} is per-
formed in a forward stepwise manner as follows. Con-
sider {al, s ( - ) }. Given a direction al, s, * () is estimated a, L
by a nonparametric smoother (e.g., running mean) of
y on a, * x. The projection pursuit regression algo-
rithm searches over all unit directions to find the -1 -0.5 0 0.5 1
direction al and associated function sl(-) that mini- bootstrapped coefficients
mize s (y1-sl(i * xa))2. Then residuals are taken
FIG. 4. Smoothed histograms of the bootstrapped coefficients for the
and the next direction and function are determined.
first term in the projection pursuit regression model. Solid histograms
This process is continued until no additional term are for the usual projection pursuit model; the dotted histograms are
significantly reduces the residual sum of squares. for linear 9(*).

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
60 B. EFRON AND R. TIBSHIRANI

(1). There are measurements on 157 patients. A pro-


portional hazards model was fit to these data, with a
quadratic term, i.e, h(t I x) = ho(t)elx+i32x. Both #, and
f2 are highly significant; the broken curve in Fig. 6 is
/3x + f2x2 as a function of x.
For comparison, Fig. 6 shows (solid line) another
a4 estimate. This was computed using local likelihood
estimation (Tibshirani and Hastie, 1984). Given a
general proportional hazards model of the form h(t I x)
= ho(t)es(x), the local likelihood technique assumes
a3
nothing about the parametric form of s(x); instead it
estimates s(x) nonparametrically using a kind of local
averaging. The algorithm is very computationally in-
a2 tensive, and standard maximum likelihood theory can-
not be applied.
A comparison of the two functions reveals an im-
portant qualitative difference: the parametric estimate
al~~ ~ ~~~~~~ p - p suggests that the hazard decreases sharply up to age
34, then rises; the local likelihood estimate stays ap-
-1 -0.5 0 0.5 1 proximately constant up to age 45 then rises. Has the
bootstrapped coefficients forced fitting of a quadratic function produced a mis-
leading result? To answer this question, we can boot-
FIG. 5. Smoothed histograms of the bootstrapped coefficients for the
second term in the projection pursuit model.
strap the local likelihood estimate. We sample with
replacement from the triples I(Yl, x1, 61) ... (Y157i
X157, 6157) and apply the local likelihood algorithm to
tively. (We do not show graphs of the estimated func- each bootstrap sample. Fig. 7 shows estimated curves
tions s(*(.) and s2(. ) although in a full analysis of the from 20 bootstrap samples.
data they would also be of interest.) Forcing s'( *) to Some of the curves are flat up to age 45, others are
be linear results in the direction a' = (.90, -.37, .03, decreasing. Hence the original local likelihood esti-
-.14, -.19)'. These are just the usual least squares mate is highly variable in this region and on the basis
estimates i1, *., * ,, Ascaled so that EP 12 = 1. of these data we cannot determine the true behavior
To assess the variability of the directions, a boot- of the function there. A look back at the original data
strap sample is drawn with replacement from (Yi, x11, shows that while half of the patients were under 45,
. . ., X15), * * *, (Y330, X3301, *- - , X3305) and the projection only 13% of the patients were under 30. Fig. 7 also
pursuit algorithm is applied. Figs. 4 and 5 show his- shows that the estimate is stable near the middle ages
tograms of the directions a'* and a* for 200 bootstrap but unstable for the older patients.
replications. Also shown in Fig. 4 (broken histogram)
are the bootstrap replications of a& with s.(.) forced 3
to be linear.
The first direction of the projection pursuit model
is quite stable and only slightly more variable than
the corresponding linear regression direction. But the 2

second direction is extremely unstable! It is clearly


unwise to put any faith in the second direction of the
original projection pursuit model.
t //~~~~~~~~~~
0 / /
Example 3: Cox's Model and Local Likelihood \L
a /t
//
Estimation

In this example, we return to Cox's proportional


hazards model described in Example 1, but with a few
added twists.
The data that we will discuss come from the Stan-
ford heart transplant program and are given in Miller 10 20 30 40 50 60
and Halpern (1982). The response y is survival time age
in weeks after a heart transplant, the covariate x is
FIG. 6. Estimates of log relative risk for the Stanford heart trans-
age at transplant, and the 0-1 variable 3 indicates plant data. Broken curve: parametric estimate. Solid curve: local
whether the survival time is censored (0) or complete likelihood estimate.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 61

TABLE 3
BHCG blood serum levels for 54 patients having metasticized breas
4
cancer in ascending order <

0.1, 0.1, 0.2, 0.4, 0.4, 0.6, 0.8, 0.8, 0.9, 0.9, 1.3, 1.3, 1.4, 1.5, 1.6,
3 1.6, 1.7, 1.7, 1.7, 1.8, 2.0, 2.0, 2.2, 2.2, 2.2, 2.3, 2.3, 2.4, 2.4, 2.4,
2.4, 2.4, 2.4, 2.5, 2.5, 2.5, 2.7, 2.7, 2.8, 2.9, 2.9, 2.9, 3.0, 3.1, 3.1,
3.2, 3.2, 3.3, 3.3, 3.5, 4.4, 4.5, 6.4, 9.4

As an example consider the blood serum data of


Table 3. Suppose we wish to estimate the true mean
A = EF{X I of this population using 0, the 25% trimmed
mean. We calculate j, = ,u(F) = 2.32, the sample mean
of the 54 observations, and 0= 2.24, the trimmed mean.
The trimmed mean is lower because it discounts the
-11,..,.
10 20 30 40 50 60 effect of the large observations 6.4 and 9.4. It looks
age
like the trimmed mean might be more robust for this
type of data, and as a matter of fact a bootstrap
FIG. 7. 20 bootstraps of the local likelihood estimate for the Stanford
analysis, B = 1000, gave estimated standard error
heart transplant data.
a = .16 for 0, compared to .21 for the sample mean.
But what about bias?
The same 1000 bootstrap replications which gave
4. OTHER MEASURES OF STATISTICAL ERROR a = .16 also gave 0*(-) = 2.29, so

So far we have discussed statistical error, or accu- (4.5) = 2.29 - 2.32 = -0.03.
racy, in terms of the standard error. It is easy to assess according to (4.4). (The estimated standard deviation
other measures of statistical error, such as bias or of fB- - due to the limitations of having B = 1000
prediction error, using the bootstrap. bootstraps is only 0.005 in this case, so we can ignore
Consider the estimation of bias. For a given statistic the difference between fB and A3.) Whether or not a
0(y), and a given parameter ,u(F), let bias of magnitude -0.03 is too large depends on the
(4.1) R(y, F) = 0(y) - A(F). context of the problem. If we attempt to remove the
bias by subtraction, we get 0- = 2.24 - (-0.03)-
(It will help keep our notation clear to call the param- 2.27. Removing bias in this way is frequently a bad
eter of interest A rather than 0.) For example, ,u might idea (see Hinkley, 1978), but at least the bootstrap
be the mean of the distribution F, assuming the sample analysis has given us a reasonable picture of the bias
space X is the real line, and 0 the 25% trimmed mean. and standard error of 0.
The bias of 0 for estimating it is Here is another measure of statistical accuracy,
(4.2) A(F) = EFR(Y, F) = EF10(y)) - A(F). different from either bias or standard error. Let 0(y)
be the 25% trimmed mean and M(F) be the mean of
The notation EF indicates expectation with respect to F, as in the serum example, and also let i(y) be the
the probability mechanism appropriate to F, in this interquartile range, the distance between the 25th and
case y = (xl, x2, - - *, xn) a random sample from F. 75th percentiles of the sample y = (x1, x2, *--, xn).
The bootstrap estimate of bias is Define

(4.3), fi = A(F) = EFR(y*, F) = Ep{0(Y*)) - U(F)


(4.6) R(y, F) = (y)- (F)
As in Section 2, y* denotes a random sample (x4, x, I(y)
***, 4) from F, i.e., a bootstrap sample. To numeri-
R is like a Student's t statistic, except that we have
cally evaluate ,3, all we do is change step (iii) of the
substituted the 25% trimmed mean for the sample
bootstrap algorithm in Section 2 to
mean and the interquartile range for the standard
'A - 1 B deviation.
/B - - E R(y*(b), F ). Suppose we know the 5th and 95th percentiles of
Bb=1
R(y, F), say p(-05)(F) and p(.95)(F), where the definition
of p(.05)(F) is
(4.4) - b 0*(b) A -()
(4.7) ProbF,R(y, F) < p(5)(F) I = .05,
and similarly for pf 95)(F). The relationship ProbFpt (.OS)
AsB-*oo,L 3 Xgoesto/F(4.3). s R < (95)} - .90 combines with definition (4.6) to

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
62 B. EFRON AND R. TIBSHIRANI

give a central 90% "t interval" for the mean


5. MORE COMPLICATED DATA SETS
(4.8) y E [6 _i(.95), 6 _ g(.05)] The bootstrap is not restricted to situations where
Of course we do not know p`05)(F) and p(95)(F), the data is a simple random sample from a single
but we can approximate them by their bootstrap distribution. Suppose for instance that the data con-
estimates p(O5)(F) and F (95)(F). A bootstrap samplesists of two independent random samples,
y* gives a bootstrap value of (4.6), R(y*, F) =
(5.1)
(6(y*) - p(F))/i(y*), where i(y*) is the interquartile U1, U2,---,Ur-F and
range of the bootstrap data x, *, * ., *. For V1, V2, ... Vn G,
any fixed number p, the bootstrap estimate of where F and G are possibly different distributions on
ProbF{R < p} based on B bootstrap samples is the real line. Suppose also that the statistic of interest
is the Hodges-Lehmann shift estimate
(4.9) #fR(y*(b), F) < p JB. A

0=
By keeping track of the empirical distribution of (5.2)
R(y*(b), F), we can pick off the values of p which medianIVj - Ui, i = 1, 2, *.. m, j = 1, 2, ..., n}.
make (4.9) equal .05 and .95. These approach F(05)(F) Having observed U1 = ul, U2 = u2, *--, Vn = Vn,
and p(95)(F) as B -* oo. we desire an estimate for a(F, G), the standard error
For the serum data, B = 1000 bootstrap replications of 0.
gave p( 5)(F) = -.303 and p 95)(F) = .078. Substituting The bootstrap estimate of a(F, G) is a = G(F, G),
these values into (4.9), and using the observed esti- where F is the empirical distribution of u1, u2, * . *,
mates 0 = 2.24, i = 1.40, gives um, and G is the empirical distribution of v1, v2, ** ,
vn. It is easy to modify the Monte Carlo algorithm of
(4.10) L E [2.13, 2.66]
Section 2 to numerically evaluate v. Let y = (u1, u2,
as a central 90% "bootstrap t interval" for the true *-, vn) be the observed data vector. A bootstrap
mean ,u(F). This is considerably shorter than the sample y* = (uiu, -* * *, u*4, v , v, ... *, v*) consists
standard t interval for ,t based on 53 degrees of free-of a random sample U*, ..., U* from F and an
independent random sample V*, - - -, V* from G. Wit
dom, i ? 1.67W = [1.97, 2.67]. Here -a = .21 is the usual
estimate of standard error (1.3). only this modification, steps (i) through (iii) of the
Bootstrap confidence intervals are discussed further Monte Carlo algorithm produce fB, (2.4), approaching
in Sections 7 and 8. They require more bootstrap aas B -- oo.
replications than do bootstrap standard errors, on the Table 4 reports on a simulation experiment inves-
order of B = 1000 rather than B = 50 or 100. This tigating how well the bootstrap works on this problem.
point is discussed briefly in Section 9. 100 trials of situation (5.1) were run, with m = 6,
By now it should be clear that we can use any n = 9, F and G both Uniform [0, 1]. For each trial,
random variable R(y, F) to measure accuracy, not just both B = 100 and B = 200 bootstrap replications were
(4.1) or (4.6), and then estimate EFIR(y, F)} by its generated. The bootstrap estimate JB was nearly un-
bootstrap value EpIR(y*, F1) }- b=1 R(y*(b), F)/B. biased for the true standard error u(F, G) = .167 for
Similarly we can estimate EFR(y, F)2 by EpR(y*, F)2, either B = 100 or B = 200, with a quite small standard
etc. Efron (1983) considers the prediction problem, in deviation from trial to trial. The improvement in going
which a training set of data is used to construct a from B = 100 to B = 200 is too small to show up in
prediction rule. A naive estimate of the prediction this experiment.
rule's accuracy is the proportion of correct guesses it In practice, statisticians must often consider quite
makes on its own training set, but this can be greatly complicated data structures: time series models, mul-
over optimistic since the prediction rule is explicitly
constructed to minimize errors on the training set. In TABLE 4
this case, a natural choice of R(y, F) is the over Bootstrap estimate of standard error for the Hodges-Lehmann
optimism, the difference between the naive estimate two-sample shift estimate; 100 trials

and the actual success rate of the prediction rule for Summary statistics for OB
new data. Efron (1983) gives the bootstrap estimate
Ave SD CV
of over optimism, and shows that it is closely related
to cross-validation, the usual method of estimating B = 100 .165 .030 .18
B = 200 .166 .031 .19
over optimism. The paper goes on to show that some
True o .167
modifications of the bootstrap estimate greatly out
perform both cross-validation and the bootstrap. Note: m = 6, n = 9; true distributions F a

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 63

tifactor layouts, sequential sampling, censored and iates; and g is a known function of / and ti, for inst
missing data, etc. Fig. 8 illustrates how the bootstrap e:'ti. The ei are an iid sample from some unknown
estimation process proceeds in a general situation. distribution F on the real line,
The actual probability mechanism P which generates
(5.4) e1, 2 . . . En F,
the observed data y belongs to some family P of
possible probability mechanism. In the Hodges-Leh- where F is usually assumed to be centered at 0 in some
mann example, P = (F, G), a pair of distributions on sense, perhaps EHe} = 0 or Prob$e < 0} = .5. The
the real line, P equals the family of all such pairs, and probability model is P = (A, F); (5.3) and (5.4) describe
y = (u1, U2, *--, Um, V1, v2, *--, vn) is generated the step P -* y in Fig. 8. The covariates t1, t2, *. . , tn
by random sampling m times from F and n times like the sample size n in the simple problem (1.1), are
from G. considered fixed at their observed values.
We have a random variable of interest R(y, P), For every choice of A we have a vector g(3) =
which depends on both y and the unknown model P, (g(/, t1), g(B, t2), * * , g(/, tn)) of predicted values f
and we wish to estimate some aspect of the dis- y. Having observed y, we estimate A by minimizing
tribution of R. In the Hodges-Lehmann example, some measure of distance between g(/3) and y,
R(y, P) = 0(y) - Ep{0j, and we estimated o(P) =
(5.5) /: min D(y, g(/)).
lEpR(y, p)2}1/2, the standard error of 0. As before, the
notation Ep indicates expectation when y is generated
The most common choice of D is D(y, g) =
according to mechanism P.
We assume that we have some way of estimating = 1$yi - g(/, ti)12.
How accurate is as an estimate of d? Let R(y, P)
the entire probability model P from the data y, pro-
equal the vector / -. A familiar measure of accuracy
ducing the estimate called P in Fig. 8. (In the two-
is the mean square error matrix
sample problem, P = (F, G), the pair of empirical
distributions.) This is the crucial step for the bootstrap.
(5.6) Z(P) = Ep( - )( - /)'
It can be carried out either parametrically or nonpar-
= EpR(y, P)R(y, P)'.
ametrically, by maximum likelihood or by some other
estimation technique. The bootstrap estimate of accuracy Z = AP) is ob-
Once we have P, we can use Monte Carlo methods tained by following through Fig. 8.
to generate bootstrap data sets y*, according to the There is an obvious choice for P = (/3, F) in this
same rules by which y is generated from P. The case. The estimate /3 is obtained from (5.5). Then F is
bootstrap random variable R(y*, P) is observable, the empirical distribution of the residuals,
since we know P as well as y*, so the distribution of
F: mass(1/n) on 90-9g(/, ti),
R(y*, P) can be found by Monte Carlo sampling. The (5.7)
bootstrap estimate of EpR(y, P) is then EfR(y*, P), i = 1, ... , n.
and likewise for estimating any other aspect of
A bootstrap sample y* is obtained by following rules
R(y, P)'s distribution. (5.3) and (5.4),
A regression model is a familiar example of a com-
plicated data structure. We observe y = (Yi, Y2, * (5.8) Y' = g(, ti) + + , i = 1,2, * * * n,
Yn), where where er c* ',2 e * is an iid sample from F. Notice
that the e* are independent bootstrap variates, even
(5-3) yi = g(#, ti) + ei i = 1, 2, *--, n.
though the e'i are not independent variates in the usual
Here A is a vector of unknown parameters we wish to sense.
estimate; for each i, ti is an observed vector of covar- Each bootstrap sample y*(b) gives a bootstrap value
Al*(b),
FAMILY OF

(5.9) MO*(b): min D(y*(b), g(/3)),


POSSIBLE ACTUAL ESTIMATED
PROBABILITY PROBABILITY OBSERVED PROBABILITY BOOTSTRAP
MODELS MODEL DATA MODEL DATA

pe ----------- P y P y
as in (5.5). The estimate

R(y,P) R(y*,P) (5.10) 2B = Eb 1t{3*(b) - 3*( )H}3*(b) -


B
RANDOM VARIABLE OF INTEREST BOOTSTRAP RANDOM VARIABLE

approaches the bootstrap estimate Z as B -- oo. (We


FIG. 8. A schematic illustration of the bootstrap process for a general
could just as well divide by B - 1 in (5.10).)
probability model P. The expectation of R(y, P) is estimated by the
bootstrap expectation of R(y*, P). The double arrow indicates Inthe
the case of ordinary least squares regression,
crucial step in applying the bootstrap. where g(/, ti) = /'ti and D(y, g) = = (y1-

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
64 B. EFRON AND R. TIBSHIRANI

Section 7 of Efron (1979a) shows that the bootstrap The statistician gets to observe
estimate, B = oo, can be calculated without Monte
(5.14) Xi= min{X?, WiJ
Carlo sampling, and is
and

(5.11) (2 tit 2_ 2 1if Xi = X


(5.15) D= X = X W.
This is the usual Gauss-Markov answer, except for
the divisor n in the definition of a2. Note: 1 - S?(t) and 1 -R(t) are the cumulative
There is another, simpler way to bootstrap a regres- distribution functions (cdf) for X? and Wi, respec-
sion problem. We can consider each covariate-re- tively; with censored data it is more convenient to
sponse pair xi = (ti, yi) to be a single data point consider survival curves than cdf.
obtained by simple random sampling from a distribu- Under assumptions (5.12)-(5.15) there is a simple
tion F. If the covariate vector ti is p-dimensional, F is formula for the nonparametric MLE of S?(t), called
a distribution on p + 1 dimensions. Then we apply the Kaplan-Meier estimator (Kaplan and Meier,
the bootstrap as described originally in Section 2 to 1958). For convenience suppose x1 < X2 < X3 < ... <
the data set x1, X2, . ., Xn ~iid F. xn, n = 97. Then the Kaplan-Meier estimate is
The two bootstrap methods for the regression prob-
lem are asymptotically equivalent, but can perform (5.16) SO) = l :)did
quite differently in small sample situations. The class
of possible probability models P is different for the where kt is the value of k such that t E [Xk, Xk*1). In
two methods. The simple method, described last, takes the case of no censoring, S?(t) is equivalent to the
less advantage of the special structure of the regres- observed empirical distribution of x1, x2, * *, xn, but
sion problem. It does not give answer (5.11) in the otherwise (5.16) corrects the empirical distribution to
case of ordinary least squares. On the other hand the account for censoring. Likewise
simple method gives a trustworthy estimate of f's
variability even if the regression model (5.3) is not (5.17) 9 R(t) =A (= -; 1)d
correct. The bootstrap, as outlined in Fig. 5, is very
general, but because of this generality there will often is the Kaplan-Meier estimate of the censoring curve
be more than one bootstrap solution for a given prob- R(t).
le.m
Fig. 9 shows S0(t) for the Channing House men. It
As the final example of this section, we discuss
crosses the 50% survival level at 0 = 1044 months.
censored data. The ages of 97 men at a California
Call this value the observed median lifetime. We can
retirement center, Channing House, were observed use the bootstrap to assign a standard error to the
either at death (an uncensored observation) or at the observed median.
time the study ended (a censored observation). The The probability mechanism is P = (SO, R); P
data set y = {(x1, dl), (x2, d2), * * *, (X97, d97)J, where xi
produces (X?, Di) according to (5.12)-(5.15), and y-
was the age of the ith man observed, and
(xi, di), - (xn, dn)} by n = 97 independent repeti-
1 if xi uncensored tions of this process. An obvious choice of the estimate
{ 0 if xi censored. P in Fig. 8 is (S? R), (5.14), (5.15). The rest of

Thus (777, 1) represents a Channing House man ob-


served to die at age 777 months, while (843, 0) repre-
1.0
sents a man 843 months old when the study ended.
His observation could be written as "843+," and in
0.8 _
fact di is just an indicator for the absence or presence
of "+." A full description of the Channing House data
0.6 _
appears in Hyde (1980).
A typical data point (Xi, Di) can be thought of as O. 4 -

generated in the following way: a real lifetime X? is


selected randomly according to a survival curve 0.2 -

(5.12) S?(t) ProbJX9 > t}, (0 c t < oo)


800 900 1000 1044 1100
and a censoring time Wi is independently selected
according to another survival curve FIG. 9. Kaplan-Meier estimated survival curve for the Channing
House men; t = age in months. The median survival age is estimated
(5.13) R(t) Prob{ W. > t), (0 < t < em). to be 1044 months (87 years).

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 65

bootstrap process is automatic: S? and R replace S? 250

and R in (5.12) and (5.13); n pairs (Xi*, D:*) are


independently generated according to rules (5.12)-
(5.15), giving the bootstrap data set y* = {xl*, d*, * , 200

(x*, dn*); and finally the bootstrap Kaplan-Meier


curve S* is constructed according to formula (5.16),
and the bootstrap observed median 0* calculated. For 150

the Channing House data, B = 1600 bootstrap repli-


cations of 0* gave estimated standard error a = 14.0
months for 0. An estimated bias of 4.1 months was 100
calculated as at (4.4). Efron (1981b) gives a fuller
description.
Once again there is a simpler way to apply to boot- 50
strap. Consider each pair y- = (xi, di) as an observed
point obtained by simple random sampling from a
bivariate distribution F, and apply the bootstrap as
0.6 0.7 0.8 0.9
described in Section 2 to the data set yl, Y2, , Yn
-iid F. This method makes no use of the special struc-
FIG. 10. Bootstrap
ture (5.12)-(5.15). Surprisingly, it gives exactly the spot data, model (6.1).
same answers as the more complicated bootstrap
method described earlier (Efron, 1981a). This leads to
a surprising conclusion: bootstrap estimates of varia-
that the distribution is skewed to the left, so a
bility for the Kaplan-Meier curve give correct stand- confidence interval for 'k might be asymmetric about
ard errors even when the usual assumptions about the 'k as discussed in Sections 8 and 9.
In bootstrapping the residuals, we have assumed
censoring mechanism, (5.12)-(5.15), fail.
that the first-order autoregressive model is correct.
6. EXAMPLES WITH MORE COMPLICATED (Recall the discussion of regression models in Section
DATA STRUCTURES 5.) In fact, the first-order autoregressive model is far
from adequate for this data. A fit of second-order
Example 1: Autoregressive Time Series Model autoregressive model
This example illustrates an application of the (6.2) z- = azi-- + OZi-2 + e
bootstrap to a famous time series.
The data are the Wolfer annual sunspot numbers gave estimates a' = 1.37, 0 = -.677, both with an
for the years 1770-1889 (taken from Anderson, 1975). estimated standard error of .067, based on Fisher
Let the count for the ith year be zi. After centering information calculations. We applied the bootstrap to
the data (replacing z- by z- - z--), we fit a first-order this model, producing the histograms for a*t, *--,
autoregressive model a*too and O*, *--, o shown in Figs. 11 and 12,
respectively.
(6.1) Z. = kz-i- + E- The bootstrap standard errors were .070 and .068,
where e- iid N(0, a2). The estimate k turned out to respectively, both close to the usual value. Note that
be .815 with an estimated standard error, one over the the additional term has reduced the skewness of the
square root of the Fisher information, of .053. first coefficient.
A bootstrap estimate of the standard error of k can
Example 2: Estimating a Response Transformation
be obtained as follows. Define the residuals ki = Z-
in Regression
ozi-1 for i = 2, 3, -.., 120. A bootstrap sample z*, z*,
.-, *20 is created by sampling 2, 3, *, 0 with Box and Cox (1964) introduced a parametric family
replacement from the residuals, then letting z* =for
zl,estimating a transformation of the response in
and z4* = Oz 1 + si*, i = 2, ..., 120. Finally, after
a regression. Given regression data {(xl, yi), * ,
centering the time series z*, z*, *-, 420, g is the (Xn, Yn)}, their model takes the form
estimate of the autoregressive parameter for this new
(6.3) z(X) = xi * +
time series. (We could, if we wished, sample the *
from a fitted normal distribution.)
where z-(X) = (yi - 1)/X for
A histogram of 1000 such bootstrap values k*f, 2 , X = 0, and e, iid N(0, (2). Estimates of X and d are
* io, /40 is shown in Fig. 10. found by minimizing Ej (Z1 - 1 )2-
The bootstrap estimate of standard error was .055, Breiman and Friedman (1985) proposed a nonpara-
agreeing nicely with the usual formula. Note however metric solution for this problem. Their so called ACE

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
66 B. EFRON AND R. TIBSHIRANI

sponse Y being number of cycles to failure, and the


factors length of test specimen (X1) (250, 300, and 350
250 mm), amplitude of loading cycle (X2) (8, 9, or 10 mm),
and load (X3) (40, 45, or 50 g). As in Box and Cox, we
treat the factors as quantitive and allow only a linear
200
term for each. Box and Cox found that a logarithmic
transformation was appropriate, with their procedure

150
producing a value of -.06 for X with an estimated 95%
confidence interval of (-.18, .06).
Fig. 13 shows the transformation selected by the
100 ACE algorithm. For comparison, the log function is
plotted (normalized) on the same figure.
The similarity is truly remarkable! In order to assess
50
the variability of the ACE curve, we can apply the
bootstrap. Since the X matrix in this problem is fixed
by design, we resampled from the residuals instead of
1.1 1.2 1.3 1.4 1.5
from the (xi, y-) pairs. The bootstrap procedure was
the following:
FIG. 11. Bootstrap histogram of & &,*.,&o for the Wolfer sun-
spot data, model (6.2).
Calculate residuals , = s(y,) - xi. *3, i = 1, 2, ... , n.
Repeat B times
300 Choose a sample l , *,n
with replacement from el, E n
Calculate y* = s-'(x, - + i = 1, 2, n
Compute s*(.) = result of ACE algorithm
applied to (x1, y ), .*., (xn, y*)
200 End

The number of bootstrap replications B was 20.


Note that the residuals are computed on the s(*) scale,
not the y scale, because it is on the s(*) scale that the
true residuals are assumed to be approximately iid.
100
The 20 estimated transformations, s *- -, S20(
are shown in Fig. 14.
The tight clustering of the smooths indicates that
the original estimate s((.) has low variability, especially
for smaller values of Y. This agrees qualitatively with
C -0.8 -0.7 -0.6 -0.5
: ? W } f I I I --T I I I I
FIG. 12. Bootstrap histogram of 0*, ..., 6100 for the Wolfer sun-
2
spot data, model (6.2).

(alternating conditional expectation) model general- Estimated

izes (6.3) to

(6.4) S(y1) = x- ~
0
where s(.) is an unspecified smooth function. (In its
Log Function
most general form, ACE allows for transformations of
the covariates as well.) The function s(.) and param-
eter j3 are estimated in an alternating fashion, utilizing
a nonparametric smoother to estimate s(-).
In the following example, taken from Friedman and
Tibshirani (1984), we compare the Box and Cox pro- -2
, , , , I , I I , . , .
cedure to ACE and use the bootstrap to assess the o 1000 2000 3000
y
variability of ACE.
The data from Box and Cox (1964) consist of a 3 X FIG. 13. Estimated transformation from ACE and the log function
3q X 3q expenriment on the strength of yar-ns, the re- for Box and Cox example.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 67

TABLE 5
2- Exact and approximate central 90% confidence intervals for 0, the
true correlation coefficient, from the law schoQl data of Fig. 1

1. Exact (normal theory) [.496, .898] R/L = .44


2. Standard (1.7) [.587, .965] R/L = 1.00
3. Transformed standard [.508, .907] R/L = .49
4. Parametric bootstrap (BC) [.488, .900] R/L = .43
5. Nonparametric bootstrap (BCa) [.43, .92] R/L = .42

Note: R/L = ratio of right side of interval, measured from 0 = .776,


-1
to left side. The exact interval is strikingly asymmetric about 0.
Section 8 discusses the nonparametric method of line 5.

strikingly different from the exact normal theory in-


-2/
terval based on the assumption of a bivariate normal
0 1000 2000 3000
y
sampling distribution F.
In this case, it is well known that it is better to
FIG. 14. Bootstrap replications of ACE transformations for Box and
make the transformation k = tanh-(O), X = tanh-(0),
Cox example.
apply (1.7) on the X scale, and then transform back to
the 0 scale. The resulting interval, line 3 of Table 5, is
the short confidence interval for X in the Box and Cox moved closer to the exact interval. However, there is
analysis. nothing automatic about the tanh-1 transformation.
For a different statistic from the correlation coeffi-
7. BOOTSTRAP CONFIDENCE INTERVALS cient or a different distributional family from the
bivariate normal, we might very well need other tricks
This section presents three closely related methods
to make (1.7) perform satisfactorily.
of using the bootstrap to set confidence intervals. The
The bootstrap can be used to produce approximate
discussion is in terms of simple parametric models,
confidence intervals in an automatic way. The follow-
where the logical basis of the bootstrap methods is
ing discussion is abridged from Efron (1984 and
easiest to see. Section 8 extends the methods to mul-
1985) and Efron (1982a, Chapter 10). Line 4 of Table
tiparameter and nonparametric models.
5 shows that the parametric bootstrap interval for the
We have discussed obtaining a, the estimated stand-
correlation coefficient 0 is nearly identical with the
ard error of an estimator 0. In practice, 0 and a' are
exact interval. "Parametric" in this case means that
usually used together to form the approximate confi-
the bootstrap algorithm begins from the bivariate
dence interval 0 E 0 ? 5z" , (1.7), where z is the
normal MLE FNORM, as for the normal theory curve
100 - a percentile point of a standard normal distri-
of Fig. 2. This good performance is no accident. The
bution. The interval (1.7) is claimed to have approxi-
bootstrap method used in line 4 in effect transforms
mate coverage probability 1 - 2a. For the law school
0 to the best (most normal) scale, finds the appropriate
example of Section 2, the values 0 = .776, & = .115,
interval, and transforms this interval back to the 0
Z(05) = -1.645, give 0 E [.587, .965] as an approximate
scale. All of this is done automatically by the bootstrap
90% central interval for the true correlation coeffi-
algorithm, without requiring special intervention from
cient.
the statistician. The price paid is a large amount of
We will call (1.7) the standard interval for 0. When
computing, perhaps B = 1000 bootstrap replications,
working within parametric families like the bivariate
as discussed in Section 10.
normal, a' in (1.7) is usually obtained by differentiating
Define G(s) to be the parametric bootstrap cdf
the log likelihood function, see Section 5a of Rao
of 0*,
(1973), although in the context of this paper we might
prefer to use the parametric bootstrap estimate of a, (7.1) G(s) = Prob*4o* < SI,
e.g., 1NORM in Section 2.
The standard intervals are an immensely useful where Prob* indicates probability computed acco
statistical tool. Thev have the great virtue of being to the bootstrap distribution of 0*. In Fig. 2 G(s) is
automatic: a computer program can be written which obtained by integrating the normal theory curve. We
produces (1.7) directly from the data y and the form will present three different kinds of bootstrap confi-
of the density function for y, with no further input dence intervals in order of increasing generality. All
required from the statistician. Nevertheless the stand- three methods use percentiles of G to define the con-
ard intervals can be quite inaccurate as Table 5 shows. fidence interval. They differ in which percentiles are
The standard interval (1.7), using 1NORM, (2.5), is used.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
68 B. EFRON AND R. TIBSHIRANI

The simplest method is to take 0 E [G-'(a), for some monotone transformation g = X= ()


G '(1 - a)] as an approximate 1 - 2a central interval
where r is a constant. In the correlation coefficient
for 0. This is called the percentile method in Section example the function g was tanh-'. The standard
10.4 of Efron (1982a). The percentile method inter- limits (7.2) can now be grossly inaccurate. However it
val is just the interval between the 100 - a and 100 - is easy to verify that the percentile limits (7.2) are
(1 - a) percentiles of the bootstrap distribution of still correct. "Correct" here means that (7.2) is the
6*. mapping of the obvious interval for X z, + r , back
We will use the notation 0[a] for the a level end- to the 0 scale, Op[a] = g-(4 + TZ (a)). It is also correct
point of an approximate confidence interval for 0, so in the sense of having exactly the claimed converge
0 E [6[a], 0[1 - a]] is the central 1 - 2a interval. probability 1 - 2a.
Subscripts will be used to indicate the various different Another way to state things is that the percentile
methods. The percentile interval has endpoints intervals are transformation invariant,

(7.2) Op[a] _ G-'(a)


(7.7) OP[a] = g(Op[al)
This compares with the standard interval,
for any monotone transformation g. This implies that
(7.3) Os[a] = 0 + iz(a) if the percentile intervals are correct on some trans-
formed scale X = g(6), then they must also be correct
Lines 1 and 2 of Table 6 summarize these definitions.
on the original scale 0. The statistician does not need
Suppose the bootstrap cdf G is perfectly normal, say
to know the normalizing transformation g, only that
(7.4) G(s) = -((s- it exists. Definition (7.2) automatically takes care of
the bookkeeping involved in the use of normalizing
where 4(1s) = fs (2wr)-1/2e_t2/2 dt, the standard normal
transformations for confidence intervals.
cdf. In other words, suppose that 6* has bootstrap
Fisher's theory of maximum likelihood estimation
distribution N(O, a2). In this case the standard method
says that we are always in situation (7.5) to a first
and the percentile method agree, Os[a] = Op[a]. In
order of asymptotic approximation. However, we are
situations like that of Fig. 2, where G is markedly non-
also in situation (7.6), for any choice of g, to the same
normal, the standard interval is quite different from
order of approximation. Efron (1984 and 1985) uses
(7.2). Which is better?
higher order asymptotic theory to differentiate be-
To answer this question, consider the simplest pos-
tween the standard and bootstrap intervals. It is the
sible situation, where for all 0
higher order asymptotic terms which often make exact
(7.5) 0 - N(6, a2) intervals strongly asymmetric about the MLE 0 as in
Table 5. The bootstrap intervals are effective at cap-
That is, we have a single unknown parameter 0 with
turing this asymmetry.
no nuisance parameters, and a single summary statis-
The percentile method automatically incorporates
tic 0 normally distributed about 0 with constant stand-
normalizing transformations, as in going from (7.5)-
ard error a. In this case the parametric bootstrap cdf
(7.6). It turns out that there are two other important
is given by (7.4), so Os[a] = Op[a]. (The bootstrap
ways that assumption (7.5) can be misleading, the first
estimate a' equals a.)
of which relates to possible bias in 0. For example
Suppose though that instead of (7.5) we have, for
consider fo(O), the family of densities for the observed
all 0,
correlation coefficient 0 when sampling n = 15 times
(7 .6 N) 2 from a bivariate normal distribution with true corre-

TABLE 6

Four methods of setting approximate confidence intervals for a real valued parameter 6

Method Abbreviation a level endpoint Correct if

1. Standard s[a] a + ^ (a) N(O, U2) a constant


There exists monotone transformation
+ = g(6), 4 = g(6) such that:
2. Percentile Op[a] G'(a) N(O, T2) T constant
3. Bias-corrected OBC[a] G-1(cf2zo + z(a)1) - N(O-ZOT, T2) zo, T constant

4. BCa OBCj[a] ( { z + (zo + 1


z (a) - N(-zoTO,
- a(zo Tz,
+Za) where rO = 1+ a4 zo,
Note: Each method is correct under more general assumptions than its predecessor. Methods 2, 3, and 4 are defined in terms of the percentiles
of G, the bootstrap distribution (7.1).

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 69

lation 0. In fact it is easy to see that no monotone TABLE 7

mapping k = g(0), q = g(0) transforms this family Central 90% confidence intervals for 6 having observed
H(X2 /19)
to - N(0, T2), as in (7.6). If there were such a g,
then Proba9G < 01 = Probofo < 14 = .50, but for1. Exact [.631 6 6, 1.88 * 6] R/L = 2.38
2. Standard (1.7) [.466 * 6,1.53 * 6] R/L = 1.00
O = .776 integrating the density function f776(0)
3. BC (7.9) [.580 * 6,1.69 * 6] R/L = 1.64
gives Prob=.776{0 < 01 = .431.
4. BCa (7.15) [.630 *0, 1.88 6 8] R/L = 2.37
The bias-corrected percentile method (BC method), 5. Nonparametric BCa [.640 6 8, 1.68 6 6] R/
line 3 of Table 6, makes an adjustment for this type
Note: The exact interval is sharply skewed to the r
of bias. Let
BC method is only a partial improvement over the standard interval.
The BCa interval, a = .108, agrees almost perfectly with the exact
(7.8) zo 3 1G(O)1,
interval.
where FV- is the inverse function of the standard
normal cdf. The BC method has a level endpoint makes 6 unbiased for 6.) A confidence interval is
(7.9) OBC[a] - G-1({2Z + Z(a)). desired for the scale parameter 6. In this case the BC
interval based on 6 is a definite improvement over the
Note: if G(O) = .50, that is if half of the bootstrap
standard interval (1.7), but goes only about half as far
distribution of J* is less than the observed value 0,
as it should toward achieving the asymmetry of the
then zo = 0 and OBC[a] = Op[a]. Otherwise definition
exact interval.
(7.9) makes a bias correction.
It turns out that the parametric family 0 -
Section 10.7 of Efron (1982a) shows that the BC
#(X29/19) cannot be transformed into (7.10), not even
interval for 0 is exactly correct if
approximately. The results of Efron (1982b) show that
(7.10) X -N(M - zor, - there does exist a monotone transformation g such
that X = g(O), 4 = g(6) satisfy to a high degree of
for some monotone transformation X = g(0), q = g(0)
approximation
and some constant zo. It does not look like (7.10) is
much more general than (7.6), but in fact the bias (7.14) N(O- zor, r) (To = 1 + a+ ).
correction is often important.
In the example of Table 5, the percentile method The constants in (7.14) are zo = .1082, a = .1077.
(7.2) gives central 90% interval [.536, .911] compared The BCa method (Efron, 1984), line 4 of Table 6,
to the BC interval [.488, .900] and the exact interval is a method of assigning bootstrap confidence intervals
[.496, .898]. By definition the endpoints of the exact which are exactly right for problems which can be
interval satisfy mapped into form (7.14). This method has a level
endpoint
(7.11) Probo=.49610> .7761 = .05 zo + z~
= Prob6.898$0 < .776}.
(7.15) OBCakY] G + 1 - a(zo + z(a))
The corresponding quantities for the BC endpoints
are If a = 0 then 6BC[aI = OBC[a], but otherwise the BCa
intervals can be a substantial improvement over the
(7.12) Prob=.488{ I> .7761 = .0465,
BC method as shown in Table 7.
Prob0=.go0O < .7761 = .0475, The constant zo in (7.15) is given by zo =
compared to 4r-1G(O)}, (7.8), and so can be computed directly fr
the bootstrap distribution. How do we know a? It
Prob=.53610 > .7761 = .0725, turns out that in one-parameter families fo(O), a good
Probo=.9110 < .7761 = .0293. approximation is

for the percentile endpoints. The bias correction is


quite important in equalizing the error probabilities
(7.16) a I SKEWo=4(l0(t))
6
at the two endpoints. If zo can be approximated accu-
rately (as mentioned in Section 9), then it is preferable where SKEWo=(i(t)) is the skewness at parameter
to use the BC intervals. value 0 = 0 of the score statistic lo(t) = (d/a0)log
Table 7 shows a simple example where the BC f0(t). For 0 lNX19/19) this gives a .1081, compared
method is less successful. The data consists of the to the actual value a = .1077 derived in Efron (1984).
For the normal theory correlation family of Table 5
single observation 0 0(X29/19), the notation indicat-
ing an unknown scale parameter 0 times a random a 0 which explains why the BC method, which takes
variable with distribution X29/19. (This definition a = 0, words so well there.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
70 B. EFRON AND R. TIBSHIRANI

The advantage of formula (7.18) is that we need not TABLE 8


know the transformation g leading to (7.14) in order Central 90% confidence intervals for 6 = fl2/fll and for o = 1/d
having observed (YI, Y2) = (8, 4) from a bivariate normal
to approximate a. In fact OBCa[], like OBC[a] and Op[a], distribution y N2(iq, I)
is transformation invariant, as in (7.7). Like the boot-
For O For
strap methods, the BCa intervals are computed di-
rectly from the form of the density function Jo (.), for
1. Exact (Fieller) [.29, .76] [1.32, 3.50]

O near 0. 2. Parametric boot (BC) [.29, .76] [1.32, 3.50]


3. Standard (1.7) [.27, .73] [1.08, 2.92]
Formula (7.16) applies to the case where 0 is the
only parameter. Section 8 briefly discusses the more MLE O=.5 2

challenging problem of setting confidence intervals for Note: The BC intervals, line 2, are based on the parametric boot-
a parameter 0 in a multiparameter family, and also in strap distribution of 6 = Y2/Y1.
nonparametric situations where the number of nui-
sance parameters is effectively infinite.
To summarize this section, the progression from the mean vector q and covariance matrix the identity,
standard intervals to the BCa method is based on a (8.1) y - N2(1, I)-
series of increasingly less restrictive assumptions, as
shown in Table 6. Each successive method in Table 6 The parameter of interest, for which we desire a
requires the statistician to do a greater amount of confidence interval, is the ratio
computation; first the bootstrap distribution G, then
(8.2) 0 = fl2/f11 -
the bias correction constant z0, and finally the con-
stant a. However, all of these computations are algo- Fieller (1954) provided well known exact intervals for
rithmic in character, and can be carried out in an 6 in this case. The Fieller intervals are based on a
automatic fashion. clever trick, which seems very special to situation
Chapter 10 of Efron (1982a) discusses several other (8.1), (8.2).
ways of using the bootstrap to construct approximate Table 8 shows Fieller's central 90% interval for 6
confidence intervals, which will not be presented here. having observed y = (8, 4). Also shown is the Fieller
One of these methods, the "bootstrap t," was used in interval for X = 1/0 = lql/fl2, which equals [.76-1, .29-1],
the blood serum example of Section 4. the obvious transformation of the interval for 0. The
standard interval (1.7) is satisfactory for 0, but not for
8. NONPARAMETRIC AND MULTIPARAMETER 0. Notice that the standard interval does not trans-
CONFIDENCE INTERVALS form correctly from 6 to k.
Line 2 shows the BC intervals based on applying
Section 7 focused on the simple case - Jo, where
definitions (7.8) and (7.9) to the parametric bootstrap
we have only a real valued parameter 0 and a real
distribution of 0 Y2/Y1 (or X = y1/y2). This is the
valued summary statistic 0 from which we are trying
to construct a confidence interval for 0. Various fa-
distribution of 0* = Y*/y* when sampling y* =
vorable properties of the bootstrap confidence inter- (Y1, Y2*) from FNORM - N2((y1, Y2), I). The bootstrap
intervals transform correctly, and in this case they
vals were demonstrated in the simple case, but of
agree with the exact interval to three decimal places.
course the simple case is where we least need a general
method like the bootstrap.
Example 2. Product of Normal Means
Now we will discuss the more common situation
where there are nuisance parameters besides the pa- For most multiparameter situations, there do not
rameter of interest 0; or even more generally the exist exact confidence intervals for a single parameter
n6nparametric case, where the number of nuisance of interest. Suppose for instance that (8.2) is changed
parameters is effectively infinite. The discussion is to
limited to a few brief examples. Efron (1984 and 1985)
(8.3) 0= 1772,
develops the theoretical basis of bootstrap approxi-
mate confidence intervals for complicated situations, still assuming (8.1). Table 9 shows approximate inter-
and gives many more examples. The word "approxi- vals for 0, and also for / = 02, having observed y =
mate" is important here since exact nonparametric (2, 4). The "almost exact" intervals are based on an
confidence intervals do not exist for most parameters analog of Fieller's argument (Efron, 1985), which with
(see Bahadur and Savage, 1956). suitable care can be carried through to a high degree
of accuracy. Once again, the parametric BC intervals
Example 1. Ratio Estimation
are a close match to line 1. The fact that the standard
The data consists of y = (Yi, Y2), assumed to come intervals do not transform correctly is particularly
from a bivariate normal distribution with unknown obvious here.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 71

TABLE 9
on bootstrapping 0 = t(F), which is the nonparametric
Central 90% confidence intervals for 6 = fllfl2 and 0 = 62 having
MLE of 0. In this case a good approximation to the
observed y = (2, 4), where y N2(iq, I)
constant a is given in terms of the empirical influence
For O For 0 function U?, defined in Section 10 at (10.11),
1. Almost exact [1.77, 17.03] [3.1, 290.0]
1 (U?)3
2. Parametric boot'(BC) [1.77, 17.12] [3.1, 239.1] (8.5) a 1 --= --I
3. Standard (1.7) [0.64, 15.36] [-53.7, 181.7]

MLE E-8 64 This is a convenient formula, since it is easy to nu-


Note: The almost exact intervals are based on the high order merically evaluate the Ui? by simply substituting a
approximation theory of Efron (1985). The BC intervals of line 2 small value of 0 into (10.11).
are based on the parametric bootstrap distribution of 6 = Y1Y2-
Example 3. The Law School Data

The good performance of the parametric BC inter- For 6 the correlation coefficient, the values of U?
vals is not accidental. The theory developed in Efron corresponding to the 15 data points shown in Fig. 1
(1985) shows that the BC intervals, based on boot- are -1.507, .168, .273, .004, .525, -.049, -.100, .477,
strapping the MLE 0, agree to high order with the .310, .004, -.526, -.091, .434, .125, -.048. (Notice
almost exact intervals in the following class of prob- how influential law school 1 is.) Formula (8.5) gives
lems: the data y comes from a multiparameter family a - -.0817. B = 100,000 bootstrap replications,
about 100 times more than was actually necessary
of densities f,(y), both y and q k-dimensional vectors;
the real valued parameter of interest 0 is a smooth (see Section 10), gave zQ = -.0927, and the central
function of q, 0 = t(q); and the family f,(y) can be interval 0 E [.43, .92] shown in Table 5. The
90%
transformed to multivariate normality, say nonparametric BCa interval is quite reasonable in this
example, particularly considering that there is no
(8.4) g(y) - Nk(h(,q), I), guarantee that the true law school distribution F is
by some one-to-one transformations g and h. anywhere near bivariate normal.
Just as in Section 7, it is not necessary for the
Example 4. Mouse Leukemia Data
statistician to know the normalizing transformations
(the First Example in Section 3)
g and h, only that they exist. The BC intervals are
obtained directly from the original densities fr,: we find The standard central 90% interval for ,B in formula
11 = ?f(y), the MLE of q; sample y* - f,; compute (3.1) 0*, is [.835, 2.18]. The bias correction constant zo-
the bootstrap MLE of 0; calculate G, the bootstrap cdf .0275, giving BC interval [1.00, 2.39]. This is shifted
of 0*, usually by Monte Carlo sampling, and finally far right of the standard interval, reflecting the long
apply definitions (7.8) and (7.9). This process gives right tail of the bootstrap histogram seen in Fig. 3.
the same interval for 0 whether or not the transfor- We can calculate "a" from (8.5), considering each of
mation to form (8.4) has been made. the n = 42 data points to be a triple (yi, xi, bi): a-
Not all problems can be transformed as in (8.4) to -.152. Because a is negative, the BCa interval is
a normal distribution with constant covariance. The shifted back to the left, equaling [.788, 2.10]. This
case considered in Table 7 is a one-dimensional contrasts with the law school example, where a, zo,
counter example. As a result the BC intervals do not and the skewness of the bootstrap distribution added
always work as well as in Tables 8 and 9, although to each other rather than cancelling out, resulting in
they usually improve on the standard method. How- a BCa interval much different from the standard in-
ever, in order to take advantage of the BCa method, terval.
which is based on more general assumptions, we need Efron (1984) provides some theoretical support for
to be able to calculate the constant a. the nonparametric BCa method. However the problem
Efron (1984) gives expressions for "a" generalizing of setting approximate nonparametric confidence in-
(7.16) to multiparameter families, and also to non- tervals is still far from well understood, and all meth-
parametric situations. If (8.4) holds, then "a" will have ods should be interpreted with some caution. We end
value zero, and the BCa method reduces to the BC this section with a cautionary example.
case. Otherwise the two intervals differ.
Example 5. The Variance
Here we will discuss only the nonparametric situa-
tion: the observed data y = (xl, x2, ... , xn) consists of Suppose X is the real line, and 0 = VarFX, the
iid observations X1, X2, ... , - F, where F can be variance. Line 5 of Table 2 shows the result of applying
any distribution on the sample space 2; we want a the nonparametric BCa method to data sets xl, x2,
confidence interval for 0 = t(F), some real valued *--, x20 which were actually iid samples from a N(0,
functional of F; and the bootstrap interval are based 1) distribution. The number .640 for example is the

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
72 B. EFRON AND R. TIBSHIRANI

average of OBC The


[.05]/0 situation
over 40 suchis data
quite sd
bootstrap replications per data set. The upper limit confidence intervals. T
1.68 . 0 is noticeably small, as pointed out by Schenker Section 8, show that B = 1000 is a rough minimum
(1985). The reason is simple: the nonparametric boot- for the number of Monte Carlo bootstraps necessary
strap distribution of 0* has a short upper tail; com- to compute the BC or BCa intervals. Somewhat
pared to the parametric bootstrap distribution which smaller values, say B = 250, can give a useful percen-
is a scaled x29 random variable. The results of Beran tile interval, the difference being that then the con-
(1984), Bickel and Freedman (1981), and Singh (1981) stant zo need not be computed. Confidence intervals
show that the nonparametric bootstrap distribution is are a fundamentally more ambitious measure of sta-
highly accurate asymptotically, but of course that is tistical accuracy than standard errors, so it is not
not a guarantee of good small sample behavior. Boot- surprising that they require more computational ef-
strapping from a smoothed version of F, as in lines 3, fort.
4, and 5 of Table 2 alleviates the problem in this
particular example. 10. THE JACKKNIFE AND THE DELTA METHOD

This section returns to the simple case of assigning


9. BOOTSTRAP SAMPLE SIZES
a standard error to 0(y), where y = (x1, ..., x.) is
How many bootstrap replications must we take? obtained by random sampling from a single unknown
Consider the standard error estimate 'B based on B distribution, X1, * - *, Xn -iid F. We will give another
bootstrap replications, (2.4). As B -* oo, B approachesdescription of the bootstrap estimate a, which illus-
a, the bootstrap estimate of standard error as origi- trates the bootstrap's relationship to older techniques
nally defined in (2.3). Because F does not estimate F of assigning standard errors, like the jackknife and
perfectly, 'a = o(F) will have a non-zero coefficient of the delta method.
variation for estimating the true standard error a = For a given bootstrap sample y* = (x*, x**, 4), as
o(F); &B will have a larger CV because of the random-described in step (i) of the algorithm in Section 2, let-
ness added by the Monte Carlo bootstrap sampling. Pi' indicate the proportion of the bootstrap sample
It is easy to derive the following approximation, equal to xi,
(9.1) CV(5B) E {CV(&) + 2] 1/2
{V- CV(5)2 + E4B) (10.1) i4 * = # xi)x i= 1,2 = .. in
n

where 6 is the kurtosis of the bootstrap distribution of = (p*9, p*, * *, p*). The vector p* has a rescaled
&'*, given the data y, and E{II its expected value multinomial distribution
averaged over y. For typical situations, CV(Uf) lies
between .10 and .30. For example, if 0 = i, n = 20, p* - MUltn(n, p?)/n
Xi -fid N(0, 1), then CV(j) -.16. (10.2) (p0 = (1/n, 1/n, *.., 1/n)),
Table 10 shows CV(&B) for various values of B and
CV(GD), assuming E{&} = 0 in (9.1). For values of where the notation indicates the proportions obse
CV(Uj) > .10, there is little improvement past B = 100. from n random draws on n categories, each with
In fact B as small as 25 gives reasonable results. Even probability 1/n.
smaller values of B can be quite informative, as we For n = 3 there are 10 possible bootstrap vectors
saw in the Stanford Heart Transplant Data (Fig. 7 of p*. These are indicated in Fig. 15 along with their
Section 3). multinomial probabilities from (10.2). For example,
p* = (1/3, 0, 2/3), corresponding to x* = (x1, X3, X3) or
TABLE 10 any permutation of these values has bootstrap proba-
Coefficient of variation of aB, the bootstrap estimate of standard
bility 1/9.
error based on B Monte Carlo replications, as a function of B and To make our discussion easier suppose that the
CV(f), the limiting CV as B oo
statistic of interest 0 is of functional form: 0 = 0(F),
B--
where 8(F) is a functional assigning a real number to
25 50 100 200- X any distribution F on the sample space X. The mean,
the correlation coefficient, and the trimmed mean are
CV(&) .25 .29 .27 .26 .25 .25
all of functional form. Statistics of functional form
l .20 .24 .22 .21 .21 .20
.15 .21 .18 .17 .16 .15 have the same value as a function of F, no matter
.10 .17 .14 .12 .11 .10 what the sample size n may be, which is convenient
.05 .15 .11 .09 .07 .05 for discussing the jackknife and delta method.
0 .14 .10 .07 .05 0
For any vector p = (P1, P2, , Pn) having non-
Note: Based on (9.1), negative weights
assuming Ef&} summing
= 0.to 1, define the weighted

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 73

x3

1/27 (10.6) p(i) - (1, 1, .., 1, , 1, ..., 1)


p*-(l/3,O,2/3)
i = 1, 2, *.. , n. Fig. 15 indicates the jackknife points
for n = 3; because 0 is the functional form, (10.4), it
/9* 1/9
does not matter that the jackknife points correspond
to sample size n - 1 rather than n.
P(2) p P (1) The linear function Oj(p) is calculated to be

1/9 2/9 1/9 (10.7) &J(p) = 0(i) + (P - p0) U

where, in terms of 0(i) 0(p(i), 0(.) = X O=1 0(i)/


U is the vector with ith coordinate

(10.8) Ui = (n - 1)(0(.) -0(i))


1/27 1/9 p 1/9 1/27 The jackknife estimate of standard error (Tukey, 1958;
Xxi (3) X2 Miller, 1974) is

FIG. 15. The bootstrap and jackknife sampling points in the case ___ ]~~~1/2 [ U211/2
n = 3. The bootstrap points (.) are shown with their probabilities. (10.9) .0 [n E {-(i) -())2J = 0nn-

A standard multinomial calculation gives the follow-


empirical distribution ing theorem (Efron, 1982a),
(10.3) F(p): probability Pion xi i = 1, ***, n.
THEOREM. The jackknife estimate of standard er-
For p = p0 = 1/n, the weighted empirical distribution ror equals [n/(n - 1)]1/2 times the bootstrap estimate of
equals F, (1.4). standard error for Oj,
Corresponding to p is a resampled value of 0, F 1~~~~~1/2
inI
(10.4) 0(p) O(F(p)). (10.10) = [n - 1 var.*0J(p*)

The shortened notation 0(p) assumes that the data In other words, the jackknife estimate is itself almost
(x1, x2, *--, xn) is considered fixed. Notice that 0(p0) a bootstrap estimate applied to a linear approximation
= 0(F) is the observed value of the statistic of interest. of 0. The factor [n/(n - 1)/2 in (10.10) makes &Tj
The bootstrap estimate a, (2.3), can then be written unbiased for u2 in the case where 0 = i, the sample
mean. We could multiply the bootstrap estimate a' by
(10.5) a =[var* 0(p*)Il/,
this same factor, and achieve the same unbiasedness,
where var* indicates variance withbut there does not
respect to seem to be any consistent advantage
distri-
bution (10.2). In terms of Fig. 15, 'a is the standard to doing so. The jackknife requires n, rather than B =
deviation of the ten possible bootstrap values 9(p*) 50 to 200 resamples, at the expense of adding a linear
weighted as shown. approximation to the standard error estimate. Tables
It looks like we could always calculate 'a simply by 1 and 2 indicate that there is some estimating effi-
doing a finite sum. Unfortunately, the number of ciency lost in making this approximation. For statis-
bootstrap points is (2n-1), 77,558,710 for n = 15 so tics like the sample median which are difficult to
straightforward calculation of 'a is usually impractical. approximate linearly, the jackknife is useless (see
That is why we have emphasized Monte Carlo ap- Section 3.4 of Efron, 1982a).
prbximations to v. Therneau (1983) considers the There is a more obvious linear approximation to
question of methods more efficient than pure Monte 0(p) than Oj(p). Why not use the first-order Taylor
Carlo, but at present there is no generally better series expansion for 0(p) about the point p = p0? This
method available. is the idea of Jaeckel's infinitesimal jackknife (1972).
However, there is another approach to approximat- The Taylor series approximation turns out to be
ing (10.5). We can replace the usually complicated
O^(P) 0(P0) + (P - P
function 0(p) by an approximation linear in p, and
then use the well known formula for the multinomial where
variance of a linear function. The jackknife approxi-
mation Oj(p) is the linear function of p which matches (10.11) U? = lim 0((i - )P + e5i) -"0(Po)
0(p), (10.4), at the n points corresponding to the
deletion of a single xi from the observed data set 5, being the ith coordinate vector. This suggests the

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
74 B. EFRON AND R. TIBSHIRANI

infinitesimal jackknife estimate of standard error names for the same method. Notice that the results
reported in line 7 of Table 2 show a" severe downward
(10.12) ^ =-[var *T(p*)Il/2= [=U02/n ]
bias. Efron and Stein (1981) show that the ordinary
with var* still indicating variance under (10.2). The jackknife is always biased upward, in a sense made
ordinary jackknife can be thought of as taking e = precise in that paper. In the authors' opinion the
-1/(n - 1) in the definition of U?, while the infini- ordinary jackknife is the method of choice if one does
tesimal jackknife lets e -*0, thereby earning the name. not want to do the bootstrap computations.
The U? are values of what Mallows (1974) calls the
empirical influence function. Their definition is a ACKNOWLEDGMENT
nonparametric estimate of the true influence function This paper is based on a previous review article ap-
pearing in Behaviormetrika. The authors and Editor
IF(x) = lim 0((1 - )F + ek) - 0(F) are grateful to that journal for graciously allowing this
e--O e
revision to appear here.
Ax being the degenerate distribution putting mass 1
REFERENCES
on x. The right side of (10.12) is then the obvious
estimate of the influence function approximation ANDERSON, 0. D. (1975). Time Series Analysis and Forecasting:
The Box-Jenkins Approach. Butterworth, London.
to the standard error of 0 (Hampel, 1974), v(F)-
BAHADUR, R. and SAVAGE, L. (1956). The nonexistence of certain
[ IF2(x) dF(x)/n]1/2. The empirical influence function
statistical procedures in nonparametric problems. Ann. Math.
method and the infinitesimal jackknife give identical Statist. 27 1115-1122.
estimates of standard error. BERAN, R. (1984). Bootstrap methods in statistics. Jahrb. Math.
How have statisticians gotten along for so many Ver. 86 14-30.
BICKEL, P. J. and FREEDMAN, D. A. (1981). Some asymptotic theory
years without methods like the jackknife and the
for the bootstrap. Ann. Statist. 9 1196-1217.
bootstrap? The answer is the delta method, which is
Box, G. E. P. and Cox, D. R. (1964). An analysis of transforma-
still the most commonly used device for approximating tions. J. R. Statist. Soc. Ser. B 26 211-252.
standard errors. The method applies to statistics of BREIMAN, L. and FRIEDMAN, J. H. (1985). Estimating optimal
the form t(Q1, Q2, . , QA), where t(-, ., -. ., .) is a transformations for multiple regression and correlation. J.
Amer. Statist. Assoc. 80 580-619.
known function and each Qa is an observed average,
Cox, D. R. (1972). Regression models and life tables. J. R. Statist.
Qa = X7n=1 Qa(Xi)/n. For example, the correlation 0 is Soc. Ser. B 34 187-202.
a function of A = 5 such averages; the average of the CRAMER, H. (1946). Mathematical Methods of Statistics. Princeton
first coordinate values, the second coordinates, the University Press, Princeton, New Jersey.
first coordinates squared, the second coordinates EFRON, B. (1979a). Bootstrap methods: another look at the jack-
knife. Ann. Statist. 7 1-26.
squared, and the cross-products.
EFRON, B. (1979b). Computers and the theory of statistics: thinking
In its nonparametric formulation, the delta method
the unthinkable. Soc. Ind. Appl. Math. 21 460-480.
works by (a) expanding t in a linear Taylor series EFRON, B. (1981a). Censored data and the bootstrap. J. Amer.
about the expectations of the Qa; (b) evaluating the Statist. Assoc. 76 312-319.
standard error of the Taylor series using the usual EFRON, B. (1981b). Nonparametric estimates of standard error: the
jackknife, the bootstrap, and other resampling methods, Biom-
expressions for variances and covariances of averages;
etrika 68 589-599.
and (c) substituting -y(F) for any unknown quantity EFRON, B. (1982a). The jackknife, the bootstrap, and other resam-
'y(F) occurring in (b). For example, the nonparametric pling plans. Soc. Ind. Appl. Math. CBMS-Natl. Sci. Found.
delta method estimates the standard error of the cor- Monogr. 38.
relation 0 by EFRON, B. (1982b). Transformation theory: how normal is a one
parameter family of distributions? Ann. Statist. 10 323-339.
[ A240 + 04 + 222 4,22 4231 413 / EFRON, B. (1982c). Maximum likelihood and decision theory. Ann.
,-;- 2 A A + - +A A Statist. 10 340-356.
I4n fL20 L02 L20L02 A11 A11/02 Al11L02 J
EFRON, B. (1983). Estimating the error rate of a prediction rule:
improvements in cross-validation. J. Amer. Statist. Assoc. 78
where, in terms of xi = (yi, zi),
316-331.

-Lg z (Y -) g -)h
EFRON, B. (1984). Better bootstrap confidence intervals. Tech. Rep.
Stanford Univ. Dept. Statist.
(Cramer (1946), p. 359).
EFRON, B. (1985). Bootstrap confidence intervals for a class of
parametric problems. Biometrika 72 45-58.
EFRON, B. and GONG, G. (1983). A leisurely look at the bootstrap,
THEOREM. For statistics of
the jackknife, and the form
cross-validation. 0 37=
Amer. Statistician t(Q
QA), the nonparametric delta
36-48. method and the inf
imal jackknife give the same estimate of standard error EFRON, B. and STEIN, C. (1981). The jackknife estimate of variance.
Ann. Statist. 9 586-596.
(Efron, 1982c).
FIELLER, E. C. (1954). Some problems in interval estimation. J. R.
Statist. Soc. Ser. B 16 175-183.
The infinitesimal jackknife, the delta method, and FRIEDMAN, J. H. and STUETZLE, W. (1981). Projection pursuit
the empirical influence function approach are three regression. J. Amer. Statist. Assoc. 76 817-823.

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms
BOOTSTRAP METHODS FOR MEASURES OF STATISTICAL ACCURACY 75

FRIEDMAN, J. H. and TIBSHIRANI, R. J. (1984). The monotone MALLOWS, C. (1974). On some topics in robustness. Memorandum,
smoothing of scatter-plots. Technometrics 26 243-250. Bell Laboratories, Murray Hill, New Jersey.
HAMPEL, F. R. (1974). The influence curve and its role in robust MILLER, R. G. (1974). The jackknife-a review. Biometrika 61
estimation. J. Amer. Statist. Assoc. 69 383-393. 1-17.
HASTIE, T. J. and TIBSHIRANI, R. J. (1985). Discussion of Peter MILLER, R. G. and HALPERN, J. (1982). Regression with censored
Huber's "Projection Pursuit." Ann. Statist. 13 502-508. data. Biometrika 69 521-531.
HINKLEY, D. V. (1978). Improving the jackknife with special refer- RAO, C. R. (1973). Linear Statistical Inference and Its Applications.
ence to correlation estimation. Biometrika 65, 13-22. Wiley, New York.
HYDE, J. (1980). Survival Analysis with Incomplete Observations. SCHENKER, N. (1985). Qualms about bootstrap confidence intervals.
Biostatistics Casebook. Wiley, New York. J. Amer. Statist. Assoc. 80 360-361.
JAECKEL, L. (1972). The infinitesimal jackknife. Memorandum MM SINGH, K. (1981). On the asymptotic accuracy of Efron's bootstrap.
72-1215-11. Bell Laboratories, Murray Hill, New Jersey. Ann. Statist. 9 1187-1195.
JOHNSON, N. and KOTZ, S. (1970). Continuous Univariate Distri- THERNEAU, T. (1983). Variance reduction techniques for the boot-
butions. Houghton Mifflin, Boston, Vol. 2. strap. Ph.D. thesis, Stanford University, Department of Statis-
KAPLAN, E. L. and MEIER, P. (1958). Nonparametric estimation tics.
from incomplete samples. J. Amer. Statist. Assoc. 53 457-481. TIBSHIRANI, R. J. and HASTIE, T. J. (1984). Local likelihood esti-
KIEFER, J. and WOLFOWITZ, J. (1956). Consistency of the maximum mation. Tech. Rep. Stanford Univ. Dept. Statist. 97.
likelihood estimator in the presence of infinitely many inciden- TUKEY, J. (1958). Bias and confidence in not quite large samples,
tal parameters. Ann. Math. Statist. 27 887-906. abstract. Ann. Math. Statist. 29 614.

Comment
J. A. Hartigan

Efron and Tibshirani are to be congratulated on a empirical distribution, and suppose that t(Fn) is an
wide-ranging persuasive survey of the many uses of estimate of some population parameter t(F). The sta-
the boostrap technology. They are a bit cagey on what tistic t(F,) is computed for several random subsamples
is or is not a bootstrap, but the description at the end (each observation appearing in the subsample with
of Section 4 seems to cover all the cases; some data y probability 1/2), and the set of t (Fn) values obtained is
comes from an unknown probability distribution F; it regarded as a sample from the posterior distribution
is desired to estimate the distribution of some function of t(F). For example, the standard deviation of the
R(y, F) given F; and this is done by estimating the t(F,) is an estimate of the standard error of t(FJ)
distribution of R (y*, F) given F where F is an estimate from t(F); however, the procedure is not restricted to
of F based on y, and y* is sampled from the known F. real valued t.
There will be three problems in any application of The procedure seems to work not too badly in
the bootstrap: (1) how to choose the estimate F? getting at the first- and second-order behaviors of
(2) how much sampling of y* from F? and (3) how t(Fn) when t(Fn) is near normal, but it not effective
close is the distribution of R(y*, F) given F to in handling third-order behavior, bias, and skewness.
R(y, F) given F? Thus there is not much point in taking huge samples
Efron and Tibshirani suggest a variety of estimates t(F,) since the third-order behavior is not relevant;
F for simple random sampling, regression, and auto- and if the procedure works only for t(Fn) near normal,
regression; their remarks about (3) are confined there are less fancy procedures for estimating standard
mainly to empirical demonstrations of the bootstrap error such as dividing the sample up into 10 subsam-
in specific situations. ples of equal size and computing their standard devia-
I have some general reservations about the boot- tion. (True, this introduces more bias than having
strap based on my experiences with subsampling tech- random subsamples each containing about half the
niques (Hartigan, 1969, 1975). Let X1, ..., X, be a observations.) Indeed, even if t(Fn) is not normal, we
random sample from a distribution F, let F, be the can obtain exact confidence intervals for the median
of t(Fn110) using the 10 subsamples. Even five sub-
J. A. Hartigan is Eugene Higgins Professor of Statistics, samples will give a respectable idea of the standard
Yale University, Box 2179 Yale Station, New Haven, error.
CT 06520. Transferring back to the bootstrap: (A) is the boot-

This content downloaded from


140.213.190.131 on Tue, 13 Apr 2021 09:26:31 UTC
All use subject to https://about.jstor.org/terms

You might also like