WIREs Computational Stats - 2011 - Hesterberg - Bootstrap

Overview
Bootstrap
Tim Hesterberg∗
This article provides an introduction to the bootstrap. The bootstrap provides
statistical inferences—standard error and bias estimates, confidence intervals,
and hypothesis tests—without assumptions such as Normal distributions
or equal variances. As such, bootstrap methods can be remarkably more
accurate than classical inferences based on Normal or t distributions. The
bootstrap uses the same basic procedure regardless of the statistic being
calculated, without requiring the use of application-specific formulae. This
article may provide two big surprises for many readers. The first is
that the bootstrap shows that common t confidence intervals are woefully
inaccurate when populations are skewed, with one-sided coverage levels off
by factors of two or more, even for very large samples. The second is that
the number of bootstrap samples required is much larger than generally
realized.  2011 John Wiley & Sons, Inc. WIREs Comp Stat 2011 3 497–526 DOI: 10.1002/wics.182
Keywords: resampling; permutation tests; inference; standard error; bias
INTRODUCTION √The usual formula standard error for the mean

is s/ n = 18.1,√ and usual 95% confidence interval
W e begin with an example of the simplest type of
bootstrapping in this section, then discuss the
idea behind the bootstrap (Section Plug-In Principle),
x ± tα/2,n−1 s/ n is (88.8, 160.2). This interval may be
suspect because of the skewness of the data, in spite
of the reasonably large sample size.
implementation by random sampling (Section Monte We may use the bootstrap1 for inferences for the
Carlo Sampling—The ‘‘Second Bootstrap Principle’’), mean of this dataset. We draw a bootstrap sample, or
using the bootstrap to estimate standard error resample, of size n with replacement from the data,
and bias (Section Bias and Standard Error), a and compute the mean. We repeat this process many
variety of examples, the central limit theorem and times, say 104 or more. The resulting bootstrap means
different types of bootstraps (Section Examples), comprise the bootstrap distribution, which we use to
the accuracy of the bootstrap (Section Accuracy estimate aspects of the sampling distribution for X.
of Bootstrap Distributions), confidence intervals Figure 2 shows a histogram and Normal quantile plot
(Section Bootstrap Confidence Intervals), hypothesis of the bootstrap distribution. The bootstrap standard
tests (Section Hypothesis Testing), planning clinical error is the standard deviation of the bootstrap dis-
trials (Section Planning Clinical Trials), the number tribution; in this case the bootstrap standard error is
of bootstrap samples needed and ways to reduce this 18.2, quite close to the formula standard error. The
number (Section How Many Bootstrap Samples Are mean of the bootstrap means is 124.4, quite close to
Needed), and conclude with references for additional x (the difference is −0.047, to three decimal places).
reading. The bootstrap distribution looks quite normal, with
Figure 1 shows a normal quantile plot of Arsenic some skewness.
concentrations from 271 wells in Bangladesh, from This amount of skewness is a cause for concern.
http://www.bgs.ac.uk/arsenic/bangladesh/ This may be counter to the intuition of many readers,
Data/SpecialStudyData.csv referenced from who use Normal quantile plots to look at data.
statlib http://lib.stat.cmu.edu/datasets. This bootstrap distribution corresponds to a sampling
The sample mean and standard deviation are x = distribution, not raw data. This is after the central
124.5 and s = 298, respectively. limit theorem has had its one chance to work, so
any deviations from normality here may translate into
∗
Correspondence to: timhesterberg@gmail.com errors in inferences. We may quantify how badly this
Google, Seattle, USA amount of skewness affects confidence intervals; we
DOI: 10.1002/wics.182 defer this to Section Bootstrap Confidence Intervals.
Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 497

19390068, 2011, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/wics.182 by UNIVERSIDADE FEDERAL DE GOIAS, Wiley Online Library on [03/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Overview wires.wiley.com/compstats
Statistical inference depends on the sampling

distribution. The sampling distribution depends on
2000
1. the underlying population(s),
2. the sampling procedure, and
1500
3. the statistic, such as X.
As (ug/L)
1000 Conceptually, the sampling distribution is

the result of drawing many samples from the
population and calculating the statistic for each. The
500 bootstrap principle is to plug in an estimate for
the population, then mimic the real life sampling
procedure and statistic calculation. The bootstrap
0 distribution depends on
−3 −2 −1 0 1 2 3
Quantiles of standard normal 1. an estimate for the population(s),
FIGURE 1 | Arsenic concentrations in 271 wells in Bangladesh. 2. the sampling procedure, and
3. the statistic, such as X.
We first discuss the idea behind the bootstrap, and

The simplest case is when the original data
give some idea of its versatility.
are an i.i.d. sample from a single population, and
we use the empirical distribution F̂n to estimate
PLUG-IN PRINCIPLE the population, where F̂n (u) = (1/n) I(xi ≤ u).
This gives the ordinary nonparametric bootstrap,
The idea behind the bootstrap is the plug-in corresponding to drawing samples of size n without
principle2 —that if a quantity is unknown, we plug replacement from the original data.
in an estimate for it.
This principle is used all the time in statistics.
The standard deviation of a sample mean for How Useful is the Bootstrap Distribution?
i.i.d. observations√ from a population with standard A fundamental question is how well the bootstrap
deviation σ is σ/ n; when σ is unknown we plug √in distribution approximates the sampling distribution.
an estimate s to obtain the usual standard error s/ n. We discuss this question in greater detail in
What is different in the bootstrap is that we plug Section Accuracy of Bootstrap Distributions, but note
in an estimate for the whole population, not just for a a few key points here. For most common estimators
numerical summary of the population. (statistics that are estimates of a population parameter,
0.025
Observed 200
Mean
0.020 180
160
0.015
Density
Mean
140
0.010 120
100
0.005
80
0.0
80 100 120 140 160 180 200 −4 −2 0 2 4
Mean Quantiles of standard normal
FIGURE 2 | Histogram and Normal quantile plot of the bootstrap distribution for arsenic concentrations.
498  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

WIREs Computational Statistics Bootstrap
e.g., X is an estimator for µ, whereas a t statistic is not Step function

an estimator), and under fairly general distribution Bootstrap average
assumptions,
center: the center of the bootstrap distribution

is not an accurate approximation for
the center of the sampling distribution.
For example, the center of the boot-
strap distribution for X is centered at
approximately x, the mean of the sam-
ple, whereas the sampling distribution is
centered at µ.
spread: the spread of the bootstrap distribution
does reflect the spread of the sampling
distribution.
bias: the bootstrap bias estimate (see below)
does reflect the bias of the sampling
distribution. FIGURE 3 | Step function defined by eight equal-size groups, and
average across bootstrap samples of step functions.
skewness: the skewness of the bootstrap distribu-
tion does reflect the skewness of the
sampling distribution. Other Population Estimates
Other estimates of the population may be used. For
The first point bears emphasis. It means that example, if there was reason to assume that the
the bootstrap is not used to get better parameter arsenic data followed a gamma distribution, we could
estimates because the bootstrap distributions are estimate parameters for the gamma distribution, then
centered around statistics θ̂ calculated from the data draw samples from a gamma distribution with those
(e.g., x or regression slope β̂) rather than the unknown estimated parameters. This is a parametric bootstrap.4
population values (e.g., µ or β). Drawing thousands In other cases, we may believe that the
of bootstrap observations from the original data is underlying population is continuous; then rather than
not like drawing observations from the underlying draw from the discrete empirical distribution, we may
population, it does not create new data. instead draw samples from a density estimated from
Instead, the bootstrap sampling is useful for the data; this is a smoothed bootstrap.4
quantifying the behavior of a parameter estimate, such
as its standard error, bias, or calculating confidence
intervals. Other Sampling Procedures
There are exceptions where bootstrap averages When the original data were not obtained using an
are useful for estimation, such as random forests.3 i.i.d. sample, the bootstrap sampling should reflect
These are beyond the scope of this article, except that the actual data collection. For example, in stratified
we give a toy example to illustrate the mechanism. sampling applications the bootstrap sampling should
Consider the case of simple linear regression, and sup- be stratified. If the original data are dependent, the
pose that there is a strong linear relationship between bootstrap sampling should reflect the dependence; this
y and x. However, instead of using linear regression, may not be straightforward.
one uses a step function—the data are split into eight There are some cases where the bootstrap
equal-size groups based on x, and the y values in each sampling should differ from the actual sampling
group are averaged to obtain the altitude for the step. procedure, including:
Applying the same procedure to bootstrap samples
randomizes the location of the step edges, and aver- • regression (Section Examples),
aging across the bootstrap samples smooths the edges
• planning clinical trials (Section Planning Clinical
of the steps. Hence the bootstrap average is more
Trials),
accurate than the original step function. This is shown
in Figure 3. A similar effect holds in random forests, • hypothesis testing (Section Hypothesis Testing),
using bootstrap averaging of tree models, which fit and
higher dimensional data using multivariate analogs of • small samples (Section Bootstrap Distributions
step functions. Are Too Narrow).

Other Statistics should be in Section How Many Bootstrap Samples

The bootstrap procedure may be used with a wide Are Needed.
variety of statistics—mean, median, trimmed mean,
regression coefficients, hazard ratio, x-intercept in a
regression, and others—using the same procedure. BIAS AND STANDARD ERROR
It does not require problem-specific analytical Let θ = θ(F) be a parameter of a population, such
calculations. as the mean, or difference in regression coefficients
This is a major advantage of the bootstrap. between sub-populations. Let θ̂ be the corresponding
It allows statistical inferences such as confidence estimate from the data, θ̂ ∗ be the estimate from a
intervals to be calculated even for statistics for
bootstrap sample, θ̂ ∗ = B−1 Bb=1 θ̂b∗ be the average of
which there are no easy formulas. It offers hope of
B bootstrap estimates, and s2θ̂ ∗ = (B − 1)−1 Bb=1 (θ̂b∗ −
reforming statistical practice—away from simple but
non-robust estimators like a sample mean or least- θ̂ ∗ )2 be the sample standard deviation of the bootstrap
squares regression, in favor of robust alternatives. estimates.
Some bootstrap calculations require that θ̂ be
a functional statistic, one that depends on the data
MONTE CARLO SAMPLING—THE only through the empirical distribution, not on n.
A mean is a functional statistic, whereas the usual
‘‘SECOND BOOTSTRAP PRINCIPLE’’ sample standard deviation s with divisor n − 1 is
The second bootstrap ‘‘principle’’ is that the bootstrap not—repeating each observation twice gives the same
is implemented by random sampling. This is not empirical distribution but a different s.
actually a principle, but an implementation detail.4 The bootstrap bias estimate for a functional
Given that we are drawing i.i.d. samples of size statistic is
n from the empirical distribution
F̂n , there are at most
nn possible samples ( 2n−1
n if we disregard the order of θ̂ ∗ − θ̂. (1)
observations, and ties in the data can further reduce
the number of unique samples). In small samples Note how this relates to the plug-in principle. The bias
we could create all possible bootstrap samples, of a statistic is E(θ̂) − θ, which for a functional statistic
deterministically. In practice n is usually too large may be expanded as EF (θ̂) − θ(F), the expected value
for that to be feasible, so we use random sampling. of θ̂ when sampling from F minus the value for
Let B be the number of bootstrap samples used, population F. Substituting F̂ for the unknown F in both
e.g., B = 104 . The resulting B statistic values represent terms yields the theoretical bootstrap bias estimate
a random sample of size B with replacement from
the theoretical bootstrap distribution consisting of nn EF̂ (θ̂ ∗ ) − θ(F̂). (2)
values (including duplicates).
In some cases we can calculate the theoretical The Monte Carlo version of the bias estimate (1)
bootstrap distribution without simulation. In the substitutes the sample average of bootstrap statistics
arsenic example, parametric bootstrapping from a for the expected value.
gamma distribution causes the theoretical bootstrap
distribution for the sample mean to be another gamma
distribution. EXAMPLES
In other cases we can calculate some aspects of In this section we consider some examples, with a
the sampling distribution without simulation. In the particular eye to standard error, bias, and normality
case of the nonparametric bootstrap when the statistic of the sampling distribution.
is the sample mean, the mean and standard deviation
of the √ theoretical bootstrap distributionare x and
σ̂F̂n / n, respectively, where σ̂ 2 = n−1 (xi − x)2 .4 Relative Risk
F̂n
Note that this differs from the usual sample standard A major study of the association between blood
deviation in using a divisor of n instead of n − 1. We pressure and cardiovascular disease found that 55
return to this point in Section Bootstrap Distributions out of 3338 men with high blood pressure died
Are Too Narrow. of cardiovascular disease during the study period,
The use of Monte Carlo sampling adds addi- compared to 21 out of 2676 with low blood
tional unwanted variability, that may be reduced by pressure. The estimated relative risk is θ̂ = p̂1 /p̂2 =
increasing the value of B. We discuss how large B 0.0165/0.0078 = 2.12.

0.8 0.030
Observed
Mean
Proportion in high risk group

0.6
0.020
Density
0.4 An outlier at 10.1 is omitted.

other large observations
are indicated below.
0.010
0.2
Relative risk
Bootstrap CI
0.0 0.0 t CI
1 2 3 4 5 6 7 0.0 0.005 0.010 0.015
Relative risk Proportion in low risk group
Slope = Relative risk
FIGURE 4 | Histogram and scatterplot of the bootstrap distribution for relative risk.
To bootstrap this, we draw samples of size corresponds to calculating the standard error of
n1 = 3338 with replacement from the first group, residuals above and below the central line (the line
independently draw samples of size n2 = 2676 from with slope θ̂), going up and down 1.96 residual
the second group, and calculate the relative risk θ̂ ∗ . standard errors from the central point (the original
In addition, we record the individual proportions p̂∗1 data) to the circled points; the endpoints of the interval
and p̂∗2 . The bootstrap distribution for relative risk is are the slopes of the lines from the origin to the circled
shown in the left panel of Figure 4. It is highly skewed, points. A t interval would not be appropriate in the
with a long right tail caused by divisor values rela- example, because of the bias and skewness.
tively close to zero. The standard error, from a sample In practice one would normally do a t interval
of 104 observations, is 0.6188. The theoretical boot- on a transformed statistic, e.g., log of relative risk, or
strap standard error is undefined because some of the log-odds-ratio log(p̂1 (1 − p̂2 )/((1 − p̂1 )p2 )). Figure 5
n n
n11 n22 bootstrap samples have θ̂ ∗ undefined because shows a normal quantile plot for the bootstrap
the denominator p̂∗2 is zero; this is not important in distribution of the log of relative risk. The distribution
practice. for log relative risk is much less skewed than is
The average of the bootstrap replicates is larger the distribution for relative risk, but still noticeably
than the original relative risk, indicating bias. The esti-
mated bias is 2.205 − 2.100 = 0.106, corresponding
to 0.17 standard errors. While the bias does not appear
large in the figure, this amount of bias can have a huge
2.0
impact on inferences; a rough calculation suggests
that the actual non-coverage of one side of a two-
sided 95% confidence interval would be 1 − (0.17 + 1.5
Log relative risk
1.96) = 0.0367 rather than 0.025, or 47% too large.

The right panel of Figure 4 shows the joint 1.0
bootstrap distribution of p̂∗1 and p̂∗2 . Each point
corresponds to one bootstrap sample, and the relative 0.5
risk is the slope of the line between the origin and
the point. The original data is at the intersection of
0.0
horizontal and vertical lines. The solid diagonal lines
exclude 2.5% of the bootstrap observations on each
side; the corresponding slopes are the endpoints of a
−4 −2 0 2 4
95% bootstrap percentile confidence interval.
Quantiles of standard normal
The bottom and top dashed diagonal lines are
the endpoints of a t interval with standard error FIGURE 5 | Normal quantile plot for bootstrap distribution for log
obtained using the usual delta method. This interval of relative risk.

skewed. Even with a log transformation, a t interval of the middle 95% of heights of regression lines at a
would only be adequate for work where accuracy is given weight.
not required. We discuss confidence intervals further The right panel shows all 300 observations, and
in Section Bootstrap Confidence Intervals. predictions for the PK/weight relationship using (1)
all 300 observations, (2) the main-effects model, and
(3) predictions for the ‘‘base case’’, males receiving
Linear Regression dose = 400, with weight equal to the average weight
The next examples, for linear regression, are based on for all subjects. In effect this uses the full dataset
a dataset from a large pharmaceutical company. The to improve predictions for a subset, ‘‘borrowing
response variable is a pharmacokinetic parameter of strength’’. There is much less variability than in the
interest, and candidate predictors are weight, sex, age, left panel, particularly for slope, primarily because of
and dose (3 levels—200, 400, and 800). There are 300 the larger sample size, but also because the addition
observations, one per subject. Our primary interest in of an important covariate (age) to the model reduces
this dataset will be to use the bootstrap to investigate residual variance.
the behavior of stepwise regression; however, first we Note that the y values shown are the actual data,
consider some other issues. not adjusted for differences between the base case and
A standard linear regression using main effects the actual values of sex, age, and dose. The line is
gives:
Value Std. Error t value Pr(>|t|)

(Intercept) 32.0819 4.2053 7.6290 0.0000
wgt 0.2394 0.0372 6.4353 0.0000
sex −7.2192 1.2306 −5.8666 0.0000
age −0.1507 0.0367 −4.1120 0.0000
dose 0.0003 0.0018 0.1695 0.8653
The left panel of Figure 6 contains a scatterplot

of a PK parameter versus weight, for the 25 males higher than most of the observations, because the PK
receiving dose = 400, as well as regression lines from values tend to be higher for males.
30 bootstrap samples. This is useful for giving a rough In the right panel it is clear that the variation in
idea of variability. A bootstrap percentile confidence the regression lines is much smaller than the vertical
interval for mean PK given weight would be the range variation in y values. We now turn to this point.
60 Original 60 Original
Bootstrap Bootstrap
50 50
PK parameter
PK parameter
40 40
30 30
20 20
10 10
40 50 60 70 80 90 100 40 50 60 70 80 90 100
Weight Weight
FIGURE 6 | Bootstrap regression lines. Left panel: 25 males receiving dose = 400. The orange line is the least-squares fit for those 25
observations, and black lines are from bootstrap samples of size 25. Right panel: the orange line is the prediction for males receiving dose = 400,
based on the main-effects linear regression using all 300 subjects, and the black lines are from bootstrap samples.

Prediction Intervals and Non-Normality variability in regression lines is outlined below in

The right panel also hints at the difference between Section Prediction Intervals.
a confidence interval (for mean response given
covariates) and a prediction interval (for a new Stepwise Regression
observation). With large n, the regression lines show Now consider the case of stepwise regression. We
little variation, but the variation of an individual point consider models ranging from the intercept-only
above and below the (true) line remains constant model to a full second-order model that includes all
regardless of n. Hence as n increases, confidence main effects, all interactions, and quadratic functions
intervals become narrow, but prediction intervals do of dose, age, and weight. We use forward and
not. This is reflected in the standard formulae for backward stepwise regression, with terms added or
confidence intervals: subtracted to minimize the Cp statistic, using the
step function of S-PLUS. The resulting coefficients
ŷ ± tα s 1/n + (x − x)2 /Sxx (3) and inferences are:
Value Std. Error t value Pr(>|t|)

(Intercept) 12.8035 14.1188 0.9068 0.3637
wgt 0.6278 0.1689 3.7181 0.0002
sex 9.2008 7.1634 1.2844 0.1980
age −0.6583 0.2389 −2.7553 0.0055
I(age^2) 0.0052 0.0024 2.1670 0.0294
wgt:sex −0.2077 0.0910 −2.2814 0.0218
and prediction intervals in the simple linear regression

case: The sex coefficient is retained in spite of the
small t value because including an interaction forces
ŷ ± tα s 1 + 1/n + (x − x)2 /Sxx , (4) retention of the corresponding main effects.
We use the bootstrap here to check model
where s is the residual standard error and Sxx = stability, obtain standard errors, and check for
bias.
(xi − x)2 . As n → ∞ the terms inside the square
root decrease to zero for a confidence interval but
approach 1 for a prediction interval; the prediction Model Stability
interval approaches ŷ ± zα s. The stepwise procedure selects a six-term model.
Now, suppose that residuals are not nor- We may use the bootstrap to check the stability
mally distributed. Asymptotically and for reason- of the procedure under random sampling (does
ably large n the confidence intervals are approxi- it consistently select the same model, or is there
mately correct, but prediction intervals are not—the substantial variation?) and to see which terms are
interval ŷ ± zα s is only correct for normally dis- consistently included.
tributed data. Prediction intervals should approach We create bootstrap samples by resampling
(ŷ ± F̂−1 (α/2), ŷ ± F̂−1 (1 − α/2)) as n → ∞, where F̂ subjects—whole rows of the data—with replacement.
is the estimated residual distribution. Resampling whole rows preserves covariances
In other words, there is no central limit theorem between variables.
for prediction intervals. The outcome for a new In 1000 bootstrap samples, only 95 result in
observation depends primarily on a single random the same model as the original data; on average
value, not an average across a large sample. Equation 3.2 terms differ between the original model and
(4) should only be used after confirming that the the bootstrap models. The original model has six
residual distribution is approximately normal. And, terms; the bootstrap models range from 4 to 12,
in the opinion of this author, (4) should not be taught with an average of 7.9, or 1.9 more than the
in introductory statistics, to students ill-equipped to original data. This suggests that stepwise regression
understand that it should only be used if residuals are tends to select more terms for random data than
normally distributed. for the corresponding population. This in turn
A bootstrap approach that takes into account suggests that the original six-term model may also be
both the shape of the residual distribution and the overfitted.

Observed Observed
Mean Mean
60 0.04
0.03
40
Density
Density
0.02
20
0.01
0 0.0
−0.05 0.0 0.05 −40 −20 0 20 40
coef.dose coef.sex
FIGURE 7 | Histograms of bootstrap distributions for dose and sex coefficients in stepwise regression.
Figure 7 shows the bootstrap distributions for coefficients) and average of the nominal standard
two coefficients: dose, and sex. The dose coefficient is errors are:
usually zero, though it may be positive or negative. boot SE avg.nominalSE
This suggests that dose is not very important in Intercept 27.9008 14.0734
determining the response. wgt 0.5122 0.2022
The sex coefficient is bimodal, with the modes sex 9.9715 5.4250
on opposite sides of zero. It turns out that the sex age 0.3464 0.2137
coefficient is usually negative when the weight–sex dose 0.0229 0.0091
interaction is included, otherwise is positive.
Overall, the bootstrap suggests that the original The bootstrap standard errors are much larger than
model is not very stable. the average of the nominal standard errors.
For comparison, repeating the experiment with The bootstrap standard errors reflect additional
a more stringent criterion for variable inclusion—a variability due to model selection, such as the bimodal
modified Cp statistic with double the penalty—results distribution for the sex coefficient, factors that the
in a more stable model. The original model has the nominal standard errors ignore.
same six terms. Of the bootstrap samples 154 yield the This is not to say that one should use the
same model, and on average the number of different bootstrap standard errors here. At the end of the
terms is 2.15. The average number of terms is 5.93, stepwise variable selection process, it is appropriate to
slightly less than for the original data; this suggests condition on the model, and do inferences accordingly.
that stepwise regression may now be slightly under- For example, a confidence interval for the sex
fitting (though one should not read too much into coefficient should be conditional on the weight–sex
this). interaction being included in the model.
But it does suggest that the nominal standard
Standard Errors errors are optimistic. In fact they are biased down-
At the end of the stepwise procedure, the table of ward, even conditional on the model terms, because
coefficients, standard errors, and t values is calculated, they are calculated using a formula that depends on
ignoring the variable selection process. In particular, residual standard error, which in turn is biased due to
the standard errors are calculated under the usual model selection.
regression assumptions, which assume that the model
is fixed from the outset. Call these nominal standard Bias
errors. Figure 8 shows bootstrap distributions for R2
For each bootstrap sample, we perform stepwise (unadjusted) and residual standard deviation. Both
selection and record the coefficients and nominal show very large bias.
standard errors. For the main effects the bootstrap The bias is not surprising—optimizing generally
standard errors (standard deviation of bootstrap gives biased results. Consider ordinary linear

Observed Observed
Mean 1.2 Mean
8
1.0
6 0.8
Density
Density
0.6
4
0.4
2
0.2
0 0.0
0.25 0.35 0.45 6.5 7.0 7.5 8.0
R2 σ
FIGURE 8 | Histograms of bootstrap distributions for R 2 and residual standard deviation in stepwise regression.
regression—unadjusted R2 is biased. If it were design and y’s are obtained conditional on the x’s. So
calculated using the true β’s instead of estimated β̂’s at first glance it would appear appropriate to resample
it would not be biased. Optimizing β̂ to minimize rows when the original data collection has random x’s.
residual squared error (and maximize R2 ) makes However, in classical statistics we commonly use
unadjusted R2 biased. inferences derived using the fixed effects model, even
In classical linear regression, with the model when the x’s are actually random. We do inferences
selected in advance, we commonly use adjusted R2 to conditional on the observed x values. Similarly, in
counteract the bias. Similarly, we use residual variance bootstrapping we may resample residuals even when
calculated using a divisor of (n − p − 1) instead of n, the x’s were originally random.
where p is the number of terms in the model. In practice the difference matters most when
But in this case it is not only the values of the there are factors with rare levels, or interactions of
coefficients that are optimized, but which terms are factors with rare combinations. If resampling rows it is
included in the model. This is not reflected in the possible that a bootstrap sample may have none of the
usual formulae. As a result, the residual standard level or combination, in which case the corresponding
error obtained from the stepwise procedure is biased term cannot be estimated, and the software may give
downward, even using a divisor of (n − p − 1). an error. Or, what is worse, there may be one or
two rows with the rare level, enough so the software
Bootstrapping Rows or Residuals would not crash, but instead quietly give garbage
There are two basic ways to bootstrap linear answers, imprecise because they are based on few
regression models—to resample rows (observations), observations.
or residuals.2,5 Hence with factors with rare levels, or small
To resample residuals, we fit the initial model samples more generally, it may be preferable to

ŷi = β̂0 + β̂j xij , calculate the residuals ri = yi − ŷi , resample residuals.
then create new bootstrap samples as Resampling residuals implicitly assumes that the
residual distribution is the same for every x, that there
y∗i = ŷi + r∗i (5) is no heteroskedasticity. A variation on resampling
residuals that allows heteroskedasticity is the wild
for i = 1, . . . , n, where r∗i is sampled with replacement bootstrap or external bootstrap,6 which in its simplest
from the observed residuals {r1 , . . . , rn }. We keep the form adds either plus or minus the original residual ri
original x and ŷ values fixed in order to create new to each fitted value,
bootstrap y∗ values.
Resampling rows corresponds to a random y∗i = ŷi ± ri , (6)
effects sampling design—in which x and y are both
obtained by random sampling from a joint population. with equal probabilities. Hence the expected value of
Resampling residuals corresponds to a fixed effects y∗i is ŷi , and the standard deviation is proportional to
model, in which the x’s are fixed by the experimental ri . For further discussion see Ref 5.

There are other variations on resampling given each x. Let p̂i be the predicted probability that
residuals, such as resampling studentized residuals, or yi = 1 given xi . Then
weighted error resampling for non-constant variance.5
1 with probability p̂i
y∗i = (7)
Prediction Intervals 0 with probability 1 − p̂i .
The idea of resampling residuals provides a way to
obtain more accurate prediction intervals. In order The kyphosis dataset7 contains observations on
to capture both variation in the estimated regression 81 children who had corrective spinal surgery, on
line and residual variation, we may resample both. four variables: Kyphosis (a factor indicating whether
Variation in the regression line may be obtained by a postoperative deformity is present), Age (in months),
resampling either residuals or rows in order to gen- Number (of vertebrae involved in the operation), and
erate random β̂ ∗ values and corresponding ŷ∗ = β̂0 +
Start (beginning of the range of vertebrae involved).
β̂j x0j , for predictions at x0 . Independently we draw A logistic regression using main effects gives
coefficients:
Value Std. Error t value

(Intercept) −2.03693352 1.449574526 −1.405194
Age 0.01093048 0.006446256 1.695633
Start −0.20651005 0.067698863 −3.050421
Number 0.41060119 0.224860819 1.826024
random residuals r∗ , and add them to the ŷ∗ . After suggesting that Start is the most important
repeating this many times, the range of the middle predictor.
95% of the (ŷ∗ + r∗ ) values gives a prediction interval. The left panel of Figure 9 shows Kyphosis versus
For further discussion and alternatives see Ref 5. Start, together with predicted curve for the base case
with Age = 87 (the median) and Number = 4 (the
median). This is a sunflower plot,8,9 in which a flower
Logistic Regression with k > 2 petals represents k duplicate values.
In logistic regression it is straightforward to resample The right panel of of Figure 9 shows predictions
rows of the data, but resampling residuals fails—the from 20 bootstrap curves.
y values must be either zero or one, but adding the Figure 10 shows the bootstrap distributions for
residual from one observation to the prediction from the four regression coefficients. All of the distributions
another yields values anywhere between −1 and 2. are substantially non-normal. It would not be
Instead, we keep the x’s fixed, and generate y appropriate to use classical normal-based inferences.
values from the estimated conditional distributions Indeed, the printout of regression coefficients above,
1.0 1.0
0.8 0.8 Original

Bootstrap
0.6 0.6
Kyphosis
Kyphosis
0.4 0.4
0.2 0.2
0.0 0.0
5 10 15 5 10 15
Start Start
FIGURE 9 | Bootstrap curves for predicted kyphosis, for Age = 87 and Number = 4.

0.10
0
0.08
0.06
(intercept)
−5
X$Age
0.04
−10 0.02
0.0
−15
−0.02
−4 −2 0 2 4 −4 −2 0 2 4
Quantiles of standard normal Quantiles of standard normal
3
0.0
−0.2
2
−0.4
X$Number
X$Start
−0.6
1
−0.8
−1.0
0
−1.2
−4 −2 0 2 4 −4 −2 0 2 4
Quantiles of standard normal Quantiles of standard normal
FIGURE 10 | Normal quantile plots of bootstrap distributions for logistic regression coefficients.
from a standard statistical package (S-Plus) includes t bias estimates ranging from 0.22 to 0.28 standard
values but omits p values. Yet it would be tempting for errors.
a package user to interpret the t coefficients as arising These results are for the conditional distribution
from a t distribution; the bootstrap demonstrates bootstrap, a kind of parametric bootstrap. Repeating
that this would be improper. The distributions are the analysis with the nonparametric bootstrap (resam-
so non-normal as to make the utility of standard pling observations) yields bootstrap distributions that
errors doubtful. are even longer-tailed, indicating larger biases and
The numerical bootstrap results are:
Observed Mean Bias SE

(Intercept) −2.03693 −2.41216 −0.375224 1.737216
Age 0.01093 0.01276 0.001827 0.008017
Start −0.20651 −0.22991 −0.023405 0.084246
Number 0.41060 0.48335 0.072748 0.274049
The bootstrap standard errors are larger than the standard errors. This reinforces the conclusion that
classical (asymptotic) standard errors by 20–24%. The classical normal-based inferences are not appropriate
distributions are also extremely biased, with absolute here.

ACCURACY OF BOOTSTRAP there is noticeable variability, particularly in the

DISTRIBUTIONS tails of the bootstrap distributions, so when accu-
racy matters B = 104 or more samples should be
How accurate is the bootstrap? This entails two used.
questions: Note the difference between using B = 1000
and B = 104 bootstrap samples. These correspond to
• How accurate is the theoretical bootstrap? drawing samples of size 1000 or 104 observations,
• How accurately does the Monte Carlo implemen- with replacement, from the theoretical bootstrap
tation approximate the theoretical bootstrap? distribution. Using more samples reduces random
Monte Carlo variation, but does not fundamentally
change the bootstrap distribution—it still has the same
We begin this section with a series of pictures
approximate center, spread, and shape.
intended to illustrate both questions. We conclude
this section with a discussion of cases where Small Sample Mean
the theoretical bootstrap is not accurate, and Figure 12 is similar to Figure 11, but for a smaller
remedies. In Section How Many Bootstrap Samples sample size, n = 9 (and a different population).
Are Needed we return to the question of Monte Carlo As before, the bootstrap distributions are centered
accuracy. at the corresponding sample means, but now the
The treatment in this section is mostly not spreads and shapes of the bootstrap distributions vary
rigorous. There is a large literature that looks at substantially, because the spreads and shapes of the
the first question rigorously and asymptotically; we samples vary substantially. As before, the Monte Carlo
reference some of that work in other sections, variation is small, and may be reduced using B = 104
particularly Section Bootstrap Confidence Intervals or more samples.
about confidence intervals, and also refer the reader It is useful to compare the bootstrap distribu-
to Refs 10, 11 and some sections of Ref 5, and the tions to classical statistical inferences. With classical
references therein. √
t intervals of the form x ± tα/2 s/ n, the confidence
interval width varies substantially in small samples as
Large Sample Mean the sample standard deviation
Figure 11 shows a population, and five samples of √ s varies. Similarly, the
classical standard error s/ n varies. The bootstrap is
size 50 from the population, in the left column. The no different in this regard—bootstrap standard errors
middle column shows the sampling distribution for the and widths of confidence intervals for the mean are
mean, and bootstrap distributions from each sample, proportional to s.
based on B = 1000 bootstrap samples. Each bootstrap Where the bootstrap does differ from classical
distribution is centered at the statistic (x) from the inferences is how it handles skewness. The bootstrap
corresponding sample rather than being centered at percentile interval, and other bootstrap confidence
the population mean µ. The spreads and shapes intervals discussed below in Section Bootstrap
of the bootstrap distributions vary a bit, but not Confidence Intervals, are in general asymmetrical with
a lot. asymmetry depending on the sample. They estimate
This informs what the bootstrap distributions the population skewness from the sample skewness.
may be used for. The bootstrap does not provide a In contrast, classical t intervals assume that the
better estimate of the population parameter µ, because population skewness is zero. In Bayesian terms, the
no matter how many bootstrap samples are used, they bootstrap uses a noninformative prior for skewness,
are centered at x (plus random variation), not µ. On while classical procedures use a prior with 100% of
the other hand, the bootstrap distributions are useful its mass on skewness = 0.
for estimating the spread and shape of the sampling Which is preferred? In large samples, clearly the
distribution. bootstrap. In small samples, the classical procedure
The right column shows five more bootstrap dis- may be preferred. If the sample size is small, then
tributions from the first sample; the first four using skewness cannot be estimated accurately from the
B = 1000 resamples, and the final using B = 104 . sample, and it may be better to assume skewness = 0
These illustrate the Monte Carlo variation in the in spite of the bias, rather than to use an estimate that
bootstrap. This variation is much smaller than the has high variability.
variation due to different original samples. For
many uses, such as quick and dirty estimation of Sample Median
standard errors or approximate confidence inter- Now turn to Figure 13, where the statistic is the
vals, B = 1000 resamples is adequate. However, sample median. Here the bootstrap distributions are

Population mean = mu
Sample mean = x
Population Sampling
distribution
−3 0 mu 3 6 0 mu 3
B=1000
Bootstrap Bootstrap
distribution distribution 2
Sample 1 for sample 1 for sample 1
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
0 3 0 3 0 3
x x x
B=10^4
Bootstrap Bootstrap
0 3 0 3 0 3
x x x
FIGURE 11 | Bootstrap distribution for the mean, n = 50. The left column shows the population and five samples. The middle column shows the
sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample,
with B = 1000 or B = 104 .
poor approximations of the sampling distribution. In depend heavily on a small number of observations out
contrast, the sampling distribution is continuous, but of a larger sample.
the bootstrap distributions are discrete, with the only In the case of the median and other interior
possible values being values in the original sample quantiles, this can be remedied using a smoothed
(here n is odd). The bootstrap distributions are very bootstrap,12,13 drawing samples from a density
sensitive to the sizes of gaps among the observations estimate based on the data, rather than drawing from
near the center of the sample. the data itself. Smoothing is less effective for more
The ordinary bootstrap tends not to work well extreme quantiles, where the bootstrap distribution
for statistics such as the median or other quantiles that would still depend heavily on a small number of

Population Population mean = mu

Sample mean = x
Sampling
distribution
−3 mu 3 −3 mu 3
Sample 1 Bootstrap Bootstrap B=1000

for Sample 1 for Sample 1
−3 x 3 −3 x 3 −3 x 3

for Sample 2 for Sample 1
−3 x 3 −3 x 3 −3 x 3

for sample 3 for sample 1
−3 x 3 −3 x 3 −3 x 3
Bootstrap Bootstrap B=1000

−3 x 3 −3 x 3 −3 x 3
Sample 5 Bootstrap Bootstrap B=10^4

−3 x 3 −3 x 3 −3 x 3
FIGURE 12 | Bootstrap distributions for the mean, n = 9. The left column shows the population and five samples. The middle column shows the
sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample,
with B = 1000 or B = 104 .
observations. In that case it may be necessary to to represent the shape of the population; when there
impose additional structure by assuming a parametric is less data you cannot.
family, and perform a parametric bootstrap.
Trust the Data or Impose Structure? Systematic Errors in Bootstrap Distributions

In general, when there is little data, or a statistic of We note three ways that bootstrap distributions are
interest depends on a small subset of a larger data systematically different than sampling distributions.
set, then it may be appropriate to make additional First, as noted above, bootstrap distributions are
assumptions such as smoothness or a parametric fam- centered at the statistic θ̂ (plus bias) rather than at
ily. When there is a lot of data you can trust the data the parameter θ (plus bias).

Population median = M
Population Sampling distribution Sample median = m
−4 M 10 −4 M 10
Sample 1 Bootstrap Bootstrap

−4 m 10 −4 m 10 −4 m 10

−4 m 10 −4 m 10 −4 m 10

−4 m 10 −4 m 10 −4 m 10

−4 m 10 −4 m 10 −4 m 10

−4 m 10 −4 m 10 −4 m 10
FIGURE 13 | Bootstrap distributions for the median, n = 15. The left column shows the population and five samples. The middle column shows
the sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first
sample.
Second, in many applications there is a with larger means tend to give larger standard
relationship between the statistic and its standard errors.
error (‘‘acceleration’’ in the terminology of Ref 14). When there is acceleration, the bootstrap stan-
For example, the standard error of a binomial dard error reflects the standard error corresponding
proportion p̂(1 − p̂)/n depends on p̂. Similarly, to θ̂, not the true standard deviation of the sampling
when sampling from a gamma distribution, the distribution (corresponding to θ). Suppose the rela-
variance of the sample mean depends on the tionship is positive; then when θ̂ < θ it tends to be true
underlying mean. More generally when sampling the that the estimated standard error is also less than the
mean from positively skewed distributions, samples true standard deviation of the sampling distribution,

and confidence intervals tend to be too short. This BOOTSTRAP CONFIDENCE

is true for t intervals, whether using a formula or INTERVALS
bootstrap standard error, and also to a lesser extent
for bootstrap percentile intervals. The more accurate A large number of bootstrap confidence intervals have
intervals discussed in Section Bootstrap Confidence been proposed in the literature. Reviews of confidence
Intervals correct for acceleration. intervals are found in Refs 16–18. Here we focus
The third systematic error is that bootstrap on five: t intervals with either bootstrap or formula
distributions tend to be too narrow. standard errors, bootstrap percentile intervals,1
bootstrap t intervals,2,4,19 bootstrap BCa (bias-
corrected, accelerated) intervals,14 and bootstrap
Bootstrap Distributions Are Too Narrow tilting.19–22
In small samples, bootstrap distributions tend to be Note that ‘‘t intervals with bootstrap standard
too narrow. Consider the case of a sample mean from a errors’’ and ‘‘bootstrap t intervals’’ are different.
single population; in√this case the theoretical
bootstrap Percentile and t intervals are quick-and-dirty
standard error is σ̂ / n where σ̂ 2 = (1/n) (xi − x)2 .4 intervals, relatively simple to compute, but are not
In contrast to the usual sample standard deviation s, very accurate except for very large samples. They
this uses a divisor of n rather than n − 1. do not properly account for factors such as bias,
The reason the distributions are too narrow acceleration, or transformations. They are first-order
relates to the plug-in principle; when plugging in the correct—under fairly general circumstances (basically,
empirical distribution F̂n for use as the population, we for asymptotically normal statistics) the one-sided
are drawing samples from a population with standard non-coverage (1 − α) intervals are
deviation σ̂ . √ levels for nominal
√
α/2 + O(1/ n). The O(1/ n) errors decrease to zero
The result is thatbootstrap standard errors are very slowly.
too
√ small, by a factor 1 − 1/n relative to the usual The BCa, tilting, and bootstrap t intervals are
s/ n; about 5% too small when n = 10, about 1% second-order correct, with coverage errors O(1/n).
too small when n = 50, etc. The percentile, BCa, and tilting intervals are
In stratified bootstrap situations the bias depends transformation invariant—they give equivalent results
on the strata sizes rather than on the total sample size. for different transformations of a statistic, e.g., hazard
There are some easy remedies. The first is ratio and log-hazard ratio, or relative risk and
to draw bootstrap samples of size n − 1, with log relative risk. t intervals are not transformation
replacement from the data of size n. The second, invariant. Bootstrap t intervals are less sensitive
bootknife sampling,15 is a combination of jackknife to transformations than are t intervals; the use
and bootstrap sampling—first create a jackknife of different (smooth) transformations has √ coverage
sample by omitting an observation, then draw a effects of order O(1/n), compared to O(1/ n) for t
bootstrap sample of size n with replacement from intervals.
the n − 1 remaining observations. The omission can Our focus is on one-sided errors because few
be random or systematic. practical situations are truly two-sided. A nominal
A third remedy is the smoothed bootstrap. 95% interval that misses 2% of the time on the
Instead of drawing random samples from the discrete left and 3% of the time on the right should not
distribution F̂n , we draw from a kernel density be considered satisfactory. It is a biased confidence
estimate F̂h (x) = n−1 ((x − xi )/h), where is interval—both endpoints are too low, so it gives a
the standard normal density (other densities may biased impression about where the true parameter
be used). The original motivation12,13 was to draw may be. The appropriate way to aggregate one-sided
samples from continuous distributions, but it can coverage errors is by adding their absolute values,
also be used to correct for the downward bias so the biased interval has a total coverage error of
of bootstrap standard errors.15 The variance of an |2 − 2.5|% + |3 − 2.5|% = 1%, not 0%.
observation from F̂h is σ̂ 2 + h2 . Using h2 = s2 /n
makes the theoretical bootstrap standard error for
the mean match the usual formula standard error.15 t Intervals
For multidimensional data x the kernel covariance A t interval is of the form
can be 1/n times the empirical covariance matrix. For
non-normal data it may be appropriate to smooth on θ̂ ± tα/2,ν sθ̂ . (8)
a transformed scale; e.g., for failure time data, to take
a log transform of the failure times, add normal noise, where sθ̂ is a standard error computed using a formula
then transform back to the original scale. or using the bootstrap, and ν is degrees of freedom,

typically set to n − 1 (although other values would be distribution function with n − 1 degrees of freedom.
better for non-normal distributions). This gives wider intervals. Extensive simulations22
The bootstrap standard error may be computed show that this gives smaller coverage errors in practice,
using the techniques in Section Bootstrap Distribu- in a wide variety of applications. The effect on
tions Are Too Narrow—bootknife, sampling with coverage errors is O(1/n), the same order as the
reduced size, or smoothed bootstrap. This results in bootknife adjustment, but the magnitude of the effect
slightly wider intervals that are usually more accurate is larger; for example, the errors caused by using z
in practice. These techniques have an O(1/n) effect rather than t quantiles in a standard t interval for a
on one-sided coverage errors, which is unimportant normal population are:
for large samples but is important in small samples.
n Non-coverage Error
For example, for a sample of independent identically
10 0.0408 0.0158
distributed observations from a normal distribution, a
20 0.0324 0.0074
nominal 95% t interval for the mean using a bootstrap
40 0.0286 0.0036
standard error without these corrections would have
100 0.0264 0.0014
one-sided coverages errors:
n Non-coverage Error For a sample size of 20, this effect alone makes
10 0.0302 0.0052 intervals tend to miss 0.0074/0.025 = 30% too often!
20 0.0277 0.0027 A third variation relates to how quantiles are
40 0.0264 0.0014 calculated for a finite number B of bootstrap samples.
100 0.0256 0.0006 Hyndman and Fan23 give a family of definitions of
quantiles for finite samples, governed by a parameter
∗
0 ≤ δ ≤ 1. The bth order statistic θ̂(b) is the (b −
Percentile Intervals δ)/(B + 1 − 2δ) quantile of the bootstrap distribution,
In its simplest form, a 95% bootstrap percentile for b = 1, . . . , B. Linear interpolation between
interval is the range of the middle 95% of a bootstrap adjacent bootstrap statistics is used if the desired
distribution. quantile is not of the form (b − δ)/(B + 1 − 2δ) for
More formally, bootstrap percentile intervals are some integer b. For bootstrap confidence intervals δ =
of the form 0 is preferred, as other choices result in lower coverage
probability. The effect on coverage errors is O(1/B).
(Ĝ−1 (α/2), Ĝ−1 (1 − α/2)). (9)
where Ĝ is the estimated bootstrap distribution of θ̂ ∗ . Bootstrap t

Variations are possible that improve finite- The difference between t intervals (possibly using
sample performance.22 These have received little bootstrap standard errors) and bootstrap t inter-
attention in the bootstrap literature, which tends vals2,4,19 is that the former assume that a t statistic
to focus on asymptotic properties. In particular, the follows a t distribution, while the latter estimate the
simple bootstrap percentile intervals tend to be too actual distribution using the bootstrap.
narrow, and the variations give wider intervals with Let
better coverage.
First, the bootknife or other techniques in θ̂ − θ
t= (11)
Section Bootstrap Distributions Are Too Narrow may sθ̂
be used.
Second, the percentiles may be adjusted. In be a t statistic. Under certain conditions the t statistic
a simple situation like the sample mean from a follows a t distribution. Those conditions are rarely
symmetric distribution the interval is similar to met in practice.
the t interval (8) but using quantiles of a normal The bootstrap analog of t is
distribution rather than t distribution, zα/2 rather
than tα/2,n−1 . As a result, the interval tends to be too θ̂ ∗ − θ̂
t∗ = . (12)
narrow. A correction is to adjust the quantiles based sθ̂ ∗
on the difference between a normal and t distribution,
The standard error may be calculated either by for-
(Ĝ−1 (α /2), Ĝ−1 (1 − α /2)). (10) mula or bootstrap sampling; in the latter case, cal-
culating each sθ̂ ∗ requires a second level of bootstrap
−1
where −1 (α /2) = Ft,n−1 (α/2) where is the sampling, with second-level bootstrap samples drawn
standard normal distribution and Ft,n−1 is the t from each first-level bootstrap sample.

Mean
0.4 500
0.3 400
Density
stdev
0.2
300
0.1
200
0.0
−6 −4 −2 0 2 80 100 120 140 160 180 200

t Mean
FIGURE 14 | Histogram of bootstrap distribution for the t statistic, and relationship between bootstrap means and standard deviations, of
arsenic concentrations.
Figure 14 shows the bootstrap distribution for those intervals are too narrow if the plug-in population
the t statistic for mean arsenic concentration,
√ where t is narrower, on average, than the parent population.
is the ordinary t statistic (x − µ)/(s/ n). In contrast The sampling distribution of a t statistic, in contrast,
to Figure 2, where the bootstrap distribution for the is invariant under changes in the scale of the parent
mean is positively skewed, the distribution for the t population. This gives it an automatic correction for
statistic is negatively skewed. The reason is that there the plug-in population being too narrow, and to add
is positive correlation between x∗ and s∗ , as seen in the bootknife sampling would over-correct.
right panel of Figure 14, so that a negative numerator Efron and Tibshirani2 note that the bootstrap t
in (12) tends to occur with a small denominator. is sometimes erratic, and suggest transforming the
The bootstrap t interval is based on the identity statistic of interest. Hesterberg22 observes erratic
behavior in small samples. We conjecture the
θ̂ − θ following explanation—that the bootstrap t depends
P(G−1
t (α/2) < < G−1
t (1 − α/2)) = 1 − α, not only on skewness, but also on kurtosis, and
sθ̂
(13) kurtosis is hard to estimate from small samples. The
bootstrap t does not use a t table, but instead estimates
where Gt is the sampling distribution of t (11). the distribution of the t statistic by simulating from the
Assuming that t∗ (12) has approximately the same data. This distribution depends not only on asymmetry
caused by skewness, but also on the effective degrees
distribution as t, we substitute quantiles of the
of freedom, that depend on kurtosis—larger kurtosis
bootstrap distribution for t∗ ; then solving for θ yields
results in greater variability in standard errors and
the bootstrap t interval
smaller effective degrees of freedom. In contrast, other
second-order-correct intervals depend on skewness,
(θ̂ − G−1 −1
t∗ (1 − α/2)sθ̂ , θ̂ − Gt∗ (α/2)sθ̂ ). (14)
but not (or much less so) on kurtosis, so are less
erratic for small samples.
Note that the right tail of the bootstrap distribution of
t∗ is used in computing the left side of the confidence
interval, and conversely. BCa Intervals
The bootstrap t and other intervals for the The bootstrap BCa interval14 uses quantiles of the
mean arsenic concentration example described in bootstrap distribution, like the percentile interval,
Section Introduction are shown in Table 1. but with the percentiles adjusted depending on a
It is not appropriate to use bootknife or other bias parameter z0 and acceleration parameter a. The
sampling methods in Section Bootstrap Distributions interval is
Are Too Narrow with the bootstrap t. The reason we
use those methods with the other intervals is because (G−1 (p(α/2)), G−1 (p(1 − α/2))), (15)

where that makes the sampling distribution have prob-

ability 2.5% of exceeding the observed value,
z0 + (−1) (c) Pθleft (θ̂ ∗ > θ̂) = 0.025. Bootstrap tilting19 borrows this
p(c) = z0 + (16)
1 − a(z0 + (−1) (c)) idea.
The idea behind bootstrap tilting19 is to create a
is the adjusted probability level for quantiles; it one-parameter family of populations that includes the
simplifies to c when z0 = a = 0. empirical distribution function, to find the member of
The BCa interval is derived by assuming there that family that has 2.5% (or 97.5%) of the bootstrap
exists a smooth transformation h such that distribution exceeding the observed value, and let the
left (right) endpoint of the interval be the parameter of
h(θ̂) ∼ N(h(θ) + z0 σh , σh2 ), (17) interest calculated from that population. The family is
restricted to have support on the empirical data, with
where σh = 1 + ah(θ) and that the same relationship
varying probabilities on the observations.
holds for bootstrap samples (substitute θ̂ ∗ for θ̂, and
In effect, the procedure turns a nonparamet-
θ̂ for θ). Some algebra yields the BCa confidence
ric problem into a parametric problem, then uses
interval. The transformation h cancels out, so need
classical one-parametric techniques. This turns out
not be estimated.
to be accurate statistically, giving second-order-
For the nonparametric bootstrap, the parameter
accurate confidence intervals,20 and with a clever
z0 is usually estimated using the fraction of bootstrap
observations that fall below the original observed implementation19,21 requires a very small number of
value, bootstrap samples, roughly 17 times fewer than a
bootstrap percentile interval, but there are implemen-
z0 = (−1) (#(θ̂ ∗ < θ̂)/B) (18) tation difficulties and different implementations can
very dramatically in small-sample accuracy.22
and acceleration parameter based on the skewness of The following sections on bootstrap tilting may
the empirical influence function. One estimate of that be skipped by most readers, but is useful background
skewness is obtained from jackknife samples; let θ̂(i) for Section Planning Clinical Trials on planning
be the statistic calculated from the original sample clinical trials.
but excluding observation i, and θ (i) be the average of
those values, then Tilting For a Sample Mean
For example, given i.i.d. observations (x1 , . . . , xn )
− ni=1 (θ̂(i) − θ (i) )3 when the parameter of interest is the population mean,
a = n . (19)
6( i=1 (θ̂(i) − θ (i) )2 )3/2 one suitable family is the exponential tilting family,
which places probabilities
Davison and Hinkley5 also give expressions for a in
the case of stratified sampling, including two-sample pi = c exp(τ xi ) (20)
applications.
For the arsenic data, z0 = 0.0438 (based on on observation i, where τ is a tilting parameter, and c
100,000 replications) and a = 0.0484. The 95% is
interval is then the range from the 0.0436 to 0.988 a normalizing constant (depending on τ ) such that
i pi = 1.
quantiles of the bootstrap distribution.
τ = 0 gives equal probabilities pi = 1/n, corre-
The BCa interval has greater Monte Carlo error
sponding to the empirical distribution function, and
than the ordinary percentile interval because Monte
about half of the bootstrap distribution is below
Carlo error in estimating z0 propagates into the end-
the observed x. τ < 0 places higher probabilities
points, and because typically one of the quantiles is
on smaller observations; sampling with these prob-
farther in the tail than for the percentile interval, e.g.,
abilities is more likely to give samples with smaller
here the 98.8% quantile is used instead of the 97.5%
observations, and smaller bootstrap means, so more
quantile. In the best case, that a = z0 = 0, this requires
of the bootstrap distribution is below x. We find
a bit more than twice as many bootstrap samples as
the values of τ for which only 2.5% of the boot-
the percentile interval for comparable Monte Carlo
strap distribution is above x; The left endpoint of the
accuracy.
confidence interval is the mean of the corresponding
weighted population, θleft = ni=1 pi x i.
Bootstrap Tilting Intervals Similarly, the right endpoint is ni=1 pi xi when
In parametric statistics, the left endpoint of a con- pi is computed using the τ that puts 97.5% of the
fidence interval for a parameter θ is the value bootstrap distribution to the right of x.

Another suitable family is the maximum Let Fp denote a weighted distribution with proba-
likelihood family, with probability bility pi on original data point xi , θ(p) = θ(Fp ) be
the parameter for the weighted distribution (e.g.,
c
pi = (21) weighted mean, or weighted regression coefficient),
1 − τ (xi − x) and p0 = (1/n, . . . , 1/n) correspond to the original
equal-probability empirical distribution function. The
on observation i.
gradient of θ(p) is
Importance Sampling Implementation
Conceptually, finding the right value of τ requires Ui (p) = lim −1 (θ(p + (δi − p)) − θ(p)), (24)
→0
trial and error; for any given τ , we calculate
p = (p1 , . . . , pn ), draw bootstrap samples with those where δi is the vector with 1 in position i and 0
probabilities, calculate the bootstrap statistics, and elsewhere. When evaluated at p0 these derivatives
calculate the fraction of those statistics that are above are known as the empirical influence function, or
θ̂, then repeat with a different τ until the fraction is infinitesimal jackknife.
2.5%. This is expensive, and the fraction varies due Four least-favorable families found in the tilting
to random sampling. literature are:
In practice we use an importance sampling
implementation. Instead of sampling with unequal F1 : pi = c exp(τ Ui (p0 ))
probabilities, we sample with equal probabilities, then
reweight the bootstrap samples by the relative likeli- F2 : pi = c exp(τ Ui (p))
hood of the sample under weighted and ordinary F3 : pi = c(1 − τ Ui (p0 ))−1
bootstrap sampling. The likelihood for a bootstrap
sample is F4 : pi = c(1 − τ Ui (p))−1 , (25)
l(x1∗ , . . . , xn∗ ) = wi ∗ (22) each indexed by a tilting parameter τ , where each c

normalizes the probabilities to add to 1.
compared to (1/n)n for ordinary bootstrap sampling. F1 and F2 are well-known as ‘‘exponential
Let wb = ( wi∗ )/(1/n)n = nwi∗ be the relative like- tilting’’, and coincide with (20) if θ is a mean.
lihood for bootstrap sample b. We estimate the Similarly, F3 and F4 are maximum likelihood tilting
probability by and coincide with (21) for a mean. F2 and F4 minimize
the backward and forward Kullback–Leibler distances
B between
p and p0 , respectively, subject to pi ≥ 0,
∗ −1
P̂p (θ̂ > θ̂) = B wb I(θ̂b∗ > θ̂) = B−1 wb , pi = 1, and θ(p) = A; varying A results in solutions
b=1 b∈R of the form given in (25). F4 also maximizes the
(23) likelihood pi subject to the same constraints.
As in the case of the sample mean, having
where R is the subset of {1, . . . , B} with θ̂b∗ > θ̂. selected a family, we find the value of τ for which
In practice we also worry about ties, cases with 2.5% (95%) of the bootstrap distribution is to the
with θ̂b∗ = θ̂. We numerically
θ̂ ∗ = θ̂; let E be the subset right of the observed θ̂; the left (right) endpoint of the
find τ to solve 0.025B = b∈R wb + (1/2) b∈E wb . confidence interval is then the parameter calculated
Similar calculations are done for the τ used for the weighted distribution with probability pi
for the right endpoint; solve 0.025B = b∈L wb + on xi .
(1/2) b∈E wb where L is the subset of {1, . . . , B} with All four families result in second-order-
θ̂b∗ < θ̂. accurate confidence intervals,20 but the finite-
In any case, after finding τ , the endpoint of sample performance differs, sometimes dramatically
the interval is the weighted mean for the empirical for smaller samples.22 The differences in coverage
distribution with probabilities calculated using τ . between the four families are O(1/n), similar to the
adjustments discussed in Section Percentile Intervals.
Tilting for Nonlinear Statistics The fixed-derivative versions F1 and F3 are easier
The procedure can be generalized to statistics other to work with, but have inferior statistical properties;
than the mean using a least-favorable single-parameter they are shorter, have actual coverage probability
family, one for which inference within the family lower than the nominal confidence, and for sufficiently
is not easier, asymptotically, than for the original high nominal confidence levels the actual coverage
problem.19 This is best done in terms of derivatives. can decrease as the nominal confidence increases.

TABLE 1 Confidence Intervals for Mean Arsenic Concentration, TABLE 2 Actual Non-Coverage of Nominal 95% t Intervals, as
Based on 100,000 Bootstrap Samples, Using Ordinary Nonparametric Estimated From Second-Order-Accurate Intervals
and Bootknife Resampling
Estimated using Left Right
95% Interval Asymmetry Bootstrap t 0.0089 0.062
Formula t (88.8, 160.2) ±35.7 BCa 0.0061 0.052
Ordinary Bootstrap
A t interval would miss more than twice too often on the right side. The
t w boot SE (88.7, 160.2) ±35.8 actual non-coverage should be 0.025 on each side.
Percentile (91.5, 162.4) (−33.0, 38.0)
Bootstrap t (94.4, 172.6) (−30.1, 48.1) t interval—in other words, what the bootstrap t and
BCa (95.2, 169.1) (−29.3, 44.6) BCa intervals think is the actual non-coverage of the
t intervals. The discrepancies are striking. On the left
Tilting (95.2, 169.4) (−29.3, 44.9)
side, the t interval should miss 2.5% of the time; it
Bootknife actually misses only about a third or fourth that often,
t w boot SE (88.7, 160.3) ±35.8 according to the bootstrap t and BCa intervals. On
Percentile (91.5, 162.6) (−32.9, 38.1) the right side, it should miss 2.5% of the time, but
BCa (95.4, 169.3) (−29.1, 44.8) actually misses somewhere between 5.2 and 6.2%,
Tilting (95.2, 169.4) (−29.3, 45.0)
according to the BCa and bootstrap t procedures.
This suggests that the t interval is severely biased,
The ‘‘asymmetry’’ column is obtained by subtracting the observed mean. The with both endpoints systematically lower than they
‘‘t w boot SE’’ interval is a t interval using a bootstrap standard error.
should be.
Similarly, exponential tilting is more convenient Implications for Other Situations

numerically, but maximum likelihood has better
The t intervals are badly biased in the arsenic example.
statistical properties, producing wider confidence What does this imply for other situations?
intervals closer to the desired coverage levels. Overall, On the one hand, the arsenic data are quite
the maximum likelihood version with changing skewed, relative to most data observed in practice.
derivatives F4 gives the widest intervals with highest On the other hand, the sample size is large. What can
and usually most accurate coverage. we say about other combinations of sample size and
population skewness?
For comparison, samples of size 47 from an
Confidence Intervals for Mean Arsenic exponential population are comparable to the arsenic
Concentration data, in the sense that the sampling distribution for the
Table 1 shows 95% confidence intervals for the mean is equally skewed. A quick simulation with 106
mean arsenic concentration example described in samples of exponential data with n = 47 shows that
Section Introduction. the actual non-coverage of 95% t intervals is 0.0089
The intervals vary dramatically, particularly in on the left and 0.0567 on the right, comparable to
the degree of asymmetry. The t intervals are symmetric the bootstrap estimates above. This shows that for
about x. The bootstrap t interval reaches much farther a distribution with only moderate skewness, like the
to the right, and is much wider. The percentile interval exponential distribution, n = 30 is not nearly enough
is asymmetrical, longer on the right side, to a lesser to use t intervals; that even n = 47 results in non-
extent than other asymmetrical intervals. While it is coverage probabilities that are off by factors of about
asymmetrical, it is not asymmetrical enough for good 3 and 2, on the two sides. Reducing the errors in non-
accuracy. While preferable to the t intervals, it is not coverage to a more reasonable 10% of the desired
as accurate as the second-order-accurate procedures value, i.e., that the actual one-sided non-coverage
The t intervals assume that the underlying probabilities are between 2.25 and 2.75% on each
population is normal, which is not true here. Still, side for a nominal 95% interval, would require around
the common practice with a sample size as large n = 5000 for an exponential distribution.
as 271 would be to use t intervals anyway. The Even for distributions that are not particularly
bootstrap can help answer whether that is reasonable, skewed, say 1/4 the skewness of an exponential
by giving an idea what the actual non-coverage is for a distribution (e.g., a gamma distribution with shape =
95% t interval. Table 2 shows what nominal coverage 16), the sample size would need to be around 470
levels would be needed for the bootstrap t and BCa to reduce the errors in non-coverage to 10% of the
intervals to coincide with the actual endpoints of the desired values.

To obtain reasonable accuracy for smaller replications, typically by a factor of 37 for a 95%
sample sizes requires the use of more accurate confidence interval. The disadvantages of tilting are
confidence intervals, either a second-order-accurate that the small-sample properties of the fixed-derivative
bootstrap interval, or comparable second-order- versions F1 and F3 are not particularly good, while
accurate non-bootstrap interval. Two general second- the more rigorous F2 and F4 are harder to implement
order-accurate procedures that do not require reliably.
sampling are ABC24 and automatic percentile25
intervals, which are approximations for BCa and
tilting intervals, respectively. HYPOTHESIS TESTING
The current practice of statistics, using normal An important point in bootstrap hypothesis testing
and t intervals with skewed data, systematically is that sampling should be done in a way that is
produces confidence intervals with endpoints that are consistent with the null distribution.
too low (for positively skewed data). We describe here three bootstrap hypothesis
Similarly, hypothesis tests are systematically
testing procedures: pooling for two-sample tests,
biased; for positively skewed data they reject H0 :
bootstrap tilting, and bootstrap t.
θ = θ0 too often for cases with θ̂ < θ0 , and too little
The first is for two-sample problems, such
for θ̂ > θ0 . The primary reason is acceleration—when
as comparing two means. Suppose that the null
θ̂ < θ0 then acceleration makes it likely that s < σ , and
hypothesis is that θ1 = θ2 , and that one is willing
the t interval does not correct for this, so improperly
to assume that if the null hypothesis is true that the
rejects H0 .
two populations are the same. Then one may pool the
data, draw samples of size n1 and n2 with replacement
Comparing Intervals from the pooled data, and compute a test statistic
t intervals and bootstrap percentile intervals are such as θ̂1 − θ̂2 or a t statistic. Let T ∗ be the bootstrap
quick-and-dirty intervals, suitable for rough approx- test statistic, and T0 the observed value of the test
imations, but should not be used where accuracy is statistic. P-value is the fraction of time that the T ∗
needed. exceeds T0 .
Among the others, I recommend the BCa in most In practice we add 1 to the numerator and
cases, provided that the number of bootstrap samples denominator when computing the fraction—the one-
B is very large. sided P-value for the one-sided alternative hypothesis
In my experience with extensive simulations, θ̂1 − θ̂2 > 0 is (#(T ∗ > T0 ) + 1)/(B + 1). The lower
the bootstrap t is the most accurate in terms of one-sided P-value is (#(T ∗ < T0 ) + 1)/(B + 1), and
coverage probabilities. However, it achieves this at the two-sided P-value is two times the smaller of
high cost—the interval is longer on average than the one-sided P-values.
the BCa and tilting intervals, often much longer. This procedure is similar to the two-sample
Adjusting the nominal coverage level of the BCa and permutation test, which pools the data and draws n1
tilting intervals upward gives comparable coverage observations without replacement for the first sample
to bootstrap t with shorter length. And the lengths of and allots the remaining n2 observations to be the
bootstrap t intervals vary much more than the others. I second sample. The permutation test is preferred. For
conjecture that this is because bootstrap t intervals are example, suppose there is one outlier in the combined
sensitive to the kurtosis of the bootstrap distribution, sample; every pair of permutation samples has exactly
which is hard to estimate accurately from reasonable- one copy of the outlier, while the bootstrap samples
sized samples. In contrast, BCa and tilting intervals may have 0, 1, 2, . . . copies. This adds extra variability
depend primarily on mean, standard deviation, and not present in the original data, and detracts from the
skewness of the bootstrap distribution. accuracy of the resulting P-values.
Also, the bootstrap t is computationally expen- Now suppose that one is not willing to assume
sive if the standard error is obtained by bootstrapping. that the two distributions are the same. Then boot-
If sθ̂ is calculated by bootstrapping, then sθ̂ ∗ is calcu- strap tilting hypothesis testing5,26,27 may be suitable.
lated using a second level of bootstrapping—drawing Tilting may also be used in one-sample and other
bootstrap samples from each first-level bootstrap sam- contexts. The idea is to find a version of the empiri-
ple (requiring a total of B + BB2 bootstrap samples, cal distribution function(s) with unequal probabilities
if B2 second-level bootstrap samples from each of B that satisfy the null hypothesis (by maximizing likeli-
first-level bootstrap samples). hood or minimizing Kullback–Leibler distance subject
The primary advantage of bootstrap tilting to the null hypothesis), then draw samples from the
over BCa is that it requires many fewer bootstrap unequal-probability empirical distributions, and let

the P-value be the fraction of times the bootstrap TABLE 3 Leukemia Data
test statistic exceeds the observed test statistic. As
Group Length of Complete Remission (in Weeks)
in the case of confidence intervals, importance sam-
pling may be used in place of sampling with unequal Maintained 9, 13, 13+, 18, 23, 28+, 31, 34, 45+, 48, 161+
probabilities, see Section Bootstrap Confidence Inter- Nonmaintained 5, 5, 8, 8, 12, 16+, 23, 27, 30, 33, 43, 45
vals. There are close connections to empirical
likelihood.28
Bootstrap tilting hypothesis tests reject H0 if A Cox proportional hazards regression, using
bootstrap tilting confidence intervals exclude the null Breslow’s method of breaking ties, yields a log-hazard
hypothesis value. ratio of 0.904 and standard error 0.512:
The third general-purpose bootstrap testing
coef exp(coef) se(coef) z p
procedure is related to bootstrap t confidence group 0.904 2.47 0.512 1.77 0.078
intervals. A t statistic is calculated for the observed
data, and the P-value for the statistic is calculated not An ordinary bootstrap with B = 104 results in
by reference to the Students t distribution, but rather eleven samples with complete separation—where the
by reference to the bootstrap distribution for the t minimum observed relapse time in the treatment group
statistic. In this case the bootstrap sampling need not exceeds the maximum observed relapse in the control
be done consistently with the null hypothesis, because t group, giving an infinite estimated hazard ratio. A
statistics are approximately pivotal—their distribution stratified bootstrap reduces the number of samples
is approximately the same independent of θ. with complete separation to three. Here stratification
is preferred (even if the original allocation were not
PLANNING CLINICAL TRIALS stratified) in order to condition on the actual sample
sizes, and prevent imbalance in the bootstrap samples.
The usual bootstrap procedure is to draw samples Omitting the three observations results in a slightly
of size n from the empirical data, or more generally long-tailed bootstrap distribution, with standard error
to plug in an estimate for the population and draw 0.523, slightly larger than the formula standard error.
samples using the sampling mechanism actually used Drawing 50 observations from each group
in practice. In planning clinical trials we may modify results in a bootstrap distribution for log-hazard ratio
this in two ways: that is nearly exactly normal with almost no bias,
no samples with separation (they are still possible,
• try other sampling procedures, such as different but unlikely), and a standard error of 0.221. Surpris-
sample sizes or stratification, and/or ingly, this 10% less than obtained by extrapolating √
• plug-in alternate population estimates. the original
formula standard error at the rate 1/ n,
0.512/ 100/23 = 0.246, and 12% less than obtained
For example, given training data of size n, to by extrapolating the original bootstrap standard error.
estimate standard errors or confidence interval width Similar results are obtained using Efron’s method for
that would result from a possible clinical trial of size handling ties, and from a smoothed bootstrap with a
N, we may draw bootstrap samples of size N with small amount of noise added to the remission times.
replacement from the data. The fact that the reduction in standard error is 10--
Similarly, we may estimate the effects of different 12% greater than expected may be because censored
sampling mechanisms, such as stratified sampling, or observations have a less serious impact with larger
case–control allocation to arms, even if pilot data were sample sizes.
obtained in other ways.
For example, we consider preliminary results
from a clinical trial to evaluate the efficacy of ‘‘What if’’ Analyses—Alternate Population
maintenance chemotherapy for acute myelogenous Estimates
leukemia (AML).29,30 After achieving remission In planning clinical trials it is often of interest to
through chemotherapy, the patients were assigned to a do ‘‘what if’’ analyses, perturbing various inputs. For
treatment group receiving maintenance chemotherapy example, how might the results differ under sampling
and a control group that did not. The goal was to from populations with a log-hazard ratio of zero,
see if maintenance chemotherapy prolonged the time or 0.5?
until relapse. The data are in Table 3. There are 11 This should be done by reweighting observa-
subjects in the treatment group and 12 in the control tions.31,32 This is a version of bootstrap tilting19,21,31,33
group. and is closely related to empirical likelihood.34

Consider first a simple example—sampling the For other statistics we replace (28) with the more
difference in two means, θ̂ = x1 − x2 . In order to sam- general
ple from populations with different values of θ, it is
natural to consider perturbing the data, shifting one θ(F̂n,w ) = θ0 , (29)
or both samples, e.g., adding θ − θ̂ to each value in
sample 1. where F̂n,w is the weighted empirical distribution (with
Perturbing the data does not generalize well to obvious generalization to multiple samples or strata).
other situations. Furthermore, perturbing the data The computational tools used for empirical
would often give incorrect answers. Suppose that likelihood34 and bootstrap tilting19,21 are useful in
the observations represent positive skewed observa- determining the weights.
tions such as survival times, with a mode at zero. The bootstrap sampling is from the weighted
Shifting one of the samples to the left would give empirical distributions, i.e., the data are sampled with
negative times; to the right would make the mode unequal probabilities.
nonzero. More subtle, but very important, is that Figure 15 shows this idea applied to the leukemia
shifting ignores the mean–variance relationships for data. The top left shows Kaplan–Meier survival
skewed populations—increasing the mean should also curves for the original data, and top right shows the
increase the variance. For positive data like survival bootstrap distribution for the log-hazard ratio, using
times, perturbing the data by multiplying one of the 50 observations in each group. The bottom left shows
samples by a factor avoids the most obvious problems, weights chosen to maximize (26), subject to (28) and
but assumes a particular mean–variance relation- a log-hazard ratio equal to 0.5. In order to reduce the
ship—that variance is proportional to the square of ratio from its original value of 0.904, the treatment
the mean. group gets high weights early and low weights later
It is also unclear how one would perturb the (the weighted distribution has a higher probability of
data in multivariate applications when some variables early events) while the control group gets the converse.
are categorical. Censored observations get roughly the average weight
Instead, we suggest using a weighted version of of the remaining noncensored observations in the same
the empirical data, maximizing the likelihood of the group. The middle left shows the resulting weighted
observed data subject to the weighted distributions survival estimates, and middle right the corresponding
satisfying desired constraints. To satisfy µ1 − µ2 = θ0 , bootstrap distribution. In this case both bootstraps are
for example, we maximize nearly normal, and the standard errors are very similar
−0.221 for the ordinary bootstrap and 0.212 for the
n1 n2
weighted bootstrap, both with 50 observations per
w1i w2i (26) group.
i=1 i=1
HOW MANY BOOTSTRAP SAMPLES

subject to constraints on weights (given here for two ARE NEEDED
samples):
We suggested in Section Accuracy of Bootstrap
Distributions that 1000 bootstrap samples are enough
w1i > 0, i = 1, . . . , n1 (27) for rough approximations, but that more are needed
w2i > 0, i = 1, . . . , n2 for greater accuracy. In this section we give details.
n1 The focus here is on Monte Carlo accuracy—how
w1i = 1 well the usual random-sampling implementation of
i=1
the bootstrap approximates the theoretical bootstrap
n2
distribution.
A bootstrap distribution based on B random
w2i = 1
samples corresponds to drawing B observations
i=1
with replacement from the theoretical bootstrap
distribution. Quantities such as the mean, standard
and the constraint specific to comparing two means: deviation, or quantiles of the bootstrap distribution
converge
√ to their theoretical counterparts at the rate
n1 n2 O(1/ B), in probability.
w1i x1i − w2i x2i = θ0 . (28) Efron and Tibshirani2 suggest that B = 200, or
i=1 i=1 even as few as B = 25, suffices for estimating standard

Kaplan Meier curves Ordinary bootstrap

1.0 Observed
Mean
0.8 1.5
Proportion surviving
0.6
Density
1.0
0.4
0.5
0.2
0.0 0.0
0 50 100 150 0.0 0.5 1.0 1.5
Survival time in weeks Log hazard ratio
Weighted kaplan Meier curves Weighted bootstrap

2.0
1.0 Observed
Mean
0.8 1.5
Proportion surviving
0.6
Density
1.0
0.4
0.5
0.2
0.0 0.0
0 50 100 150 0.0 0.5 1.0 1.5
Survival time in weeks Log hazard ratio
Observation weights
Maintained
0.12 Control
0.10
Observation weights
0.08
0.06
0.04
0.02
Maintained/censored
0.0 Control/censored
0 50 100 150
Log hazard ratio
FIGURE 15 | Survival curves and bootstrap distribution for log-hazard ratio, original and perturbed (weighted) to a log-hazard ratio of 0.5.
errors, and that B = 1000 is enough for confidence computers were much slower; with faster computers
intervals. it is much easier to take more samples.
We argue that larger sizes are appropriate, on Second, those criteria were developed using
two grounds. First, those criteria were developed when arguments that combine the random variation

due to the original sample with the random confidence interval is the range from the α/2 to
variation due to bootstrap sampling. For example, 1 − α/2 quantiles of the bootstrap distribution. Let
.
Efron and Tibshirani2 indicate that cv(se ˆ B) = G−1
∞ (c) be the c quantile of the theoretical bootstrap
1/2
ˆ ∞ )2 + (E( ) + 2)/(4B)
cv(se , where cv is coeffi- distribution, and the number of bootstrap statistics
cient of variation, cv(Y) = σY /E(Y), seB and se∞ are falling below this quantile is approximately binomial
bootstrap standard errors using B or ∞ replications, with parameters B and c (the proportion parameter
respectively, and relates to the kurtosis of the may differ slightly due to the discreteness of the
bootstrap distribution; it is zero for normal distri- bootstrap distribution). For finite B, the one-sided
butions. Even relatively small values of B make the error has standard error approximately c(1 − c)/B.
ˆ B )/cv(se
ratio cv(se ˆ ∞ ) not much larger than 1. For c = 0.025, to reduce 1.96 standard errors to
We feel that the variation in bootstrap answers c/10 requires B ≥ (10/0.025)2 1.962 0.025 × 0.975 =
conditional on the data is more relevant. This is 14980, about 15,000 bootstrap samples.
particularly true in clinical trial applications, where The more stringent criterion of a 1% error would
require approximately 1.5 million bootstrap samples.
• reproducibility is important—two people ana- The bootstrap BCa confidence interval has
lyzing the same data should get (almost exactly) greater Monte Carlo error, because it requires
the same results, with random variation between estimating a bias parameter using the proportion of
their answers minimized, and bootstrap samples falling below the original θ̂ (and
• the data may be very expensive—there is little the variance of a binomial proportion p(1 − p)/B
.
point in wasting the value of expensive data is greatest for p = 0.5). It requires B about twice as
by introducing extraneous variation using B large as the bootstrap percentile interval for equivalent
too small. Given the choice between reducing Monte Carlo accuracy—30,000 bootstrap samples to
variation in the ultimate results by gathering satisfy the 10% criterion.
more data or by increasing B, it would be cheaper On the other hand, the bootstrap tilting interval
to increase B, at least until B is quite large. requires about 17 times fewer bootstrap samples
for the same Monte Carlo accuracy as the simple
. percentile interval, so that about 1000 bootstrap
ˆ B ) = (δ + 2)/(4B),
Conditional on the data, cv(se
where δ is the kurtosis of the theoretical bootstrap samples would suffice to satisfy the 10% criterion.
distribution (conditional on the data). When δ is In summary, to have 95% probability that the
zero (usually√approximately true), this simplifies to actual one-sided non-coverage for a 95% bootstrap
. interval falls within 10% of the desired value, between
ˆ B ) = 1/ 2B.
cv(se
To determine how large B should be, we 0.0225 and 0.0275, conditional on the data, requires
consider the effect on confidence intervals. Consider about 1000 samples for a bootstrap tilting interval,
a t interval of the form θ̂ ± tα/2 seB . Suppose that 4000 for a t interval using a bootstrap standard error,
such an interval using se∞ would be approximately 15,000 for a bootstrap percentile interval, and 30,000
correct, with one-sided non-coverage α/2. Then for a bootstrap BCa interval.
the actual non-coverage using seB in place of se∞ Figure 16 shows the Monte Carlo variability of
−1 a number of bootstrap confidence interval procedures,
would be Ft,n−1 ((seB /se∞ )Ft,n−1 (α/2)). For n large
and α = 0.05, to have the actually one-sided non- for various combinations of sample size, statistic, and
coverage fall within 10% of the desired value underlying data; these are representative of a larger
(between 0.0225 and 0.0275) requires that seB /se∞ be collection of examples in Ref 22. The panels show the
between −1 (0.025 × 1.1)/−1 (0.025) = 0.979 and variability due to Monte Carlo sampling with a finite
−1 (0.025 × 0.9)/−1 (0.025) = 1.023. To have 95% bootstrap sample size B, conditional on the data.
confidence √of no more than 10% error requires Figure 16 is based on 2000 randomly generated
that 1.96/ 2B ≤ 0.022, or B ≥ 0.5(1.96/0.022)2 = datasets for each sample size, distribution, and
3970, or about 4000 bootstrap samples. statistic. For each dataset, and for each value of B,
To satisfy the more stringent criterion of 95% two sets of bootstrap samples are created and intervals
confidence that the non-coverage error is less than calculated using all methods. For each method, a
1% of 0.025 would require approximately 400, 000 sample variance is calculated using the usual unbiased
bootstrap samples. With modern computers this is not sample variance (based on two observations). The
unreasonable, unless the statistic is particularly slow estimate of Monte Carlo variability is then the average
to compute. across the 2000 datasets of these unbiased sample
Consider also bootstrap confidence intervals variances. The result is the ‘‘within-group’’ component
based on quantiles. The simple bootstrap percentile of variance (due to Monte Carlo variability) and

Mean, exponential, n = 40 Correlation, normal, n = 80

empir 0.025 empir
BCa BCa
0.03 boot-t boot-t
Monte Carlo standard dev

exp-tilt 0.020 exp-tilt

ml-tilt ml-tilt
0.015
0.02
0.010
0.01
0.005
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
Bootstrap sample size Bootstrap sample size
Variance, normal, n = 10 Variance, exponential, n = 20

empir empir
BCa BCa
0.025 boot-t 0.10 boot-t

exp-tilt exp-tilt
ml-tilt ml-tilt
0.08
0.015 0.06
0.04
0.005 0.02
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
Ratio of means, normal, n = 10 Ratio of means, exponential, n = 80

empir 0.08 empir
BCa BCa
boot-t boot-t
0.08
exp-tilt exp-tilt
ml-tilt 0.06 ml-tilt
0.06
0.04
0.04
0.02
0.02
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
FIGURE 16 | Monte Carlo variability for confidence intervals.
excludes the ‘‘between-group’’ component (due to the usual formula

√ for standard error of a sample
differences between datasets). mean is seB / B, where seB is the sample standard
deviation of the bootstrap statistics. The standard
Assessing Monte Carlo Variation error of a bootstrap proportion p̂ is p̂(1 − p̂)/B. The
To assess Monte Carlo variation in practice, there are standard
error of a bootstrap standard error seB is
two options. The first is to use asymptotic formulae. seB (δ + 2)/(4B).
For example, the bootstrap estimate of bias (1) The other alternative is to resample the bootstrap
depends on the sample mean of the bootstrap statistics; values. Given B i.i.d. observations θ̂1∗ , θ̂2∗ , . . . , θ̂B∗ from

the theoretical bootstrap distribution, and a summary Quasi-random sampling47 can be very useful
statistic Q (e.g., standard error, bias estimate, or end- for small n and large B; the convergence rate is
point of a confidence interval), we may draw B2 O(log(B)n B−1 ) compared to O(B−1/2 ) for Monte
bootstrap samples of size B from the B observations, Carlo methods.
and calculate the summary statistics Q∗1 , Q∗2 , . . . , Q∗B2 . Analytical approximations for bootstrap distri-
The sample standard deviation of the Q∗ s is the Monte butions are available in some situations, including ana-
Carlo standard error. lytical approximations for bootstrap tilting and BCa
intervals,20,24 and saddlepoint approximations.48–52
Variance Reduction
There are a number of techniques that can be used to ADDITIONAL TOPICS
reduce the Monte Carlo variation.
The balanced bootstrap,35 in which each of the n Some topics that are beyond the scope of this
observations is included exactly B times in the B boot- articlea include bootstrapping dependent data (time
strap samples, is useful for bootstrap bias estimates series, mixed effects models), cross-validation and
but of little value otherwise. bootstrap-validation (bootstrapping prediction errors,
Antithetic variates36 is moderately helpful for and classification errors), Bayesian bootstrap, boot-
bias estimation but of little value otherwise. strap likelihoods. Refs 2 and 5 are good starting points
Importance sampling37,38 is particularly useful for these topics, with the exception of mixed effects
for estimating tail quantiles, as for bootstrap per- models. Ref 2 is an introduction to the bootstrap
centile and BCa intervals. For nonlinear statistics one written for upper-level undergraduate or beginning
should use a defensive mixture Distribution.39,40 graduate students. Ref 5 is the best general-purpose
Control variates36,39,41,42 are moderately to reference for the bootstrap for statistical practi-
extremely useful for bias and standard error esti- tioners. Ref 10 looks at asymptotic properties of
mation and can be combined with importance various bootstrap methods. The author’s website
sampling.43 They are most effective in large samples http://home.comcast.net/∼timhesterberg/
for statistics that are approximately linear. bootstrap has resources for teaching statistics using
Concomitants42,44 are moderately to extremely the bootstrap, and some technical reports, particularly
useful for quantiles and can be combined with impor- on computational aspects of bootstrapping.
tance sampling.45 They are most effective in large
samples for statistics that are approximately linear;
NOTE
linear approximations tailored to a tail of interest can
dramatically improve the accuracy.46 a This article is a minor revision of Ref 53.
REFERENCES
1. Efron B. Bootstrap methods: another look at the jack- 7. Chambers J, Hastie T. Statistical Models in S. Pacific
knife (with discussion). Ann Stat 1979, 7:1–26. Grove, CA: Wadsworth; 1992.
2. Efron B, Tibshirani RJ. An Introduction to the Boot- 8. Chambers JM, Cleveland WS, Kleiner B, Tukey PA.
strap. Chapman and Hall; 1993. Graphical Methods for Data Analysis. Wadsworth;
1983.
3. Breiman L. Random forests. Mach Learn 2001, 45:
5–32. 9. Ruckstuhl A, Stahel W, Maechler M, Hesterberg T.
Sunflower. Statlib. Available at: http://lib.stat.cmu.edu/
4. Efron B. The Jackknife, the Bootstrap and Other
S/sunflower. (Accessed 1995).
Resampling Plans. National Science Foundation–
Conference Board of the Mathematical Sciences Mono- 10. Hall P. The Bootstrap and Edgeworth Expansion. New
graph 38. Philadelphia: Society for Industrial and York: Springer; 1992.
Applied Mathematics; 1982. 11. Shao J, Tu D. The Jackknife and Bootstrap. New York:
5. Davison A, Hinkley D. Bootstrap Methods and Their Springer-Verlag; 1995.
Applications. Cambridge University Press; 1997. 12. Silverman B, Young G. The bootstrap: to smooth or not
6. Wu CFJ. Jackknife, bootstrap, and other resampling to smooth. Biometrika 1987, 74:469–479.
methods in regression analysis (with discussion). Ann 13. Hall P, DiCiccio T, Romano J. On smoothing and the
Stat 1986, 14:1261–1350. bootstrap. Ann Stat 1989, 17:692–704.

14. Efron B. Better bootstrap confidence intervals (with 34. Owen A. Empirical Likelihood. Chapman & Hall/CRC
discussion). J Am Stat Assoc 1987, 82:171–200. Press; 2001.
15. Hesterberg TC. Unbiasing the bootstrap-bootknife sam- 35. Gleason JR. Algorithms for balanced bootstrap simula-
pling vs. smoothing. Proceedings of the Section on tions. Am St 1988, 42:263–266.
Statistics & the Environment. American Statistical 36. Therneau TM. Variance reduction techniques for the
Association; 2004, 2924–2930. bootstrap. Technical Report No. 200, PhD thesis,
16. DiCiccio TJ, Romano JP. A review of bootstrap confi- Department of Statistics, Stanford University; 1983.
dence intervals (with discussion). J R Stat Soc B 1988, 37. Johns MV. Importance sampling for bootstrap confi-
50:338–354.
dence intervals. J Am Stat Assoc 1988, 83:701–714.
17. Hall P. Theoretical comparison of bootstrap confidence
38. Davison AC. Discussion of paper by D. V. Hinkley. J R
intervals (with discussion). Ann Stat 1988, 16:927–985.
Stat Soc B 1986, 50:356–357.
18. DiCiccio T, Efron B. Bootstrap confidence intervals
39. Hesterberg TC. Advances in importance sampling. PhD
(with discussion). Stat Sci 1996, 11:189–228.
thesis, Statistics Department, Stanford University; 1988.
19. Efron B. Nonparametric standard errors and confidence
40. Hesterberg TC. Weighted average importance sampling
intervals. Can J Stat 1981, 9:139–172.
and defensive mixture distributions. Technometrics
20. DiCiccio TJ, Romano JP. Nonparametric confidence 1995, 37:185–194.
limits by resampling methods and least favorable fami-
lies. Int Stat Rev 1990, 58:59–76. 41. Davison AC, Hinkley DV, Schechtman E. Efficient boot-
strap simulation. Biometrika 1986, 73:555–566.
21. Hesterberg TC. Bootstrap tilting confidence intervals
and hypothesis tests. In: Berk K, Pourahmadi M, eds. 42. Efron B. More efficient bootstrap computations. J Am
Computer Science and Statistics: Proceedings of the Stat Assoc 1990, 85:79–89.
31st Symposium on the Interface, vol 31. Fairfax Sta- 43. Hesterberg TC. Control variates and importance sam-
tion, VA: Interface Foundation of North America; 1999, pling for efficient bootstrap simulations. Stat Comput
389–393. 1996, 6:147–157.
22. Hesterberg TC. Bootstrap tilting confidence intervals. 44. Do KA, Hall P. Distribution estimation using concomi-
Technical Report 84, Research Department, MathSoft, tants of order statistics, with application to Monte
Inc.; 1999. Carlo simulations for the bootstrap. J R Stat Soc B
23. Hyndman RJ, Fan Y. Sample quantiles in statistical 1992, 54:595–607.
packages. Am Stat 1996, 50:361–364. 45. Hesterberg TC. Fast bootstrapping by combining
24. DiCiccio T, Efron B. More accurate confidence intervals importance sampling and concomitants. Computing
in exponential families. Biometrika 1992, 79:231–245. Science and Statistics, 1997, 29:72–78.
25. DiCiccio TJ, Martin MA, Young GA. Analytic approx- 46. Hesterberg TC. Tail-specific linear approximations for
imations to bootstrap distribution functions using sad- efficient bootstrap simulations. J Comput Graph Stat
dlepoint methods. Technical Report 356, Department 1995, 4:113–133.
of Statistics, Stanford University; 1990. 47. Do KA, Hall P. Quasi-random sampling for the boot-
26. Efron B. Censored data and the bootstrap. J Am Stat strap. Stat Comput 1991, 1:13–22.
Assoc 1981, 76:312–319. 48. Tingley M, Field C. Small-sample confidence intervals.
27. Hinkley DV. Bootstrap significance tests. Bull Int Stat J Am Stat Assoc 1990, 85:427–434.
Inst 1989, 53:65–74. 49. Daniels HE, Young GA. Saddlepoint approximation
28. Owen A. Empirical likelihood ratio confidence intervals for the studentized mean, with an application to the
for a single functional. Biometrika 1988, 75:237–249. bootstrap. Biometrika 1991, 78:169–179.
29. Embury SH, Elias L, Heller PH, Hood CE, Greenberg 50. Wang S. General saddlepoint approximations in the
PL, Schrier SL. Remission maintenance therapy in bootstrap. Stat Prob Lett 1992, 13:61–66.
acute myelogenous leukemia. West J Med 1977, 126: 51. DiCiccio TJ, Martin MA, Young GA. Analytical
267–272. approximations to bootstrap distributions functions
30. Insightful. S-PLUS 8 Guide to Statistics. 1700 West- using saddlepoint methods. Stat Sin 1994, 4:281.
lake Ave N., Suite 500, Seattle; 2007.
52. Canty AJ, Davison AC. Implementation of saddlepoint
31. Hesterberg TC. Bootstrap tilting diagnostics. Proceed- approximations to bootstrap distributions. In: Billard
ings of the Statistical Computing Section; 2001. L, Fisher NI, eds. Computing Science and Statistics;
32. Hesterberg TC. Resampling for planning clinical Proceedings of the 28th Symposium on the Interface,
trials-using S+Resample. Statistical Methods in Bio- vol 28. Fairfax Station, VA: Interface Foundation of
pharmacy, Paris. Available at: http://home.comcast.net/ North America; 1997, 248–253.
∼timhesterberg/articles/Paris05-ResampleClinical.pdf.
53. Hesterberg TC. Bootstrap. In: D’Agostino R, Sullivan L,
(Accessed 2011). Massaro J, eds. Wiley Encyclopedia of Clinical Trials.
33. Hall P, Presnell B. Intentionally biased bootstrap meth- John Wiley & Sons; 2007.
ods. J R Stat Soc B 1999, 61:143–158.

FURTHER READING
Chernick MR. Bootstrap Methods: A Practitioner’s Guide. New York: John Wiley & Sons; 1999. (An extensive bibliography,
with roughly 1700 references related to the bootstrap.)
Hesterberg T, Monaghan S, Moore DS, Clipson A, Epstein R. Bootstrap Methods and Permutation Tests. W. H. Freeman.
Chapter for The Practice of Business Statistics by Moore, McCabe, Duckworth, and Sclove; 2003. Available at: http://
bcs.whfreeman.com/pbs/cat_160/PBS18.pdf. (An introduction to the bootstrap written for introductory statistics students.)
(Accessed 2011).

WIREs Computational Stats - 2011 - Hesterberg - Bootstrap

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WIREs Computational Stats - 2011 - Hesterberg - Bootstrap

Uploaded by

Copyright:

Available Formats

Overview

INTRODUCTION √The usual formula standard error for the mean

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 497

Statistical inference depends on the sampling

1000 Conceptually, the sampling distribution is

We first discuss the idea behind the bootstrap, and

498  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

e.g., X is an estimator for µ, whereas a t statistic is not Step function

center: the center of the bootstrap distribution

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 499

Other Statistics should be in Section How Many Bootstrap Samples

500  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Proportion in high risk group

0.4 An outlier at 10.1 is omitted.

1.96) = 0.0367 rather than 0.025, or 47% too large.

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 501

Value Std. Error t value Pr(>|t|)

The left panel of Figure 6 contains a scatterplot

502  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Prediction Intervals and Non-Normality variability in regression lines is outlined below in

Value Std. Error t value Pr(>|t|)

and prediction intervals in the simple linear regression

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 503

504  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 505

Value Std. Error t value

0.8 0.8 Original

506  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Observed Mean Bias SE

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 507

ACCURACY OF BOOTSTRAP there is noticeable variability, particularly in the

508  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 509

Population Population mean = mu

Sample 1 Bootstrap Bootstrap B=1000

Sample 2 Bootstrap Bootstrap B=1000

Sample 3 Bootstrap Bootstrap B=1000

Bootstrap Bootstrap B=1000

Sample 5 Bootstrap Bootstrap B=10^4

Trust the Data or Impose Structure? Systematic Errors in Bootstrap Distributions

510  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Sample 1 Bootstrap Bootstrap

Sample 2 Bootstrap Bootstrap

Sample 3 Bootstrap Bootstrap

Sample 4 Bootstrap Bootstrap

Sample 5 Bootstrap Bootstrap

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 511

and confidence intervals tend to be too short. This BOOTSTRAP CONFIDENCE

512  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

where Ĝ is the estimated bootstrap distribution of θ̂ ∗ . Bootstrap t

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 513

−6 −4 −2 0 2 80 100 120 140 160 180 200

514  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

where that makes the sampling distribution have prob-

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 515

l(x1∗ , . . . , xn∗ ) = wi ∗ (22) each indexed by a tilting parameter τ , where each c

516  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Similarly, exponential tilting is more convenient Implications for Other Situations

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 517

518  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011

Vo lu me 3, No vember/December 2011  2011 Jo h n Wiley & So n s, In c. 519

HOW MANY BOOTSTRAP SAMPLES

520  2011 Jo h n Wiley & So n s, In c. Vo lu me 3, No vember/December 2011