Professional Documents
Culture Documents
WIREs Computational Stats - 2011 - Hesterberg - Bootstrap
WIREs Computational Stats - 2011 - Hesterberg - Bootstrap
Bootstrap
Tim Hesterberg∗
This article provides an introduction to the bootstrap. The bootstrap provides
statistical inferences—standard error and bias estimates, confidence intervals,
and hypothesis tests—without assumptions such as Normal distributions
or equal variances. As such, bootstrap methods can be remarkably more
accurate than classical inferences based on Normal or t distributions. The
bootstrap uses the same basic procedure regardless of the statistic being
calculated, without requiring the use of application-specific formulae. This
article may provide two big surprises for many readers. The first is
that the bootstrap shows that common t confidence intervals are woefully
inaccurate when populations are skewed, with one-sided coverage levels off
by factors of two or more, even for very large samples. The second is that
the number of bootstrap samples required is much larger than generally
realized. 2011 John Wiley & Sons, Inc. WIREs Comp Stat 2011 3 497–526 DOI: 10.1002/wics.182
Keywords: resampling; permutation tests; inference; standard error; bias
0.025
Observed 200
Mean
0.020 180
160
0.015
Density
Mean
140
0.010 120
100
0.005
80
0.0
80 100 120 140 160 180 200 −4 −2 0 2 4
Mean Quantiles of standard normal
FIGURE 2 | Histogram and Normal quantile plot of the bootstrap distribution for arsenic concentrations.
0.8 0.030
Observed
Mean
Relative risk
Bootstrap CI
0.0 0.0 t CI
1 2 3 4 5 6 7 0.0 0.005 0.010 0.015
Relative risk Proportion in low risk group
Slope = Relative risk
FIGURE 4 | Histogram and scatterplot of the bootstrap distribution for relative risk.
To bootstrap this, we draw samples of size corresponds to calculating the standard error of
n1 = 3338 with replacement from the first group, residuals above and below the central line (the line
independently draw samples of size n2 = 2676 from with slope θ̂), going up and down 1.96 residual
the second group, and calculate the relative risk θ̂ ∗ . standard errors from the central point (the original
In addition, we record the individual proportions p̂∗1 data) to the circled points; the endpoints of the interval
and p̂∗2 . The bootstrap distribution for relative risk is are the slopes of the lines from the origin to the circled
shown in the left panel of Figure 4. It is highly skewed, points. A t interval would not be appropriate in the
with a long right tail caused by divisor values rela- example, because of the bias and skewness.
tively close to zero. The standard error, from a sample In practice one would normally do a t interval
of 104 observations, is 0.6188. The theoretical boot- on a transformed statistic, e.g., log of relative risk, or
strap standard error is undefined because some of the log-odds-ratio log(p̂1 (1 − p̂2 )/((1 − p̂1 )p2 )). Figure 5
n n
n11 n22 bootstrap samples have θ̂ ∗ undefined because shows a normal quantile plot for the bootstrap
the denominator p̂∗2 is zero; this is not important in distribution of the log of relative risk. The distribution
practice. for log relative risk is much less skewed than is
The average of the bootstrap replicates is larger the distribution for relative risk, but still noticeably
than the original relative risk, indicating bias. The esti-
mated bias is 2.205 − 2.100 = 0.106, corresponding
to 0.17 standard errors. While the bias does not appear
large in the figure, this amount of bias can have a huge
2.0
impact on inferences; a rough calculation suggests
that the actual non-coverage of one side of a two-
sided 95% confidence interval would be 1 − (0.17 + 1.5
Log relative risk
skewed. Even with a log transformation, a t interval of the middle 95% of heights of regression lines at a
would only be adequate for work where accuracy is given weight.
not required. We discuss confidence intervals further The right panel shows all 300 observations, and
in Section Bootstrap Confidence Intervals. predictions for the PK/weight relationship using (1)
all 300 observations, (2) the main-effects model, and
(3) predictions for the ‘‘base case’’, males receiving
Linear Regression dose = 400, with weight equal to the average weight
The next examples, for linear regression, are based on for all subjects. In effect this uses the full dataset
a dataset from a large pharmaceutical company. The to improve predictions for a subset, ‘‘borrowing
response variable is a pharmacokinetic parameter of strength’’. There is much less variability than in the
interest, and candidate predictors are weight, sex, age, left panel, particularly for slope, primarily because of
and dose (3 levels—200, 400, and 800). There are 300 the larger sample size, but also because the addition
observations, one per subject. Our primary interest in of an important covariate (age) to the model reduces
this dataset will be to use the bootstrap to investigate residual variance.
the behavior of stepwise regression; however, first we Note that the y values shown are the actual data,
consider some other issues. not adjusted for differences between the base case and
A standard linear regression using main effects the actual values of sex, age, and dose. The line is
gives:
60 Original 60 Original
Bootstrap Bootstrap
50 50
PK parameter
PK parameter
40 40
30 30
20 20
10 10
40 50 60 70 80 90 100 40 50 60 70 80 90 100
Weight Weight
FIGURE 6 | Bootstrap regression lines. Left panel: 25 males receiving dose = 400. The orange line is the least-squares fit for those 25
observations, and black lines are from bootstrap samples of size 25. Right panel: the orange line is the prediction for males receiving dose = 400,
based on the main-effects linear regression using all 300 subjects, and the black lines are from bootstrap samples.
Observed Observed
Mean Mean
60 0.04
0.03
40
Density
Density
0.02
20
0.01
0 0.0
−0.05 0.0 0.05 −40 −20 0 20 40
coef.dose coef.sex
FIGURE 7 | Histograms of bootstrap distributions for dose and sex coefficients in stepwise regression.
Figure 7 shows the bootstrap distributions for coefficients) and average of the nominal standard
two coefficients: dose, and sex. The dose coefficient is errors are:
usually zero, though it may be positive or negative. boot SE avg.nominalSE
This suggests that dose is not very important in Intercept 27.9008 14.0734
determining the response. wgt 0.5122 0.2022
The sex coefficient is bimodal, with the modes sex 9.9715 5.4250
on opposite sides of zero. It turns out that the sex age 0.3464 0.2137
coefficient is usually negative when the weight–sex dose 0.0229 0.0091
interaction is included, otherwise is positive.
Overall, the bootstrap suggests that the original The bootstrap standard errors are much larger than
model is not very stable. the average of the nominal standard errors.
For comparison, repeating the experiment with The bootstrap standard errors reflect additional
a more stringent criterion for variable inclusion—a variability due to model selection, such as the bimodal
modified Cp statistic with double the penalty—results distribution for the sex coefficient, factors that the
in a more stable model. The original model has the nominal standard errors ignore.
same six terms. Of the bootstrap samples 154 yield the This is not to say that one should use the
same model, and on average the number of different bootstrap standard errors here. At the end of the
terms is 2.15. The average number of terms is 5.93, stepwise variable selection process, it is appropriate to
slightly less than for the original data; this suggests condition on the model, and do inferences accordingly.
that stepwise regression may now be slightly under- For example, a confidence interval for the sex
fitting (though one should not read too much into coefficient should be conditional on the weight–sex
this). interaction being included in the model.
But it does suggest that the nominal standard
Standard Errors errors are optimistic. In fact they are biased down-
At the end of the stepwise procedure, the table of ward, even conditional on the model terms, because
coefficients, standard errors, and t values is calculated, they are calculated using a formula that depends on
ignoring the variable selection process. In particular, residual standard error, which in turn is biased due to
the standard errors are calculated under the usual model selection.
regression assumptions, which assume that the model
is fixed from the outset. Call these nominal standard Bias
errors. Figure 8 shows bootstrap distributions for R2
For each bootstrap sample, we perform stepwise (unadjusted) and residual standard deviation. Both
selection and record the coefficients and nominal show very large bias.
standard errors. For the main effects the bootstrap The bias is not surprising—optimizing generally
standard errors (standard deviation of bootstrap gives biased results. Consider ordinary linear
Observed Observed
Mean 1.2 Mean
8
1.0
6 0.8
Density
Density
0.6
4
0.4
2
0.2
0 0.0
0.25 0.35 0.45 6.5 7.0 7.5 8.0
R2 σ
FIGURE 8 | Histograms of bootstrap distributions for R 2 and residual standard deviation in stepwise regression.
regression—unadjusted R2 is biased. If it were design and y’s are obtained conditional on the x’s. So
calculated using the true β’s instead of estimated β̂’s at first glance it would appear appropriate to resample
it would not be biased. Optimizing β̂ to minimize rows when the original data collection has random x’s.
residual squared error (and maximize R2 ) makes However, in classical statistics we commonly use
unadjusted R2 biased. inferences derived using the fixed effects model, even
In classical linear regression, with the model when the x’s are actually random. We do inferences
selected in advance, we commonly use adjusted R2 to conditional on the observed x values. Similarly, in
counteract the bias. Similarly, we use residual variance bootstrapping we may resample residuals even when
calculated using a divisor of (n − p − 1) instead of n, the x’s were originally random.
where p is the number of terms in the model. In practice the difference matters most when
But in this case it is not only the values of the there are factors with rare levels, or interactions of
coefficients that are optimized, but which terms are factors with rare combinations. If resampling rows it is
included in the model. This is not reflected in the possible that a bootstrap sample may have none of the
usual formulae. As a result, the residual standard level or combination, in which case the corresponding
error obtained from the stepwise procedure is biased term cannot be estimated, and the software may give
downward, even using a divisor of (n − p − 1). an error. Or, what is worse, there may be one or
two rows with the rare level, enough so the software
Bootstrapping Rows or Residuals would not crash, but instead quietly give garbage
There are two basic ways to bootstrap linear answers, imprecise because they are based on few
regression models—to resample rows (observations), observations.
or residuals.2,5 Hence with factors with rare levels, or small
To resample residuals, we fit the initial model samples more generally, it may be preferable to
ŷi = β̂0 + β̂j xij , calculate the residuals ri = yi − ŷi , resample residuals.
then create new bootstrap samples as Resampling residuals implicitly assumes that the
residual distribution is the same for every x, that there
y∗i = ŷi + r∗i (5) is no heteroskedasticity. A variation on resampling
residuals that allows heteroskedasticity is the wild
for i = 1, . . . , n, where r∗i is sampled with replacement bootstrap or external bootstrap,6 which in its simplest
from the observed residuals {r1 , . . . , rn }. We keep the form adds either plus or minus the original residual ri
original x and ŷ values fixed in order to create new to each fitted value,
bootstrap y∗ values.
Resampling rows corresponds to a random y∗i = ŷi ± ri , (6)
effects sampling design—in which x and y are both
obtained by random sampling from a joint population. with equal probabilities. Hence the expected value of
Resampling residuals corresponds to a fixed effects y∗i is ŷi , and the standard deviation is proportional to
model, in which the x’s are fixed by the experimental ri . For further discussion see Ref 5.
There are other variations on resampling given each x. Let p̂i be the predicted probability that
residuals, such as resampling studentized residuals, or yi = 1 given xi . Then
weighted error resampling for non-constant variance.5
1 with probability p̂i
y∗i = (7)
Prediction Intervals 0 with probability 1 − p̂i .
The idea of resampling residuals provides a way to
obtain more accurate prediction intervals. In order The kyphosis dataset7 contains observations on
to capture both variation in the estimated regression 81 children who had corrective spinal surgery, on
line and residual variation, we may resample both. four variables: Kyphosis (a factor indicating whether
Variation in the regression line may be obtained by a postoperative deformity is present), Age (in months),
resampling either residuals or rows in order to gen- Number (of vertebrae involved in the operation), and
erate random β̂ ∗ values and corresponding ŷ∗ = β̂0 +
Start (beginning of the range of vertebrae involved).
β̂j x0j , for predictions at x0 . Independently we draw A logistic regression using main effects gives
coefficients:
random residuals r∗ , and add them to the ŷ∗ . After suggesting that Start is the most important
repeating this many times, the range of the middle predictor.
95% of the (ŷ∗ + r∗ ) values gives a prediction interval. The left panel of Figure 9 shows Kyphosis versus
For further discussion and alternatives see Ref 5. Start, together with predicted curve for the base case
with Age = 87 (the median) and Number = 4 (the
median). This is a sunflower plot,8,9 in which a flower
Logistic Regression with k > 2 petals represents k duplicate values.
In logistic regression it is straightforward to resample The right panel of of Figure 9 shows predictions
rows of the data, but resampling residuals fails—the from 20 bootstrap curves.
y values must be either zero or one, but adding the Figure 10 shows the bootstrap distributions for
residual from one observation to the prediction from the four regression coefficients. All of the distributions
another yields values anywhere between −1 and 2. are substantially non-normal. It would not be
Instead, we keep the x’s fixed, and generate y appropriate to use classical normal-based inferences.
values from the estimated conditional distributions Indeed, the printout of regression coefficients above,
1.0 1.0
0.6 0.6
Kyphosis
Kyphosis
0.4 0.4
0.2 0.2
0.0 0.0
5 10 15 5 10 15
Start Start
FIGURE 9 | Bootstrap curves for predicted kyphosis, for Age = 87 and Number = 4.
0.10
0
0.08
0.06
(intercept)
−5
X$Age
0.04
−10 0.02
0.0
−15
−0.02
−4 −2 0 2 4 −4 −2 0 2 4
Quantiles of standard normal Quantiles of standard normal
3
0.0
−0.2
2
−0.4
X$Number
X$Start
−0.6
1
−0.8
−1.0
0
−1.2
−4 −2 0 2 4 −4 −2 0 2 4
Quantiles of standard normal Quantiles of standard normal
FIGURE 10 | Normal quantile plots of bootstrap distributions for logistic regression coefficients.
from a standard statistical package (S-Plus) includes t bias estimates ranging from 0.22 to 0.28 standard
values but omits p values. Yet it would be tempting for errors.
a package user to interpret the t coefficients as arising These results are for the conditional distribution
from a t distribution; the bootstrap demonstrates bootstrap, a kind of parametric bootstrap. Repeating
that this would be improper. The distributions are the analysis with the nonparametric bootstrap (resam-
so non-normal as to make the utility of standard pling observations) yields bootstrap distributions that
errors doubtful. are even longer-tailed, indicating larger biases and
The numerical bootstrap results are:
The bootstrap standard errors are larger than the standard errors. This reinforces the conclusion that
classical (asymptotic) standard errors by 20–24%. The classical normal-based inferences are not appropriate
distributions are also extremely biased, with absolute here.
Population mean = mu
Sample mean = x
Population Sampling
distribution
−3 0 mu 3 6 0 mu 3
B=1000
Bootstrap Bootstrap
distribution distribution 2
Sample 1 for sample 1 for sample 1
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
distribution distribution 3
Sample 2 for sample 2 for sample 1
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
distribution distribution 4
Sample 3 for sample 3 for sample 1
0 3 0 3 0 3
x x x
B=1000
Bootstrap Bootstrap
distribution distribution 5
Sample 4 for sample 4 for sample 1
0 3 0 3 0 3
x x x
B=10^4
Bootstrap Bootstrap
distribution distribution 6
Sample 5 for sample 5 for sample 1
0 3 0 3 0 3
x x x
FIGURE 11 | Bootstrap distribution for the mean, n = 50. The left column shows the population and five samples. The middle column shows the
sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample,
with B = 1000 or B = 104 .
poor approximations of the sampling distribution. In depend heavily on a small number of observations out
contrast, the sampling distribution is continuous, but of a larger sample.
the bootstrap distributions are discrete, with the only In the case of the median and other interior
possible values being values in the original sample quantiles, this can be remedied using a smoothed
(here n is odd). The bootstrap distributions are very bootstrap,12,13 drawing samples from a density
sensitive to the sizes of gaps among the observations estimate based on the data, rather than drawing from
near the center of the sample. the data itself. Smoothing is less effective for more
The ordinary bootstrap tends not to work well extreme quantiles, where the bootstrap distribution
for statistics such as the median or other quantiles that would still depend heavily on a small number of
−3 mu 3 −3 mu 3
−3 x 3 −3 x 3 −3 x 3
−3 x 3 −3 x 3 −3 x 3
−3 x 3 −3 x 3 −3 x 3
−3 x 3 −3 x 3 −3 x 3
−3 x 3 −3 x 3 −3 x 3
FIGURE 12 | Bootstrap distributions for the mean, n = 9. The left column shows the population and five samples. The middle column shows the
sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first sample,
with B = 1000 or B = 104 .
observations. In that case it may be necessary to to represent the shape of the population; when there
impose additional structure by assuming a parametric is less data you cannot.
family, and perform a parametric bootstrap.
Population median = M
Population Sampling distribution Sample median = m
−4 M 10 −4 M 10
−4 m 10 −4 m 10 −4 m 10
−4 m 10 −4 m 10 −4 m 10
−4 m 10 −4 m 10 −4 m 10
−4 m 10 −4 m 10 −4 m 10
−4 m 10 −4 m 10 −4 m 10
FIGURE 13 | Bootstrap distributions for the median, n = 15. The left column shows the population and five samples. The middle column shows
the sampling distribution, and bootstrap distributions from each sample. The right column shows five more bootstrap distributions from the first
sample.
Second, in many applications there is a with larger means tend to give larger standard
relationship between the statistic and its standard errors.
error (‘‘acceleration’’ in the terminology of Ref 14). When there is acceleration, the bootstrap stan-
For example, the standard error of a binomial dard error reflects the standard error corresponding
proportion p̂(1 − p̂)/n depends on p̂. Similarly, to θ̂, not the true standard deviation of the sampling
when sampling from a gamma distribution, the distribution (corresponding to θ). Suppose the rela-
variance of the sample mean depends on the tionship is positive; then when θ̂ < θ it tends to be true
underlying mean. More generally when sampling the that the estimated standard error is also less than the
mean from positively skewed distributions, samples true standard deviation of the sampling distribution,
typically set to n − 1 (although other values would be distribution function with n − 1 degrees of freedom.
better for non-normal distributions). This gives wider intervals. Extensive simulations22
The bootstrap standard error may be computed show that this gives smaller coverage errors in practice,
using the techniques in Section Bootstrap Distribu- in a wide variety of applications. The effect on
tions Are Too Narrow—bootknife, sampling with coverage errors is O(1/n), the same order as the
reduced size, or smoothed bootstrap. This results in bootknife adjustment, but the magnitude of the effect
slightly wider intervals that are usually more accurate is larger; for example, the errors caused by using z
in practice. These techniques have an O(1/n) effect rather than t quantiles in a standard t interval for a
on one-sided coverage errors, which is unimportant normal population are:
for large samples but is important in small samples.
n Non-coverage Error
For example, for a sample of independent identically
10 0.0408 0.0158
distributed observations from a normal distribution, a
20 0.0324 0.0074
nominal 95% t interval for the mean using a bootstrap
40 0.0286 0.0036
standard error without these corrections would have
100 0.0264 0.0014
one-sided coverages errors:
n Non-coverage Error For a sample size of 20, this effect alone makes
10 0.0302 0.0052 intervals tend to miss 0.0074/0.025 = 30% too often!
20 0.0277 0.0027 A third variation relates to how quantiles are
40 0.0264 0.0014 calculated for a finite number B of bootstrap samples.
100 0.0256 0.0006 Hyndman and Fan23 give a family of definitions of
quantiles for finite samples, governed by a parameter
∗
0 ≤ δ ≤ 1. The bth order statistic θ̂(b) is the (b −
Percentile Intervals δ)/(B + 1 − 2δ) quantile of the bootstrap distribution,
In its simplest form, a 95% bootstrap percentile for b = 1, . . . , B. Linear interpolation between
interval is the range of the middle 95% of a bootstrap adjacent bootstrap statistics is used if the desired
distribution. quantile is not of the form (b − δ)/(B + 1 − 2δ) for
More formally, bootstrap percentile intervals are some integer b. For bootstrap confidence intervals δ =
of the form 0 is preferred, as other choices result in lower coverage
probability. The effect on coverage errors is O(1/B).
(Ĝ−1 (α/2), Ĝ−1 (1 − α/2)). (9)
Mean
0.4 500
0.3 400
Density
stdev
0.2
300
0.1
200
0.0
FIGURE 14 | Histogram of bootstrap distribution for the t statistic, and relationship between bootstrap means and standard deviations, of
arsenic concentrations.
Figure 14 shows the bootstrap distribution for those intervals are too narrow if the plug-in population
the t statistic for mean arsenic concentration,
√ where t is narrower, on average, than the parent population.
is the ordinary t statistic (x − µ)/(s/ n). In contrast The sampling distribution of a t statistic, in contrast,
to Figure 2, where the bootstrap distribution for the is invariant under changes in the scale of the parent
mean is positively skewed, the distribution for the t population. This gives it an automatic correction for
statistic is negatively skewed. The reason is that there the plug-in population being too narrow, and to add
is positive correlation between x∗ and s∗ , as seen in the bootknife sampling would over-correct.
right panel of Figure 14, so that a negative numerator Efron and Tibshirani2 note that the bootstrap t
in (12) tends to occur with a small denominator. is sometimes erratic, and suggest transforming the
The bootstrap t interval is based on the identity statistic of interest. Hesterberg22 observes erratic
behavior in small samples. We conjecture the
θ̂ − θ following explanation—that the bootstrap t depends
P(G−1
t (α/2) < < G−1
t (1 − α/2)) = 1 − α, not only on skewness, but also on kurtosis, and
sθ̂
(13) kurtosis is hard to estimate from small samples. The
bootstrap t does not use a t table, but instead estimates
where Gt is the sampling distribution of t (11). the distribution of the t statistic by simulating from the
Assuming that t∗ (12) has approximately the same data. This distribution depends not only on asymmetry
caused by skewness, but also on the effective degrees
distribution as t, we substitute quantiles of the
of freedom, that depend on kurtosis—larger kurtosis
bootstrap distribution for t∗ ; then solving for θ yields
results in greater variability in standard errors and
the bootstrap t interval
smaller effective degrees of freedom. In contrast, other
second-order-correct intervals depend on skewness,
(θ̂ − G−1 −1
t∗ (1 − α/2)sθ̂ , θ̂ − Gt∗ (α/2)sθ̂ ). (14)
but not (or much less so) on kurtosis, so are less
erratic for small samples.
Note that the right tail of the bootstrap distribution of
t∗ is used in computing the left side of the confidence
interval, and conversely. BCa Intervals
The bootstrap t and other intervals for the The bootstrap BCa interval14 uses quantiles of the
mean arsenic concentration example described in bootstrap distribution, like the percentile interval,
Section Introduction are shown in Table 1. but with the percentiles adjusted depending on a
It is not appropriate to use bootknife or other bias parameter z0 and acceleration parameter a. The
sampling methods in Section Bootstrap Distributions interval is
Are Too Narrow with the bootstrap t. The reason we
use those methods with the other intervals is because (G−1 (p(α/2)), G−1 (p(1 − α/2))), (15)
Another suitable family is the maximum Let Fp denote a weighted distribution with proba-
likelihood family, with probability bility pi on original data point xi , θ(p) = θ(Fp ) be
the parameter for the weighted distribution (e.g.,
c
pi = (21) weighted mean, or weighted regression coefficient),
1 − τ (xi − x) and p0 = (1/n, . . . , 1/n) correspond to the original
equal-probability empirical distribution function. The
on observation i.
gradient of θ(p) is
Importance Sampling Implementation
Conceptually, finding the right value of τ requires Ui (p) = lim −1 (θ(p + (δi − p)) − θ(p)), (24)
→0
trial and error; for any given τ , we calculate
p = (p1 , . . . , pn ), draw bootstrap samples with those where δi is the vector with 1 in position i and 0
probabilities, calculate the bootstrap statistics, and elsewhere. When evaluated at p0 these derivatives
calculate the fraction of those statistics that are above are known as the empirical influence function, or
θ̂, then repeat with a different τ until the fraction is infinitesimal jackknife.
2.5%. This is expensive, and the fraction varies due Four least-favorable families found in the tilting
to random sampling. literature are:
In practice we use an importance sampling
implementation. Instead of sampling with unequal F1 : pi = c exp(τ Ui (p0 ))
probabilities, we sample with equal probabilities, then
reweight the bootstrap samples by the relative likeli- F2 : pi = c exp(τ Ui (p))
hood of the sample under weighted and ordinary F3 : pi = c(1 − τ Ui (p0 ))−1
bootstrap sampling. The likelihood for a bootstrap
sample is F4 : pi = c(1 − τ Ui (p))−1 , (25)
TABLE 1 Confidence Intervals for Mean Arsenic Concentration, TABLE 2 Actual Non-Coverage of Nominal 95% t Intervals, as
Based on 100,000 Bootstrap Samples, Using Ordinary Nonparametric Estimated From Second-Order-Accurate Intervals
and Bootknife Resampling
Estimated using Left Right
95% Interval Asymmetry Bootstrap t 0.0089 0.062
Formula t (88.8, 160.2) ±35.7 BCa 0.0061 0.052
Ordinary Bootstrap
A t interval would miss more than twice too often on the right side. The
t w boot SE (88.7, 160.2) ±35.8 actual non-coverage should be 0.025 on each side.
Percentile (91.5, 162.4) (−33.0, 38.0)
Bootstrap t (94.4, 172.6) (−30.1, 48.1) t interval—in other words, what the bootstrap t and
BCa (95.2, 169.1) (−29.3, 44.6) BCa intervals think is the actual non-coverage of the
t intervals. The discrepancies are striking. On the left
Tilting (95.2, 169.4) (−29.3, 44.9)
side, the t interval should miss 2.5% of the time; it
Bootknife actually misses only about a third or fourth that often,
t w boot SE (88.7, 160.3) ±35.8 according to the bootstrap t and BCa intervals. On
Percentile (91.5, 162.6) (−32.9, 38.1) the right side, it should miss 2.5% of the time, but
BCa (95.4, 169.3) (−29.1, 44.8) actually misses somewhere between 5.2 and 6.2%,
Tilting (95.2, 169.4) (−29.3, 45.0)
according to the BCa and bootstrap t procedures.
This suggests that the t interval is severely biased,
The ‘‘asymmetry’’ column is obtained by subtracting the observed mean. The with both endpoints systematically lower than they
‘‘t w boot SE’’ interval is a t interval using a bootstrap standard error.
should be.
To obtain reasonable accuracy for smaller replications, typically by a factor of 37 for a 95%
sample sizes requires the use of more accurate confidence interval. The disadvantages of tilting are
confidence intervals, either a second-order-accurate that the small-sample properties of the fixed-derivative
bootstrap interval, or comparable second-order- versions F1 and F3 are not particularly good, while
accurate non-bootstrap interval. Two general second- the more rigorous F2 and F4 are harder to implement
order-accurate procedures that do not require reliably.
sampling are ABC24 and automatic percentile25
intervals, which are approximations for BCa and
tilting intervals, respectively. HYPOTHESIS TESTING
The current practice of statistics, using normal An important point in bootstrap hypothesis testing
and t intervals with skewed data, systematically is that sampling should be done in a way that is
produces confidence intervals with endpoints that are consistent with the null distribution.
too low (for positively skewed data). We describe here three bootstrap hypothesis
Similarly, hypothesis tests are systematically
testing procedures: pooling for two-sample tests,
biased; for positively skewed data they reject H0 :
bootstrap tilting, and bootstrap t.
θ = θ0 too often for cases with θ̂ < θ0 , and too little
The first is for two-sample problems, such
for θ̂ > θ0 . The primary reason is acceleration—when
as comparing two means. Suppose that the null
θ̂ < θ0 then acceleration makes it likely that s < σ , and
hypothesis is that θ1 = θ2 , and that one is willing
the t interval does not correct for this, so improperly
to assume that if the null hypothesis is true that the
rejects H0 .
two populations are the same. Then one may pool the
data, draw samples of size n1 and n2 with replacement
Comparing Intervals from the pooled data, and compute a test statistic
t intervals and bootstrap percentile intervals are such as θ̂1 − θ̂2 or a t statistic. Let T ∗ be the bootstrap
quick-and-dirty intervals, suitable for rough approx- test statistic, and T0 the observed value of the test
imations, but should not be used where accuracy is statistic. P-value is the fraction of time that the T ∗
needed. exceeds T0 .
Among the others, I recommend the BCa in most In practice we add 1 to the numerator and
cases, provided that the number of bootstrap samples denominator when computing the fraction—the one-
B is very large. sided P-value for the one-sided alternative hypothesis
In my experience with extensive simulations, θ̂1 − θ̂2 > 0 is (#(T ∗ > T0 ) + 1)/(B + 1). The lower
the bootstrap t is the most accurate in terms of one-sided P-value is (#(T ∗ < T0 ) + 1)/(B + 1), and
coverage probabilities. However, it achieves this at the two-sided P-value is two times the smaller of
high cost—the interval is longer on average than the one-sided P-values.
the BCa and tilting intervals, often much longer. This procedure is similar to the two-sample
Adjusting the nominal coverage level of the BCa and permutation test, which pools the data and draws n1
tilting intervals upward gives comparable coverage observations without replacement for the first sample
to bootstrap t with shorter length. And the lengths of and allots the remaining n2 observations to be the
bootstrap t intervals vary much more than the others. I second sample. The permutation test is preferred. For
conjecture that this is because bootstrap t intervals are example, suppose there is one outlier in the combined
sensitive to the kurtosis of the bootstrap distribution, sample; every pair of permutation samples has exactly
which is hard to estimate accurately from reasonable- one copy of the outlier, while the bootstrap samples
sized samples. In contrast, BCa and tilting intervals may have 0, 1, 2, . . . copies. This adds extra variability
depend primarily on mean, standard deviation, and not present in the original data, and detracts from the
skewness of the bootstrap distribution. accuracy of the resulting P-values.
Also, the bootstrap t is computationally expen- Now suppose that one is not willing to assume
sive if the standard error is obtained by bootstrapping. that the two distributions are the same. Then boot-
If sθ̂ is calculated by bootstrapping, then sθ̂ ∗ is calcu- strap tilting hypothesis testing5,26,27 may be suitable.
lated using a second level of bootstrapping—drawing Tilting may also be used in one-sample and other
bootstrap samples from each first-level bootstrap sam- contexts. The idea is to find a version of the empiri-
ple (requiring a total of B + BB2 bootstrap samples, cal distribution function(s) with unequal probabilities
if B2 second-level bootstrap samples from each of B that satisfy the null hypothesis (by maximizing likeli-
first-level bootstrap samples). hood or minimizing Kullback–Leibler distance subject
The primary advantage of bootstrap tilting to the null hypothesis), then draw samples from the
over BCa is that it requires many fewer bootstrap unequal-probability empirical distributions, and let
the P-value be the fraction of times the bootstrap TABLE 3 Leukemia Data
test statistic exceeds the observed test statistic. As
Group Length of Complete Remission (in Weeks)
in the case of confidence intervals, importance sam-
pling may be used in place of sampling with unequal Maintained 9, 13, 13+, 18, 23, 28+, 31, 34, 45+, 48, 161+
probabilities, see Section Bootstrap Confidence Inter- Nonmaintained 5, 5, 8, 8, 12, 16+, 23, 27, 30, 33, 43, 45
vals. There are close connections to empirical
likelihood.28
Bootstrap tilting hypothesis tests reject H0 if A Cox proportional hazards regression, using
bootstrap tilting confidence intervals exclude the null Breslow’s method of breaking ties, yields a log-hazard
hypothesis value. ratio of 0.904 and standard error 0.512:
The third general-purpose bootstrap testing
coef exp(coef) se(coef) z p
procedure is related to bootstrap t confidence group 0.904 2.47 0.512 1.77 0.078
intervals. A t statistic is calculated for the observed
data, and the P-value for the statistic is calculated not An ordinary bootstrap with B = 104 results in
by reference to the Students t distribution, but rather eleven samples with complete separation—where the
by reference to the bootstrap distribution for the t minimum observed relapse time in the treatment group
statistic. In this case the bootstrap sampling need not exceeds the maximum observed relapse in the control
be done consistently with the null hypothesis, because t group, giving an infinite estimated hazard ratio. A
statistics are approximately pivotal—their distribution stratified bootstrap reduces the number of samples
is approximately the same independent of θ. with complete separation to three. Here stratification
is preferred (even if the original allocation were not
PLANNING CLINICAL TRIALS stratified) in order to condition on the actual sample
sizes, and prevent imbalance in the bootstrap samples.
The usual bootstrap procedure is to draw samples Omitting the three observations results in a slightly
of size n from the empirical data, or more generally long-tailed bootstrap distribution, with standard error
to plug in an estimate for the population and draw 0.523, slightly larger than the formula standard error.
samples using the sampling mechanism actually used Drawing 50 observations from each group
in practice. In planning clinical trials we may modify results in a bootstrap distribution for log-hazard ratio
this in two ways: that is nearly exactly normal with almost no bias,
no samples with separation (they are still possible,
• try other sampling procedures, such as different but unlikely), and a standard error of 0.221. Surpris-
sample sizes or stratification, and/or ingly, this 10% less than obtained by extrapolating √
• plug-in alternate population estimates. the original
formula standard error at the rate 1/ n,
0.512/ 100/23 = 0.246, and 12% less than obtained
For example, given training data of size n, to by extrapolating the original bootstrap standard error.
estimate standard errors or confidence interval width Similar results are obtained using Efron’s method for
that would result from a possible clinical trial of size handling ties, and from a smoothed bootstrap with a
N, we may draw bootstrap samples of size N with small amount of noise added to the remission times.
replacement from the data. The fact that the reduction in standard error is 10--
Similarly, we may estimate the effects of different 12% greater than expected may be because censored
sampling mechanisms, such as stratified sampling, or observations have a less serious impact with larger
case–control allocation to arms, even if pilot data were sample sizes.
obtained in other ways.
For example, we consider preliminary results
from a clinical trial to evaluate the efficacy of ‘‘What if’’ Analyses—Alternate Population
maintenance chemotherapy for acute myelogenous Estimates
leukemia (AML).29,30 After achieving remission In planning clinical trials it is often of interest to
through chemotherapy, the patients were assigned to a do ‘‘what if’’ analyses, perturbing various inputs. For
treatment group receiving maintenance chemotherapy example, how might the results differ under sampling
and a control group that did not. The goal was to from populations with a log-hazard ratio of zero,
see if maintenance chemotherapy prolonged the time or 0.5?
until relapse. The data are in Table 3. There are 11 This should be done by reweighting observa-
subjects in the treatment group and 12 in the control tions.31,32 This is a version of bootstrap tilting19,21,31,33
group. and is closely related to empirical likelihood.34
Consider first a simple example—sampling the For other statistics we replace (28) with the more
difference in two means, θ̂ = x1 − x2 . In order to sam- general
ple from populations with different values of θ, it is
natural to consider perturbing the data, shifting one θ(F̂n,w ) = θ0 , (29)
or both samples, e.g., adding θ − θ̂ to each value in
sample 1. where F̂n,w is the weighted empirical distribution (with
Perturbing the data does not generalize well to obvious generalization to multiple samples or strata).
other situations. Furthermore, perturbing the data The computational tools used for empirical
would often give incorrect answers. Suppose that likelihood34 and bootstrap tilting19,21 are useful in
the observations represent positive skewed observa- determining the weights.
tions such as survival times, with a mode at zero. The bootstrap sampling is from the weighted
Shifting one of the samples to the left would give empirical distributions, i.e., the data are sampled with
negative times; to the right would make the mode unequal probabilities.
nonzero. More subtle, but very important, is that Figure 15 shows this idea applied to the leukemia
shifting ignores the mean–variance relationships for data. The top left shows Kaplan–Meier survival
skewed populations—increasing the mean should also curves for the original data, and top right shows the
increase the variance. For positive data like survival bootstrap distribution for the log-hazard ratio, using
times, perturbing the data by multiplying one of the 50 observations in each group. The bottom left shows
samples by a factor avoids the most obvious problems, weights chosen to maximize (26), subject to (28) and
but assumes a particular mean–variance relation- a log-hazard ratio equal to 0.5. In order to reduce the
ship—that variance is proportional to the square of ratio from its original value of 0.904, the treatment
the mean. group gets high weights early and low weights later
It is also unclear how one would perturb the (the weighted distribution has a higher probability of
data in multivariate applications when some variables early events) while the control group gets the converse.
are categorical. Censored observations get roughly the average weight
Instead, we suggest using a weighted version of of the remaining noncensored observations in the same
the empirical data, maximizing the likelihood of the group. The middle left shows the resulting weighted
observed data subject to the weighted distributions survival estimates, and middle right the corresponding
satisfying desired constraints. To satisfy µ1 − µ2 = θ0 , bootstrap distribution. In this case both bootstraps are
for example, we maximize nearly normal, and the standard errors are very similar
−0.221 for the ordinary bootstrap and 0.212 for the
n1 n2
weighted bootstrap, both with 50 observations per
w1i w2i (26) group.
i=1 i=1
0.8 1.5
Proportion surviving
0.6
Density
1.0
0.4
0.5
0.2
0.0 0.0
0 50 100 150 0.0 0.5 1.0 1.5
Survival time in weeks Log hazard ratio
0.8 1.5
Proportion surviving
0.6
Density
1.0
0.4
0.5
0.2
0.0 0.0
0 50 100 150 0.0 0.5 1.0 1.5
Survival time in weeks Log hazard ratio
Observation weights
Maintained
0.12 Control
0.10
Observation weights
0.08
0.06
0.04
0.02
Maintained/censored
0.0 Control/censored
0 50 100 150
Log hazard ratio
FIGURE 15 | Survival curves and bootstrap distribution for log-hazard ratio, original and perturbed (weighted) to a log-hazard ratio of 0.5.
errors, and that B = 1000 is enough for confidence computers were much slower; with faster computers
intervals. it is much easier to take more samples.
We argue that larger sizes are appropriate, on Second, those criteria were developed using
two grounds. First, those criteria were developed when arguments that combine the random variation
due to the original sample with the random confidence interval is the range from the α/2 to
variation due to bootstrap sampling. For example, 1 − α/2 quantiles of the bootstrap distribution. Let
.
Efron and Tibshirani2 indicate that cv(se ˆ B) = G−1
∞ (c) be the c quantile of the theoretical bootstrap
1/2
ˆ ∞ )2 + (E( ) + 2)/(4B)
cv(se , where cv is coeffi- distribution, and the number of bootstrap statistics
cient of variation, cv(Y) = σY /E(Y), seB and se∞ are falling below this quantile is approximately binomial
bootstrap standard errors using B or ∞ replications, with parameters B and c (the proportion parameter
respectively, and relates to the kurtosis of the may differ slightly due to the discreteness of the
bootstrap distribution; it is zero for normal distri- bootstrap distribution). For finite B, the one-sided
butions. Even relatively small values of B make the error has standard error approximately c(1 − c)/B.
ˆ B )/cv(se
ratio cv(se ˆ ∞ ) not much larger than 1. For c = 0.025, to reduce 1.96 standard errors to
We feel that the variation in bootstrap answers c/10 requires B ≥ (10/0.025)2 1.962 0.025 × 0.975 =
conditional on the data is more relevant. This is 14980, about 15,000 bootstrap samples.
particularly true in clinical trial applications, where The more stringent criterion of a 1% error would
require approximately 1.5 million bootstrap samples.
• reproducibility is important—two people ana- The bootstrap BCa confidence interval has
lyzing the same data should get (almost exactly) greater Monte Carlo error, because it requires
the same results, with random variation between estimating a bias parameter using the proportion of
their answers minimized, and bootstrap samples falling below the original θ̂ (and
• the data may be very expensive—there is little the variance of a binomial proportion p(1 − p)/B
.
point in wasting the value of expensive data is greatest for p = 0.5). It requires B about twice as
by introducing extraneous variation using B large as the bootstrap percentile interval for equivalent
too small. Given the choice between reducing Monte Carlo accuracy—30,000 bootstrap samples to
variation in the ultimate results by gathering satisfy the 10% criterion.
more data or by increasing B, it would be cheaper On the other hand, the bootstrap tilting interval
to increase B, at least until B is quite large. requires about 17 times fewer bootstrap samples
for the same Monte Carlo accuracy as the simple
. percentile interval, so that about 1000 bootstrap
ˆ B ) = (δ + 2)/(4B),
Conditional on the data, cv(se
where δ is the kurtosis of the theoretical bootstrap samples would suffice to satisfy the 10% criterion.
distribution (conditional on the data). When δ is In summary, to have 95% probability that the
zero (usually√approximately true), this simplifies to actual one-sided non-coverage for a 95% bootstrap
. interval falls within 10% of the desired value, between
ˆ B ) = 1/ 2B.
cv(se
To determine how large B should be, we 0.0225 and 0.0275, conditional on the data, requires
consider the effect on confidence intervals. Consider about 1000 samples for a bootstrap tilting interval,
a t interval of the form θ̂ ± tα/2 seB . Suppose that 4000 for a t interval using a bootstrap standard error,
such an interval using se∞ would be approximately 15,000 for a bootstrap percentile interval, and 30,000
correct, with one-sided non-coverage α/2. Then for a bootstrap BCa interval.
the actual non-coverage using seB in place of se∞ Figure 16 shows the Monte Carlo variability of
−1 a number of bootstrap confidence interval procedures,
would be Ft,n−1 ((seB /se∞ )Ft,n−1 (α/2)). For n large
and α = 0.05, to have the actually one-sided non- for various combinations of sample size, statistic, and
coverage fall within 10% of the desired value underlying data; these are representative of a larger
(between 0.0225 and 0.0275) requires that seB /se∞ be collection of examples in Ref 22. The panels show the
between −1 (0.025 × 1.1)/−1 (0.025) = 0.979 and variability due to Monte Carlo sampling with a finite
−1 (0.025 × 0.9)/−1 (0.025) = 1.023. To have 95% bootstrap sample size B, conditional on the data.
confidence √of no more than 10% error requires Figure 16 is based on 2000 randomly generated
that 1.96/ 2B ≤ 0.022, or B ≥ 0.5(1.96/0.022)2 = datasets for each sample size, distribution, and
3970, or about 4000 bootstrap samples. statistic. For each dataset, and for each value of B,
To satisfy the more stringent criterion of 95% two sets of bootstrap samples are created and intervals
confidence that the non-coverage error is less than calculated using all methods. For each method, a
1% of 0.025 would require approximately 400, 000 sample variance is calculated using the usual unbiased
bootstrap samples. With modern computers this is not sample variance (based on two observations). The
unreasonable, unless the statistic is particularly slow estimate of Monte Carlo variability is then the average
to compute. across the 2000 datasets of these unbiased sample
Consider also bootstrap confidence intervals variances. The result is the ‘‘within-group’’ component
based on quantiles. The simple bootstrap percentile of variance (due to Monte Carlo variability) and
0.015
0.02
0.010
0.01
0.005
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
Bootstrap sample size Bootstrap sample size
exp-tilt exp-tilt
ml-tilt ml-tilt
0.08
0.015 0.06
0.04
0.005 0.02
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
Bootstrap sample size Bootstrap sample size
exp-tilt exp-tilt
ml-tilt 0.06 ml-tilt
0.06
0.04
0.04
0.02
0.02
0.0 0.0
100 140 200 300 500 1K 2K 100 140 200 300 500 1K 2K
Bootstrap sample size Bootstrap sample size
the theoretical bootstrap distribution, and a summary Quasi-random sampling47 can be very useful
statistic Q (e.g., standard error, bias estimate, or end- for small n and large B; the convergence rate is
point of a confidence interval), we may draw B2 O(log(B)n B−1 ) compared to O(B−1/2 ) for Monte
bootstrap samples of size B from the B observations, Carlo methods.
and calculate the summary statistics Q∗1 , Q∗2 , . . . , Q∗B2 . Analytical approximations for bootstrap distri-
The sample standard deviation of the Q∗ s is the Monte butions are available in some situations, including ana-
Carlo standard error. lytical approximations for bootstrap tilting and BCa
intervals,20,24 and saddlepoint approximations.48–52
Variance Reduction
There are a number of techniques that can be used to ADDITIONAL TOPICS
reduce the Monte Carlo variation.
The balanced bootstrap,35 in which each of the n Some topics that are beyond the scope of this
observations is included exactly B times in the B boot- articlea include bootstrapping dependent data (time
strap samples, is useful for bootstrap bias estimates series, mixed effects models), cross-validation and
but of little value otherwise. bootstrap-validation (bootstrapping prediction errors,
Antithetic variates36 is moderately helpful for and classification errors), Bayesian bootstrap, boot-
bias estimation but of little value otherwise. strap likelihoods. Refs 2 and 5 are good starting points
Importance sampling37,38 is particularly useful for these topics, with the exception of mixed effects
for estimating tail quantiles, as for bootstrap per- models. Ref 2 is an introduction to the bootstrap
centile and BCa intervals. For nonlinear statistics one written for upper-level undergraduate or beginning
should use a defensive mixture Distribution.39,40 graduate students. Ref 5 is the best general-purpose
Control variates36,39,41,42 are moderately to reference for the bootstrap for statistical practi-
extremely useful for bias and standard error esti- tioners. Ref 10 looks at asymptotic properties of
mation and can be combined with importance various bootstrap methods. The author’s website
sampling.43 They are most effective in large samples http://home.comcast.net/∼timhesterberg/
for statistics that are approximately linear. bootstrap has resources for teaching statistics using
Concomitants42,44 are moderately to extremely the bootstrap, and some technical reports, particularly
useful for quantiles and can be combined with impor- on computational aspects of bootstrapping.
tance sampling.45 They are most effective in large
samples for statistics that are approximately linear;
NOTE
linear approximations tailored to a tail of interest can
dramatically improve the accuracy.46 a This article is a minor revision of Ref 53.
REFERENCES
1. Efron B. Bootstrap methods: another look at the jack- 7. Chambers J, Hastie T. Statistical Models in S. Pacific
knife (with discussion). Ann Stat 1979, 7:1–26. Grove, CA: Wadsworth; 1992.
2. Efron B, Tibshirani RJ. An Introduction to the Boot- 8. Chambers JM, Cleveland WS, Kleiner B, Tukey PA.
strap. Chapman and Hall; 1993. Graphical Methods for Data Analysis. Wadsworth;
1983.
3. Breiman L. Random forests. Mach Learn 2001, 45:
5–32. 9. Ruckstuhl A, Stahel W, Maechler M, Hesterberg T.
Sunflower. Statlib. Available at: http://lib.stat.cmu.edu/
4. Efron B. The Jackknife, the Bootstrap and Other
S/sunflower. (Accessed 1995).
Resampling Plans. National Science Foundation–
Conference Board of the Mathematical Sciences Mono- 10. Hall P. The Bootstrap and Edgeworth Expansion. New
graph 38. Philadelphia: Society for Industrial and York: Springer; 1992.
Applied Mathematics; 1982. 11. Shao J, Tu D. The Jackknife and Bootstrap. New York:
5. Davison A, Hinkley D. Bootstrap Methods and Their Springer-Verlag; 1995.
Applications. Cambridge University Press; 1997. 12. Silverman B, Young G. The bootstrap: to smooth or not
6. Wu CFJ. Jackknife, bootstrap, and other resampling to smooth. Biometrika 1987, 74:469–479.
methods in regression analysis (with discussion). Ann 13. Hall P, DiCiccio T, Romano J. On smoothing and the
Stat 1986, 14:1261–1350. bootstrap. Ann Stat 1989, 17:692–704.
14. Efron B. Better bootstrap confidence intervals (with 34. Owen A. Empirical Likelihood. Chapman & Hall/CRC
discussion). J Am Stat Assoc 1987, 82:171–200. Press; 2001.
15. Hesterberg TC. Unbiasing the bootstrap-bootknife sam- 35. Gleason JR. Algorithms for balanced bootstrap simula-
pling vs. smoothing. Proceedings of the Section on tions. Am St 1988, 42:263–266.
Statistics & the Environment. American Statistical 36. Therneau TM. Variance reduction techniques for the
Association; 2004, 2924–2930. bootstrap. Technical Report No. 200, PhD thesis,
16. DiCiccio TJ, Romano JP. A review of bootstrap confi- Department of Statistics, Stanford University; 1983.
dence intervals (with discussion). J R Stat Soc B 1988, 37. Johns MV. Importance sampling for bootstrap confi-
50:338–354.
dence intervals. J Am Stat Assoc 1988, 83:701–714.
17. Hall P. Theoretical comparison of bootstrap confidence
38. Davison AC. Discussion of paper by D. V. Hinkley. J R
intervals (with discussion). Ann Stat 1988, 16:927–985.
Stat Soc B 1986, 50:356–357.
18. DiCiccio T, Efron B. Bootstrap confidence intervals
39. Hesterberg TC. Advances in importance sampling. PhD
(with discussion). Stat Sci 1996, 11:189–228.
thesis, Statistics Department, Stanford University; 1988.
19. Efron B. Nonparametric standard errors and confidence
40. Hesterberg TC. Weighted average importance sampling
intervals. Can J Stat 1981, 9:139–172.
and defensive mixture distributions. Technometrics
20. DiCiccio TJ, Romano JP. Nonparametric confidence 1995, 37:185–194.
limits by resampling methods and least favorable fami-
lies. Int Stat Rev 1990, 58:59–76. 41. Davison AC, Hinkley DV, Schechtman E. Efficient boot-
strap simulation. Biometrika 1986, 73:555–566.
21. Hesterberg TC. Bootstrap tilting confidence intervals
and hypothesis tests. In: Berk K, Pourahmadi M, eds. 42. Efron B. More efficient bootstrap computations. J Am
Computer Science and Statistics: Proceedings of the Stat Assoc 1990, 85:79–89.
31st Symposium on the Interface, vol 31. Fairfax Sta- 43. Hesterberg TC. Control variates and importance sam-
tion, VA: Interface Foundation of North America; 1999, pling for efficient bootstrap simulations. Stat Comput
389–393. 1996, 6:147–157.
22. Hesterberg TC. Bootstrap tilting confidence intervals. 44. Do KA, Hall P. Distribution estimation using concomi-
Technical Report 84, Research Department, MathSoft, tants of order statistics, with application to Monte
Inc.; 1999. Carlo simulations for the bootstrap. J R Stat Soc B
23. Hyndman RJ, Fan Y. Sample quantiles in statistical 1992, 54:595–607.
packages. Am Stat 1996, 50:361–364. 45. Hesterberg TC. Fast bootstrapping by combining
24. DiCiccio T, Efron B. More accurate confidence intervals importance sampling and concomitants. Computing
in exponential families. Biometrika 1992, 79:231–245. Science and Statistics, 1997, 29:72–78.
25. DiCiccio TJ, Martin MA, Young GA. Analytic approx- 46. Hesterberg TC. Tail-specific linear approximations for
imations to bootstrap distribution functions using sad- efficient bootstrap simulations. J Comput Graph Stat
dlepoint methods. Technical Report 356, Department 1995, 4:113–133.
of Statistics, Stanford University; 1990. 47. Do KA, Hall P. Quasi-random sampling for the boot-
26. Efron B. Censored data and the bootstrap. J Am Stat strap. Stat Comput 1991, 1:13–22.
Assoc 1981, 76:312–319. 48. Tingley M, Field C. Small-sample confidence intervals.
27. Hinkley DV. Bootstrap significance tests. Bull Int Stat J Am Stat Assoc 1990, 85:427–434.
Inst 1989, 53:65–74. 49. Daniels HE, Young GA. Saddlepoint approximation
28. Owen A. Empirical likelihood ratio confidence intervals for the studentized mean, with an application to the
for a single functional. Biometrika 1988, 75:237–249. bootstrap. Biometrika 1991, 78:169–179.
29. Embury SH, Elias L, Heller PH, Hood CE, Greenberg 50. Wang S. General saddlepoint approximations in the
PL, Schrier SL. Remission maintenance therapy in bootstrap. Stat Prob Lett 1992, 13:61–66.
acute myelogenous leukemia. West J Med 1977, 126: 51. DiCiccio TJ, Martin MA, Young GA. Analytical
267–272. approximations to bootstrap distributions functions
30. Insightful. S-PLUS 8 Guide to Statistics. 1700 West- using saddlepoint methods. Stat Sin 1994, 4:281.
lake Ave N., Suite 500, Seattle; 2007.
52. Canty AJ, Davison AC. Implementation of saddlepoint
31. Hesterberg TC. Bootstrap tilting diagnostics. Proceed- approximations to bootstrap distributions. In: Billard
ings of the Statistical Computing Section; 2001. L, Fisher NI, eds. Computing Science and Statistics;
32. Hesterberg TC. Resampling for planning clinical Proceedings of the 28th Symposium on the Interface,
trials-using S+Resample. Statistical Methods in Bio- vol 28. Fairfax Station, VA: Interface Foundation of
pharmacy, Paris. Available at: http://home.comcast.net/ North America; 1997, 248–253.
∼timhesterberg/articles/Paris05-ResampleClinical.pdf.
53. Hesterberg TC. Bootstrap. In: D’Agostino R, Sullivan L,
(Accessed 2011). Massaro J, eds. Wiley Encyclopedia of Clinical Trials.
33. Hall P, Presnell B. Intentionally biased bootstrap meth- John Wiley & Sons; 2007.
ods. J R Stat Soc B 1999, 61:143–158.
FURTHER READING
Chernick MR. Bootstrap Methods: A Practitioner’s Guide. New York: John Wiley & Sons; 1999. (An extensive bibliography,
with roughly 1700 references related to the bootstrap.)
Hesterberg T, Monaghan S, Moore DS, Clipson A, Epstein R. Bootstrap Methods and Permutation Tests. W. H. Freeman.
Chapter for The Practice of Business Statistics by Moore, McCabe, Duckworth, and Sclove; 2003. Available at: http://
bcs.whfreeman.com/pbs/cat_160/PBS18.pdf. (An introduction to the bootstrap written for introductory statistics students.)
(Accessed 2011).