Intro Bootstrap 341

Introduction to the Bootstrap
Machelle D. Wilson
Outline
 Why the Bootstrap?
 Limitations of traditional statistics
 How Does it Work?
 The Empirical Distribution Function and the Plug-in
Principle
 Accuracy of an estimate: Bootstrap standard error and
confidence intervals
 Examples
 How Good is the Bootstrap?
Limitations of Traditional Statistics:
Problems with distributional assumptions
 Often data can not safely be assumed to be
from an identifiable distribution.
 Sometimes the distribution of the statistic
is mathematically intractable, even
assuming that distributional assumptions
can be made.
 Hence, often the bootstrap provides a
superior alternative to parametric
statistics.
An example data set
1000 Bootstrapped Means
Red Lines=BS CI
250
Black Lines=Normal CI
200
150
100
50
0
80 100 120 140 160 180

mean dose
Mean conc. and Dose rate fixed
An Example Data Set
1000 Bootstrapped Means
300
Red Lines=BS CI
Black Lines=Normal CI
250
200
150
100
50
0
50 100 150 200 250 300 350

mean dose
Mean Conc and Dose Rate Random
Statistics in the Computer Age
 Efron and Tibshirani, 1991 in Science:
“Most of our familiar statistical methods, such as
hypothesis testing, linear regression, analysis of
variance, and maximum likelihood estimation, were
designed to be implemented on mechanical calculators.
Modern electronic computation has encouraged a host of
new statistical methods that require fewer distributional
assumptions than their predecessors and can be applied
to more complicated statistical estimators…without the
usual concerns for mathematical tractability.”
The Bootstrap Solution
 With the advent of cheap, high power
computing, it has become relatively easy to use
resampling techniques, such as the bootstrap, to
estimate the distribution of sample statistics
empirically rather than making distributional
assumptions.
 The bootstrap resamples the data with equal
probability and with replacement and calculates
the statistic of interest at each resampling. The
resulting histogram, mean, quantiles and
variance of the bootstrapped statistics provide
an estimate of its distribution.
Example
 Take the data set 1,2,3. There are 10 1,2,3 1,1,2
possible resamplings, where re-orderings 1,1,3 2,2,1
are considered the same sampling. 2,2,3 3,3,1
3,3,2 1,1,1
30
2,2,2 3,3,3
25
20
15
10
0
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
The Bootstrap Solution
 In general, the number of  2n  1 
bootstrap samples, Cn, is Cn   .
 n 1 
Table of possible distinct bootstrap

re-samplings by sample size.
n 5 10 12 15 20 25 30
Cn 126 92,378 1.35x104 7.76x105 6.89x1010 6.32x1013 5.91x1016

The Empirical Distribution Function
 Having observed a random sample of size
n from a probability distribution F,
F   x1 , x2 ,...xn 
the empirical distribution function (edf),
Fˆ , assigns to a set A in the sample
space of x its empirical probability
Fˆ  A   Pˆ  A  #  xi  A / n
Example
A random sample of 100 throws of a die
yields 13 ones, 19 twos, 10 threes, 17 fours,
14 fives, and 27 sixes. Hence the edf is
Fˆ (1)  0.13 Fˆ (4)  0.17

Fˆ (2)  0.19 Fˆ (5)  0.14
Fˆ (3)  0.10 Fˆ (6)  0.27
The Plug-in Principle
 Itcan be shown that F̂ is a sufficient statistic
for F.
 That is, all the information about F contained
in x is also contained in F̂ .
 The plug-in principle estimates
  T (F )
by
ˆ ˆ
  T (F )
The Plug-in Principle
 Ifthe only information about F comes from
the sample x, then ˆ  T ( Fˆ ) is a minimum
variance unbiased estimator of  .
 The bootstrap is drawing B samples from the
empirical distribution to estimate B statistics
ˆ
of interest, .
*
 Hence, the bootstrap is both sampling from

an edf (of the original sample) and generating
an edf (of the statistic).
Graphical Representation of the Bootstrap

x={x1,x2,…,xn}
x*1 x*2 x*3 …. …. … x*B
T(x*1) T(x*2) T(x*3) … …

T(x*B)
B
1 B
t  ˆ  T ( x)    T  x*b 
 [T ( x )  t ]
b 2
ˆ (T ( x)) 
se b 1
B b 1 B 1
Bootstrap Standard Error and Confidence
intervals.
 The bootstrap estimate of the mean is just
the empirical average of the statistic over
all bootstrap samples.
 The bootstrap estimate of standard error is

just the empirical standard deviation of
the bootstrap statistic over all bootstrap
samples.
Bootstrap Confidence Intervals
 The percentile interval: the bootstrap
confidence interval for any statistic is
simply the α/2 and 1-α/2 quantiles.
 For example, if B=1000, then to construct
the BS confidence interval we rank the
statistics and take the 25th and the 975th
values.
 There are other BS CIs but this is the
easiest and makes the fewest assumptions.
Example: Bootstrap of the Median
 Go to Splus.
How Good is the Bootstrap?
 The bootstrap, in most cases is as good as the
empirical distribution function.
 The bootstrap is not optimal when there is good
information about F that did not come from the
data–i.e. prior information or strong, valid
distributional assumptions.
 The bootstrap does not work well for extreme
values and needs some what difficult
modifications for autocorrelated data such as
times series.
 When all our information comes from the sample
itself, we can not do better than the bootstrap.

Intro Bootstrap 341

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro Bootstrap 341

Uploaded by

Copyright:

Available Formats

Introduction to the Bootstrap

1000 Bootstrapped Means

80 100 120 140 160 180

50 100 150 200 250 300 350

Table of possible distinct bootstrap

Cn 126 92,378 1.35x104 7.76x105 6.89x1010 6.32x1013 5.91x1016

Fˆ (1)  0.13 Fˆ (4)  0.17

 Hence, the bootstrap is both sampling from

x1 x2 x3 …. …. … xB

T(x1) T(x2) T(x*3) … …

 The bootstrap estimate of standard error is

You might also like