Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Introduction to the Bootstrap

Machelle D. Wilson
Outline
 Why the Bootstrap?
 Limitations of traditional statistics
 How Does it Work?
 The Empirical Distribution Function and the Plug-in
Principle
 Accuracy of an estimate: Bootstrap standard error and
confidence intervals
 Examples
 How Good is the Bootstrap?
Limitations of Traditional Statistics:
Problems with distributional assumptions
 Often data can not safely be assumed to be
from an identifiable distribution.
 Sometimes the distribution of the statistic
is mathematically intractable, even
assuming that distributional assumptions
can be made.
 Hence, often the bootstrap provides a
superior alternative to parametric
statistics.
An example data set

1000 Bootstrapped Means

Red Lines=BS CI
250

Black Lines=Normal CI
200
150
100
50
0

80 100 120 140 160 180


mean dose
Mean conc. and Dose rate fixed
An Example Data Set
1000 Bootstrapped Means
300

Red Lines=BS CI
Black Lines=Normal CI
250
200
150
100
50
0

50 100 150 200 250 300 350


mean dose
Mean Conc and Dose Rate Random
Statistics in the Computer Age
 Efron and Tibshirani, 1991 in Science:
“Most of our familiar statistical methods, such as
hypothesis testing, linear regression, analysis of
variance, and maximum likelihood estimation, were
designed to be implemented on mechanical calculators.
Modern electronic computation has encouraged a host of
new statistical methods that require fewer distributional
assumptions than their predecessors and can be applied
to more complicated statistical estimators…without the
usual concerns for mathematical tractability.”
The Bootstrap Solution
 With the advent of cheap, high power
computing, it has become relatively easy to use
resampling techniques, such as the bootstrap, to
estimate the distribution of sample statistics
empirically rather than making distributional
assumptions.
 The bootstrap resamples the data with equal
probability and with replacement and calculates
the statistic of interest at each resampling. The
resulting histogram, mean, quantiles and
variance of the bootstrapped statistics provide
an estimate of its distribution.
Example
 Take the data set 1,2,3. There are 10 1,2,3 1,1,2
possible resamplings, where re-orderings 1,1,3 2,2,1
are considered the same sampling. 2,2,3 3,3,1
3,3,2 1,1,1
30
2,2,2 3,3,3
25

20

15

10

0
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
The Bootstrap Solution
 In general, the number of  2n  1 
bootstrap samples, Cn, is Cn   .
 n 1 

Table of possible distinct bootstrap


re-samplings by sample size.
n 5 10 12 15 20 25 30

Cn 126 92,378 1.35x104 7.76x105 6.89x1010 6.32x1013 5.91x1016


The Empirical Distribution Function
 Having observed a random sample of size
n from a probability distribution F,
F   x1 , x2 ,...xn 
the empirical distribution function (edf),
Fˆ , assigns to a set A in the sample
space of x its empirical probability

Fˆ  A   Pˆ  A  #  xi  A / n
Example
A random sample of 100 throws of a die
yields 13 ones, 19 twos, 10 threes, 17 fours,
14 fives, and 27 sixes. Hence the edf is

Fˆ (1)  0.13 Fˆ (4)  0.17


Fˆ (2)  0.19 Fˆ (5)  0.14
Fˆ (3)  0.10 Fˆ (6)  0.27
The Plug-in Principle
 Itcan be shown that F̂ is a sufficient statistic
for F.
 That is, all the information about F contained
in x is also contained in F̂ .
 The plug-in principle estimates

  T (F )
by
ˆ ˆ
  T (F )
The Plug-in Principle
 Ifthe only information about F comes from
the sample x, then ˆ  T ( Fˆ ) is a minimum
variance unbiased estimator of  .
 The bootstrap is drawing B samples from the
empirical distribution to estimate B statistics
ˆ
of interest, .
*

 Hence, the bootstrap is both sampling from


an edf (of the original sample) and generating
an edf (of the statistic).
Graphical Representation of the Bootstrap


x={x1,x2,…,xn}

x*1 x*2 x*3 …. …. … x*B

T(x*1) T(x*2) T(x*3) … …


T(x*B)

B
1 B
t  ˆ  T ( x)    T  x*b 
 [T ( x )  t ]
b 2

ˆ (T ( x)) 
se b 1
B b 1 B 1
Bootstrap Standard Error and Confidence
intervals.
 The bootstrap estimate of the mean is just
the empirical average of the statistic over
all bootstrap samples.

 The bootstrap estimate of standard error is


just the empirical standard deviation of
the bootstrap statistic over all bootstrap
samples.
Bootstrap Confidence Intervals
 The percentile interval: the bootstrap
confidence interval for any statistic is
simply the α/2 and 1-α/2 quantiles.
 For example, if B=1000, then to construct
the BS confidence interval we rank the
statistics and take the 25th and the 975th
values.
 There are other BS CIs but this is the
easiest and makes the fewest assumptions.
Example: Bootstrap of the Median
 Go to Splus.
How Good is the Bootstrap?
 The bootstrap, in most cases is as good as the
empirical distribution function.
 The bootstrap is not optimal when there is good
information about F that did not come from the
data–i.e. prior information or strong, valid
distributional assumptions.
 The bootstrap does not work well for extreme
values and needs some what difficult
modifications for autocorrelated data such as
times series.
 When all our information comes from the sample
itself, we can not do better than the bootstrap.

You might also like