02 Permutation Test

Permutation Test
Kosuke Imai
Harvard University
S TAT 186/G OV 2002 C AUSAL I NFERENCE
Fall 2019
Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 1 / 15

Announcements
TF Section begins this week
First problem set will be posted on Friday
Gov 2002 students: find a project partner!

Randomized Controlled Trials (RCTs)
Why randomize treatment assignment in experiments?

1 makes the treatment and control groups “identical”
Joint distribution of all observed X and unobserved U pretreatment
confounders is identical:
P(X, U | T = 1) = P(X, U | T = 0)
U includes potential outcomes {Y (1), Y (0)}

Treatment assignment is statistically independent of X and U
{X, U} ⊥
⊥T and in particular {Y (1), Y (0)} ⊥
⊥T
Removes selection problem stochastically ! controlled experiments

2 enables us to formally quantify the degree of uncertainty
Potential problems of RCTs: sample selection, placebo effects,

noncompliance, missing data, spillover/carryover effects roles
of statistical methods

Randomization Inference vs. Model-based Inference
Randomization as the “reason basis for inference” (Fisher)

Randomness comes from the physical act of randomization, which
then can be used to make statistical inference
Also called design-based inference
Advantage: design justifies analysis
Contrast this with model-based inference, which assumes a

distribution for potential outcomes
Advantage of model-based inference: flexibility

Lady Tasting Tea (Fisher 1935. The Design of Experiments. Oliver and Boyd)
Does tea taste different depending on whether the tea was poured
into the milk or whether the milk was poured into the tea?
8 cups; n = 8
Randomly choose 4 cups into which pour the tea first (Ti = 1)
Null hypothesis: the lady cannot tell the difference
Sharp null – H0 : Yi (1) = Yi (0) for all i = 1, . . . , 8
Statistic: the number of correctly classified cups
The lady classified all 8 cups correctly!
Did this happen by chance?

Permutation Test
Frequency Plot Probability Distribution
cups guess actual scenarios . . .
0.5
1 M M T T
30
0.4
2 T T T T
probability
frequency
0.3
3 T T T T
20
4 M M T M
0.2
5 M 5 10
M M M
0.1
6 T T M M
0.0
0
7 T T M T
0 2 4 6 8 0 2 4 6 8
8 M M M M
correctly guessed 8 Number4of correctly6guessed cups Number of correctly guessed cups
8 C4 = 70 ways to do this and each arrangement is equally likely

What is the p-value?
No assumption, but the sharp null may be of little interest

of how regression coefficients were selected in the appendix.
5% as the Threshold
RESULTS
It is usual and convenient for experimenters to take 5 per cent.
Figures 1(b)(a) and 1(b)(b) show the distribution of z-scores7 for coefficients reported
as a standard levelrepresents
inof significance,
the APSR and the AJPS forin the
one- sensetests,
and two-tailed that they8 are
respectively. The dashed line
the critical value for the canonical 5% test of statistical significance. There is
prepared to ignore aall
clearresults which
pattern in these figures.fail to reach
Turning this standard.
first to the two-tailed tests, there is a dramatic
spike in the number of z-scores in the APSR and AJPS just over the critical value of 1.96
R. A. Fisher (1935). The Design of 1(b)(a)).
(see Figure Experiments. Oliver
The formation & Boyd
in the neighborhood of the critical value resembles
90
80
Publication bias:
70
p-hacking
60
Frequency
file drawer bias
50
40
30
Potential solutions:
20
pre-registration
10
statistical control of
0
multiple testing 0.16 1.06 1.96 2.86 3.76 4.66 5.56 6.46 7.36 8.26 9.16 10.06 10.96 11.86 12.76 13.66
z-Statistic
Figure 1(a). Histogram

(Gerber z-statistics, 2008.
andofMalhotra. APSR & Q.AJPS (Two-Tailed).
J. Political Sci.) Width of bars
(0.20) approximately represents 10% caliper. Dotted line represents critical z-statistic
Kosuke Imai (Harvard) (1.96) associated with p =Test
Permutation 0.05 significance levelStat186/Gov2002
for one-tailed tests.Fall 2019 7 / 15
Basic Setup
Units: i = 1, . . . , n
Treatment: Ti ∈ {0, 1}
Outcome: Yi = Yi (Ti )
Complete randomization of the treatment assignment

Exactly n1 units receive the treatment
n0 = n − n1 units are assigned to the control group
This differs from the Bernoulli randomization
We know the distribution of Ti :

n1
Pr(Ti = 1 | Yi (1), Yi (0)) =
n
Pn
for all i = 1, . . . , n and i=1 Ti = n1

Fisher’s Exact Test
2 × 2 table:
Treated (T = 1) Control (T = 0)
Pn Pn
Success (Y = 1) i=1 Ti Yi (1) i=1 (1 − Ti )Yi (0)
Pn Pn
Failure (Y = 0) i=1 Ti (1 − Yi (1)) i=1 (1 − Ti )(1 − Yi (0))
Total n1 n0
Test statistic: S = ni=1 Ti Yi (1)

P
Under complete randomization and the sharp null of no treatment

effect, the test statistic follows the hyper-geometric distribution:
m n−m

s n1 −s
Pr(S = s | On ) = n

n1
Pn
where m = i=1 Yi and On = {Yi (0), Yi (1)}ni=1 .
Computation
Exact computation difficult when n is large
Monte Carlo approximation:

1 Fill in missing potential outcomes under the sharp null
2 Sample Ti according to complete randomization
3 Compute the test statistic
Can be made arbitrarily accurate by increasing number of draws
Analytical approximations:
n1 m mn0 n1 m
E(S | On ) = , and V(S | On ) = 1−
n n(n − 1) n
p
1 Normal: {S − E(S | On )}/ V(S | On ) ∼ N (0, 1)
2 Binomial(n1 , m/n)
Becomes accurate as n grows

The Project STAR (Mosteller. 1997. Bull. Am. Acad. Arts Sci.)
The Student-Teacher Achievement Ratio Project (1985–1989)

More than 10,000 students involved with the cost of $12 million
Effects of class size in early grade levels
3 arms: Small class, Regular-sized class, Regular class with aid
Long-term impact of class size:

Small class Regular-sized class
Graduate 754 892
Not graduate 148 189
Total 902 1081

Exact p-value: 0.28 (one-sided), 0.55 (two-sided)
Asymptotic p-value: 0.26 (one-sided), 0.53 (two-sided)

Rank-sum Tests
Fisher’s exact test assumes binary outcome
Rank-sum tests are often used for continous outcome
Rank of the outcome for unit i: Ri (Y)
Wilcoxson’s rank-sum statistic:
n
X
S = Ti · Ri (Y)
i=1
1 symmetric
2 moments (assume no tie):
n1 (n + 1) n0 n1 (n + 1)
E(S | On ) = , V(S | On ) =
2 12
3 reference distribution does not depend on index
Mann-Whitney U test statistic:
n1 (n + 1)
U = S−
2
The Project STAR Revisited
Effect of kindergraden class size on 8th grade reading score:

small class regular class
0.012
0.012
0.008
0.008
Density
Density
0.004
0.004
0.000
0.000
400 500 600 700 800 900 400 500 600 700 800 900
Wilcoxon’s rank-sum test (there are some ties):

p-value < 0.001

General Procedure for Permutation Tests
1 Specify sharp null hypothesis

Typically, H0 : τ0i = Yi (1) − Yi (0) where we set τ0i = 0 for all i
No effect implies no heterogenous effect, no spillover effect, etc.
2 Choose a test statistic S = f ({Yi , Ti , τ0i }ni=1 )

Pn
Fisher’s exact test statistic: S = i=1 Ti (Yi − τ0i )
Other commonly used test statistics include rank sum and
difference-in-means
3 Compute the reference distribution and p-value based on the

randomized distribution of treatment assignment
Exact distribution in small samples
Large-sample approximation
Monte Carlo approximation as a general strategy

Summary
Randomization of treatment assignment as a basis for inference

design-based, assumption-free inference
Inference over repeated (hypothetical) randomization
sample inference rather than population inference
Sharp null hypothesis:

implies no effect for every unit
may not be of interest but serves as a starting point of analysis
Reading: I MBENS AND RUBIN , C HAPTER 5

02 Permutation Test

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Permutation Test

Uploaded by

Copyright:

Available Formats

Permutation Test

S TAT 186/G OV 2002 C AUSAL I NFERENCE

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 1 / 15

TF Section begins this week

First problem set will be posted on Friday

Gov 2002 students: find a project partner!

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 2 / 15

Why randomize treatment assignment in experiments?

U includes potential outcomes {Y (1), Y (0)}

Removes selection problem stochastically ! controlled experiments

Potential problems of RCTs: sample selection, placebo effects,

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 3 / 15

Randomization as the “reason basis for inference” (Fisher)

Contrast this with model-based inference, which assumes a

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 4 / 15

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 5 / 15

8 C4 = 70 ways to do this and each arrangement is equally likely

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 6 / 15

Figure 1(a). Histogram

Complete randomization of the treatment assignment

We know the distribution of Ti :

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 8 / 15

Test statistic: S = ni=1 Ti Yi (1)

Under complete randomization and the sharp null of no treatment

Exact computation difficult when n is large

Monte Carlo approximation:

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 10 / 15

The Student-Teacher Achievement Ratio Project (1985–1989)

Long-term impact of class size:

Graduate 754 892

Not graduate 148 189

Total 902 1081

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 11 / 15

Effect of kindergraden class size on 8th grade reading score:

Wilcoxon’s rank-sum test (there are some ties):

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 13 / 15

1 Specify sharp null hypothesis

2 Choose a test statistic S = f ({Yi , Ti , τ0i }ni=1 )

3 Compute the reference distribution and p-value based on the

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 14 / 15

Randomization of treatment assignment as a basis for inference

Sharp null hypothesis:

Reading: I MBENS AND RUBIN , C HAPTER 5

Kosuke Imai (Harvard) Permutation Test Stat186/Gov2002 Fall 2019 15 / 15

You might also like