Professional Documents
Culture Documents
Chapter 3 Randomized Experiments Advanced Inference
Chapter 3 Randomized Experiments Advanced Inference
2023-03-19
Sample vs population
We usually only care about the second one in social science. This means that we have two types of ATE:
Sample average treatment effect (SATE)
n
1X
SAT E = [Y1i − Y0i ]
n i=1
P AT E = E[Y1i − Y0i ]
n n
1 X 1 X
τ̂ = Di Yi − (1 − Di )Yi
n1 i=1 n0 i=1
1
Proof that τ̂ is an unbiased estimator of SATE
n n
1 X 1 X
E[τ̂ |O] = E(Di |O)Y1i − (1 − E(Di |O))Y0i
n1 i=1 n0 i=1
By definition,
n1
E[Di |O] =
n
and
n0
[1 − E(Di |O)] =
n
So plugging those back in, we get
n n
1 X n1 1 X n0
E[τ̂ |O] = Y1i − Y0i
n1 i=1 n n0 i=1 n
n n
1 n1 X 1 n0 X
= Y1i − Y0i
n1 n i=1 n0 n i=1
n n n
1X 1X 1X
= Y1i − Y0i = (Y1i − Y0i ) = SAT E
n i=1 n i=1 n i=1
Given the law of iterated expectations, E[Y ] = E[E[Y |X]] = E[Y |X = x]PX (x):
P
Based on the above proof, we know that E[τ̂ |O] = SAT E, that means:
E[SAT E]
In social science, obtaining a random sample is impossible. In such a case, focus on SATE and interpret as
such. Consider the estimate internally valid, not not externally valid.
Because of random sampling, we know that PATE = PATT.
2
Variance for within-sample inference
Very similar to population level variance from previous lecture.
Exact variance for SATE is unidentifiable:
1 n0 2 n1 2
V(τ̂ |O) = ( S + S + 2S01 )
n n1 1 n0 0
Where S12 is sample variance for treated, S02 is sample variance for control, and S01 is sample covariance of
Y1i and Y0i .
As previously, the biased estimate is more conservative:
S2 S2
\
V(τ̂ |O) = 1 + 0 ≥ V(τ̂ |O)
n1 n0
• Because of LLN:
P
– SAT E −→ P AT E
P
– τ̂ −
→ P AT E
• Because of CLT:
a
– τ̂ is asymptotically normal: τ̂ e N (P AT E, V(τ̂ ))
Cluster randomization
Cluster randomization is when you randomly select 2 sub-groups and then randomly select individuals within
those sub-groups.
Reasons to use cluster randomization:
• Treatment only makes sense at group level but outcome is measured for individuals (e.g., treatment is
a new curriculum)
• Treatment too costly to implement individually
• SUTVA only plausible of treatment defined at group level
Cons
• Much bigger standard error, because of “clustering” (the fact that units within the same cluster are
typically more similar than units in different clusters)
3
Intracluster correlation
Law of Total Variance tells us that total variance = mean of “within” variance + “between variance”
In this equation:
• E[V(Y |X)] tells us the average of the within-group variance by measuring how much Y varies within
each group of observations defined by the values of X
• V(E[Y |X]) tells us the variance of the conditional means of Y given X by measuring how much the
mean of Y varies across different groups defined by the values of X. In other words, it measures how
much the conditional means of Y deviate from the overall mean.
Nj
G X Nj
G X G
X ¯
(Ydij − Yd ) =
2
X
(Ydij − Ȳdj )2 +
X
Nj (Ȳdj − Ȳ¯d )2
j=1 i=1 j=1 i=1 j=1
for d = 0, 1 where G is the number of clusters, Nj is the number of observations in cluster j, Ȳdj is the mean
Ydij in cluster j, and Ȳ¯d is the global mean Ydij .
Slide 13/60
I think that the point of this is that there’s a difference between within-cluster of variation and between-
cluster variation.
From this, we can somehow define the intracluster correlation:
2
σB σ2
ρ= 2
=1− W
σ σ2
Where σB
2
is the between-cluster variance, and σW
2
is the within-cluster variance.
When ρ = 1, that means that responses are identical within each j, and when ρ = 0, that means that clusters
are uncorrelated within each j.
We can show that cluster randomization inflates the sampling variance approximately by the Moulton
factor
V(τ̂clustered )
≈ 1 + (N̄ − 1)ρ
V(τ̂complete )
G
1 X
N̄ = Nj
G j=1
4
So to calculate cluster SE, you can calculate the usual SE and multiply by the square root of the Molton
factor, though more often, people just use cluster robust SEs.
Therefore, like like with complete randomization, you can do valid inference via regression:
• Sample regression (difference in means) estimator is unbiased for τ if clusters are equally sized.
– Possible bias if cluster size vary and are correlated with potential outcomes
• Valid (conservative) standard error estimates can be obtained via cluster-robust standard errors
• When G is small (when there are few clusters), ρ will be poorly estimated, and cluster SEs will be
unreliable.
• When given choice, increase the number of clusters instead of sample size per cluster. That’s what’s
driving the asymptotics.
yi = τ Di + ui , E[ui ] = 0
Where
n
X n
X
V[τ̂ ] = V[ Di ui ]/( Di2 )2
i=1 i=1
Different sets of assumptions lead us to choose to re-estimate V[τ̂ ], Vhet [τ̂ ], or Vclu [τ̂ ]
1. If we assume:
Cov[ui , uj ] = 0
. . . .and homoskedasticity . . .
≿[ui ] = sigma2
. . . then . . .
σ2
V[τ̂ ] = Pn
i=1 Di2
5
Cov[ui , uj ] = 0
[Even if they share the same cluster? Does this implicitly assume no clustering?]
. . . .and we assume heteroskedasticity instead of homoskedasticity (?)
Pn
i=1 Di V[ui ] 2
2
Vhet [τ̂ = ( P n )
i=1 Di 2
3. If we assume
Xn X Xn
Vclu [τ̂ ] = ( Di Dj · ui uj · 1[i, j in the same cluster])/( Di2 )2
i=1 j i=1
I’m a little confused, but I think that this is just saying that:
1. When you have no correlation between u_i and u_j, homoskedasticity, and NO clustering, then you
use
V[τ̂ ]
.
2. When you have no correlation between u_i and u_j, heteroskedasticity, but still NO clustering, then
you use
Vhet [τ̂ ]
3. When you have no correlation between u_i and u_j (except when observations are in the same cluster)
heteroskedasticity, and clustering, you use
Vclu [τ̂ ]
.
The key assumption about errors (aside from zero conditional mean) is
6
Block Randomization
• Adjustment before randomization is better than covariate adjustment after randomization. Therefore,
you should “Block what you can; randomize what you can’t.”
Setup:
• Units: i = 1, ..., N
• Blocks: j = 1, ..., M
• Nj : number of units in block j
• pj : treatment probability in block j
N1j
– pj = Nj for within-block complete randomization.
• If pj = p for all j (identical for each block), then the pooled difference in means is unbiased for the
ATE:
N N
1 X 1 X
τ̂ = Di Yi − (1 − Di )Yi
N1 i=1 N0 i=1
If pj varies across blocks, then τ̂ will be biased, because Dij is no longer independent of blocks when pj
varies.
weighted average of block-specific difference-in-means will be unbiased:
M
X Nj
τ̂B = τ̂j
j=i
N
where
Nj Nj
1 X 1 X
τ̂j = Dij Yij − (1 − Dij )Yij
N1j i=1 N0j i=1
where N1j is the number of treated units in group j, and N0j = Nj − N1j .
Because the randomizations in each block are independent, the sampling variance of the weighted-average
estimator is simply:
Nj 2
V(τ̂B )( ) V(τ̂j )
N
Component variance can be estimated conservatively via the Neyman formula for each block:
2 2
S1j S0j
V(τ̂j ) = +
N1j N0j
• If pj = p, OLS with block dummies (fixed effects, Bij ) will yield an unbiased estimate for ATE.
7
M
X
Yi = α + τ Di + βj Bij + ϵi
j−2
where
E[τ̂OLS ] = τ
For i in block j:
1
wij = ifDi = 1
pj
1
wij = ifDi = 0
1 − pj
- Also can use the interacted OLS (“Lin estimator”) specification
Yi = α + τCR Di + ϵi
Block randomization
M
X
Yi = α + τBR Di + βj Bij + ϵ∗
j=2
We get:
σϵ2 SSRϵ̂
V[τ̂CR ] = Pn , σ̂ϵ2 =
i=1 (D i − D̄)2 N −2
and
σϵ2∗ SSRϵˆ∗
V[τ̂BR ] = P , σ̂ϵ2∗ =
i=1 (Di − D̄) (1 − Rj )
2 2 N −M −1
Where Rj2 is R2 from regression of Di on Bij variables and a constant. If there is a high correlation between
the block indicators and the treatment, then this number will be very small, and you’ll get a huge variance.
In the experimental case, your block indicators shouldn’t be very correlated with your indicators unless
you have really unequal probability of treatment. So in the experimental case, (1 − Rj2 ) will basically be 1
(ignorable).
8
Therefore, in the experimental case, the denominators of these two terms will basically be the same, so it
comes down to whether:
SSRϵˆ∗
All told, in random experiments V̂ [τ̂BR ] < V̂ [τ̂CR ] if N −M −1 < N −2 .
SSRϵ̂
The denominators here don’t matter that much in large experiments, because the differences in df will be
marginal.
Assuming that you block on things that are correlated with Y, that will soak up a lot of the variance, and
so the block SSR will be smaller than in the case of complete randomization.
In other words, if you design your research well, then:
Randomization Inference
Instead of a typical test of differences in means that we’d use with large N:
Pros:
Cons:
• Sharp null almost always false, so is it really that meaningful to reject it?
9
Power Analysis
σ12 σ02
V(τ̂ ) = +
pN (1 − p)N
We can treat this as a minimization problem. What is the value of p* that makes the derivative with respect
to p equa lto zero:
σ12 σ02
− + =0
p∗2 N (1 − p∗ )2 N
Therefore
1 − p∗ σ0
=
p∗ σ1
and
σ1 1
p∗ = =
σ1 + σ0 1 + σ0 /σ1
The rule of thumb assuming that σ1 ≈ σ0 is that p* = 0.5.
For practical reasons, it is sometimes better to choose unequal sample sizes, even if σ1 ≈ σ0 .
Suppose that
N
p = 0.5, soN0 = N1 =
2
δ = µ1 − µ0
10
Therefore, the test statistic is centered around the true effect, scaled by the sample size and variance:
√
Ȳ1 − Ȳ0 δ N
t= q 2 ∼ N( , 1)
σ1 σ02 2σ
N1 + N0
√ √ √ √
δ N δ N δ N δ N
= p(t − < −1.96 − ) + p(t − < 1.96 − )
2σ 2σ 2σ 2σ
√ √ √
δ N δ N δ N
= ϕ(−1.96 − ) + (1 − ϕ(1.96 − p(t − < −1.96 − ))
2σ 2σ 2σ
δ δ
P (rejectµi − µ0 = 0|µ1 − µ0 = 0) = ϕ(−1.96 − q ) + (1 − ϕ(1.96 − q ))
σ12 σ02 σ12 σ02
pN + (1−p)N pN + (1−p)N
We specify:
δ - effect magnitude
1 − ψ - power value (usually 0.8 or higher)
σ12 , σ02 - usually assumed to be equal to one another, using previous measures
p - proportion of observations in the treatment group
Where RB 2
is the proportion of explained variation in the outcome predicted by the blocks (which you get
by regressing Y on all of the block dummies).
• The more similar observations are within blocks and the more different blocks are from each other, the
higher this predictive power is, and the larger precision gain from blocking.
11
Designing complex experiments
Pre-treatment covariates
In this scenario. . .
• There are i units, treatment Di is 0 or 1, and potential outcomes are Yi (d), where d = 0,1.
• Idea is that pre-treatment covariates might be correlated with both Di and Yi , thereby confounding
the causal relationship
• Idea is that within sub-groups for which X = x, it is “as if” Di is randomly assigned.
12
Population regression functions instead of difference in means
Under the conditional ignorability and common support assumptions, τ is nonparametrically identified as:
Z
(E[Yi |Di = 1, Xi = x] − E[Yi |Di = 0, Xi = x])f (x)dx
E[Yi (1) − Yi (0)|Xi ] = E[Yi (1)|Xi , Di = 1] − E[Yi (0)|Xi , Di = 0] = E[Yi |Xi , Di = 1] − E[Y0 |Xi , Di = 1]
13