Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Chapter 3: Randomized Experiments: Advanced Inference

MIT 17.802: Quantitative Research Methods II

2023-03-19

Lecture 3: Randomized Experiments - Part 2: Advanced Inference

Sample vs population

IRL, we only have a sample of the population.


Therefore, there are two layers of randomization:

1. n units are randomly sampled from the population


2. n1 units are randomly assigned to the treatment

We usually only care about the second one in social science. This means that we have two types of ATE:
Sample average treatment effect (SATE)

n
1X
SAT E = [Y1i − Y0i ]
n i=1

Population average treatment effect (PATE)

P AT E = E[Y1i − Y0i ]

Note that these are estimands, not estimators.


For SATE, our estimator is usually the difference in means.

n n
1 X 1 X
τ̂ = Di Yi − (1 − Di )Yi
n1 i=1 n0 i=1

Random assignment implies that Y1i , Y0i ⊥ Di .


Since quantity of interest (Qol) here is sample ATE, unbiasedness is defined over repeated random treatment
assignment (not sampling) this is the proof:

1
Proof that τ̂ is an unbiased estimator of SATE

Where O is the current sample. . . .


Using the law of iterated expectations:

n n
1 X 1 X
E[τ̂ |O] = E(Di |O)Y1i − (1 − E(Di |O))Y0i
n1 i=1 n0 i=1
By definition,

n1
E[Di |O] =
n
and
n0
[1 − E(Di |O)] =
n
So plugging those back in, we get

n n
1 X n1 1 X n0
E[τ̂ |O] = Y1i − Y0i
n1 i=1 n n0 i=1 n

n n
1 n1 X 1 n0 X
= Y1i − Y0i
n1 n i=1 n0 n i=1

n n n
1X 1X 1X
= Y1i − Y0i = (Y1i − Y0i ) = SAT E
n i=1 n i=1 n i=1

Proof that τ̂ is an unbiased esitmator of PATE


n n
1 X 1 X
τ̂ = Di Yi − (1 − Di )Yi
n1 i=1 n0 i=1

Given the law of iterated expectations, E[Y ] = E[E[Y |X]] = E[Y |X = x]PX (x):
P

E(τ̂ ) = E[E[τ̂ |O]]

Based on the above proof, we know that E[τ̂ |O] = SAT E, that means:

E(τ̂ ) = E[E[τ̂ |O]] = E[SAT E]

Then, in the case of random sampling, we can take it a step further:

E[SAT E]

In social science, obtaining a random sample is impossible. In such a case, focus on SATE and interpret as
such. Consider the estimate internally valid, not not externally valid.
Because of random sampling, we know that PATE = PATT.

2
Variance for within-sample inference
Very similar to population level variance from previous lecture.
Exact variance for SATE is unidentifiable:

1 n0 2 n1 2
V(τ̂ |O) = ( S + S + 2S01 )
n n1 1 n0 0

Where S12 is sample variance for treated, S02 is sample variance for control, and S01 is sample covariance of
Y1i and Y0i .
As previously, the biased estimate is more conservative:

S2 S2
\
V(τ̂ |O) = 1 + 0 ≥ V(τ̂ |O)
n1 n0

Variation for population inference

For population inference, the variance formula V(τ̂


[) is an unbiased estimator of PATE.

Intuitively, the formula is too big for V(τ̂


\ |O), but since for hte population, it becomes bigger, this becomes
an unbiased estimator.

Asymptotic Inference for SATE and PATE


As n → ∞. . . .

• Because of LLN:
P
– SAT E −→ P AT E
P
– τ̂ −
→ P AT E

• Because of CLT:
a
– τ̂ is asymptotically normal: τ̂ e N (P AT E, V(τ̂ ))

Cluster randomization
Cluster randomization is when you randomly select 2 sub-groups and then randomly select individuals within
those sub-groups.
Reasons to use cluster randomization:

• Treatment only makes sense at group level but outcome is measured for individuals (e.g., treatment is
a new curriculum)
• Treatment too costly to implement individually
• SUTVA only plausible of treatment defined at group level

Cons

• Much bigger standard error, because of “clustering” (the fact that units within the same cluster are
typically more similar than units in different clusters)

3
Intracluster correlation

Law of Total Variance tells us that total variance = mean of “within” variance + “between variance”

V(Y ) = E[V(Y |X)] + V(E[Y |X])

In this equation:

• E[V(Y |X)] tells us the average of the within-group variance by measuring how much Y varies within
each group of observations defined by the values of X
• V(E[Y |X]) tells us the variance of the conditional means of Y given X by measuring how much the
mean of Y varies across different groups defined by the values of X. In other words, it measures how
much the conditional means of Y deviate from the overall mean.

This implies the decomposition of heterogeneity in potential outcomes:

Nj
G X Nj
G X G
X ¯
(Ydij − Yd ) =
2
X
(Ydij − Ȳdj )2 +
X
Nj (Ȳdj − Ȳ¯d )2
j=1 i=1 j=1 i=1 j=1

for d = 0, 1 where G is the number of clusters, Nj is the number of observations in cluster j, Ȳdj is the mean
Ydij in cluster j, and Ȳ¯d is the global mean Ydij .

(Ydij − Y¯d )2 is overall variation, j=1 i=1


PG PNj PG PNj
Here, j=1 i=1 (Ydij − Ȳdj )2 is within-cluster variation, and
N (Ȳ − Ȳ¯ )2 is between-cluster variation.
PG
j=1 j dj d

Slide 13/60
I think that the point of this is that there’s a difference between within-cluster of variation and between-
cluster variation.
From this, we can somehow define the intracluster correlation:
2
σB σ2
ρ= 2
=1− W
σ σ2

Where σB
2
is the between-cluster variance, and σW
2
is the within-cluster variance.
When ρ = 1, that means that responses are identical within each j, and when ρ = 0, that means that clusters
are uncorrelated within each j.

Inference in cluster-randomized experiments

We can show that cluster randomization inflates the sampling variance approximately by the Moulton
factor

V(τ̂clustered )
≈ 1 + (N̄ − 1)ρ
V(τ̂complete )

where N̄ is the average cluster size:

G
1 X
N̄ = Nj
G j=1

4
So to calculate cluster SE, you can calculate the usual SE and multiply by the square root of the Molton
factor, though more often, people just use cluster robust SEs.
Therefore, like like with complete randomization, you can do valid inference via regression:

• Sample regression (difference in means) estimator is unbiased for τ if clusters are equally sized.

– Possible bias if cluster size vary and are correlated with potential outcomes

• Valid (conservative) standard error estimates can be obtained via cluster-robust standard errors
• When G is small (when there are few clusters), ρ will be poorly estimated, and cluster SEs will be
unreliable.
• When given choice, increase the number of clusters instead of sample size per cluster. That’s what’s
driving the asymptotics.

Cluster Robust SEs

Assume for simplicity a model with no intercept:

yi = τ Di + ui , E[ui ] = 0

Where

n
X n
X
V[τ̂ ] = V[ Di ui ]/( Di2 )2
i=1 i=1

Different sets of assumptions lead us to choose to re-estimate V[τ̂ ], Vhet [τ̂ ], or Vclu [τ̂ ]

1. If we assume:

. . . no covariance between ui and uj . . .

Cov[ui , uj ] = 0

. . . .and homoskedasticity . . .
≿[ui ] = sigma2

. . . then . . .

σ2
V[τ̂ ] = Pn
i=1 Di2

we can use this “regular” variance measurement.

2. If we assume that there is no covariance between ui and uj . . .

5
Cov[ui , uj ] = 0

[Even if they share the same cluster? Does this implicitly assume no clustering?]
. . . .and we assume heteroskedasticity instead of homoskedasticity (?)
Pn
i=1 Di V[ui ] 2
2
Vhet [τ̂ = ( P n )
i=1 Di 2

3. If we assume

Cov[ui , uj ] = 0 unless observations i and j share the same cluster, then:

Xn X Xn
Vclu [τ̂ ] = ( Di Dj · ui uj · 1[i, j in the same cluster])/( Di2 )2
i=1 j i=1

I’m a little confused, but I think that this is just saying that:

1. When you have no correlation between u_i and u_j, homoskedasticity, and NO clustering, then you
use
V[τ̂ ]
.
2. When you have no correlation between u_i and u_j, heteroskedasticity, but still NO clustering, then
you use
Vhet [τ̂ ]
3. When you have no correlation between u_i and u_j (except when observations are in the same cluster)
heteroskedasticity, and clustering, you use
Vclu [τ̂ ]
.

Then, to estimate any of these, replace ui in the equation with ûi

Matrix formuation of Cluster Robust SEs


Let i denote the ith of N individuals in the sample and g denote the gth of G clusters. Then the general
regression model is:

yig = x′ig β + uig

The key assumption about errors (aside from zero conditional mean) is

E[uig ujg′ |xig , xjg′ ] = 0, unlessg = g ′

The variance-covariance matrix is

V[β̂] = (X ′ X)−1 B(X ′ X)−1


PG
Where Bclu = g=1 Xg′ E[ug u′g |Xg ]Xg (the “meat of the sandwich”).
PG
To estimate Bclu , B̂clu = g=1 Xg′ ûg û′g Xg where ûg = yg − Xg β̂.

6
Block Randomization

• Adjustment before randomization is better than covariate adjustment after randomization. Therefore,
you should “Block what you can; randomize what you can’t.”

Setup:

• Units: i = 1, ..., N

• Blocks: j = 1, ..., M
• Nj : number of units in block j
• pj : treatment probability in block j
N1j
– pj = Nj for within-block complete randomization.

• If pj = p for all j (identical for each block), then the pooled difference in means is unbiased for the
ATE:

N N
1 X 1 X
τ̂ = Di Yi − (1 − Di )Yi
N1 i=1 N0 i=1

If pj varies across blocks, then τ̂ will be biased, because Dij is no longer independent of blocks when pj
varies.
weighted average of block-specific difference-in-means will be unbiased:

M
X Nj
τ̂B = τ̂j
j=i
N

where

Nj Nj
1 X 1 X
τ̂j = Dij Yij − (1 − Dij )Yij
N1j i=1 N0j i=1

where N1j is the number of treated units in group j, and N0j = Nj − N1j .
Because the randomizations in each block are independent, the sampling variance of the weighted-average
estimator is simply:

Nj 2
V(τ̂B )( ) V(τ̂j )
N
Component variance can be estimated conservatively via the Neyman formula for each block:
2 2
S1j S0j
V(τ̂j ) = +
N1j N0j

### Regression for block-randomized experiments

• If pj = p, OLS with block dummies (fixed effects, Bij ) will yield an unbiased estimate for ATE.

7
M
X
Yi = α + τ Di + βj Bij + ϵi
j−2

where

E[τ̂OLS ] = τ

• Use HC2 robust SE or clustered SE if randomization was clustered within blocks


• If pj varies by block, use weighted least squares instead of OLS, where the weight is the inverse proba-
bility of treatment/ control for each unit:

For i in block j:

1
wij = ifDi = 1
pj
1
wij = ifDi = 0
1 − pj
- Also can use the interacted OLS (“Lin estimator”) specification

Why blocking can usually help

It reduces variance estimate:


Where:
Complete randomization

Yi = α + τCR Di + ϵi

Block randomization
M
X
Yi = α + τBR Di + βj Bij + ϵ∗
j=2

We get:

σϵ2 SSRϵ̂
V[τ̂CR ] = Pn , σ̂ϵ2 =
i=1 (D i − D̄)2 N −2
and

σϵ2∗ SSRϵˆ∗
V[τ̂BR ] = P , σ̂ϵ2∗ =
i=1 (Di − D̄) (1 − Rj )
2 2 N −M −1

Where Rj2 is R2 from regression of Di on Bij variables and a constant. If there is a high correlation between
the block indicators and the treatment, then this number will be very small, and you’ll get a huge variance.
In the experimental case, your block indicators shouldn’t be very correlated with your indicators unless
you have really unequal probability of treatment. So in the experimental case, (1 − Rj2 ) will basically be 1
(ignorable).

8
Therefore, in the experimental case, the denominators of these two terms will basically be the same, so it
comes down to whether:
SSRϵˆ∗
All told, in random experiments V̂ [τ̂BR ] < V̂ [τ̂CR ] if N −M −1 < N −2 .
SSRϵ̂

The denominators here don’t matter that much in large experiments, because the differences in df will be
marginal.
Assuming that you block on things that are correlated with Y, that will soak up a lot of the variance, and
so the block SSR will be smaller than in the case of complete randomization.
In other words, if you design your research well, then:

SSRϵ̂∗ < SSRϵ̂

Randomization Inference

[Skipped notes on this]

Fisher’s exact test for small samples

Instead of a typical test of differences in means that we’d use with large N:

H0 : E[Y1i ] = E[Y0i ], H1 : E[Y1i ] ̸= E[Y0i ]

You’ve got a sharp null hypothesis and corresponding alternative hypothesis.

H0 : Y1i = Y0i , H1 : Y1i ̸= Y0i

The sharp null is almost always true.


Fisher’s exact test procedure
Where Ω represents the set of all possible ways to assign treatments.

1. Calculate a statistic α̂ (e.g., difference in means) from the data


2. Obtain the null distribution of the statistic by calculating the same statistic α̂(ω) under the sharp null
hypothesis for every possible ωϵΩ.

3. Compare α̂ to the null distribution and see how “extreme” it is

Pros:

• No assumptions needed, as randomization is seen as a “reasoned basis for causal inference”

Cons:

• Sharp null almost always false, so is it really that meaningful to reject it?

9
Power Analysis

• For statistical tests

– Type 1 error: rejecting the null if the null is true (α)


– Type 2 error: failing to reject the null if it is false (ψ)

• Usually, α is set to 0.05


• Power is 1 − ψ and represents the probability of rejecting the null of the null is false

We want to choose the optimal treatmente allocation p = N1


N to minimize the variance of the estimator of
the average treatment effefct.
Recall that our asymptotically valid variance expression is

σ12 σ02
V(τ̂ ) = +
pN (1 − p)N

We can treat this as a minimization problem. What is the value of p* that makes the derivative with respect
to p equa lto zero:

σ12 σ02
− + =0
p∗2 N (1 − p∗ )2 N

Therefore
1 − p∗ σ0
=
p∗ σ1

and

σ1 1
p∗ = =
σ1 + σ0 1 + σ0 /σ1
The rule of thumb assuming that σ1 ≈ σ0 is that p* = 0.5.
For practical reasons, it is sometimes better to choose unequal sample sizes, even if σ1 ≈ σ0 .

Choosing overall sample size

Suppose that

Y0 ∼ (µ0 , σ02 = σ 2 )andY1 ∼ (µ1 , σ12 = σ 2 )

N
p = 0.5, soN0 = N1 =
2

δ = µ1 − µ0

Then, for the t-statistic of equality of means:



Y¯1 − Ȳ0 − δ Y¯1 − Ȳ0 − δ Ȳ1 − Ȳ0 − δ δ N
t= q 2 = = ∼ N ( , 1)
2σ 2σ
q
σ1 σ02 √
N1 + N0
2σ 2
N + 2σ 2
N N

10
Therefore, the test statistic is centered around the true effect, scaled by the sample size and variance:

Ȳ1 − Ȳ0 δ N
t= q 2 ∼ N( , 1)
σ1 σ02 2σ
N1 + N0

In this case, power, which means P (rejectµi − µ0 = 0|µ1 − µ0 = δ) is:


Note: the |µ1 − µ0 = δ in that formula basically means “if the null is false and the alternative is true.”

p(t < −1.96) + p(t > 1.96)

√ √ √ √
δ N δ N δ N δ N
= p(t − < −1.96 − ) + p(t − < 1.96 − )
2σ 2σ 2σ 2σ
√ √ √
δ N δ N δ N
= ϕ(−1.96 − ) + (1 − ϕ(1.96 − p(t − < −1.96 − ))
2σ 2σ 2σ

General formula for power function

δ δ
P (rejectµi − µ0 = 0|µ1 − µ0 = 0) = ϕ(−1.96 − q ) + (1 − ϕ(1.96 − q ))
σ12 σ02 σ12 σ02
pN + (1−p)N pN + (1−p)N

We specify:
δ - effect magnitude
1 − ψ - power value (usually 0.8 or higher)
σ12 , σ02 - usually assumed to be equal to one another, using previous measures
p - proportion of observations in the treatment group

• If σ12 = σ02 , then power is maximized by p = 0.5

Minimum detectable effect

Assuming σ 2 = σ12 = σ02 :


s
σ2
M DECR = Mn−2
N p(1 − p)

Where Mn−2 = t1−α/2 + t1−ψ is called the multiplier.


With blocking:
s
σ 2 · (1 − RB
2 )
M DEBR = Mn−m−1
N p(1 − p)

Where RB 2
is the proportion of explained variation in the outcome predicted by the blocks (which you get
by regressing Y on all of the block dummies).

• The more similar observations are within blocks and the more different blocks are from each other, the
higher this predictive power is, and the larger precision gain from blocking.

11
Designing complex experiments

Lecture 4: Observational studies - Identification


These notes are much shorter, and I didn’t do detailed coverage of the concepts.

Pre-treatment covariates

• Think of an observational study as a hidden experiment


• You need the right control variables to reveal the experiment

In this scenario. . .

• There are i units, treatment Di is 0 or 1, and potential outcomes are Yi (d), where d = 0,1.

• The quantities of interest are:

ATE = τAT E = E[Yi (1) − Yi (0)]

ATT = τAT T = E[Yi (1) − Yi (0)|Di = 1]

Can we identify these quantities of interest when Di not random?

• Idea is that pre-treatment covariates might be correlated with both Di and Yi , thereby confounding
the causal relationship

Two key assumptions: Conditinal ignorability and common support

• In random experiments, Yi (0), Yi (1) ⊥ Di

• In observational studies, we instead make two assumptions

1. The conditional ignorability (CI) assumption states that:

Yi (0), Yi (1) ⊥ Di |Xi = xfor anyxϵX

• Idea is that within sub-groups for which X = x, it is “as if” Di is randomly assigned.

2. The common support (“positivity”) assumption is that

0 < P (Di = 1|Xi = x) < 1for anyxϵX

• There has to be some possibility of being greated within any X sub-group.

12
Population regression functions instead of difference in means

• In randomized experiments, we considered identification with population difference in means

τ̂ = E[Yi |Di = 1] − E[Yi |Di = 0]

• Here, we use difference in the population regression functions:

τ̂ (x) = E[Yi Di = 1, Xi = x] − E[Yi Di = 0, Xi = x]

Under the conditional ignorability and common support assumptions, τ is nonparametrically identified as:

τAT E = E[τ̂ (Xi )]

E[Yi (1) − Yi (0)|Di = 1]

Z
(E[Yi |Di = 1, Xi = x] − E[Yi |Di = 0, Xi = x])f (x)dx

Where the first E is taken with respect to the distribution of Xi , f(x).


Given the conditional ignorability assumption, we have

E[Yi (1) − Yi (0)|Xi ] = E[Yi (1)|Xi , Di = 1] − E[Yi (0)|Xi , Di = 0] = E[Yi |Xi , Di = 1] − E[Y0 |Xi , Di = 1]

Under the common support assumption:

13

You might also like