Professional Documents
Culture Documents
Matching and Selection On Observables Handout
Matching and Selection On Observables Handout
Matching and Selection On Observables Handout
Harvard University
Department of Economics
EC 2140
Alberto Abadie
Table 1
Death Rates per 1,000 Person-Years
Table 2
Mean Ages, Years
Subclassification
Question
What is the average death rate for Pipe Smokers?
Answer
15 · (11/40) + 35 · (13/40) + 50 · (16/40) = 35.5
Subclassification: Example
Question
What would be the average death rate for Pipe Smokers if they
had same age distribution as Non-Smokers?
Answer
15 · (29/40) + 35 · (9/40) + 50 · (2/40) = 21.2
Smoking and Mortality (Cochran, 1968)
Table 3
Adjusted Death Rates using 3 Age groups
Definition (Outcomes)
Those variables, Y , that are (possibly) not predetermined are
called outcomes (for some individual i, Y0i 6= Y1i )
In general, one should not condition on outcomes, because this
may induce bias
Adjustment for Observables in Observational Studies
Subclassification
Matching
Regression
Randomization implies:
Therefore:
E [Y |D = 1] E [Y |D = 0] = E [Y1 |D = 1] E [Y0 |D = 0]
= E [Y1 ] E [Y0 ]
= E [Y1 Y0 ].
Identification Result
Given selection on observables we have
E [Y1 Y0 |X ] = E [Y1 Y0 |X , D = 1]
= E [Y |X , D = 1] E [Y |X , D = 0]
Identification Result
Similarly,
↵ATET = E [Y1 Y0 |D = 1]
Z
= E [Y |X , D = 1] E [Y |X , D = 0] dP(X |D = 1)
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Age Distribution: After Matching
3
2
1 A: Trainees
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Before matching
After matching:
Works well when we can find good matches for each treated unit,
so M is usually small (typically, M = 1 or M = 2).
Matching
where 0 1
b12 0 ··· 0
B 0 b22 ··· 0 C
b B C
V =B .. .. .. .. C.
@ . . . . A
0 0 ··· bk2
Notice that, the normalized Euclidean distance is equal to:
v
u k
uX (Xni Xnj )2
kXi Xj k = t 2
.
n=1
b n
Then,
1 X⇣ ⌘
bATET
↵ = (µ1 (Xi ) + "i ) (µ0 (Xj(i)) + "j(i) )
N1
Di =1
1 X
= (µ1 (Xi ) µ0 (Xj(i) ))
N1
Di =1
1 X
+ ("i "j(i) ).
N1
Di =1
1 X
bATET
↵ ↵ATET = (µ1 (Xi ) µ0 (Xj(i) ) ↵ATET )
N1
Di =1
1 X
+ ("i "j(i) ).
N1
Di =1
Therefore,
1 X
bATET
↵ ↵ATET = (µ1 (Xi ) µ0 (Xi ) ↵ATET )
N1
Di =1
1 X
+ ("i "j(i) )
N1
Di =1
1 X
+ (µ0 (Xi ) µ0 (Xj(i) )).
N1
Di =1
Matching: Bias Problem
Now, if k is large:
) The di↵erence between Xi and Xj(i) converges to zero very slowly
) The di↵erence µ0 (Xi ) µ0 (Xj(i) ) converges to zero very slowly
p
) E [ N1 (µ0 (Xi ) µ0 (Xj(i) ))|D = 1] may not converge to zero!
p
) E [ N1 (b ↵ATET ↵ATET )] may not converge to zero!
µ0 (Xi ) µ0 (Xj(i) )
to the bias.
Bias-corrected matching:
BC 1 X⇣ ⌘
bATET
↵ = (Yi Yj(i) ) (b
µ0 (Xi ) b0 (Xj(i) ))
µ
N1
Di =1
Under some conditions, the bias correction eliminates the bias of the
matching estimator without a↵ecting the variance.
M
!2
2 1 X 1 X
bATET = Yi Yj (i) bATET
↵ ,
N1 M m=1 m
Di =1
is valid.
M
!2
2 1 X 1 X
bATET = Yi bATET
Yj (i) ↵
N1 M m=1 m
Di =1
✓ ◆
1 X Ki (Ki 1)
+ c i |Xi , Di = 0),
var(Y
N1 M2
Di =0
?D|X , then
Because of the Rosenbaum-Rubin result, if (Y1 , Y0 )?
And the results follow from integration over P(X ) and P(X |D = 1).
D p(X )
↵ATE = E Y
p(X )(1 p(X ))
1 D p(X )
↵ATET = E Y
P(D = 1) 1 p(X )
N
1 X Di p b(Xi )
bATE
↵ = Yi
N b(Xi )(1 p
p b(Xi ))
i=1
XN
1 Di b
p (Xi )
bATET
↵ = Yi
N1 1 b
p (Xi )
i=1
Standard errors:
Regression Estimators
E [Y |D = 1, X ] = E [Y1 |D = 1, X ] = E [Y1 |X ],
E [Y |D = 0, X ] = E [Y0 |D = 0, X ] = E [Y0 |X ]
↵, b) converge to
As a result, (b
h i
0 2
(↵, ) = argmina,b E Y aD Xb
It can be shown (Handout 2) that the (↵, ) that solve this
minimization problem also solve:
h i
0 2
(↵, ) = argmina,b E E [Y |D, X ] (aD + X b)
bD + X 0 b provides an
) Even if E [Y |D, X ] is nonlinear, ↵
estimation of a well-defined approximation to E [Y |D, X ].
) The result is true also for nonlinear specifications.
) But the approximation could be a bad one if the regression
is severely misspecified.
Least Squares as a Weighting Scheme
Suppose that the covariates take on a finite number of values:
x 1 , x 2 , . . . , x K . Then, from the subclassification section we know
that:
K ⇣
X ⌘
↵ATE = E [Y |D = 1, X = x ] E [Y |D = 0, X = x ] Pr(X = x k )
k k
k=1
where ⇢
k 1 if X = x k
Z =
0 6 xk
if X =
where
var(D|X = x k ) Pr(X = x k )
wk = P K .
k k
k=1 var(D|X = x ) Pr(X = x )
tistics for the experimental treatment group. The second pair from comparisons between the PSID sample and the treatment
of columns presents summary statistics for the experimental group a tenuous task. From Panel B, we can obtain an unbi-
controls. The third pair of columns presents summary statistics ased estimate of the effect of the NSW program on earnings in
for the nonexperimental comparison group constructed from 1978 by comparing the averages for the experimental treated
the PSID. The last two columns present normalized differences and controls, 6.35 − 4.55 = 1.80, with a standard error of 0.67
between the covariate distributions, between the experimental (earnings are measured in thousands of dollars). Using a nor-
treated and controls, and between the experimental treated and mal approximation to the limiting distribution of the effect of
the PSID comparison group, respectively. These normalized the program on earnings in 1978, we obtain a 95% confidence
differences are calculated as interval, which is [0.49, 3.10].
X1 − X0 Table 2 presents estimates of the causal effect of the NSW
nor-dif = ! , program on earnings using various matching, regression, and
(S02 + S12 )/2 weighting estimators. Panel A reports estimates for the exper-
" " imental data (treated and controls). Panel B reports estimates
where X w = i:Wi =w Xi /Nw and Sw2 = i:Wi =w (Xi − X w )2 / based on the experimental treated and the PSID comparison
(Nw − 1). Note that this differs from the t-statistic for the test of
group. The first set of rows in each case reports matching es-
the null hypothesis that E[X|W = 0] = E[X|W = 1], which will
timates for M equal to 1, 4, 16, 64, and 2490 (the size of the
be
PSID comparison group). The matching estimates include sim-
X1 − X0 ple matching with no-bias adjustment and bias-adjusted match-
t-stat = ! .
ing, first by matching on the covariates and then by matching
S02 /N0 + S12 /N1
on the estimated propensity score. The covariate matching es-
The normalized difference provides a scale-free measure of the timators use the matrix Ane (the diagional matrix with inverse
difference in the location of the two distributions, and is useful sample variances on the diagonal) as the distance measure. Be-
for assessing the degree of difficulty in adjusting for differences cause we are focused on the average effect for the treated, the
in covariates. bias correction only requires an estimate of µ0 (Xi ). We esti-
Panel A contains the results for pretreatment variables and mate this regression function using linear regression on all nine
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and the any higher order terms or interactions, with only the control
Abadie and Imbens (2011)
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that Wj = 0 and
Z X Y
In the DAG:
Z X Y
U is a parent of X and Y .
X and Y are descendants of Z .
There is a directed path from Z to Y .
There are two paths from Z to U (but no directed path).
X is a collider of the path Z ! X U.
X is a noncollider of the path Z ! X ! Y .
Confounding
Confounding arises when the treatment and the outcome have
common causes.
D Y
X
The association between D and Y does not only reflect the
causal e↵ect of D on Y .
Confounding creates backdoor paths, that is, paths starting
with incoming arrows. In the DAG we can see a backdoor
path from D to Y (D X ! Y ).
However, once we “block” the backdoor path by conditioning
on the common cause, X , the association between D and Y is
only reflective of the e↵ect of D on Y .
D Y
Blocked Paths
A path is blocked if and only if:
It contains a noncollider that has been conditioned on,
Or, it contains a collider that has not been conditioned on and
has no descendants that have been conditioned on.
Examples:
1 Conditioning on a noncollider blocks a path:
X Z Y
2 Conditioning on a collider opens a path:
Z X Y
3 Not conditioning on a collider (or its descendants) leaves a
path blocked:
Z X Y
Backdoor Criterion
Suppose that:
D is a treatment,
Y is an outcome,
X1 , . . . , Xk is a set of covariates.
X1 D Y
X2
Conditioning on X1 and X2 blocks the backdoor paths.
Matching may work even if not all common causes are
observed: U and X1 are common causes.
X1
U X2 D Y
X1 D Y
X2
Conditioning on X1 blocks the backdoor path. Conditioning
on X2 would open a path!
Matching on all pretreatment covariates is not always
the answer: There is one backdoor path and it is closed.
U1
X D Y
U2
No confounding. Conditioning on X would open a path!
X3 D Y X3 D Y
X2 X2