Chapter4 Intro To Regression

Chapter 4: Introduction to Regression
Jonathan Roth
Mathematical Econometrics I
Brown University
Fall 2023
Motivation
We showed that under conditional unconfoundedness we can learn the

conditional average treatment effect (CATE) by comparing outcome
means for the treatment/control group conditional on Xi :
CATE (x ) = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]
When Xi is discrete and we have many observations per x -value (Nx is

large), we showed how we can use the Central Limit Theorem to
estimate each of these conditional means and “do inference”
1
Motivation
We showed that under conditional unconfoundedness we can learn the

conditional average treatment effect (CATE) by comparing outcome
means for the treatment/control group conditional on Xi :
CATE (x ) = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]
When Xi is discrete and we have many observations per x -value (Nx is

large), we showed how we can use the Central Limit Theorem to
estimate each of these conditional means and “do inference”
But what about when Xi is continuous?
1
Suppose this is our data
2
We might be willing to assume that college attendance is
as-good-as-random conditional on SAT score
2
Say we’re interested in the CATE at X = 1350. Theory tells us to
compare average income for Brown/URI SAT scores with X = 1350.
2
We could estimate the average at Brown using our 1 student with
X = 1350. But that estimate is very noisy, & we can’t apply the CLT
2
We could estimate the average at Brown using our 1 student with
X = 1350. But that estimate is very noisy, & we can’t apply the CLT
Moreover, we don’t have any URI students with X = 1350!
2
Clearly, we need to extrapolate from students with other SAT scores.
2
What would you do if you were eyeballing it?
2
What would you do if you were eyeballing it?
Probably draw a line through the points to estimate the CEFs!
2
With these CEF estimates in hand, we can estimate CATE (x ) at any x
2
Outline
1. Population Regression
2. Sample Regressions (OLS)
3. Putting Regression into Practice
3
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
4
There are a few outstanding questions that we need to answer:
4
How do we approximate the CEF in the sample that we have? (I.e.

how to draw the lines through the data!)
4

How can we construct confidence intervals / do hypothesis tests for

the estimates of the CEF?
4


What happens if the real CEF doesn’t take the form we’ve used for
estimation (e.g. isn’t linear)?
4


What happens if the real CEF doesn’t take the form we’ve used for
estimation (e.g. isn’t linear)?
We’ll try to answer all of these questions over the next several lectures!
4
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
What we want to do: Estimate approximations to the CEF and test

hypotheses about them
5
Roadmap

How can we use what know to do what we want?
5
Roadmap


1) Assume the CEF takes a particular form, e.g. linear:
E [Yi |Xi = x ] = α + x β
5
Roadmap


E [Yi |Xi = x ] = α + x β
2) Show that under this assumption, α and β can be represented as

functions of population means.
5
Roadmap


E [Yi |Xi = x ] = α + x β

3) Use our tools for estimating population means using sample means
to estimate α, β and test hypotheses about them
5
Roadmap


E [Yi |Xi = x ] = α + x β

4) Argue that even if our assumption about the form of the CEF is
wrong, the parameters α, β may provide a “good” approximation.
5
The “Least Squares” Problem
Suppose Xi is scalar and the CEF is linear (we’ll relax both later):
E [Yi |Xi = x ] = α + x β
6
E [Yi |Xi = x ] = α + x β
A useful fact which we will exploit is that when this is true (α, β )
solve the “least squares” problem:
(α, β ) = arg min E [(Yi − (a + bXi ))2 ]

a,b
6
E [Yi |Xi = x ] = α + x β
A useful fact which we will exploit is that when this is true (α, β )
solve the “least squares” problem:
(α, β ) = arg min E [(Yi − (a + bXi ))2 ]

a,b
Where does this come from?!
6
Starting with a Simpler Problem
To show that (α, β ) solve a “least-squares” problem, let’s first

consider a simpler, related problem:
7

Suppose that we want to find a constant u to minimize
min E [(Yi − u)2 ]

u
7

min E [(Yi − u)2 ]

u
What constant u should we choose?
7

min E [(Yi − u)2 ]

u
What constant u should we choose? The population mean µ = E [Yi ]!
7

min E [(Yi − u)2 ]

u
Proof:
The derivative of E [(Yi − u)2 ] with respect to u is E [2(Yi − u)].
7

min E [(Yi − u)2 ]

u
Proof:
The derivative of E [(Yi − u)2 ] with respect to u is E [2(Yi − u)].
Setting the derivative to 0, we obtain
E [2(Yi − µ)] = 0 ⇒ 2E [Yi ] = 2u ⇒ u = E [Yi ].
7
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
min E [(Yi − u(Xi ))2 ]

u(·)
8
min E [(Yi − u(Xi ))2 ]

u(·)
What function u(x ) should we choose?
8
min E [(Yi − u(Xi ))2 ]

u(·)
What function u(x ) should we choose? The conditional expectation

u(x ) = E [Y |X = x ].
8
min E [(Yi − u(Xi ))2 ]

u(·)

u(x ) = E [Y |X = x ].
Proof:
By the law of iterated expectations,
E [(Yi − u(Xi ))2 ] = E [E [(Yi − u(Xi ))2 |Xi ]].
8
min E [(Yi − u(Xi ))2 ]

u(·)

u(x ) = E [Y |X = x ].
Proof:
Thus, for each value of x , we want to choose u(x ) to minimize
E [(Yi − u(x ))2 |Xi = x ].
8
min E [(Yi − u(Xi ))2 ]

u(·)

u(x ) = E [Y |X = x ].
Proof:
Thus, for each value of x , we want to choose u(x ) to minimize
E [(Yi − u(x ))2 |Xi = x ].
However, our argument on the previous slide implies that the solution
is u(x ) = E [Yi |Xi = x ].
8
Looping Back...
We’ve shown that the function u(x ) = E [Yi |Xi = x ] solves
min E [(Yi − u(Xi ))2 ]
u(·)
9
Looping Back...
min E [(Yi − u(Xi ))2 ]
u(·)
Thus, if E [Yi |Xi = x ] = α + β x , then u(x ) = α + β x minimizes

min E [(Yi − u(Xi ))2 ]
u(·)
9
Looping Back...
min E [(Yi − u(Xi ))2 ]
u(·)

min E [(Yi − u(Xi ))2 ]
u(·)
The minimization above was over all functions u(·), including linear
ones of the form a + bx . Hence,
E [(Yi − (α + β Xi ))2 ] ≤ E [(Yi − (a + bXi ))2 ] for all a, b.
9
Looping Back...
min E [(Yi − u(Xi ))2 ]
u(·)

min E [(Yi − u(Xi ))2 ]
u(·)
The minimization above was over all functions u(·), including linear
ones of the form a + bx . Hence,
E [(Yi − (α + β Xi ))2 ] ≤ E [(Yi − (a + bXi ))2 ] for all a, b.
This implies that (α, β ) solve
min E [(Yi − (a + bXi ))2 ],

a,b
as we wanted to show
9
Why This is Useful
So we’ve shown that α, β are the solutions to
min E [(Yi − (a + bXi ))2 ].

a,b
How does this help us?
10
Why This is Useful
min E [(Yi − (a + bXi ))2 ].

a,b
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
10
Why This is Useful
min E [(Yi − (a + bXi ))2 ].

a,b
Let’s take the derivative w.r.t. a and b and set them to zero at (α, β ):
10
Why This is Useful
min E [(Yi − (a + bXi ))2 ].

a,b
E [−2(Yi − (α + β Xi ))] = 0
10
Why This is Useful
min E [(Yi − (a + bXi ))2 ].

a,b
E [−2(Yi − (α + β Xi ))] = 0
E [−2Xi (Yi − (α + β Xi ))] = 0
10
Why This is Useful
min E [(Yi − (a + bXi ))2 ].

a,b
E [−2(Yi − (α + β Xi ))] = 0
E [−2Xi (Yi − (α + β Xi ))] = 0
We now have 2 equations with 2 unknowns, which we can use to solve

for the CEF parameters (α, β )
10
The Least Squares Solution
The solution to the system of equations is as follows:
E [(Xi − E [Xi ])(Yi − E [Yi ])] Cov (Xi , Yi )

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β
11

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β
These are continuous functions of population means!
11

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β
These are continuous functions of population means!
We can therefore use the tools from previous lectures to estimate

them and test hypotheses about the CEF!
11
Roadmap


E [Yi |Xi = x ] = α + x β X

functions of population means. X
12
Outline
1. Population RegressionX
2. Sample Regressions (OLS)
13
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β
How can we estimate α, β ?
14

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β

Replace population means with sample averages!
14

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β

1
∑N
i=1 (Xi − X̄ )(Yi − Ȳ ) Cov
d (Xi , Yi )
βˆ = N
1 N
=
2
N ∑i=1 (Xi − X̄ ) Var
d (Xi )
14

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β

1
∑N
i=1 (Xi − X̄ )(Yi − Ȳ ) Cov
d (Xi , Yi )
βˆ = N
1 N
=
2
N ∑i=1 (Xi − X̄ ) Var
d (Xi )
α̂ = Ȳ − X̄ βˆ
14

β= =
E [(Xi − E [Xi ])2 ] Var (Xi )
α = E [Yi ] − E [Xi ]β

1
∑N
i=1 (Xi − X̄ )(Yi − Ȳ ) Cov
d (Xi , Yi )
βˆ = N
1 N
=
2
N ∑i=1 (Xi − X̄ ) Var
d (Xi )
α̂ = Ȳ − X̄ βˆ
These α̂, βˆ are called ordinary least squares (OLS) coefficients

1
They solve the “sample analog” problem, mina,b N ∑i (Yi − (a + bXi ))2
14
15
16
What is the estimated value of E [Yi |Di = 1, Xi = 1350]?
16
What is the estimated value of E [Yi |Di = 1, Xi = 1350]?
α̂ + βˆ · 1350 = 34947 + 15 · 1350 = 55197.
16
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
17
Consistency of OLS
We have that
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
17
Consistency of OLS
We have that
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
17
Consistency of OLS
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
→p E [Xi2 ] − E [Xi ]2 (E [Xi Yi ] − E [Xi ]E [Yi ])
17
Consistency of OLS
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
= Var (Xi )−1 Cov (Xi , Yi )
17
Consistency of OLS
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
= Var (Xi )−1 Cov (Xi , Yi ) = β
17
Consistency of OLS
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
= Var (Xi )−1 Cov (Xi , Yi ) = β
Analogously, we can show that α̂ →p α.
17
Asymptotic Distribution for OLS
Our OLS estimates α̂, βˆ are continuous functions of sample means.
18
We can therefore use the Central Limit Theorem and Continuous

Mapping Theorem to show that they are asymptotically normally
distributed
18

distributed
In particular, we will show that

√
N(βˆ − β ) →d N(0, σ 2 ),
where
Var ((Xi − E [Xi ])εi )

σ2 =
Var (Xi )2
18

distributed
In particular, we will show that

√
N(βˆ − β ) →d N(0, σ 2 ),
where

σ2 =
Var (Xi )2
This is useful
√ because we can then form CIs for β of the form
βˆ ± 1.96σ̂ / N, where σ̂ is our estimate of σ .
18
Deriving the Asymptotic Distribution for OLS
Define the regression residual εi = Yi − (α + Xi β ), implying
Yi = α + Xi β + εi
19
The first-order conditions we derived for (α, β ) imply this residual is

mean-zero and orthogonal to the regressor: E [εi ] = E [Xi εi ] = 0
19

Taking means, Ȳ = α + X̄ β + ε̄.
19

Taking means, Ȳ = α + X̄ β + ε̄. So Yi − Ȳ = (Xi − X̄ )β + (εi − ε̄)
19
Asymptotic Distribution for OLS (cont.)
We just derived that Yi − Ȳ = (Xi − X̄ )β + (εi − ε̄).
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
20
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1
1 N 1 N
= ∑ (Xi − X̄ )2
N i=1 ∑ (Xi − X̄ )((Xi − X̄ )β + (εi − ε̄))
N i=1
20
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1
1 N 1 N
= ∑ (Xi − X̄ )2
N i=1 ∑ (Xi − X̄ )((Xi − X̄ )β + (εi − ε̄))
N i=1
!−1
1 N 1 N
=β+ ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
20
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
21
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
21
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
√ 1 N
By CLT, N N ∑i=1 (Xi − E [Xi ])εi →d N(0, Var ((Xi − E [Xi ])εi )).
21
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
√ 1 N
By CLT, N N ∑i=1 (Xi − E [Xi ])εi →d N(0, Var ((Xi − E [Xi ])εi )).
√
By LLN, CLT, and Slutsky, ε̄ N(X̄ − E [Xi ]) →d 0 × N(0, Var (Xi )) = 0
21
Finishing the Asymptotics (!)
Putting all the pieces together, we see that
√
N(βˆ − β ) →d N(0, σ 2 ),
where

σ2 =
Var (Xi )2
As before, we can estimate the variance σ 2 using sample averages,

1
∑i ((Xi − X̄ )ε̂i )2
σ̂ 2 = , where ε̂i = Yi − (α̂ + Xi βˆ )
N
1
∑ (X − X̄ ) 2 2
N i i
22
Finishing the Asymptotics (!)
Putting all the pieces together, we see that
√
N(βˆ − β ) →d N(0, σ 2 ),
where

σ2 =
Var (Xi )2
As before, we can estimate the variance σ 2 using sample averages,

1
∑i ((Xi − X̄ )ε̂i )2
σ̂ 2 = , where ε̂i = Yi − (α̂ + Xi βˆ )
N
1
∑ (X − X̄ ) 2 2
N i i
Can do similar steps to show α̂ is asymptotically normally distributed

as well. (We’ll show formulas later!)
22
23
A CI for β is βˆ ± 1.96 × SE
23
A CI for β is βˆ ± 1.96 × SE ≈ [5, 25]
23
Aside on notation/terminology
Oftentimes people will say: consider the (population) regression
Yi = α + β Di + εi (1)
What they mean is: “define (α, β ) = arg mina,b E [(Yi − (a + bXi ))2 ]”
(α, β ) are referred to as the “population regression coefficients”
24
Aside on notation/terminology
Oftentimes people will say: consider the (population) regression
Yi = α + β Di + εi (1)
What they mean is: “define (α, β ) = arg mina,b E [(Yi − (a + bXi ))2 ]”
(α, β ) are referred to as the “population regression coefficients”
Likewise, people will say “We estimate equation (1) by OLS” to mean
that they compute the sample analogs to α, β via OLS, i.e. α̂, βˆ .
24
Outline
1. Population RegressionX
2. Sample Regressions (OLS)X
25
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
τ = E [Yi (1) − Yi (0)] = E [Yi |Di = 1] − E [Yi |Di = 0]
26
Observe that we can write:
E [Yi |Di = d] = E [Yi |Di = 0] + (E [Yi |Di = 1] − E [Yi |Di = 0]) · d
26

= α +βd
26

= α +βd
Thus, the CEF E [Yi |Di = d] is linear in d, and the slope coefficient β
is exactly the estimand which identifies the ATE in an experiment!
26

= α +βd
Analogously, the OLS slope coefficient βˆ is the difference in sample

means which estimates the ATE: βˆ = Ȳ1 − Ȳ0 = τ̂.
26

= α +βd
Analogously, the OLS slope coefficient βˆ is the difference in sample

means which estimates the ATE: βˆ = Ȳ1 − Ȳ0 = τ̂.
We can thus use OLS as a convenient tool for estimating the ATE and
getting standard errors
26
Example - WorkAdvance
Background: gaps between college-educated and non-college educated
workers have widened over time
Yet not everyone thrives in a traditional college background
WorkAdvance is a job-training program intended to provide people

with certifiable skills in high-wage industries (e.g. IT, healthcare
manufacturing)
MDRC conducted a randomized trial that randomized access to the

training program among people who passed the initial screening
27
28
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
29
Yi = α +β Di +εi
|{z} |{z}
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609
What is the estimated treatment effect?
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609
What is the estimated treatment effect? βˆ = 1965
What is a CI for the treatment effects?
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609

βˆ ± 1.96 × SEβ =
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609

βˆ ± 1.96 × SEβ = 1965 ± 1.96 × 609
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609

βˆ ± 1.96 × SEβ = 1965 ± 1.96 × 609 = [771, 3159]
What is the estimated control mean?
29
Yi = α +β Di +εi
|{z} |{z}
α̂ 14636 425
βˆ 1965 609

βˆ ± 1.96 × SEβ = 1965 ± 1.96 × 609 = [771, 3159]
What is the estimated control mean? α̂ = 14636
29
Roadmap


E [Yi |Xi = x ] = α + x β X

functions of population means. X
to estimate α, β and test hypotheses about them. X
30
Regressions as Approximations
So far we’ve assumed that the conditional expectation is linear:

E [Yi |Xi = x ] = α + β x
What if the true CEF is not linear?!
31

E [Yi |Xi = x ] = α + β x
Claim: if CEF is not linear, then OLS still gives us the “best linear
approximation” to the CEF
31

E [Yi |Xi = x ] = α + β x
What we mean by this is that the α, β of OLS minimize
minα,β E [(E [Y |X ] − (α + β X ))2 ]
31

E [Yi |Xi = x ] = α + β x
What we mean by this is that the α, β of OLS minimize
minα,β E [(E [Y |X ] − (α + β X ))2 ]

That is, we get the linear function that’s “closest” to the CEF in
terms of mean-squared error
31
32
33
34
35
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
36
Let µ(x ) = E [Y |X = x ]. Then we have
36

E [(Y − (α + β X ))2 ] = E [(Y − µ(X ) + µ(X ) − (α + β X ))2 ]
36

E [(Y − (α + β X ))2 ] = E [(Y − µ(X ) + µ(X ) − (α + β X ))2 ]
= E [(Y − µ(X ))2 ] + E [(µ(X ) − (α + β X ))2 ]
+ 2E [(Y − µ(X ))(µ(X ) − (α + β X ))]
36

E [(Y − (α + β X ))2 ] = E [(Y − µ(X ) + µ(X ) − (α + β X ))2 ]
= E [(Y − µ(X ))2 ] + E [(µ(X ) − (α + β X ))2 ]
+ 2E [(Y − µ(X ))(µ(X ) − (α + β X ))]
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
36

E [(Y − (α + β X ))2 ] = E [(Y − µ(X ) + µ(X ) − (α + β X ))2 ]
= E [(Y − µ(X ))2 ] + E [(µ(X ) − (α + β X ))2 ]
+ 2E [(Y − µ(X ))(µ(X ) − (α + β X ))]
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
E [E [(Y − µ(X ))(µ(X ) − (α + β X ))|X ]] =
36

E [(Y − (α + β X ))2 ] = E [(Y − µ(X ) + µ(X ) − (α + β X ))2 ]
= E [(Y − µ(X ))2 ] + E [(µ(X ) − (α + β X ))2 ]
+ 2E [(Y − µ(X ))(µ(X ) − (α + β X ))]
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
E [E [(Y − µ(X ))(µ(X ) − (α + β X ))|X ]] =
E [(µ(X ) − (α + β X )) E [Y − µ(X )|X ]] = 0
| {z }
=0
Hence, E [(Y − (α + β X ))2 ] = [(Y − µ(X ))2 ] + E [µ(X ) − (α + β X ))2 ].

But the first term doesn’t depend on β . So minimizing
E [(Y − (α + β X ))2 ] is the same as minimizing E [µ(X ) − (α + β X ))2 ]
36
E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ] ≈ α1 + β1 x − (α0 + β0 x );
37
α̂1 + βˆ1 x −(α̂0 + βˆ0 x ) = (34, 947+15x )−(13, 942+21x ) = 21, 005−6x
37
α̂1 + βˆ1 x −(α̂0 + βˆ0 x ) = (34, 947+15x )−(13, 942+21x ) = 21, 005−6x
So w/conditional ignorability, ATE = E [CATE (Xi )] ≈ 21, 005 − 6E [Xi ]

37

Chapter4 Intro To Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter4 Intro To Regression

Uploaded by

Copyright:

Available Formats

Chapter 4: Introduction to Regression

We showed that under conditional unconfoundedness we can learn the

CATE (x ) = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]

When Xi is discrete and we have many observations per x -value (Nx is

We showed that under conditional unconfoundedness we can learn the

CATE (x ) = E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ]

When Xi is discrete and we have many observations per x -value (Nx is

But what about when Xi is continuous?

2. Sample Regressions (OLS)

3. Putting Regression into Practice

There are a few outstanding questions that we need to answer:

There are a few outstanding questions that we need to answer:

How do we approximate the CEF in the sample that we have? (I.e.

There are a few outstanding questions that we need to answer:

How do we approximate the CEF in the sample that we have? (I.e.

How can we construct confidence intervals / do hypothesis tests for

There are a few outstanding questions that we need to answer:

How do we approximate the CEF in the sample that we have? (I.e.

How can we construct confidence intervals / do hypothesis tests for

There are a few outstanding questions that we need to answer:

How do we approximate the CEF in the sample that we have? (I.e.

How can we construct confidence intervals / do hypothesis tests for

What we want to do: Estimate approximations to the CEF and test

What we want to do: Estimate approximations to the CEF and test

How can we use what know to do what we want?

What we want to do: Estimate approximations to the CEF and test

How can we use what know to do what we want?

What we want to do: Estimate approximations to the CEF and test

How can we use what know to do what we want?

2) Show that under this assumption, α and β can be represented as

What we want to do: Estimate approximations to the CEF and test

How can we use what know to do what we want?

2) Show that under this assumption, α and β can be represented as

What we want to do: Estimate approximations to the CEF and test

How can we use what know to do what we want?

2) Show that under this assumption, α and β can be represented as

(α, β ) = arg min E [(Yi − (a + bXi ))2 ]

(α, β ) = arg min E [(Yi − (a + bXi ))2 ]

Where does this come from?!

To show that (α, β ) solve a “least-squares” problem, let’s first

To show that (α, β ) solve a “least-squares” problem, let’s first

Suppose that we want to find a constant u to minimize

min E [(Yi − u)2 ]

To show that (α, β ) solve a “least-squares” problem, let’s first

Suppose that we want to find a constant u to minimize

min E [(Yi − u)2 ]

What constant u should we choose?

To show that (α, β ) solve a “least-squares” problem, let’s first

Suppose that we want to find a constant u to minimize

min E [(Yi − u)2 ]

What constant u should we choose? The population mean µ = E [Yi ]!

To show that (α, β ) solve a “least-squares” problem, let’s first

Suppose that we want to find a constant u to minimize

min E [(Yi − u)2 ]

What constant u should we choose? The population mean µ = E [Yi ]!

To show that (α, β ) solve a “least-squares” problem, let’s first

Suppose that we want to find a constant u to minimize

min E [(Yi − u)2 ]

What constant u should we choose? The population mean µ = E [Yi ]!

E [2(Yi − µ)] = 0 ⇒ 2E [Yi ] = 2u ⇒ u = E [Yi ].

min E [(Yi − u(Xi ))2 ]

min E [(Yi − u(Xi ))2 ]

What function u(x ) should we choose?