Professional Documents
Culture Documents
Chapter4 Intro To Regression
Chapter4 Intro To Regression
Jonathan Roth
Mathematical Econometrics I
Brown University
Fall 2023
Motivation
1
Motivation
1
Suppose this is our data
2
We might be willing to assume that college attendance is
as-good-as-random conditional on SAT score
2
Say we’re interested in the CATE at X = 1350. Theory tells us to
compare average income for Brown/URI SAT scores with X = 1350.
2
We could estimate the average at Brown using our 1 student with
X = 1350. But that estimate is very noisy, & we can’t apply the CLT
2
We could estimate the average at Brown using our 1 student with
X = 1350. But that estimate is very noisy, & we can’t apply the CLT
Moreover, we don’t have any URI students with X = 1350!
2
Clearly, we need to extrapolate from students with other SAT scores.
2
Clearly, we need to extrapolate from students with other SAT scores.
What would you do if you were eyeballing it?
2
Clearly, we need to extrapolate from students with other SAT scores.
What would you do if you were eyeballing it?
Probably draw a line through the points to estimate the CEFs!
2
With these CEF estimates in hand, we can estimate CATE (x ) at any x
2
Outline
1. Population Regression
3
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
4
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
4
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
4
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
4
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
What happens if the real CEF doesn’t take the form we’ve used for
estimation (e.g. isn’t linear)?
4
Introduction to Regression
The idea of regression is to formalize the process of estimating the
conditional expectation function (CEF) by extrapolating across units
using a particular functional form (e.g. linear, quadratic, etc.)
What happens if the real CEF doesn’t take the form we’ve used for
estimation (e.g. isn’t linear)?
We’ll try to answer all of these questions over the next several lectures!
4
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
5
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
5
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
5
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
5
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
3) Use our tools for estimating population means using sample means
to estimate α, β and test hypotheses about them
5
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
3) Use our tools for estimating population means using sample means
to estimate α, β and test hypotheses about them
4) Argue that even if our assumption about the form of the CEF is
wrong, the parameters α, β may provide a “good” approximation.
5
The “Least Squares” Problem
Suppose Xi is scalar and the CEF is linear (we’ll relax both later):
E [Yi |Xi = x ] = α + x β
6
The “Least Squares” Problem
Suppose Xi is scalar and the CEF is linear (we’ll relax both later):
E [Yi |Xi = x ] = α + x β
A useful fact which we will exploit is that when this is true (α, β )
solve the “least squares” problem:
6
The “Least Squares” Problem
Suppose Xi is scalar and the CEF is linear (we’ll relax both later):
E [Yi |Xi = x ] = α + x β
A useful fact which we will exploit is that when this is true (α, β )
solve the “least squares” problem:
6
Starting with a Simpler Problem
7
Starting with a Simpler Problem
7
Starting with a Simpler Problem
7
Starting with a Simpler Problem
7
Starting with a Simpler Problem
Proof:
The derivative of E [(Yi − u)2 ] with respect to u is E [2(Yi − u)].
7
Starting with a Simpler Problem
Proof:
The derivative of E [(Yi − u)2 ] with respect to u is E [2(Yi − u)].
Setting the derivative to 0, we obtain
7
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
8
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
8
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
8
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
Proof:
By the law of iterated expectations,
8
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
Proof:
By the law of iterated expectations,
8
Now A Slightly Harder Problem
Now suppose we want to choose the function u(x ) to minimize
Proof:
By the law of iterated expectations,
However, our argument on the previous slide implies that the solution
is u(x ) = E [Yi |Xi = x ].
8
Looping Back...
We’ve shown that the function u(x ) = E [Yi |Xi = x ] solves
min E [(Yi − u(Xi ))2 ]
u(·)
9
Looping Back...
We’ve shown that the function u(x ) = E [Yi |Xi = x ] solves
min E [(Yi − u(Xi ))2 ]
u(·)
9
Looping Back...
We’ve shown that the function u(x ) = E [Yi |Xi = x ] solves
min E [(Yi − u(Xi ))2 ]
u(·)
The minimization above was over all functions u(·), including linear
ones of the form a + bx . Hence,
9
Looping Back...
We’ve shown that the function u(x ) = E [Yi |Xi = x ] solves
min E [(Yi − u(Xi ))2 ]
u(·)
The minimization above was over all functions u(·), including linear
ones of the form a + bx . Hence,
as we wanted to show
9
Why This is Useful
10
Why This is Useful
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
10
Why This is Useful
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
Let’s take the derivative w.r.t. a and b and set them to zero at (α, β ):
10
Why This is Useful
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
Let’s take the derivative w.r.t. a and b and set them to zero at (α, β ):
E [−2(Yi − (α + β Xi ))] = 0
10
Why This is Useful
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
Let’s take the derivative w.r.t. a and b and set them to zero at (α, β ):
E [−2(Yi − (α + β Xi ))] = 0
E [−2Xi (Yi − (α + β Xi ))] = 0
10
Why This is Useful
How does this help us? By solving the minimization problem, we can
express α, β as functions of population expectations.
Let’s take the derivative w.r.t. a and b and set them to zero at (α, β ):
E [−2(Yi − (α + β Xi ))] = 0
E [−2Xi (Yi − (α + β Xi ))] = 0
10
The Least Squares Solution
11
The Least Squares Solution
11
The Least Squares Solution
11
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
3) Use our tools for estimating population means using sample means
to estimate α, β and test hypotheses about them
4) Argue that even if our assumption about the form of the CEF is
wrong, the parameters α, β may provide a “good” approximation.
12
Outline
1. Population RegressionX
13
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x
14
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x
14
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x
14
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x
14
Estimating Regression Coefficients
We showed that when E [Yi | Xi = x ] = α + β x
14
15
16
What is the estimated value of E [Yi |Di = 1, Xi = 1350]?
16
What is the estimated value of E [Yi |Di = 1, Xi = 1350]?
α̂ + βˆ · 1350 = 34947 + 15 · 1350 = 55197.
16
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
→p E [Xi2 ] − E [Xi ]2 (E [Xi Yi ] − E [Xi ]E [Yi ])
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
→p E [Xi2 ] − E [Xi ]2 (E [Xi Yi ] − E [Xi ]E [Yi ])
= Var (Xi )−1 Cov (Xi , Yi )
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
→p E [Xi2 ] − E [Xi ]2 (E [Xi Yi ] − E [Xi ]E [Yi ])
= Var (Xi )−1 Cov (Xi , Yi ) = β
17
Consistency of OLS
We can now use our results for (functions of) sample averages to
show that βˆ is consistent for β , i.e. βˆ →p β .
We have that
!−1
N
1 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1 !
1 N 2 2 1 N
= ∑ Xi − X̄
N i=1 ∑ Xi Yi − X̄ Ȳ
N i=1
−1
→p E [Xi2 ] − E [Xi ]2 (E [Xi Yi ] − E [Xi ]E [Yi ])
= Var (Xi )−1 Cov (Xi , Yi ) = β
17
Asymptotic Distribution for OLS
Our OLS estimates α̂, βˆ are continuous functions of sample means.
18
Asymptotic Distribution for OLS
Our OLS estimates α̂, βˆ are continuous functions of sample means.
18
Asymptotic Distribution for OLS
Our OLS estimates α̂, βˆ are continuous functions of sample means.
18
Asymptotic Distribution for OLS
Our OLS estimates α̂, βˆ are continuous functions of sample means.
This is useful
√ because we can then form CIs for β of the form
βˆ ± 1.96σ̂ / N, where σ̂ is our estimate of σ .
18
Deriving the Asymptotic Distribution for OLS
Yi = α + Xi β + εi
19
Deriving the Asymptotic Distribution for OLS
Yi = α + Xi β + εi
19
Deriving the Asymptotic Distribution for OLS
Yi = α + Xi β + εi
19
Deriving the Asymptotic Distribution for OLS
Yi = α + Xi β + εi
19
Asymptotic Distribution for OLS (cont.)
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
20
Asymptotic Distribution for OLS (cont.)
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1
1 N 1 N
= ∑ (Xi − X̄ )2
N i=1 ∑ (Xi − X̄ )((Xi − X̄ )β + (εi − ε̄))
N i=1
20
Asymptotic Distribution for OLS (cont.)
Thus,
!−1
1 N 1 N
βˆ = ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(Yi − Ȳ )
N i=1 N i=1
!−1
1 N 1 N
= ∑ (Xi − X̄ )2
N i=1 ∑ (Xi − X̄ )((Xi − X̄ )β + (εi − ε̄))
N i=1
!−1
1 N 1 N
=β+ ∑ (Xi − X̄ )2 ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
20
Asymptotic Distribution for OLS (cont.)
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
21
Asymptotic Distribution for OLS (cont.)
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
21
Asymptotic Distribution for OLS (cont.)
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
√ 1 N
By CLT, N N ∑i=1 (Xi − E [Xi ])εi →d N(0, Var ((Xi − E [Xi ])εi )).
21
Asymptotic Distribution for OLS (cont.)
Hence,
!−1
√ 1 N √ 1 N
N(βˆ − β ) = ∑ (Xi − X̄ )2 N ∑ (Xi − X̄ )(εi − ε̄)
N i=1 N i=1
!−1
1 N 2
√ 1 N
= ∑ (Xi − X̄ ) N ∑ (Xi − E [Xi ])εi
N i=1 N i=1
!−1
1 N √
2
− ∑ i (X − X̄ ) ε̄ N( X̄ − E [Xi ])
N i=1
−1
By LLN and CMT, 1
N ∑N
i=1 (Xi − X̄ )
2 →p Var (Xi )−1
√ 1 N
By CLT, N N ∑i=1 (Xi − E [Xi ])εi →d N(0, Var ((Xi − E [Xi ])εi )).
√
By LLN, CLT, and Slutsky, ε̄ N(X̄ − E [Xi ]) →d 0 × N(0, Var (Xi )) = 0
21
Finishing the Asymptotics (!)
Putting all the pieces together, we see that
√
N(βˆ − β ) →d N(0, σ 2 ),
where
22
Finishing the Asymptotics (!)
Putting all the pieces together, we see that
√
N(βˆ − β ) →d N(0, σ 2 ),
where
22
23
A CI for β is βˆ ± 1.96 × SE
23
A CI for β is βˆ ± 1.96 × SE ≈ [5, 25]
23
Aside on notation/terminology
Yi = α + β Di + εi (1)
What they mean is: “define (α, β ) = arg mina,b E [(Yi − (a + bXi ))2 ]”
24
Aside on notation/terminology
Yi = α + β Di + εi (1)
What they mean is: “define (α, β ) = arg mina,b E [(Yi − (a + bXi ))2 ]”
Likewise, people will say “We estimate equation (1) by OLS” to mean
that they compute the sample analogs to α, β via OLS, i.e. α̂, βˆ .
24
Outline
1. Population RegressionX
25
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
26
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
26
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
26
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
Thus, the CEF E [Yi |Di = d] is linear in d, and the slope coefficient β
is exactly the estimand which identifies the ATE in an experiment!
26
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
Thus, the CEF E [Yi |Di = d] is linear in d, and the slope coefficient β
is exactly the estimand which identifies the ATE in an experiment!
26
Using Regressions to Analyze RCTs
Recall that when we have an experiment, the average treatment effect
is identified by a different in means:
Thus, the CEF E [Yi |Di = d] is linear in d, and the slope coefficient β
is exactly the estimand which identifies the ATE in an experiment!
We can thus use OLS as a convenient tool for estimating the ATE and
getting standard errors
26
Example - WorkAdvance
Background: gaps between college-educated and non-college educated
workers have widened over time
27
28
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Estimate OLS regression:
Yi = α +β Di +εi
|{z} |{z}
Earnings 2-3 years later Treatment indicator
Coefficient Estimate SE
α̂ 14636 425
βˆ 1965 609
29
Roadmap
What we know how to do: Estimate and test hypotheses about
population means using sample means
3) Use our tools for estimating population means using sample means
to estimate α, β and test hypotheses about them. X
4) Argue that even if our assumption about the form of the CEF is
wrong, the parameters α, β may provide a “good” approximation.
30
Regressions as Approximations
31
Regressions as Approximations
Claim: if CEF is not linear, then OLS still gives us the “best linear
approximation” to the CEF
31
Regressions as Approximations
Claim: if CEF is not linear, then OLS still gives us the “best linear
approximation” to the CEF
31
Regressions as Approximations
Claim: if CEF is not linear, then OLS still gives us the “best linear
approximation” to the CEF
31
32
33
34
35
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
E [E [(Y − µ(X ))(µ(X ) − (α + β X ))|X ]] =
36
Proof of OLS as Best Approximation
We solved for the α, β to minimize E [(Y − (α + β X ))2 ].
By the LIE,
E [(Y − µ(X ))(µ(X ) − (α + β X ))] =
E [E [(Y − µ(X ))(µ(X ) − (α + β X ))|X ]] =
E [(µ(X ) − (α + β X )) E [Y − µ(X )|X ]] = 0
| {z }
=0
37
E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ] ≈ α1 + β1 x − (α0 + β0 x );
α̂1 + βˆ1 x −(α̂0 + βˆ0 x ) = (34, 947+15x )−(13, 942+21x ) = 21, 005−6x
37
E [Yi |Di = 1, Xi = x ] − E [Yi |Di = 0, Xi = x ] ≈ α1 + β1 x − (α0 + β0 x );
α̂1 + βˆ1 x −(α̂0 + βˆ0 x ) = (34, 947+15x )−(13, 942+21x ) = 21, 005−6x