Professional Documents
Culture Documents
Week3 Lecture1
Week3 Lecture1
Week3 Lecture1
University of Amsterdam
Week 3. Lecture 1
February, 2024
1 / 61
Overview
2 / 61
The plan for this week
3 / 61
Recap: Linear model
Last week, we introduced the multiple regression model using the vector
notation:
yi = xi0 β + εi , i = 1, . . . , n. (1)
For [K × 1] dimensional vector β of unknown regression coefficients. We
illustrated how models with K > 2 arise naturally if one combines restricted
models with K = 2.
4 / 61
Recap: Projection matrices
5 / 61
1. Total vs. Partial effects
6 / 61
1.1. Empirical problem
7 / 61
Back to Hotels in Vienna
This means that premium hotels that might offer SPA, conference, and
better restaurant facilities are also located closer to the city center.
Your colleague suggest that you also include the starsi regressor in the
model, and using the FWL intuition control for this potential relationship
between distancei and starsi :
8 / 61
New model. Regression Output.
9 / 61
Original model.Regression Output.
Compare with original model:
10 / 61
Interpretation?
We observe that with inclusion of starsi the effect of distance becomes
smaller (in absolute value), but is still negative.
Using the FWL we know that the two estimators (with and without starsi
included) for the effect of distancei on pricei measure algebraically two
different types of associations:
1 The one with starsi included in the model measures partial correlation
between distancei and pricei . Intuitively this means, that if we have
two hotels with different prices we first remove any differences between
two hotels that are coming from the different number of starsi they are
awarded.
2 The one without starsi included in the model measures total correlation
between distancei and pricei . In particular, when measuring the effect
of distancei we now neglect the fact that expensive hotels can also be
the ones closer to the city center.
These are two different estimates with two different interpretations.
11 / 61
Graphical illustration
12 / 61
1.2. Ceteris paribus analysis
13 / 61
Simple model
Assume that:
(1) (2)
E[ui |distancei , starsi ] = 0, E[ui |distancei ] = 0. (7)
14 / 61
Coefficient interpretation. Large model
15 / 61
Coefficient interpretation. Small model
In this case:
∂ E[pricei |distancei ]
= β1 + β2 π. (8)
∂distancei
Note: This is also the coefficient of the population linear projection of
pricei on distancei !
16 / 61
Hence, in this linear model we measure the total (both direct and indirect
through starsi ) effect of distancei :
βdistance,total = β1 + β2 π. (9)
Note that we generally expect that β1 < 0, β2 > 0 and π < 0. Hence the
total effect is more negative than the direct effect.
17 / 61
2. Adding and subtracting variables
18 / 61
2.1. Restricted and unrestricted estimators
19 / 61
Two models
Consider the situation where you are not sure if you want to model the data
using the “small” model:
y = X1 β1 + ε, (10)
or using the “large” model:
y = X1 β1 + X2 β2 + ε. (11)
In both cases you are primarily interested in β1 . In our case, the coefficient
on distancei .
20 / 61
Two estimators
The “small” model can be seen as a restricted version of the “large” model
with β2 = 0 using this notion we use additional subscript R in βb1,R The two
OLS estimators for β1 :
21 / 61
Two estimators. Important relationship.
In the proof of the FWL during one of the intermediate steps we showed
that:
βb1 = (X10 X1 ) X10 (y − X2 βb2 ) .
−1
(14)
Equivalently:
βb1 = βb1,R − (X10 X1 ) X10 X2 βb2 .
−1
(15)
Hence, the unrestricted estimator βb1 can be seen as a linear combination
between the restricted estimator βb1,R and βb2 .
22 / 61
OLS and model parameters
Hence from the above we see that the total effect βb1,R is decomposed as:
= βb1 + Π
cβb2 . (17)
23 / 61
This decomposition is exactly the same as the one for our two-equations
model for price, distance, stars:
24 / 61
2.2. Biased or unbiased?
25 / 61
Decomposition again
We will also use the fact (from previous lectures) that the OLS estimator in
the large model is unbiased, i.e.:
b X ] = β0 .
E[β| (22)
In particular:
26 / 61
Unbiased estimation with irrelevant regressors
= β1,0 .
Hence both the restricted βb1,R and unrestricted βb1 estimators are unbiased.
27 / 61
Fully conditional bias. Non-irrelevant regressors.
Consider the situation now, when β2,0 6= 0, e.g. there is a direct effect of
starsi on pricei . In this situation:
Hence:
E[βb1,R |X ] 6= β1,0 . (25)
And the restricted estimator is conditionally biased.
28 / 61
The Omitted variable bias
For obvious reasons, this bias is caused by omitting relevant variable (X2 in
this case) from the model. Unless X10 X2 = O (almost impossible to satisfy),
this bias is non-zero.
29 / 61
Partially conditional bias (Not in your textbook)
Note that while for any given sample X10 X2 = O might be difficult to satisfy,
it is possible that this is satisfied in repeated samples. For example, it is
possible that E[X2 |X1 ] = O. In that case:
= β1,0 .
Hence, there is no omitted variable bias left, because the two variables are
not correlated in large samples.
30 / 61
Which bias is the relevant one?
Hence, should we worry more about conditional bias on X or on X1 only?
The answer depends on your specific situation, and whether it is likely that
treating regressors as fully fixed is justified.
Take the hotels example, and assume that hotels cannot change their
locations, so distance is fixed.
1 If you are interested in measuring the effect of distance on prices for a
given set of hotels with given set of stars then looking at fully
conditional (on X ) is more appropriate.
2 If you are interested in measuring the effect of distance on prices, but
you allow for possibility that hotels might change their rankings (in the
imaginable world) then looking at conditional statements on X1 alone
is sufficient.
31 / 61
2.3. Variance
32 / 61
Useful fact
if and only if
B −1 − A−1 ≥ 0. (28)
Here as in the previous lecture by ≥ we mean no smaller in the positive
semi-definite sense.
We will not attempt to proof this fact, and leave it for curious students to
check.
33 / 61
Adding Irrelevant variables. Variance Effect.
Let us assume that the true model has β2,0 = 0K2 (true coefficients). Then
X2 is irrelevant regressor.
Variances
34 / 61
We now show that:
35 / 61
Variance. In the presence of bias.
In this case:
βb1,R − E[βb1,R |X ] = (X10 X1 ) X10 ε.
−1
(37)
Such that (following same steps as before)
36 / 61
Summary. Variance effect.
Note that here I will define efficiency in terms of smallest variance among
the two competing linear estimators.
37 / 61
Important! (and always forgotten)
It is not true that the corresponding estimates of variances (and also the
standard errors)
satisfy:
c (βb1,R |X ) ≤ var
var c (βb1 |X ). (42)
This is all because it possible that sR2 ≥ s 2 , while it is always
(X10 X1 )−1 ≤ (X10 MX2 X1 )−1 .
38 / 61
2.4. (Root) Mean Squared Error (self reading at
home)
39 / 61
Summary from previous discussion
From previous discussion we it is clear that neither βb1 nor βb1,R clearly
dominates each other in both having lower bias and lower variance when it
is uncertain if β2,0 = 0 or not.
40 / 61
Combining two. Definition.
Note that instead of considering bias and variance separately we can look at
the measure that combines the measures together. In particular, the Mean
Squared Error (MSE) is defined as:
Hence unlike the variance that is centered at the mean, MSE is centered at
the true value β1,0 .
41 / 61
It is easy to see that:
Hence, whether MSE (βb1 |X ) > MSE (βb1,R |X ) (or the other way), now
depends on many ingredients: σ02 , β2,0 , as well relationships between X1
and X2 .
42 / 61
3. Implications for model fit
43 / 61
3.1. The R 2
44 / 61
Look! My R 2 increased
45 / 61
R 2 always increases...
Consider two sets of Residual sums of squares (in both cases we include
intercept):
There are several ways how this can be shown. I will present two: i)
projection based; ii) objective function based.
46 / 61
R 2 always increases. Projection based argument.
Consider the definition of SSRR and use the fact that y = PX y + MX y :
SSRR = y 0 MX1 y
= y 0 MX1 PX y + y 0 MX1 MX y
= y 0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + SSR.
47 / 61
R 2 always increases. Objective-function based argument.
0
Consider the following vector βe = (βb1,R , 00K2 )0 .
In other words:
b ≤ LSn (β),
LSn (β) (53)
for any choice of β. In particular, as it holds for any β it should also hold
for βe we defined above thus:
b ≤ LSn (β).
LSn (β) e (54)
48 / 61
3.2. Adjusted R 2
49 / 61
The Adjusted R 2
The two coincide when K = 1, e.g. when only constant term is included in
the model, but in other cases:
2
R ≤ R 2. (56)
50 / 61
Adjusted R 2 for model selection
2
In principle R can be used for model selection. For example, to choose
between a small model with K1 regressors and a large model with K1 + K2
regressors.
2
You just need to select model that produces largest R .
51 / 61
Illustration.Model 1.
52 / 61
Illustration. Model 2.
53 / 61
Conclusions?
The larger model with interaction term between Di and distancei produces
2
not only the larger R 2 , but also larger R .
54 / 61
Illustration. Alternative model with two binary variables.
2
Example below shows that R can decrease if additional variable only
marginally increases R 2 .
pricei = α + β1 Di + β2 Bi + εi . (58)
55 / 61
Results
56 / 61
Results. Original setup.
2
Higher R 2 , but lower R !
57 / 61
Other methods for model selection
58 / 61
4. Summary
59 / 61
Summary today
In this lecture
I We discussed the concepts of total and partial effects.
I We related these concepts to restricted and unrestricted OLS
estimators.
I We used the FWL theorem to study the properties of restricted OLS
estimator.
I We introduced the concept of the omitted variable bias.
60 / 61
On Friday
61 / 61