Week3 Lecture1

Econometrics 1 (6012B0374Y)
dr. Artūras Juodis
University of Amsterdam
Week 3. Lecture 1
February, 2024
1 / 61
Overview
1 Total vs. Partial effects Variance

Empirical problem (Root) Mean Squared Error
Ceteris paribus analysis (self reading at home)
2 Adding and subtracting variables 3 Implications for model fit
Restricted and unrestricted The R 2
estimators Adjusted R 2
Biased or unbiased? 4 Summary
2 / 61
The plan for this week
I We use the Frisch-Waugh-Lovell theorem to study OLS properties after

adding and subtracting regressors from the model.
I We analyze under which conditions smaller models can be better than
larger models and vice verse.
I We show that R 2 is not a good measure to determine the size of the
model.
I We analyze finite sample distribution of the OLS estimator under
normality assumption (Friday).
I We show how to test simple statistical hypothesis H0 : κ0 β0 = c using
simple t-tests (Friday).
3 / 61
Recap: Linear model
Last week, we introduced the multiple regression model using the vector
notation:
yi = xi0 β + εi , i = 1, . . . , n. (1)
For [K × 1] dimensional vector β of unknown regression coefficients. We
illustrated how models with K > 2 arise naturally if one combines restricted
models with K = 2.
In particular, we showed how two different measures of impact of distance

can be combined to result in a richer model that better explains the
variation of hotel prices in Vienna.
4 / 61
Recap: Projection matrices
We used these projection matrices to prove the Frisch-Waugh-Lovell (FWL)

theorem that states:
βb1 = (X10 MX2 X1 ) (X10 MX2 y ) ,

−1
(2)
βb2 = (X20 MX1 X2 ) (X20 MX1 y ) .
−1
(3)
If we partition X = (X1 , X2 ) and βb = (βb10 , βb20 )0 .
5 / 61
1. Total vs. Partial effects
6 / 61
1.1. Empirical problem
7 / 61
Back to Hotels in Vienna
Your colleague is worried that while measuring the effect of distance on

price, you miss one important aspect - hotels that are closer to the city
center sometimes also tend to be the ones with more stars.
This means that premium hotels that might offer SPA, conference, and
better restaurant facilities are also located closer to the city center.
Your colleague suggest that you also include the starsi regressor in the
model, and using the FWL intuition control for this potential relationship
between distancei and starsi :
pricei = α + β1 distancei + β2 starsi + εi. (4)
8 / 61
New model. Regression Output.
Figure: Regression of Price on distancei , and starsi .
9 / 61
Original model.Regression Output.
Compare with original model:
Figure: Regression of Price on distancei , and starsi .
10 / 61
Interpretation?
We observe that with inclusion of starsi the effect of distance becomes
smaller (in absolute value), but is still negative.
Using the FWL we know that the two estimators (with and without starsi
included) for the effect of distancei on pricei measure algebraically two
different types of associations:
1 The one with starsi included in the model measures partial correlation
between distancei and pricei . Intuitively this means, that if we have
two hotels with different prices we first remove any differences between
two hotels that are coming from the different number of starsi they are
awarded.
2 The one without starsi included in the model measures total correlation
between distancei and pricei . In particular, when measuring the effect
of distancei we now neglect the fact that expensive hotels can also be
the ones closer to the city center.
These are two different estimates with two different interpretations.
11 / 61
Graphical illustration
Figure: X = distancei ; D = starsi ; Y = pricei .
12 / 61
1.2. Ceteris paribus analysis
13 / 61
Simple model
Econometricians like to think about relationship between variables in terms

of models. For example, consider the following two-equations model for
price, distance, stars:
(1)
pricei = c (1) + β1 distancei + β2 starsi + ui , (5)
(2)
starsi = c (2) + πdistancei + ui . (6)
Assume that:
(1) (2)
E[ui |distancei , starsi ] = 0, E[ui |distancei ] = 0. (7)
14 / 61
Coefficient interpretation. Large model
Note that in the large model β1 and β2 can be interpreted as:
∂ E[pricei |distancei , starsi ]

= β1
∂distancei
∂ E[pricei |distancei , starsi ]
= β2 .
∂starsi
Hence, both coefficients measure what is usually called the ceteris paribus
effect of changing either distancei or starsi marginally, while keeping the
other characteristics as they are.
(Here for simplicity I pretend that starsi is a continuous variable.)
15 / 61
Coefficient interpretation. Small model
What about the interpretation in the model where only distancei is

included? For this we combine the two equations:
(2) (1)
pricei = c (1) + β1 distancei + β2 (c (2) + πdistancei + ui ) + ui
(1) (2)
= (c (1) + β2 c (2) ) + (β1 + β2 π)distancei + (ui + β2 ui ).
In this case:
∂ E[pricei |distancei ]
= β1 + β2 π. (8)
∂distancei
Note: This is also the coefficient of the population linear projection of
pricei on distancei !
16 / 61
Hence, in this linear model we measure the total (both direct and indirect
through starsi ) effect of distancei :
βdistance,total = β1 + β2 π. (9)
Note that we generally expect that β1 < 0, β2 > 0 and π < 0. Hence the
total effect is more negative than the direct effect.
This is also what we found in our application.
17 / 61
2. Adding and subtracting variables
18 / 61
2.1. Restricted and unrestricted estimators
19 / 61
Two models
Consider the situation where you are not sure if you want to model the data
using the “small” model:
y = X1 β1 + ε, (10)
or using the “large” model:
y = X1 β1 + X2 β2 + ε. (11)
In both cases you are primarily interested in β1 . In our case, the coefficient
on distancei .
20 / 61
Two estimators
The “small” model can be seen as a restricted version of the “large” model
with β2 = 0 using this notion we use additional subscript R in βb1,R The two
OLS estimators for β1 :
βb1,R = (X10 X1 ) (X10 y ) .

−1
(restricted) (12)
Using the FWL theorem the unrestricted estimator of β1 is given by:
βb1 = (X10 MX2 X1 ) (X10 MX2 y ) .

−1
(unrestricted) (13)
21 / 61
Two estimators. Important relationship.
In the proof of the FWL during one of the intermediate steps we showed
that:
βb1 = (X10 X1 ) X10 (y − X2 βb2 ) .

−1
(14)
Equivalently:
βb1 = βb1,R − (X10 X1 ) X10 X2 βb2 .
−1
(15)
Hence, the unrestricted estimator βb1 can be seen as a linear combination
between the restricted estimator βb1,R and βb2 .
22 / 61
OLS and model parameters
Hence from the above we see that the total effect βb1,R is decomposed as:
+ (X10 X1 ) X10 X2 βb2

−1
βb1,R = βb1 (16)
|{z} |{z} | {z }
Total effect Direct effect Indirect effect
= βb1 + Π
cβb2 . (17)
c ≡ (X 0 X1 )−1 X 0 X2 is the OLS estimator from the regression of

Here Π X2
1 1
on X1 (i.e. the number of stars on distance)
23 / 61
This decomposition is exactly the same as the one for our two-equations
model for price, distance, stars:
β1,r = β1 + πβ2 , (18)
where (as before):

(1)
pricei = c (1) + β1 distancei + β2 starsi + ui , (19)
(2)
starsi = c (2) + πdistancei + ui . (20)
24 / 61
2.2. Biased or unbiased?
25 / 61
Decomposition again
In what follows we will use the following decomposition:
βb1,R = βb1 + (X10 X1 ) X10 X2 βb2 .

−1
(21)
We will also use the fact (from previous lectures) that the OLS estimator in
the large model is unbiased, i.e.:
b X ] = β0 .
E[β| (22)
In particular:
E[βb1 |X ] = β1,0 , (23)

E[βb2 |X ] = β2,0 . (24)
26 / 61
Unbiased estimation with irrelevant regressors
First consider the situation when β2,0 = 0, i.e. regressors X2 do not

contribute to the model for y directly. In this case:
E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]

−1
= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]

−1
= β1,0 .
Hence both the restricted βb1,R and unrestricted βb1 estimators are unbiased.
In this respect, they are equivalent.
27 / 61
Fully conditional bias. Non-irrelevant regressors.
Consider the situation now, when β2,0 6= 0, e.g. there is a direct effect of
starsi on pricei . In this situation:
E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]

−1
= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]

−1
= β1,0 + (X10 X1 ) X10 X2 β2,0 .

−1
Hence:
E[βb1,R |X ] 6= β1,0 . (25)
And the restricted estimator is conditionally biased.
28 / 61
The Omitted variable bias
The difference between E[βb1,R |X ] and β1,0 , i.e.:
E[βb1,R |X ] − β1,0 = (X10 X1 ) X10 X2 β2,0 ,

−1
(26)
is called the omitted variable bias.
For obvious reasons, this bias is caused by omitting relevant variable (X2 in
this case) from the model. Unless X10 X2 = O (almost impossible to satisfy),
this bias is non-zero.
29 / 61
Partially conditional bias (Not in your textbook)
Note that while for any given sample X10 X2 = O might be difficult to satisfy,
it is possible that this is satisfied in repeated samples. For example, it is
possible that E[X2 |X1 ] = O. In that case:
E[βb1,R |X1 ] = E[E[βb1,R |X1 , X2 ]|X1 ]

= β1,0 + E[(X10 X1 ) X10 X2 β2,0 |X1 ]
−1
= β1,0 + (X10 X1 ) X10 E[X2 |X1 ]β2,0

−1
= β1,0 .
Hence, there is no omitted variable bias left, because the two variables are
not correlated in large samples.
30 / 61
Which bias is the relevant one?
Hence, should we worry more about conditional bias on X or on X1 only?
The answer depends on your specific situation, and whether it is likely that
treating regressors as fully fixed is justified.
Note that unbiasedness is a fictional concept where we imagine what would

have happened if we observed different realizations of the same probabilistic
model many times.
Take the hotels example, and assume that hotels cannot change their
locations, so distance is fixed.
1 If you are interested in measuring the effect of distance on prices for a
given set of hotels with given set of stars then looking at fully
conditional (on X ) is more appropriate.
2 If you are interested in measuring the effect of distance on prices, but
you allow for possibility that hotels might change their rankings (in the
imaginable world) then looking at conditional statements on X1 alone
is sufficient.
31 / 61
2.3. Variance
32 / 61
Useful fact
Let A and B two positive definite matrices of dimension [q × q], then:

A − B ≥ 0, (27)
if and only if
B −1 − A−1 ≥ 0. (28)
Here as in the previous lecture by ≥ we mean no smaller in the positive
semi-definite sense.
We will not attempt to proof this fact, and leave it for curious students to
check.
33 / 61
Adding Irrelevant variables. Variance Effect.
The two OLS estimators for β1 :
βb1 = (X10 MX2 X1 ) (X10 MX2 y )

−1
(29)
βb1,R = (X10 X1 ) (X10 y ) .
−1
(30)
Let us assume that the true model has β2,0 = 0K2 (true coefficients). Then
X2 is irrelevant regressor.
Variances
var (βb1 |X ) = σ02 (X10 MX2 X1 ) ,

−1
(31)
var (βb1,R |X ) = σ02 (X10 X1 ) .
−1
(32)
34 / 61
We now show that:
(X10 MX2 X1 ) ≥ (X10 X1 )

−1 −1
, (33)
so that the variance of the unrestricted (with irrelevant X2 ) included

estimator is no smaller than that of the restricted estimator.
Using the trick the previous statement can be equivalently established by

proving that:
X10 X1 − X10 MX2 X1 ≥ 0. (34)
But this is trivial because:
X10 X1 − X10 MX X1 = X10 PX X1 .

2 2
(35)
By the definition of the projection matrix PX , it follows immediately that

X10 PX2 X1 ≥ 0.
2
35 / 61
Variance. In the presence of bias.
Consider the (fully conditional) variance of the restricted estimator when

β2,0 6= 0. By definition:
var (βb1,R |X ) = E[(βb1,R − E[βb1,R |X ])(βb1,R − E[βb1,R |X ])0 ]. (36)
In this case:
βb1,R − E[βb1,R |X ] = (X10 X1 ) X10 ε.
−1
(37)
Such that (following same steps as before)
var (βb1,R |X ) = σ02 (X10 X1 ) .

−1
(38)
Conclusion: irrespective of the value of β2,0 the fully conditional variance of

βb1,R remains the same.
36 / 61
Summary. Variance effect.
Table: The effects of omitting potentially relevant variables
Estimator β2,0 = 0 β2,0 6= 0

βb1,R unbiased, efficient biased, efficient
E[βb1,R |X ] = β1,0 E[βb1,R |X ] 6= β1,0
var (β1,R |X ) = σ02 (X10 X1 ) var (β1,R |X ) = σ02 (X10 X1 )
b −1 b −1
βb1 unbiased, not efficient unbiased, not efficient

E[βb1 |X ] = β1,0 E[βb1 |X ] = β1,0
var (β1 |X ) = σ02 (X10 MX2 X1 ) var (β1,R |X ) = σ02 (X10 MX2 X1 )
b −1 b −1
Note that here I will define efficiency in terms of smallest variance among
the two competing linear estimators.
37 / 61
Important! (and always forgotten)
Note that while it is always true that:
var (βb1,R |X ) ≤ var (βb1 |X ). (39)
It is not true that the corresponding estimates of variances (and also the
standard errors)
c (βb1,R |X ) = sR2 (X10 X1 )−1 ,

var (40)
c (βb1 |X ) = s 2 (X 0 MX X1 )−1 ,
var 1 2 (41)
satisfy:
c (βb1,R |X ) ≤ var
var c (βb1 |X ). (42)
This is all because it possible that sR2 ≥ s 2 , while it is always
(X10 X1 )−1 ≤ (X10 MX2 X1 )−1 .
38 / 61
2.4. (Root) Mean Squared Error (self reading at
home)
39 / 61
Summary from previous discussion
From previous discussion we it is clear that neither βb1 nor βb1,R clearly
dominates each other in both having lower bias and lower variance when it
is uncertain if β2,0 = 0 or not.
Hence which approach should be used? In our setup, the answer

depends on whether you value more unbiased estimation or more certain
estimation (i.e. estimators with lower variance).
Is there a middle ground?
40 / 61
Combining two. Definition.
Note that instead of considering bias and variance separately we can look at
the measure that combines the measures together. In particular, the Mean
Squared Error (MSE) is defined as:
MSE (βb1 |X ) = E[(βb1 − β1,0 )(βb1 − β1,0 )0 ], (43)

MSE (βb1,R |X ) = E[(βb1,R − β1,0 )(βb1,R − β1,0 )0 ]. (44)
Hence unlike the variance that is centered at the mean, MSE is centered at
the true value β1,0 .
41 / 61
It is easy to see that:
MSE = Bias Bias 0 + Variance. (45)
Hence in our case:
MSE (βb1 |X ) = σ02 (X1 MX2 X1 )−1 , (46)

MSE (βb1,R |X ) = σ 2 (X1 X1 )−1 + (X1 X1 )−1 X 0 X2 β2,0 β 0
0 1 2,0 X20 X1 (X1 X1 )−1 .
(47)
Hence, whether MSE (βb1 |X ) > MSE (βb1,R |X ) (or the other way), now
depends on many ingredients: σ02 , β2,0 , as well relationships between X1
and X2 .
42 / 61
3. Implications for model fit
43 / 61
3.1. The R 2
44 / 61
Look! My R 2 increased
In the empirical examples we considered so far we always noticed that by

adding additional regressors, e.g. by adding X2 , the R 2 of the model went
up. Always. This might give us some indication that X2 helps to better
explain variation in y . Hence, X2 is a useful explanatory variable.
45 / 61
R 2 always increases...
Below we show that this reasoning is wrong, and R 2 always increases by

adding additional regressors into the model. Hence, R 2 does not say too
much about the quality of the model, it is just algebra.
Consider two sets of Residual sums of squares (in both cases we include
intercept):
SSRR = y 0 MX1 y = (y − X1 βb1,R )0 (y − X1 βb1,R ), (48)

SSR = y 0 MX y = (y − X β)
b 0 (y − X β).b (49)
We want to show that always:
SSRR ≥ SSR. (50)
There are several ways how this can be shown. I will present two: i)
projection based; ii) objective function based.
46 / 61
R 2 always increases. Projection based argument.
Consider the definition of SSRR and use the fact that y = PX y + MX y :
SSRR = y 0 MX1 y
= y 0 MX1 PX y + y 0 MX1 MX y
= y 0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + SSR.
Here we used the fact that MX MX 1 = MX (as we showed previous week).
y 0 PX0 MX PX y ≥ 0, as it is a quadratic form.

1
So:
SSRR ≥ SSR. (51)
47 / 61
R 2 always increases. Objective-function based argument.
0
Consider the following vector βe = (βb1,R , 00K2 )0 .
Recall that last week we proved that βb is a minimizer of the Least-squares

objective function, i.e.:
βb = arg min LSn (β) = arg min(y − X β)0 (y − X β). (52)

β β
In other words:
b ≤ LSn (β),
LSn (β) (53)
for any choice of β. In particular, as it holds for any β it should also hold
for βe we defined above thus:
b ≤ LSn (β).
LSn (β) e (54)
Hence SSR ≤ SSRR by the definition of SSR and SSRR .
48 / 61
3.2. Adjusted R 2
49 / 61
The Adjusted R 2
When confronting with patterns that additional regressors improve

(in-sample) fit one of the practises is to penalize models with large K . The
oldest approach, uses the degrees-of-freedom adjustment in the so-called
Adjusted R 2 :
1
PN 2
2 n−K i=1 e
bi
R = 1 − 1 PN . (55)
2
n−1 i=1 (yi − y )
The two coincide when K = 1, e.g. when only constant term is included in
the model, but in other cases:
2
R ≤ R 2. (56)
It is not difficult to see that:

2 n−1
R =1− (1 − R 2 ). (57)
n−K
50 / 61
Adjusted R 2 for model selection
2
In principle R can be used for model selection. For example, to choose
between a small model with K1 regressors and a large model with K1 + K2
regressors.
2
You just need to select model that produces largest R .
51 / 61
Illustration.Model 1.
Figure: Regression of Price on distancei and dummy variable

Di = 1(distancei < 2km).
52 / 61
Illustration. Model 2.
Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),

and their interactioni .
53 / 61
Conclusions?
The larger model with interaction term between Di and distancei produces
2
not only the larger R 2 , but also larger R .
This is not surprising as the difference between two R 2 was substantially

big, at the expense of one added regressor.
54 / 61
Illustration. Alternative model with two binary variables.
2
Example below shows that R can decrease if additional variable only
marginally increases R 2 .
Consider an alternative multivariate regression model for Hotel prices in

Vienna:
pricei = α + β1 Di + β2 Bi + εi . (58)
Here Di is defined as before, while Bi = 1(2km < distancei < 4km).
55 / 61
Results
Figure: Regression of Price on dummy variable Di = 1(distancei < 2km), and

dummy variable Bi = 1(2km < distancei < 4km)
56 / 61
Results. Original setup.
Figure: Regression of Price on dummy variable Di = 1(distancei < 2km).
2
Higher R 2 , but lower R !
57 / 61
Other methods for model selection
Other approaches for model selection:

I Information criteria: AIC, BIC.
I Sample splitting into training and validation.
I Leave-on-out cross-validation.
I etc.
Only AIC/BIC discussed at the end of this course.
58 / 61
4. Summary
59 / 61
Summary today
In this lecture
I We discussed the concepts of total and partial effects.
I We related these concepts to restricted and unrestricted OLS
estimators.
I We used the FWL theorem to study the properties of restricted OLS
estimator.
I We introduced the concept of the omitted variable bias.
60 / 61
On Friday
I We discuss the concept of multicollinearity.

I We look at distributional properties of the OLS estimator joint
normality of ε.
I We show how t-statistics can be constructed to test simple linear
hypothesis on β.
I We show how to do formal statistical testing using t-statistics.
61 / 61

Week3 Lecture1

Uploaded by

Copyright:

Available Formats

You might also like

Week3 Lecture1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week3 Lecture1

Uploaded by

Copyright:

Available Formats

Econometrics 1 (6012B0374Y)

dr. Artūras Juodis

1 Total vs. Partial effects Variance

I We use the Frisch-Waugh-Lovell theorem to study OLS properties after

In particular, we showed how two different measures of impact of distance

We used these projection matrices to prove the Frisch-Waugh-Lovell (FWL)

βb1 = (X10 MX2 X1 ) (X10 MX2 y ) ,

If we partition X = (X1 , X2 ) and βb = (βb10 , βb20 )0 .

Your colleague is worried that while measuring the effect of distance on

pricei = α + β1 distancei + β2 starsi + εi. (4)

Figure: Regression of Price on distancei , and starsi .

Figure: Regression of Price on distancei , and starsi .

Figure: X = distancei ; D = starsi ; Y = pricei .

Econometricians like to think about relationship between variables in terms

Note that in the large model β1 and β2 can be interpreted as:

∂ E[pricei |distancei , starsi ]

(Here for simplicity I pretend that starsi is a continuous variable.)

What about the interpretation in the model where only distancei is

This is also what we found in our application.

βb1,R = (X10 X1 ) (X10 y ) .

Using the FWL theorem the unrestricted estimator of β1 is given by:

βb1 = (X10 MX2 X1 ) (X10 MX2 y ) .

+ (X10 X1 ) X10 X2 βb2

c ≡ (X 0 X1 )−1 X 0 X2 is the OLS estimator from the regression of

β1,r = β1 + πβ2 , (18)

where (as before):

In what follows we will use the following decomposition:

βb1,R = βb1 + (X10 X1 ) X10 X2 βb2 .

E[βb1 |X ] = β1,0 , (23)

First consider the situation when β2,0 = 0, i.e. regressors X2 do not

E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]

= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]

In this respect, they are equivalent.

E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]

= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]

= β1,0 + (X10 X1 ) X10 X2 β2,0 .

The difference between E[βb1,R |X ] and β1,0 , i.e.:

E[βb1,R |X ] − β1,0 = (X10 X1 ) X10 X2 β2,0 ,

is called the omitted variable bias.

E[βb1,R |X1 ] = E[E[βb1,R |X1 , X2 ]|X1 ]

= β1,0 + (X10 X1 ) X10 E[X2 |X1 ]β2,0

Note that unbiasedness is a fictional concept where we imagine what would

Let A and B two positive definite matrices of dimension [q × q], then:

The two OLS estimators for β1 :

βb1 = (X10 MX2 X1 ) (X10 MX2 y )

var (βb1 |X ) = σ02 (X10 MX2 X1 ) ,

(X10 MX2 X1 ) ≥ (X10 X1 )

so that the variance of the unrestricted (with irrelevant X2 ) included

Using the trick the previous statement can be equivalently established by

X10 X1 − X10 MX X1 = X10 PX X1 .

By the definition of the projection matrix PX , it follows immediately that

Consider the (fully conditional) variance of the restricted estimator when

var (βb1,R |X ) = E[(βb1,R − E[βb1,R |X ])(βb1,R − E[βb1,R |X ])0 ]. (36)

var (βb1,R |X ) = σ02 (X10 X1 ) .

Conclusion: irrespective of the value of β2,0 the fully conditional variance of

Table: The effects of omitting potentially relevant variables

Estimator β2,0 = 0 β2,0 6= 0

βb1 unbiased, not efficient unbiased, not efficient