Week3 Lecture1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Econometrics 1 (6012B0374Y)

dr. Artūras Juodis

University of Amsterdam

Week 3. Lecture 1

February, 2024

1 / 61
Overview

1 Total vs. Partial effects Variance


Empirical problem (Root) Mean Squared Error
Ceteris paribus analysis (self reading at home)
2 Adding and subtracting variables 3 Implications for model fit
Restricted and unrestricted The R 2
estimators Adjusted R 2
Biased or unbiased? 4 Summary

2 / 61
The plan for this week

I We use the Frisch-Waugh-Lovell theorem to study OLS properties after


adding and subtracting regressors from the model.
I We analyze under which conditions smaller models can be better than
larger models and vice verse.
I We show that R 2 is not a good measure to determine the size of the
model.
I We analyze finite sample distribution of the OLS estimator under
normality assumption (Friday).
I We show how to test simple statistical hypothesis H0 : κ0 β0 = c using
simple t-tests (Friday).

3 / 61
Recap: Linear model

Last week, we introduced the multiple regression model using the vector
notation:
yi = xi0 β + εi , i = 1, . . . , n. (1)
For [K × 1] dimensional vector β of unknown regression coefficients. We
illustrated how models with K > 2 arise naturally if one combines restricted
models with K = 2.

In particular, we showed how two different measures of impact of distance


can be combined to result in a richer model that better explains the
variation of hotel prices in Vienna.

4 / 61
Recap: Projection matrices

We used these projection matrices to prove the Frisch-Waugh-Lovell (FWL)


theorem that states:

βb1 = (X10 MX2 X1 ) (X10 MX2 y ) ,


−1
(2)
βb2 = (X20 MX1 X2 ) (X20 MX1 y ) .
−1
(3)

If we partition X = (X1 , X2 ) and βb = (βb10 , βb20 )0 .

5 / 61
1. Total vs. Partial effects

6 / 61
1.1. Empirical problem

7 / 61
Back to Hotels in Vienna

Your colleague is worried that while measuring the effect of distance on


price, you miss one important aspect - hotels that are closer to the city
center sometimes also tend to be the ones with more stars.

This means that premium hotels that might offer SPA, conference, and
better restaurant facilities are also located closer to the city center.

Your colleague suggest that you also include the starsi regressor in the
model, and using the FWL intuition control for this potential relationship
between distancei and starsi :

pricei = α + β1 distancei + β2 starsi + εi. (4)

8 / 61
New model. Regression Output.

Figure: Regression of Price on distancei , and starsi .

9 / 61
Original model.Regression Output.
Compare with original model:

Figure: Regression of Price on distancei , and starsi .

10 / 61
Interpretation?
We observe that with inclusion of starsi the effect of distance becomes
smaller (in absolute value), but is still negative.

Using the FWL we know that the two estimators (with and without starsi
included) for the effect of distancei on pricei measure algebraically two
different types of associations:
1 The one with starsi included in the model measures partial correlation
between distancei and pricei . Intuitively this means, that if we have
two hotels with different prices we first remove any differences between
two hotels that are coming from the different number of starsi they are
awarded.
2 The one without starsi included in the model measures total correlation
between distancei and pricei . In particular, when measuring the effect
of distancei we now neglect the fact that expensive hotels can also be
the ones closer to the city center.
These are two different estimates with two different interpretations.

11 / 61
Graphical illustration

Figure: X = distancei ; D = starsi ; Y = pricei .

12 / 61
1.2. Ceteris paribus analysis

13 / 61
Simple model

Econometricians like to think about relationship between variables in terms


of models. For example, consider the following two-equations model for
price, distance, stars:
(1)
pricei = c (1) + β1 distancei + β2 starsi + ui , (5)
(2)
starsi = c (2) + πdistancei + ui . (6)

Assume that:
(1) (2)
E[ui |distancei , starsi ] = 0, E[ui |distancei ] = 0. (7)

14 / 61
Coefficient interpretation. Large model

Note that in the large model β1 and β2 can be interpreted as:

∂ E[pricei |distancei , starsi ]


= β1
∂distancei
∂ E[pricei |distancei , starsi ]
= β2 .
∂starsi
Hence, both coefficients measure what is usually called the ceteris paribus
effect of changing either distancei or starsi marginally, while keeping the
other characteristics as they are.

(Here for simplicity I pretend that starsi is a continuous variable.)

15 / 61
Coefficient interpretation. Small model

What about the interpretation in the model where only distancei is


included? For this we combine the two equations:
(2) (1)
pricei = c (1) + β1 distancei + β2 (c (2) + πdistancei + ui ) + ui
(1) (2)
= (c (1) + β2 c (2) ) + (β1 + β2 π)distancei + (ui + β2 ui ).

In this case:
∂ E[pricei |distancei ]
= β1 + β2 π. (8)
∂distancei
Note: This is also the coefficient of the population linear projection of
pricei on distancei !

16 / 61
Hence, in this linear model we measure the total (both direct and indirect
through starsi ) effect of distancei :

βdistance,total = β1 + β2 π. (9)

Note that we generally expect that β1 < 0, β2 > 0 and π < 0. Hence the
total effect is more negative than the direct effect.

This is also what we found in our application.

17 / 61
2. Adding and subtracting variables

18 / 61
2.1. Restricted and unrestricted estimators

19 / 61
Two models

Consider the situation where you are not sure if you want to model the data
using the “small” model:
y = X1 β1 + ε, (10)
or using the “large” model:

y = X1 β1 + X2 β2 + ε. (11)

In both cases you are primarily interested in β1 . In our case, the coefficient
on distancei .

20 / 61
Two estimators

The “small” model can be seen as a restricted version of the “large” model
with β2 = 0 using this notion we use additional subscript R in βb1,R The two
OLS estimators for β1 :

βb1,R = (X10 X1 ) (X10 y ) .


−1
(restricted) (12)

Using the FWL theorem the unrestricted estimator of β1 is given by:

βb1 = (X10 MX2 X1 ) (X10 MX2 y ) .


−1
(unrestricted) (13)

21 / 61
Two estimators. Important relationship.

In the proof of the FWL during one of the intermediate steps we showed
that:
βb1 = (X10 X1 ) X10 (y − X2 βb2 ) .
 
−1
(14)

Equivalently:
βb1 = βb1,R − (X10 X1 ) X10 X2 βb2 .
−1
(15)
Hence, the unrestricted estimator βb1 can be seen as a linear combination
between the restricted estimator βb1,R and βb2 .

22 / 61
OLS and model parameters

Hence from the above we see that the total effect βb1,R is decomposed as:

+ (X10 X1 ) X10 X2 βb2


−1
βb1,R = βb1 (16)
|{z} |{z} | {z }
Total effect Direct effect Indirect effect

= βb1 + Π
cβb2 . (17)

c ≡ (X 0 X1 )−1 X 0 X2 is the OLS estimator from the regression of


Here Π X2
1 1
on X1 (i.e. the number of stars on distance)

23 / 61
This decomposition is exactly the same as the one for our two-equations
model for price, distance, stars:

β1,r = β1 + πβ2 , (18)

where (as before):


(1)
pricei = c (1) + β1 distancei + β2 starsi + ui , (19)
(2)
starsi = c (2) + πdistancei + ui . (20)

24 / 61
2.2. Biased or unbiased?

25 / 61
Decomposition again

In what follows we will use the following decomposition:

βb1,R = βb1 + (X10 X1 ) X10 X2 βb2 .


−1
(21)

We will also use the fact (from previous lectures) that the OLS estimator in
the large model is unbiased, i.e.:
b X ] = β0 .
E[β| (22)

In particular:

E[βb1 |X ] = β1,0 , (23)


E[βb2 |X ] = β2,0 . (24)

26 / 61
Unbiased estimation with irrelevant regressors

First consider the situation when β2,0 = 0, i.e. regressors X2 do not


contribute to the model for y directly. In this case:

E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]


−1

= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]


−1

= β1,0 .

Hence both the restricted βb1,R and unrestricted βb1 estimators are unbiased.

In this respect, they are equivalent.

27 / 61
Fully conditional bias. Non-irrelevant regressors.

Consider the situation now, when β2,0 6= 0, e.g. there is a direct effect of
starsi on pricei . In this situation:

E[βb1,R |X ] = E[βb1 |X ] + E[(X10 X1 ) X10 X2 βb2 |X ]


−1

= β1,0 + (X10 X1 ) X10 X2 E[βb2 |X ]


−1

= β1,0 + (X10 X1 ) X10 X2 β2,0 .


−1

Hence:
E[βb1,R |X ] 6= β1,0 . (25)
And the restricted estimator is conditionally biased.

28 / 61
The Omitted variable bias

The difference between E[βb1,R |X ] and β1,0 , i.e.:

E[βb1,R |X ] − β1,0 = (X10 X1 ) X10 X2 β2,0 ,


−1
(26)

is called the omitted variable bias.

For obvious reasons, this bias is caused by omitting relevant variable (X2 in
this case) from the model. Unless X10 X2 = O (almost impossible to satisfy),
this bias is non-zero.

29 / 61
Partially conditional bias (Not in your textbook)

Note that while for any given sample X10 X2 = O might be difficult to satisfy,
it is possible that this is satisfied in repeated samples. For example, it is
possible that E[X2 |X1 ] = O. In that case:

E[βb1,R |X1 ] = E[E[βb1,R |X1 , X2 ]|X1 ]


= β1,0 + E[(X10 X1 ) X10 X2 β2,0 |X1 ]
−1

= β1,0 + (X10 X1 ) X10 E[X2 |X1 ]β2,0


−1

= β1,0 .

Hence, there is no omitted variable bias left, because the two variables are
not correlated in large samples.

30 / 61
Which bias is the relevant one?
Hence, should we worry more about conditional bias on X or on X1 only?
The answer depends on your specific situation, and whether it is likely that
treating regressors as fully fixed is justified.

Note that unbiasedness is a fictional concept where we imagine what would


have happened if we observed different realizations of the same probabilistic
model many times.

Take the hotels example, and assume that hotels cannot change their
locations, so distance is fixed.
1 If you are interested in measuring the effect of distance on prices for a
given set of hotels with given set of stars then looking at fully
conditional (on X ) is more appropriate.
2 If you are interested in measuring the effect of distance on prices, but
you allow for possibility that hotels might change their rankings (in the
imaginable world) then looking at conditional statements on X1 alone
is sufficient.

31 / 61
2.3. Variance

32 / 61
Useful fact

Let A and B two positive definite matrices of dimension [q × q], then:


A − B ≥ 0, (27)

if and only if
B −1 − A−1 ≥ 0. (28)
Here as in the previous lecture by ≥ we mean no smaller in the positive
semi-definite sense.

We will not attempt to proof this fact, and leave it for curious students to
check.

33 / 61
Adding Irrelevant variables. Variance Effect.

The two OLS estimators for β1 :

βb1 = (X10 MX2 X1 ) (X10 MX2 y )


−1
(29)
βb1,R = (X10 X1 ) (X10 y ) .
−1
(30)

Let us assume that the true model has β2,0 = 0K2 (true coefficients). Then
X2 is irrelevant regressor.
Variances

var (βb1 |X ) = σ02 (X10 MX2 X1 ) ,


−1
(31)
var (βb1,R |X ) = σ02 (X10 X1 ) .
−1
(32)

34 / 61
We now show that:

(X10 MX2 X1 ) ≥ (X10 X1 )


−1 −1
, (33)

so that the variance of the unrestricted (with irrelevant X2 ) included


estimator is no smaller than that of the restricted estimator.

Using the trick the previous statement can be equivalently established by


proving that:
X10 X1 − X10 MX2 X1 ≥ 0. (34)
But this is trivial because:

X10 X1 − X10 MX X1 = X10 PX X1 .


2 2
(35)

By the definition of the projection matrix PX , it follows immediately that


X10 PX2 X1 ≥ 0.
2

35 / 61
Variance. In the presence of bias.

Consider the (fully conditional) variance of the restricted estimator when


β2,0 6= 0. By definition:

var (βb1,R |X ) = E[(βb1,R − E[βb1,R |X ])(βb1,R − E[βb1,R |X ])0 ]. (36)

In this case:
βb1,R − E[βb1,R |X ] = (X10 X1 ) X10 ε.
−1
(37)
Such that (following same steps as before)

var (βb1,R |X ) = σ02 (X10 X1 ) .


−1
(38)

Conclusion: irrespective of the value of β2,0 the fully conditional variance of


βb1,R remains the same.

36 / 61
Summary. Variance effect.

Table: The effects of omitting potentially relevant variables

Estimator β2,0 = 0 β2,0 6= 0


βb1,R unbiased, efficient biased, efficient
E[βb1,R |X ] = β1,0 E[βb1,R |X ] 6= β1,0
var (β1,R |X ) = σ02 (X10 X1 ) var (β1,R |X ) = σ02 (X10 X1 )
b −1 b −1

βb1 unbiased, not efficient unbiased, not efficient


E[βb1 |X ] = β1,0 E[βb1 |X ] = β1,0
var (β1 |X ) = σ02 (X10 MX2 X1 ) var (β1,R |X ) = σ02 (X10 MX2 X1 )
b −1 b −1

Note that here I will define efficiency in terms of smallest variance among
the two competing linear estimators.

37 / 61
Important! (and always forgotten)

Note that while it is always true that:

var (βb1,R |X ) ≤ var (βb1 |X ). (39)

It is not true that the corresponding estimates of variances (and also the
standard errors)

c (βb1,R |X ) = sR2 (X10 X1 )−1 ,


var (40)
c (βb1 |X ) = s 2 (X 0 MX X1 )−1 ,
var 1 2 (41)

satisfy:
c (βb1,R |X ) ≤ var
var c (βb1 |X ). (42)
This is all because it possible that sR2 ≥ s 2 , while it is always
(X10 X1 )−1 ≤ (X10 MX2 X1 )−1 .

38 / 61
2.4. (Root) Mean Squared Error (self reading at
home)

39 / 61
Summary from previous discussion

From previous discussion we it is clear that neither βb1 nor βb1,R clearly
dominates each other in both having lower bias and lower variance when it
is uncertain if β2,0 = 0 or not.

Hence which approach should be used? In our setup, the answer


depends on whether you value more unbiased estimation or more certain
estimation (i.e. estimators with lower variance).

Is there a middle ground?

40 / 61
Combining two. Definition.

Note that instead of considering bias and variance separately we can look at
the measure that combines the measures together. In particular, the Mean
Squared Error (MSE) is defined as:

MSE (βb1 |X ) = E[(βb1 − β1,0 )(βb1 − β1,0 )0 ], (43)


MSE (βb1,R |X ) = E[(βb1,R − β1,0 )(βb1,R − β1,0 )0 ]. (44)

Hence unlike the variance that is centered at the mean, MSE is centered at
the true value β1,0 .

41 / 61
It is easy to see that:

MSE = Bias Bias 0 + Variance. (45)

Hence in our case:

MSE (βb1 |X ) = σ02 (X1 MX2 X1 )−1 , (46)


MSE (βb1,R |X ) = σ 2 (X1 X1 )−1 + (X1 X1 )−1 X 0 X2 β2,0 β 0
0 1 2,0 X20 X1 (X1 X1 )−1 .
(47)

Hence, whether MSE (βb1 |X ) > MSE (βb1,R |X ) (or the other way), now
depends on many ingredients: σ02 , β2,0 , as well relationships between X1
and X2 .

42 / 61
3. Implications for model fit

43 / 61
3.1. The R 2

44 / 61
Look! My R 2 increased

In the empirical examples we considered so far we always noticed that by


adding additional regressors, e.g. by adding X2 , the R 2 of the model went
up. Always. This might give us some indication that X2 helps to better

explain variation in y . Hence, X2 is a useful explanatory variable.

45 / 61
R 2 always increases...

Below we show that this reasoning is wrong, and R 2 always increases by


adding additional regressors into the model. Hence, R 2 does not say too
much about the quality of the model, it is just algebra.

Consider two sets of Residual sums of squares (in both cases we include
intercept):

SSRR = y 0 MX1 y = (y − X1 βb1,R )0 (y − X1 βb1,R ), (48)


SSR = y 0 MX y = (y − X β)
b 0 (y − X β).b (49)

We want to show that always:

SSRR ≥ SSR. (50)

There are several ways how this can be shown. I will present two: i)
projection based; ii) objective function based.

46 / 61
R 2 always increases. Projection based argument.
Consider the definition of SSRR and use the fact that y = PX y + MX y :
SSRR = y 0 MX1 y
= y 0 MX1 PX y + y 0 MX1 MX y
= y 0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX PX y + y 0 MX y
= y 0 PX0 MX1 PX y + y 0 MX y
= y 0 PX0 MX1 PX y + SSR.

Here we used the fact that MX MX 1 = MX (as we showed previous week).

y 0 PX0 MX PX y ≥ 0, as it is a quadratic form.


1
So:

SSRR ≥ SSR. (51)

47 / 61
R 2 always increases. Objective-function based argument.

0
Consider the following vector βe = (βb1,R , 00K2 )0 .

Recall that last week we proved that βb is a minimizer of the Least-squares


objective function, i.e.:

βb = arg min LSn (β) = arg min(y − X β)0 (y − X β). (52)


β β

In other words:
b ≤ LSn (β),
LSn (β) (53)
for any choice of β. In particular, as it holds for any β it should also hold
for βe we defined above thus:
b ≤ LSn (β).
LSn (β) e (54)

Hence SSR ≤ SSRR by the definition of SSR and SSRR .

48 / 61
3.2. Adjusted R 2

49 / 61
The Adjusted R 2

When confronting with patterns that additional regressors improve


(in-sample) fit one of the practises is to penalize models with large K . The
oldest approach, uses the degrees-of-freedom adjustment in the so-called
Adjusted R 2 :
1
PN 2
2 n−K i=1 e
bi
R = 1 − 1 PN . (55)
2
n−1 i=1 (yi − y )

The two coincide when K = 1, e.g. when only constant term is included in
the model, but in other cases:
2
R ≤ R 2. (56)

It is not difficult to see that:


2 n−1
R =1− (1 − R 2 ). (57)
n−K

50 / 61
Adjusted R 2 for model selection

2
In principle R can be used for model selection. For example, to choose
between a small model with K1 regressors and a large model with K1 + K2
regressors.
2
You just need to select model that produces largest R .

51 / 61
Illustration.Model 1.

Figure: Regression of Price on distancei and dummy variable


Di = 1(distancei < 2km).

52 / 61
Illustration. Model 2.

Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),


and their interactioni .

53 / 61
Conclusions?

The larger model with interaction term between Di and distancei produces
2
not only the larger R 2 , but also larger R .

This is not surprising as the difference between two R 2 was substantially


big, at the expense of one added regressor.

54 / 61
Illustration. Alternative model with two binary variables.

2
Example below shows that R can decrease if additional variable only
marginally increases R 2 .

Consider an alternative multivariate regression model for Hotel prices in


Vienna:

pricei = α + β1 Di + β2 Bi + εi . (58)

Here Di is defined as before, while Bi = 1(2km < distancei < 4km).

55 / 61
Results

Figure: Regression of Price on dummy variable Di = 1(distancei < 2km), and


dummy variable Bi = 1(2km < distancei < 4km)

56 / 61
Results. Original setup.

Figure: Regression of Price on dummy variable Di = 1(distancei < 2km).

2
Higher R 2 , but lower R !

57 / 61
Other methods for model selection

Other approaches for model selection:


I Information criteria: AIC, BIC.
I Sample splitting into training and validation.
I Leave-on-out cross-validation.
I etc.
Only AIC/BIC discussed at the end of this course.

58 / 61
4. Summary

59 / 61
Summary today

In this lecture
I We discussed the concepts of total and partial effects.
I We related these concepts to restricted and unrestricted OLS
estimators.
I We used the FWL theorem to study the properties of restricted OLS
estimator.
I We introduced the concept of the omitted variable bias.

60 / 61
On Friday

I We discuss the concept of multicollinearity.


I We look at distributional properties of the OLS estimator joint
normality of ε.
I We show how t-statistics can be constructed to test simple linear
hypothesis on β.
I We show how to do formal statistical testing using t-statistics.

61 / 61

You might also like