Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Econometrics 1 (6012B0374Y)

dr. Artūras Juodis

University of Amsterdam

Week 2. Lecture 1

February, 2024

1 / 54
Overview

Multiple regression model


1 Multiple regression setting OLS estimator
Motivation 3 Geometry of OLS
The model Fitted values and residuals
Empirical results Projection matrices
2 Linear model and OLS using The R 2
matrix notation 4 Summary

2 / 54
The plan for this week

I We motive the use of the multiple regression model.


I We introduce matrix and vector notation in this context.
I We use matrix notation to derive and study OLS estimator.
I We provide geometrical interpretation of the OLS estimator.
I We prove the Gauss-Markov theorem (Friday).
I We prove the Frisch-Waugh-Lovell theorem (Friday).

3 / 54
Recap: Linear model

Last week, we considered the simple linear model with a single regressor xi

yi = α + βxi + εi , i = 1, . . . , n. (1)

We used this model to understand the determinants of hotel prices in


Vienna. Unfortunately, empirical results for two different choices of xi were
not sufficiently convincing. This motivates the need to study models with
multiple regressors.

4 / 54
Recap: OLS estimator

We used sample data {(yi , xi )}ni=1 to construct statistics that can be used
as estimates of (α, β). For this purpose, we considered the Ordinary Least
Squares (OLS) objective function.

The OLS estimators: P


i (x− x)(yi − y )
βb = Pi 2
, (2)
i (xi − x)
and that
b = y − x β.
α b (3)
Today, we show how to derive expressions for (bα, β)
b using unified
matrix/vector notation. We will do it for the general setting with multiple
regressors.

5 / 54
1. Multiple regression setting

6 / 54
1.1. Motivation

7 / 54
Scatter plot
250

200
price

150

100

50
0 5 10 15 20
distance_km

Figure: Scatter plot

8 / 54
Two models we considered last week

In order to explain/fit these patterns we considered two possible linear


models with a single regressor:

pricei = α + βdistancei + εi , or (4)


pricei = α + βDi + εi . (5)

Problem: Both of these models were able to explain some of the features in
the above scatter plot, e.g. that most most expensive hotels are the ones
closer to the city center. However, neither of the models was able to
explain/predict/fit what happens for hotels that are far away from the
center.

Can we do better? YES!

9 / 54
Combination of the two

If we believe that distance is more important for hotels closer to the city
center than for those that are outside of the city center, then it is natural to
consider models that combine feature from separate individual models.

Two natural extensions:

pricei = α + β1 distancei + β2 Di + εi (6)


pricei = α + β1 distancei + β2 Di + β3 (distancei × Di ) +εi . (7)
| {z }
interactioni

While the first model simply adds two regressors linearly to the model, the
second one allows for the interaction effect between the distance variable
and whether the hotel is within 2km or not.

Models of the second type are very common in empirical work. More about
that in Week 5.

10 / 54
1.2. The model

11 / 54
Multiple regression model

In what follows we consider the linear regression model with K regressors:


K
X
yi = βk xk,i + εi , i = 1, . . . , n. (8)
k=1

Here we slightly deviate from the convention used before (for a reason
obvious soon) and include the intercept as first regressor, i.e. α = β1 and
x1,i = 1, i = 1, . . . , n.

For the follow up results distinction between regressors that vary between
units (e.g. distance) and the ones that do not (intercept) is immaterial.

12 / 54
Interpretation

The coefficient βk under (modified version of) classical assumptions can be


interpreted as the partial/marginal effect:

∂ E[yi |(x1,i , . . . , xK ,i )]
= βk , k = 2, . . . , K . (9)
∂xk,i

If all regressors are continuous. Hence, βk measure the effect of the marginal
change in xk,i on the conditional expectation of yi (given all regressors).

13 / 54
Interpretation. Hotels example.

In this model (using the new notation)

pricei = β1 + β2 distancei + β3 Di + β4 (distancei × Di ) +εi , (10)


| {z }
interactioni

the coefficients do not have direct marginal effects interpretation because of


the interaction term. Instead:
(
∂ E[pricei |distancei ] β2 + β4 if distance < 2km
= (11)
∂distancei β2 if distance ≥ 2km.

14 / 54
Not all regressors are equal!
Note that while I use the same notation (x1,i , . . . , xK ,i ) for all regressors (so
they are all mathematically equal), in reality some regressors for economists
are of greater interest than others.

We would usually split all regressors into:


I Policy/treatment/primary variables, e.g. variables that contain
information about the exposure of units to treatments/policy changes,
etc.
I Control variables, e.g. demographic characteristics of units. Intercept.
This distinction is fairly new in econometrics, and is mostly driven by the
credibility revolution where more and more studies try to evaluate effects of
policy interventions.

In that case, the effects of variables that can actually be


manipulated/intervened are more important than control variables that
policy makers cannot change.

15 / 54
1.3. Empirical results

16 / 54
Empirical results. Model 1.

Figure: Regression of Price on distancei and dummy variable


Di = 1(distancei < 2km).

17 / 54
Empirical results. Model 2.

Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),


and their interactioni .

18 / 54
How should we interpret Model 2? Fitted curves

Translating regression output, we can consider the corresponding fitted lines:


(
172.61 − 43.10 × distance if distance < 2km
yb(distance) = (12)
88.72 + 0.72 × distance if distance ≥ 2km.

19 / 54
Conclusions?
On Model 1.
I Just adding two regressors additively does not improve the fit of the
model substantially, i.e. R 2 goes up marginally. Later this week we
explain why R 2 should always go up.
I From Model 1 it is clear that distance does not matter that much for
the variation of prices. It matters more if the distance <2km or not.
So relationship is not linear afterall.
On Model 2.
I Adding interaction term dramatically improves R 2 . Hence, distance
matters!
I But it is mostly important (and can be explained by the model) only if
you are within the 2km radius from the city center.
I For observations outside the 2km radius from the city center, the effect
of distance is even positive!
I This illustrates why models with interacted explanatory variables are so
popular by applied econometricians and economists.

20 / 54
2. Linear model and OLS using matrix notation

21 / 54
2.1. Multiple regression model

22 / 54
How OLS is calculated with multiple regressors?

OLS coefficients in the multiple regression setting are obtained identically to


the setting with one regressor, i.e.: by minimizing the Least Squares
objective function (or by minimizing the sum of squared errors):

n K
!2
X X
(βb1 , . . . , βbK ) = arg min yi − βk xk,i . (13)
β1 ,...,βK
i=1 k=1

Or as in the first lecture:

(βb1 , . . . , βbK ) = arg min LSn (β1 , . . . , βK ). (14)


β1 ,...,βK

23 / 54
Derivatives

We look at the first partial derivatives of that objective function:


n K
!
∂LSn (β1 , . . . , βK ) X X
= −2 xk,i yi − βk xk,i .
∂βk
i=1 k=1

for all k = 1, . . . , K . Hence, the minimizer of the objective function


(βb1 , . . . , βbK ) should be the zero of the above set of equations (K equations
with K unknowns), i.e.:
n K
!
X X
xk,i yi − βbk xk,i = 0, (15)
i=1 k=1

for all k = 1, . . . , K . Hence, set of K equations in K unknowns!

24 / 54
Not convenient

While previous equations are correct, they are generally inconvenient to


work with. Given your knowledge of (Advanced) Linear Algebra it is way
more appropriate to derive all results using appropriate vector/matrix
notation and the language of system of linear equations.

Note that for any [S × 1] vectors a = (a1 , . . . , aS )0 and b = (b1 , . . . , bS )0 :


S
X
0
a b = b0 a = as bs . (16)
s=1

Here I use the convention that 0 denotes the transpose of a vector, and all
bold quantities are column vectors.

25 / 54
Matrix preliminaries for OLS

Let us introduce the following notation:

y = (y1 , . . . , yn )0 , [n × 1]
ε = (ε1 , . . . , εn )0 , [n × 1]
xi = (x1,i , . . . , xK ,i )0 , [K × 1]
x
(k)
= (xk,1 , . . . , xk,n )0 , [n × 1]
X = (x1 , . . . , xn )0 , [n × K ]
β = (β1 , . . . , βK )0 , [K × 1].

With this notation the linear model just reads as:

yi = xi0 β + εi , (17)

for all i = 1, . . . , n.

26 / 54
Matrix preliminaries for OLS

We can also collect all these n individual models into a system of n


equations:

y1 = x10 β + ε1 ,
y2 = x20 β + ε2 ,
... = ...,
0
yn−1 = xn−1 β + εn−1 ,
yn = xn0 β + εn .

Or simply as:
y = X β + ε. (18)

27 / 54
Example. Vienna Hotels. Model with interaction.

The first five rows of y and X are given by (without any specific sorting):
   
81 1 2.737 0 0
 85  1 2.254 0 0 
   
 83  1 2.737 0 0 
y =  82  , X = 1 (19)
   
   1.932 1 1.932 
103 1 1.449 1 1.449
.. .. .. .. ..
   
. . . . .

Here the first column of X is the vector of ones (intercept), the second
column is the distance in kilometres, the third column is a binary variable
that indicates if hotel is <2km from the city center, and the final column is
the product of the latter two.

28 / 54
2.2. OLS estimator

29 / 54
OLS using matrix notation

Using matrix notation:


n
X
LSn (β1 , . . . , βK ) = LSn (β) = (yi − xi0 β)2 = (y − X β)0 (y − X β). (20)
i=1

Note that for any value of β the LSn (β) objective function is a scalar!

Given that β is a [K × 1] vector, the derivative of LSn (β) with respect to β


is a [K × 1] vector.

30 / 54
Derivatives

We showed that derivatives are given by:


n K
!
∂LSn (β1 , . . . , βK ) X X
= −2 xk,i yi − βk xk,i .
∂βk
i=1 k=1

Or alternatively using our new notation:


n
∂LSn (β) X
= −2 xk,i (yi − xi0 β) = −2(x (k) )0 (y − X β).
∂βk
i=1

Collecting all such K equations together:

∂LSn (β)
= −2X 0 (y − X β).
∂β

31 / 54
The OLS estimator

From the above we conclude that the OLS estimator βb is the solution to
following set of K equations in K unknowns:
0
X (y − X β)
b = 0K . (21)

If X 0 X is of full rank K , i.e. rank(X 0 X ) = K then the above systems of


equations has a unique solution:

n
!−1 n
!
−1
X X
0 0 0
βb = (X X) (X y) = xi xi x i yi . (22)
i=1 i=1

−1
Here (·) is the usual matrix inverse.

32 / 54
The OLS estimator. Special case

For the special case we analyzed in the previous week xi = (1, xi )0 and
βb = (b b 0 then:
α, β)
   Pn Pn −1  Pn 
α 1
= Pni=1 Pni=1 x2i Pni=1 yi
b
(23)
βb i=1 xi i=1 xi i=1 xi yi

We arrive at the expression derived previously in the course upon using the
exact formulas for the inversion of a [2 × 2] matrix.

33 / 54
3. Geometry of OLS

34 / 54
3.1. Fitted values and residuals

35 / 54
LS objective function decomposition

Observe that:
b 0 (y − X βb − X (β − β))
LSn (β) = (y − X βb − X (β − β)) b
b 0 (y − X β)
= (y − X β) b 0 X 0 X (β − β)
b + (β − β) b
b 0 X (β − β)
− (y − X β) b 0 X 0 (y − X β)
b − (β − β) b

36 / 54
LS objective function decomposition

Observe that:
b 0 X = (y − X (X 0 X )−1 X 0 y )0 X = y 0 X − y 0 X = 00 .
(y − X β) (24)
K

Hence:

LSn (β) = LSn (β) b 0 X 0 X (β − β)


b + (β − β) b ≥ LSn (β).
b (25)

Why above inequality? Observe that (β − β) b 0 X 0 X (β − β)


b is a quadratic
form. Hence, it is non-negative by construction.

Conclusion? OLS estimator βb is indeed a minimizer of the objective


function LSn (β).

37 / 54
Decomposition

Consider the decomposition (using vector notation) in the explained/fitted


part of y and residual:
y = yb + eb. (26)
Note that:

y
b = X βb = X (X 0 X )−1 X 0 y , (27)
0 −1 0
e
b = y − yb = (In − X (X X) X )y . (28)

Hence both the fitted values and the residuals are certain (linear)
transformations of the original data y .

38 / 54
3.2. Projection matrices

39 / 54
Decomposition

Let:

PX = X (X 0 X )−1 X 0 ,
MX = In − X (X 0 X )−1 X 0 .

Then:
MX + PX = In , (29)
and also:
M X PX = On×n . (30)
These two matrices (MX and PX ) are very special and known to be
projection matrices. Also MX is known to be the residual maker matrix for
an obvious reason.

40 / 54
M
X and P X are projection matrices

General definition. A matrix V is a orthogonal projection matrix if:

V = V 2 = V 0. (31)

It is easy to see that indeed PX is a projection matrix. Next, consider MX :

MX MX = (In − X (X 0 X )−1 X 0 )MX = MX − PX MX = MX , (32)

and obviously that MX = MX0 . Hence MX is also an orthogonal projection


matrix.

41 / 54
Projection matrix P X

What exactly these matrices project onto?

PX is a projection matrix onto the space spanned by the columns X (K of


those). In particular, take any vector z = X γ (hence z is in the span of X ),
then:
0 −1 0
PX z = X (X X ) X X γ = X γ = z. (33)
This means, that if you project something that already lies in the span of X
nothing changes.

42 / 54
Projection matrix M X

Note that dim(X ) = [n × K ], hence if rank(X ) = K , then the dimension of


the corresponding null-space is n − K .

Indeed MX projects off the space spanned by columns of X . In particular,


take any vector z = X γ (hence z is in the span of X ), then:

MX z = X γ − X (X 0 X )−1 X 0 X γ = X γ − X γ = 0n . (34)

43 / 54
OLS geometrically

Hence, OLS geometrically simply helps to project y on two spaces that are
orthogonal to each other:
I yb fitted values that lie in the K dimensional space spanned by columns
X;

I eb residual values, that line in the corresponding orthogonal


complement.
From this definition it is not surprising that:
0
y
b eb = y 0 PX MX y = 0. (35)

44 / 54
Implication. Projection matrices.

One of the most obvious implications of MX X = O is that residuals eb, sum


to 0, i.e:
Xn
ebi = ı0n eb = 0. (36)
i=1
0
Here ın = (1, . . . , 1) is an [n × 1] dimensional vector of ones. In particular,
this results follows from the fact that:

ın = Xe1 , (37)

where e1 = (1, 0, . . . , 0)0 a [K × 1] selection vector (i.e. the vector that


chooses the first column from X ). Hence:

ı0n eb = e10 X 0 MX y = e10 X 0 MX0 y = e10 (MX X )0 y = 0. (38)

45 / 54
3.3. The R 2

46 / 54
Some preliminaries

Consider the decomposition

SST = SSE + SSR. (39)

Recall that we defined R 2 as a function of the SSE (Explained Sum of


Squares):
SSE SSR
R2 ≡ =1− ∈ [0; 1]. (40)
SST SST
In what follows we derive (again) the SST = SSE + SSR decomposition
using matrix algebra and some additional projection matrices.

47 / 54
Demeaning projection matrix

In the definition of SSE we considered the demeaned yi , i.e. yi − y .


Consider the stacked version of this vector (i.e. [n × 1] vector) using the
vector notation:

y
e ≡ y − ın y = y − ın ı0n y /n = y − ın (ı0n ın )−1 ı0n y . (41)

Note that above can be expressed as:

y
e = M1 y , (42)

where M1 = In − ın (ı0n ın )−1 ı0n is an orthogonal projection matrix!

48 / 54
SST = SSE + SSR decomposition

From y = PX y + MX y we can obtain:

y
e = M 1 PX y + M 1 M X y . (43)

This looks complicated! However, notice that:

M1 MX = MX − ın (ı0n ın )−1 ı0n MX , (44)

but we showed previously that ı0n MX = 00n . Hence:

M1 MX = MX . (45)

49 / 54
SST = SSE + SSR decomposition

Using the above result:

y
e = M 1 PX y + M X y . (46)

Such that:
0
y
e ye = y 0 PX0 M10 M1 PX y + y 0 MX0 MX y . (47)
Where we used the fact that because MX = MX0 = MX2 (and the same for
M1 ):
0 0
MX M1 PX = (M1 MX ) PX = MX PX = O. (48)

50 / 54
The R 2

This implies again that:

e0 ye = yb0 M yb + eb0 eb .
y (49)
|{z} | {z1 } |{z}
SST SSE SSR

Hence: 0
SSE y MX y
R2 ≡ =1− . (50)
SST y 0 M1 y

Here we used the fact that eb0 eb = y 0 MX y and ye0 ye = y 0 M1 y .

Hence R 2 is a function of two different quadratic forms in the y vector.

51 / 54
4. Summary

52 / 54
Summary today

In this lecture
I We introduced the multiple regression framework.
I We introduced the vector/matrix notation for this framework.
I We showed how the OLS estimator can be derived using this new
notation.
I We provided a geometrical interpretation for the OLS estimator.
I We gave residuals and fitted values interpretations in terms of the
corresponding orthogonal projections.

53 / 54
On Friday

I We study the statistical properties of the OLS estimator.


I We prove the Gauss-Markov theorem that implies that OLS is the
BLUE estimator.
I We prove the Frisch-Waugh-Lovell theorem.

54 / 54

You might also like