All Slides

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 289

Econometrics I

An Introductory Course to Econometric Modeling, WS 2023/24

S. Adhikari, E. Flonner, G. Malsiner-Walli


Acknowledgements to: S. Frühwirth-Schnatter, T. Fissler

Institute for Statistics and Mathematics, Department of Finance, Accounting and Statistics
Introductory Course to Econometric Modeling

The course – in particular this set of slides – is based on and coherent with previous
econometrics courses held by Sylvia Frühwirth-Schnatter. It is aimed at being consistent
with other courses from this lecture series (“Econometrics II” / “Applied
Econometrics”).

Throughout the next few months, we strive for competence in the following. . .
Major Milestones
▶ Part I: Basic Concepts of Econometric Modeling
▶ Part II: OLS Estimation
▶ Part III: Multiple Regression Model

2 / 288
Literature
Introductory and largely non-mathematical:
▶ Gary Koop: Analysis of Economic Data. Wiley, 4th edition, 2013.

The “Classics”:
▶ James H. Stock and Mark W. Watson: Introduction to Econometrics. Prentice
Hall, 3rd international edition, 2011.
▶ Jeffrey M. Wooldridge: Introductory Econometrics: A Modern Approach. Cengage,
5th international edition, 2013.

In German:
▶ Herbert Stocker: Methoden der Empirischen Wirtschaftsforschung.
https://www.hsto.info/econometrics/.
▶ Peter Hackl: Einführung in die Ökonometrie. Pearson, 2. Auflage, 2013.

Older editions are good enough.


3 / 288
Workload

▶ 60 ECTS credits are the equivalent of a full year of study.


▶ Workload of Econometrics I: 4 ECTS
▶ Workload in hours: 4 x 25 hours = 100 hours
▶ Workload a week: 100/12 ≈ 8 hours.

4 / 288
Part I
Basic Concepts of Econometric Modeling
Outline

Part I: Basic Concepts of Econometric Modeling


▶ What is econometric modeling?

▶ First steps in R

▶ Common data structures

▶ The simple regression model


▶ Model formulation and basic assumptions
▶ The log-linear regression model

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 6 / 288


Econometric Modeling

Econometrics deals with learning about an economic phenomenon (e.g. status of the
economy, influence of product attributes, volatility on financial markets, wage mobility)
from data.
▶ Econometric model: description of the phenomenon involving quantities that are
observable
▶ Data: collected for the observable variables
▶ Econometric inference: draw conclusions from the data about the phenomenon
of interest

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 7 / 288


Econometric Modeling - Stochastic/Deterministic
Models?

Example: Relationship between price and demand:


▶ Description of the phenomenon involving quantities that are observable.
▶ Economic model: simplified description of the process behind the data based on
a deterministic, mathematical model.
▶ Sometimes the deterministic model is based on some economic theory,
sometimes it’s simply a “convenient choice”.
▶ Stochastic model rather than a deterministic model (mainly) because of
simplification.

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 8 / 288


Deterministic Model

Exact quantitative relationship between the variables of interest is assumed to be


known.
Example: Deterministic Relationship between Demand and Price
D = f (p),
where D is the demand and p is the price.

Linear model:
D = β0 + β1 p
Non-linear model:
D = β0 p β1

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 9 / 288


Stochastic model

Exact quantitative relationship between the variables of interest is NOT known, but
disturbed by a (stochastic) error term.

Example: Stochastic Relationship between Demand and Price


D = f (p, u)
where D is the demand, p is the price, and u is an unobservable error.

Linear model:
D = β0 + β1 p + u
Non-linear model:
D = β0 p β1 u

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 10 / 288


Econometric Model

Deterministic model Stochastic model


−1.5 −1.5

−2 −2
demand

demand
−2.5 −2.5

−3 −3

−3.5 −3.5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price

0.25 0.25

0.2 0.2
demand

demand
0.15 0.15

0.1 0.1

0.05 0.05
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 11 / 288
Where Does the Error Come From?

▶ u aggregates variables that are not included into the model because
▶ their influence is not known a priori;
▶ these variables are unobservable or difficult to quantify.
▶ u aggregates measurement errors which are caused by quantifying economic
variables.
▶ u captures the unpredictable randomness in the left hand side variable of the model.

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 12 / 288


Econometric Inference - Example

Example: Relationship between Demand and Price


Estimate β0 and β1 from the linear model:

D = β0 + β1 p + u

or from the non-linear model:


D = β0 p β1 u
from data.
Remark: For the second model, β1 is the price elasticity

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 13 / 288


Econometric Inference - Definition

Econometric inference is, in general, concerned with drawing conclusions from observed
data about quantities that are not directly observable.

For instance, we might be interested in


▶ model parameters of economic interest (e.g. price elasticities)
▶ hypotheses about these parameters
▶ prediction (e.g. the expected demand for a particular price)

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 14 / 288


Econometric Inference - Uncertainty

Due to this impossibility to observe these quantities of interest, any statement about
these quantities will be uncertain. Two ways of dealing with this uncertainty:
▶ Classical inference: parameter estimation, hypothesis testing, and prediciton as
discussed in the PI Statistik.
▶ Bayesian inference: is based on the concept that the state of knowledge about
any unknown quantity is best expressed in terms of a probability distribution which
is updated in the light of new knowledge.

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 15 / 288


Applied Econometrics - Steps

▶ Model formulation
▶ Model estimation
▶ Econometric inference: parameter estimation, hypothesis testing,
forecasting/prediction
▶ Model choice
▶ Model checking

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 16 / 288


Outline

Part I: Basic Concepts of Econometric Modeling


▶ What is econometric modeling?

▶ First steps in R

▶ Common data structures

▶ The simple regression model


▶ Model formulation and basic assumptions
▶ The log-linear regression model

Part I: Basic Concepts of Econometric Modeling First steps in R 17 / 288


Software

▶ A software package is needed for practical econometric inference.


▶ We will use R.
▶ Detailed instruction on how to use EViews is given in the tutorial.
▶ We already assume a certain familiarity with R. Still, help and instructions can be
found in the tutorial.

Part I: Basic Concepts of Econometric Modeling First steps in R 18 / 288


Outline

Part I: Basic Concepts of Econometric Modeling


▶ What is econometric modeling?

▶ First steps in R

▶ Common data structures

▶ The simple regression model


▶ Model formulation and basic assumptions
▶ The log-linear regression model

Part I: Basic Concepts of Econometric Modeling Common data structures 19 / 288


Data Structure - Typs

Experimental data: data obtained through a designed experiment (medicine, marketing,


etc.). In experiments, a lot of important variables can be explicitly controlled (age,
gender, etc.).
This is a rare situation in economics (and many other areas without laboratories).

(Socio-)Economists mostly deal with observational (non-experimental) data:


▶ Cross-sectional Data
▶ Time series data
▶ Panel data

Part I: Basic Concepts of Econometric Modeling Common data structures 20 / 288


Cross-sectional Data

▶ We are interested in variables (Y , X ) (e.g. relationship between demand D and


price P) or a set of variables (Y , X1 , . . . , XK ).
▶ We are observing these variables simultaneously for N subjects drawn randomly
from a population (e.g., for various individuals, firms, supermarkets, countries).
Typically, cross sectional data are indexed as follows:

(yi , xi ), or (yi , x1i , . . . , xKi ), i = 1, . . . , N

If the data set is not a (simple) random sample, there is a sample-selection problem.

Part I: Basic Concepts of Econometric Modeling Common data structures 21 / 288


Time Series Data

▶ We are (traditionally) interested in a single variable Y (e.g. the return of a


financial asset).
▶ We are observing this variable over time (e.g. every month).
▶ Data cannot be regarded as random sample. It is important to account for trends,
seasonality, . . .
Typically, time series data are indexed as follows:

yt , t = 1, . . . , T .

Part I: Basic Concepts of Econometric Modeling Common data structures 22 / 288


Panel Data

▶ Panel data or longitudinal data: The same (random) individual observations Yi


is followed over time, i.e., we have a time series for each cross-section unit.
Typically, panel data are indexed as follows:

yit , i = 1, . . . , N, t = 1, . . . , T

Part I: Basic Concepts of Econometric Modeling Common data structures 23 / 288


R Homework

Have a look at how data are organized. Files and R code are available on learn@wu:
▶ Case Study Marketing, workfile marketing
▶ Case Study Profit, workfile profit
▶ Case Study Vienna Stocks, workfile viennastocks
▶ Case Study Yields, workfile yieldus
▶ Case Study Chicken, workfile chicken
▶ Case Study Labor Force, workfile change
The code file is called code_eco_I.R.

Part I: Basic Concepts of Econometric Modeling Common data structures 24 / 288


Outline

Part I: Basic Concepts of Econometric Modeling


▶ What is econometric modeling?

▶ First steps in R

▶ Common data structures

▶ The simple regression model


▶ Model formulation and basic assumptions
▶ The log-linear regression model

Part I: Basic Concepts of Econometric Modeling The simple regression model 25 / 288
Question and Data

▶ We are interested in a
▶ dependent variable Y (left-hand side, explained, response), which is supposed
▶ to depend on an variable X explanatory (right-hand side, independent, control,
predictor).
▶ Examples:
▶ demand is a response variable and price is a predictor variable;
▶ wage is a response and years of education is a predictor.
▶ Data: We observe the pair of variables (Y , X ) for N subjects drawn randomly from
a population (e.g. for various supermarkets, for various individuals): (yi , xi ),
i = 1, . . . , N.

Part I: Basic Concepts of Econometric Modeling The simple regression model 26 / 288
Model Formulation

The simple linear regression model describes the dependence between the variables X
and Y as:
Simple Linear Regression Model

Y = β0 + β1 X + u. (1)

The parameters β0 and β1 need to be estimated:


▶ β0 is referred to as the constant or intercept
▶ β1 is referred to as the slope parameter

Part I: Basic Concepts of Econometric Modeling The simple regression model 27 / 288
Impact of the Error Term

2 2
σ =0.2 σ =1
1 1

0 0

−1 −1

demand

demand
−2 −2

−3 −3

−4 −4

−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price

2 2
σ =0.01 σ =0
1 1

0 0

−1 −1
demand

demand
−2 −2

−3 −3

−4 −4

−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price

Part I: Basic Concepts of Econometric Modeling The simple regression model 28 / 288
Basic Assumptions

▶ The average value of the error term u in the population is 0 (not restrictive, we can
always use β0 to normalize E(u) to 0):

Assumption About the Unconditional Mean Error

E(u) = 0 (2)
▶ A crucial assumption is that u and X are uncorrelated. This means that the
conditional mean of u is zero, i.e., knowing X does not give us any information
about u.
Assumption About the Conditional Mean Error

E(u|X ) = E(u) (3)

Part I: Basic Concepts of Econometric Modeling The simple regression model 29 / 288
Main Assumption

The linear model given in (1) and assumptions (2) and (3) imply that E(Y |X ) (i.e., the
conditional mean of Y given X ) is a linear function of X :

Modeling Assumption on the Conditional Mean

E(Y |X ) = β0 + β1 X (4)

Loosely speaking: For a fixed value of X = x , on average over the population, the linear
prediction β0 + β1 x is correct.

Part I: Basic Concepts of Econometric Modeling The simple regression model 30 / 288
Understanding the Regression Model - error term

▶ Simulate data from a simple regression model with β0 = 0.2 and β1 = −1.8:

Y = 0.2 − 1.8X + u (5)

▶ Specification of the error term:


 
u independent of X and u ∼ N 0, σ 2 (6)

▶ Demonstration ⇒ R-code code_eco_I.R

Part I: Basic Concepts of Econometric Modeling The simple regression model 31 / 288
Understanding the Regression Model

2 2
σ =0.2 σ =1
1 1

0 0

−1 −1

demand

demand
−2 −2

−3 −3

−4 −4

−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price

2 2
σ =0.01 σ =0
1 1

0 0

−1 −1
demand

demand
−2 −2

−3 −3

−4 −4

−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price

Part I: Basic Concepts of Econometric Modeling The simple regression model 32 / 288
Understanding the Parameters - interpretation of β1

Expected value of Y , given X = x :

E(Y |X = x ) = β0 + β1 x

Expected value of Y , if the predictor X is changed by 1:

E(Y |X = x + 1) = β0 + β1 (x + 1)

Thus, β1 is the expected absolute change of the response variable Y , if the predictor X
is increased by 1:

E(Y |X = x + 1) − E(Y |X = x ) = β1

Part I: Basic Concepts of Econometric Modeling The simple regression model 33 / 288
Understanding the Parameters

▶ The effect of changing X is independent of the level of X .


▶ The sign shows the direction of the expected change:
▶ If β1 > 0, then the changes of X and Y go into the same direction.
▶ If β1 < 0, then the changes of X and Y go into opposite directions.
▶ If β1 = 0, then a change in X has no influence on Y .

Part I: Basic Concepts of Econometric Modeling The simple regression model 34 / 288
The Log-Linear Regression Model

▶ Log-linear regression model assumes a (specific) nonlinear relation btw Y and X :


Log-linear regression model
Suppose Y , X > 0,

Y = β̃0 · X β1 · ũ, β̃0 , ũ > 0. (7)


▶ By taking the natural logarithm on both sides we obtain a linear (in the
parameters) regression model for the transformed variables log Y and log X :
Log-linear regression model with assumption on error
log Y = β0 + β1 log X + u, E(u|X ) = 0. (8)

where β0 = log β̃0 and u = log ũ.


This model is sometimes called the “log-log” model, because logarithms are taken w.r.t. X and Y .
Part I: Basic Concepts of Econometric Modeling The simple regression model 35 / 288
Visualizing the Log-Linear Regression Model

▶ Simulate data from a simple log-linear regression model with β̃0 = 0.2 and
β1 = −1.8:

Y = 0.2 · X −1.8 e u

▶ Specification of the error term:


 
u independent of X and u ∼ N 0, σ 2

▶ Demonstration ⇒ R-code regsimlog.R

Part I: Basic Concepts of Econometric Modeling The simple regression model 36 / 288
Visualizing the Log-Linear Regression Model

2
σ =0.01
0.25 −1

0.2 −1.5

log(demand)
demand
0.15 −2

0.1 −2.5

0.05 −3
1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8
price log(price)

2
σ =0.1
0.35 −1

0.3
−1.5
0.25

log(demand)
−2
demand

0.2

0.15 −2.5
0.1
−3
0.05

0 −3.5
1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8
price log(price)

Part I: Basic Concepts of Econometric Modeling The simple regression model 37 / 288
Part II
OLS Estimation
Understanding the Parameters I

In economics, elasticity measures how changing one variable affects other variables in
relative terms. If y = f (x ), then the elasticity is the ratio of the percentage change
%∆y in y and the percentage change %∆x in the variable x :
%∆y ∂y ∂x ∂log y
≈ / =
%∆x y x ∂ log x
From equation (8) we obtain the following expected value of log Y , if the predictor X is
equal to x :

E(log Y |X = x ) = β0 + β1 log x .

Therefore:
! !
%∆Y ∂log y ∂E(log y )
E ≈E = = β1
%∆X ∂ log x ∂ log x

Part II: OLS Estimation 39 / 288


Understanding the Parameters II

▶ The parameter β1 is approximately the expected change in % of the response


variable Y , if the predictor X is increased by 1% (elasticity).
▶ The sign of β1 shows the direction the of expected relative change of the response
variable Y . If β1 = 0, then a change in X has no influence in Y .
▶ If X is increased by p%, then the expected change of Y is equal to β1 p% for
small p.

Part II: OLS Estimation 40 / 288


Understanding the Parameters with log variables

Summary of functional forms involving logarithms


Dependent variable Independent variable Interpretation of β1
y x ∆y = β1 ∆x
y log(x ) ∆y = (β1 /100)%∆x
log(y ) x % ∆y = (β1 · 100)∆x
log(y ) log(x ) %∆y = β1 %∆x

Part II: OLS Estimation 41 / 288


Ordinary Least Squares (OLS) Estimation -
Estimation Problem

▶ Let (yi , xi ), i = 1, . . . , N, denote a random sample of size N from the population.


Hence, for each i

yi = β0 + β1 xi + ui . (9)

▶ The population parameters β0 and β1 are estimated from a sample.

▶ The parameter estimates are typically denoted by a hat: βˆ0 and βˆ1 .

▶ Estimation problem: How to choose the unknown parameters β0 and β1 ?

Part II: OLS Estimation 42 / 288


OLS-Estimation - black box?

▶ Estimation as Black Box? Very conveniently, the estimation problem is solved by


software packages like R or EViews. It helps, however, to have a deeper
understanding of what is going on.
▶ The commonly used method to estimate the parameters in a simple (mean)
regression model is ordinary least square (OLS) estimation.

Part II: OLS Estimation 43 / 288


OLS-Estimation

▶ Let (γ0 , γ1 ) denote a candidate for (β0 , β1 ).


▶ For each observation xi , the prediction ŷi of yi depends on the candidate choice
(γ0 , γ1 ):

ŷi (γ0 , γ1 ) = γ0 + γ1 xi (10)

▶ For each observation xi define the residual ui (prediction error) as:

ui = yi − ŷi = yi − (γ0 + γ1 xi ) (11)

▶ For each candidate (γ0 , γ1 ), an overall measure of fit is obtained by aggregating


these prediction errors.

Part II: OLS Estimation 44 / 288


OLS-Estimation

▶ The aggregated squared prediction errors:

Sum of Squared Residuals (SSR)

N N N
2 2
(yi − γ0 − γ1 xi )2
X X X
SSR = ui (γ0 , γ1 ) = (yi − ŷi (γ0 , γ1 )) = (12)
i=1 i=1 i=1

Then: β̂ = (βˆ0 , βˆ1 ) = arg min SSR(γ0 , γ1 ) (13)


γ0 ,γ1

▶ Intuitively, OLS is fitting a line through the sample points such that the sum of
squared residuals is as small as possible.
▶ The OLS-estimator β̂ = (βˆ0 , βˆ1 ) is the parameter that minimizes the sum of
squared residuals

Part II: OLS Estimation 45 / 288


OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

Part II: OLS Estimation 46 / 288


OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

Part II: OLS Estimation 47 / 288


OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

Part II: OLS Estimation 48 / 288


OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 26.80
Part II: OLS Estimation 49 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 22.19
Part II: OLS Estimation 50 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 18.03
Part II: OLS Estimation 51 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 14.33
Part II: OLS Estimation 52 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 11.09
Part II: OLS Estimation 53 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 8.30
Part II: OLS Estimation 54 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 5.97
Part II: OLS Estimation 55 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 4.09
Part II: OLS Estimation 56 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 2.67
Part II: OLS Estimation 57 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

RSS = 1.11
Part II: OLS Estimation 58 / 288
OLS-Estimation

18
14
8 10
6

0 2 4 6 8 10 12

Part II: OLS Estimation 59 / 288


How to Compute the OLS Estimator?

OLS estimates for the Simple linear regression model


sy
β̂0 = y − β̂1 x β̂1 = rxy (14)
sx
where
▶ x mean of x1 , . . . , xN
▶ y mean of y1 , . . . , yN
▶ sx standard deviation of x1 , . . . , xN
▶ sy standard deviation of y1 , . . . , yN
▶ rxy linear correlation coefficient
The only requirement is that we have sample variation in X , i.e. sx2 > 0.

Part II: OLS Estimation 60 / 288


Proof Я цього не розумію, а завчати не надто хочу

The OLS estimator is obtained as solution to the following minimization problem:


N
(yi − γ0 − γ1 xi )2
X
arg min
γ0 ,γ1 i=1

Taking derivatives with respect to γ0 and γ1 , the first-order conditions are:


N
X
−2 (yi − γ0 − γ1 xi ) = 0, and (15)
i=1
N
X
−2 xi (yi − γ0 − γ1 xi ) = 0 (16)
i=1

Part II: OLS Estimation 61 / 288


Proof

From (15) we have ȳ − γ0 − γ1 x̄ = 0. Thus:

y = β̂0 + β̂1 x =⇒ β̂0 = y − β̂1 x (17)

Implications (algebraic properties of OLS):


▶ The regression line passes through the sample mean.
▶ The sum (and thus also the average) of the OLS residuals

ûi = yi − β̂0 − β̂1 xi

is equal to zero. Follows directly from (15):


N N
1 X 1 X
ûi = (yi − βˆ0 − βˆ1 xi ) = 0
N i=1 N i=1

Part II: OLS Estimation 62 / 288


Proof

Substituting β̂0 = y − β̂1 x into (16) and solving for β̂1 , we obtain:
PN
i=1 (xi − x )(yi − y) sy
β̂1 = PN 2
= rxy (18)
i=1 (xi − x ) sx
PN
provided that i=1 (xi − x )2 > 0 (or sx2 > 0).

Implications (algebraic properties of OLS):


▶ The slope estimate is the sample covariance between X and Y , divided by the
sample variance of X .
▶ If X and Y are positively (negatively) correlated, the slope will be positive
(negative).

Part II: OLS Estimation 63 / 288


Proof

▶ The sample covariance between the regressor and the OLS residuals is zero.
Follows from (16):
N N
1 X 1 X
xi ûi = xi (yi − βˆ0 − βˆ1 xi ) = 0.
N i=1 N i=1

Use OLS estimator to estimate Y for a given X = x :


Predicted Values

ŷ = β̂0 + β̂1 x .

Part II: OLS Estimation 64 / 288


Statistical Properties of OLS Estimation

Econometric inference: learning from the data about the unknown parameter

β = (β0 , β1 ) in the regression model.
▶ Use the OLS estimator β̂ to learn about the regression parameter.
▶ Is this estimator equal to the true value?
▶ How large is the difference between the OLS estimator and the true parameter?
▶ Is there a better estimator than the OLS estimator?

Part II: OLS Estimation 65 / 288


Understanding the Estimation Problem

1. Simulate data from a simple regression model:


 
Yi = 0.2 − 1.8Xi + ui , ui |Xi ∼ N 0, σ 2 (19)

2. Run OLS estimation to obtain (β̂0 , β̂1 ) and compare the estimated values with the
true values β0 = 0.2 and β1 = −1.8,
3. Repeat this experiment several times.

Part II: OLS Estimation 66 / 288


Small vs. Large Sample Size (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 1000 σ2 = 1 σ2X = 1 µX = 0


4

4
2

2
0

0
y

y
−2

−2
−4

−4

−2 −1 0 1 2 −2 −1 0 1 2
x x

Part II: OLS Estimation 67 / 288


Small vs. Large Sample Size (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 1000 σ2 = 1 σ2X = 1 µX = 0


−1.2

−1.2
−1.4

−1.4


−1.6

−1.6

● ●● ● ●

● ● ●

●● ●● ● ● ●● ●● ● ● ●●
●● ●
● ●● ● ● ● ●● ●●

● ●●●●●
● ● ● ●● ● ● ●
●●
−1.8

−1.8

● ● ●
● ●
● ●● ● ●
● ●●● ●
●●●●
●●
●● ● ●
●●
β1

β1
● ● ●
● ●● ● ●● ●●●●● ●

●●●●●
^

^
●● ● ● ●● ● ● ●●

●● ●●●●
●●●




● ● ●●●●●● ●
● ● ● ● ● ● ●● ●

● ● ● ● ● ●●●●●

● ● ● ●●●● ●●●●●●
● ● ● ●●
● ● ● ● ● ●


−2.0

−2.0
●●

−2.2

−2.2
−2.4

−2.4

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Part II: OLS Estimation 68 / 288


Small vs. Large Error Variance (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 3 σ2X = 1 µX = 0


4

4
2

2
0

0
y

y
−2

−2
−4

−4

−2 −1 0 1 2 −2 −1 0 1 2
x x

Part II: OLS Estimation 69 / 288


Small vs. Large Error Variance (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 3 σ2X = 1 µX = 0


−1.2

−1.2

−1.4

−1.4

● ●

● ●
● ● ● ●
−1.6

−1.6
● ●
● ● ●
● ●
● ● ● ● ● ●

● ● ● ● ● ●
● ●● ● ● ● ●
●● ● ● ● ● ● ●
● ● ● ●
● ● ●●● ● ● ● ● ● ● ● ●
● ● ● ●

● ●● ● ● ● ● ● ● ● ●
−1.8

−1.8
● ●


● ● ● ● ● ●

● ●●● ● ●
β1

β1
● ●● ●● ● ● ● ●
^

^
● ● ● ●
● ●●● ●● ● ● ●
● ● ● ●●
● ●● ● ● ●● ● ●

●● ● ●● ● ●●●
●●● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●●●●●● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
−2.0

−2.0
● ● ● ●
● ● ●
● ●
● ● ● ●
● ●

●● ●
−2.2

−2.2


−2.4

−2.4

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Part II: OLS Estimation 70 / 288


Small vs. Large Spread of X (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 10 µX = 0


4

4
2

2
0

0
y

y
−2

−2
−4

−4

−2 −1 0 1 2 −2 −1 0 1 2
x x

Part II: OLS Estimation 71 / 288


Small vs. Large Spread of X (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 10 µX = 0


−1.2

−1.2
−1.4

−1.4
●●
● ●

−1.6

−1.6

● ●●
● ● ● ●
● ● ●●

● ● ●● ●
● ● ●●
● ●●
● ● ● ●● ● ● ● ●
● ●● ●● ● ●● ● ● ●●● ●
●● ● ●●●● ●●● ●●●●●● ●
−1.8

−1.8
● ● ● ● ●● ●●●
● ●
● ●● ● ● ● ●
●●●
● ● ●● ●
● ●● ● ● ●
β1

β1
●●●● ● ● ●
● ● ●●
● ●● ●
●●●
● ●● ● ●●●● ●
^

^
● ●● ●
● ● ●
● ●
● ● ●●●● ● ●●
● ● ●●
●● ●●
● ●● ●

● ● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ●
●● ●● ● ●● ●
● ● ●
● ● ● ●
● ●● ● ●

−2.0

−2.0
● ●


−2.2

−2.2
−2.4

−2.4

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Part II: OLS Estimation 72 / 288


Effect of “De-Centering” (100 Experiments)
середнє
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2
4

4
2

2
0

0
y

y
−2

−2
−4

−4

−2 −1 0 1 2 −2 −1 0 1 2
x x

Part II: OLS Estimation 73 / 288


Effect of “De-Centering” (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2


−1.2

−1.2
−1.4

−1.4

−1.6

−1.6
● ●
● ● ●

● ●
● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●● ●
●●●
● ● ●●● ● ● ● ● ●
● ●●
● ●● ● ●

−1.8

−1.8
● ●● ● ● ● ●

● ●●●● ●
● ● ●●

●● ●● ●
β1

β1
● ●
●● ● ●● ●
● ● ● ● ●●● ● ● ● ●
^

^

● ● ●●● ● ● ● ● ● ●● ● ●
● ●
● ● ● ●● ● ●
● ●●
● ● ●● ● ● ● ● ●●● ●● ●●●●
● ●
●● ● ●● ● ●
● ● ● ●● ● ●● ● ●●
● ●● ● ● ●

● ● ● ● ● ●●
● ●
−2.0

−2.0
● ●● ● ●
● ●

● ●


−2.2

−2.2
−2.4

−2.4

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Part II: OLS Estimation 74 / 288


Understanding the Estimation Problem

▶ Although we are estimating the true model (no model misspecification), the OLS
estimates differ from the true value.
▶ Many different data sets of size N may be generated by the same regression
model due to the stochastic error term.
▶ The estimated parameters differ, as the sample mean, sample variance and
correlation coefficient are different for each data set:

OLS Estimates (Recap)


sy
β̂0 = y − β̂1 x , β̂1 = rxy
sx

Part II: OLS Estimation 75 / 288


About the Expected Error of the OLS Estimator

Obviously, the estimator is a random variable. Hence it makes sense to study the
statistical properties of OLS estimation.
Questions
▶ Are the OLS estimates unbiased, i.e., is the expected difference between the OLS
estimator and the true parameter equal to 0?
▶ How precise are these parameter estimates, i.e., how large is the variance of the
two estimators?
▶ Are the OLS coefficients correlated?
▶ How are the OLS coefficients distributed?

Part II: OLS Estimation 76 / 288


About the Expected Error of the OLS Estimator

▶ For fixed values of x1 , . . . , xN the sampling properties of y , sy2 and rxy determine
the estimation error:
Properties of the Estimation Error
The estimation error. . .
▶ decreases with increasing number of observations N

▶ increases with increasing error variance σ 2

▶ depends on the predictor variable through sx2

[Note: The estimation error of β̂0 also depends on x ]

Part II: OLS Estimation 77 / 288


Unbiasedness

OLS is Unbiased
Under assumption (4), the OLS estimator is unbiased, i.e., on average the estimated
value is equal to the true one:
   
E β̂1 = β1 , E β̂0 = β0

Relation between β̂1 and β1 :


PN
i=1 (Xi − X )ui
β̂1 = β1 + . (20)
N · sX2

Proof of (20) as exercise.

Part II: OLS Estimation 78 / 288


Unbiasedness
 
Since E(ui |Xi ) = 0, we obtain from (20) that E β̂1 = β1 because

N
  1 X
E β̂1 = E(β1 ) + (Xi − X )E(ui |Xi ) = β1 .
NsX2 i=1
 
Hence E β1 − β̂1 = 0. Furthermore:

N
1 X
β̂0 = Y − β̂1 X = β0 + β1 X + ū − β̂1 X = β0 + (β1 − β̂1 )X + ui ,
N i=1

which implies
N
    1 X
E β̂0 = E(β0 ) + E β1 − β̂1 X + E(ui |Xi ) = β0 .
N i=1

Part II: OLS Estimation 79 / 288


Homoskedasticity

▶ How big is the difference between the OLS estimator and the true parameter?
▶ To answer this question, we make an additional assumption on the conditional
variance:
Assumption of Homoskedasticity

V(u|X ) = σ 2 (21)
▶ This means that the variance of the error term u is the same, regardless of the
value of the predictor variable X .

▶ Note: If assumption (21) is violated, e.g. if V(u|X ) = σ 2 h(X ), then we say the
error term is heteroskedastic.

Part II: OLS Estimation 80 / 288


Homoskedasticity

▶ Assumption (21) certainly holds if u and X are assumed to be independent.


However, (21) is a weaker assumption.
▶ Assumption (21) implies that σ 2 is also the unconditional variance of u, referred to
as error variance V(u):
 
V(u) = E u 2 − E(u)2 = σ 2

Its square root σ is the standard deviation of the error.


▶ It follows that V(Y |X ) = σ 2 .

Part II: OLS Estimation 81 / 288


Variance of the OLS Estimator

▶ How large is the variation of the OLS estimator around the true parameter?
 
▶ We know that E β̂1 − β1 = 0
▶ We measure the variation of the OLS estimator around the true parameter through
the expected squared difference, i.e. the variance:
  
E (β̂1 − β1 )2 = V β̂1 (22)
   
▶ Similarly for β̂0 : V β̂0 = E (β̂0 − β0 )2

Part II: OLS Estimation 82 / 288


Variance of the OLS Estimator

Variance of the slope estimator β̂1 follows from (20):


N N
  1 X
2 σ2 X 2 σ2
V β̂1 = 2 2 2 (Xi − X ) V(ui ) = 2 2 2 (Xi − X ) = (23)
N (sX ) i=1 N (sX ) i=1 NsX2

▶ The variance of the slope estimator is the larger, the smaller the number of
observations N (or the smaller, the larger N). Doubling the sample size N halves
the variance of β̂1 .
▶ The variance of the slope estimator is the larger, the larger the error variance σ 2 .
Doubling the error variance σ 2 doubles the variance of β̂1 .
▶ The variance of the slope estimator is the larger, the smaller the variation in X.
Doubling sX2 halves the variance of β̂1 .

Part II: OLS Estimation 83 / 288


Variance of the OLS Estimator

▶ The variance is in general different for the two parameters in the simple regression
model. V(β0 ) is given by (without proof):
N
  σ2 1 X
V β̂0 = 2
· Xi2 (24)
NsX N i=1

▶ The standard deviations sd(βˆ0 ) and sd(βˆ1 ) of the OLS estimators are defined as:
r   r  
sd(βˆ0 ) = V βˆ0 , sd(βˆ1 ) = V βˆ1

Part II: OLS Estimation 84 / 288


Checking the Assumptions

▶ We present checks for correct model specification (4), that is,

E(u|X ) = 0

and for homoskedasticity (21), meaning

V(u|X ) = σ 2 .

Part II: OLS Estimation 85 / 288


Checking Correct Model Specification

▶ Since we don’t observe the error terms ui directly, we take the residuals ûi as
proxies.
▶ We usually don’t have enough observations of the regressor X for any possible
value x . That’s why we check E(u|X ) = 0 not for X = x , but for a ≤ X ≤ b.
OLS−true model

0.3
50

0.2
0.1
40

OLS−residuals

−0.1 0.0
log(y)

30
20
10

−0.3
0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

log(x) log(x)

OLS−misspecification
6e+24

3e+24
4e+24

OLS−residuals
y

1e+24
2e+24

−1e+24
0e+00

0 200 400 600 800 1000 0 200 400 600 800 1000

x x

Part II: OLS Estimation 86 / 288


Checking Homoskedasticity

▶ Again, we take the residuals ûi as proxies for the errors.


▶ As before, we check V(u|X ) = σ 2 not for X = x , but for a ≤ X ≤ b to see
whether the variability of the residuals change with the level of X .
OLS−homoskedasticity

6
4
3000

OLS−residuals

2
0
y

2000

−2
−4
1000

−6
0 200 400 600 800 1000 0 2000 4000 6000 8000

x x

OLS−heteroskedasticity
2000 4000 6000 8000

2000
OLS−residuals
y

0
−4000
0

0 200 400 600 800 1000 0 200 400 600 800 1000

x x

Part II: OLS Estimation 87 / 288


Part III
Multiple Regression Model
Outline

Part III: Multiple Regression Model

▶ Model Formulation

▶ OLS Estimation

Part III: Multiple Regression Model Model Formulation 89 / 288


Data

▶ We are interested in a
▶ dependent (left-hand side, explained, response) variable Y , which is supposed to
depend on
▶ K explanatory (right-hand side, independent, control, predictor) variables
X1 , . . . , XK .
▶ Example: wage is a response; education, gender, and experience are predictor
variables.
▶ Sample: We observe these variables for N subjects drawn randomly from a
population (e.g. for various supermarkets, for various individuals):

(yi , x1,i , . . . , xK ,i ) for i = 1, . . . , N

Part III: Multiple Regression Model Model Formulation 90 / 288


Model Formulation

▶ The multiple regression model describes the relation between the response variable
Y and the predictor variables X1 , . . . , XK as:
Multiple Linear Regression Model

Y = β0 + β1 X1 + . . . + βK XK + u (25)

where β0 , β1 , . . . , βK are unknown parameters.

▶ Key Assumption:

E(u|X1 , . . . , XK ) = E(u) = 0 (26)

Part III: Multiple Regression Model Model Formulation 91 / 288


Model Formulation

▶ Assumption (26) implies:

Linearity

E(Y |X1 , . . . , XK ) = β0 + β1 X1 + . . . + βK XK (27)

▶ Note that E(Y |X1 , . . . , XK ) is a linear function


▶ in the parameters β0 , β1 , . . . , βK (important for ”easy´´ OLS estimation)
▶ in the predictor variables X1 , . . . , XK (important for the correct interpretation of the
parameters)

Part III: Multiple Regression Model Model Formulation 92 / 288


Understanding the Parameters

▶ The parameter βk is the expected absolute change of the response Y , if the


predictor variable Xk is increased by 1, and all other predictors remain the same
(“ceteris paribus”)
▶ If Xk is increased by c, the expected absolute change of Y is βk c, ceteris paribus.

Proof:

E(Y |Xk = x + c) − E(Y |Xk = x )


= β0 + β1 X1 + . . . + βk (x + c) + . . . + βK XK
− (β0 + β1 X1 + . . . + βk x + . . . + βK XK )
= βk c

Part III: Multiple Regression Model Model Formulation 93 / 288


Understanding the Parameters

The sign shows the direction of the expected change:


▶ If βk > 0, then larger Xk implies larger Y ceteris paribus (and vice versa).
▶ If βk < 0, then larger Xk implies smaller Y ceteris paribus (and vice versa).
▶ If βk = 0, then a change in Xk has no influence on Y .

Part III: Multiple Regression Model Model Formulation 94 / 288


The Multiple Log-Linear Model

▶ The multiple log-linear model (also called the multiple “log-log” model) reads:

Y = e β0 · X1β1 · · · XKβK · e u (28)

▶ The log transformation of all variables yields a model that is linear in the
parameters β0 , β1 , . . . , βK ,

log Y = β0 + β1 log X1 + . . . + βK log XK + u, (29)

but is nonlinear in the predictor variables X1 , . . . , XK .


▶ This is important for the correct interpretation of the parameters!

Part III: Multiple Regression Model Model Formulation 95 / 288


The Multiple Log-Linear Model

Interpretation of the parameters:


▶ The coefficient βk is the elasticity of the response variable Y with respect to the
variable Xk , i.e. the expected relative percentage change of Y , if the predictor
variable Xk is increased by 1% and all other predictor variables remain the same
(ceteris paribus).
▶ If Xk is increased by p%, then the expected relative change of Y is approximately
equal to βk p%, ceteris paribus (for small p).

Part III: Multiple Regression Model Model Formulation 96 / 288


R Homework

Homework:
Have a look in R how to define a multiple regression model and discuss the meaning of
the estimated parameters:
▶ Case Study Chicken, work file chicken
▶ Case Study Marketing, work file marketing
▶ Case Study Profit, work file profit
⇒ R-code code_eco_I.R

Part III: Multiple Regression Model Model Formulation 97 / 288


Outline

Part III: Multiple Regression Model

▶ Model Formulation

▶ OLS Estimation

Part III: Multiple Regression Model OLS Estimation 98 / 288


OLS Estimation

▶ Let (yi , x1,i , . . . , xK ,i ), i = 1, . . . , N denote a random sample of size N from the


population. Hence, for each i:

yi = β0 + β1 x1,i + . . . + βk xk,i + . . . + βK xK ,i + ui (30)

▶ The population parameters β0 , β1 , . . . , βK are estimated from a sample.


▶ The parameter estimates (coefficients) are typically denoted by βˆ0 , βˆ1 , . . . , βˆK . We
will use the following vector notation:
′ ′
β = (β0 , . . . , βK ) , β̂ = (βˆ0 , βˆ1 , . . . , βˆK ) (31)

Part III: Multiple Regression Model OLS Estimation 99 / 288


OLS Estimation

The commonly used method to estimate the parameters in a multiple regression model
is, again, OLS estimation:
▶ Denote the candidate choice by γ = (γ0 , . . . , γK )′ .
▶ For each observation yi , the prediction ŷi (γ) of yi depends on γ.
▶ For each yi , define the regression residuals (prediction error) ui (γ) as:

ui (γ) = yi − ŷi (γ) = yi − (γ0 + γ1 x1,i + . . . + γK xK ,i ) (32)

Part III: Multiple Regression Model OLS Estimation 100 / 288


OLS Estimation

▶ For each candidate value γ, an overall measure of fit is obtained by aggregating


these prediction errors:

Sum of Squared Residuals (SSR)

N N
ui (γ)2 = (yi − γ0 − γ1 x1,i − . . . − γK xK ,i )2
X X
SSR(γ) = (33)
i=1 i=1

β̂ = arg min SSR(γ) (34)


γ

▶ The OLS-estimator β̂ = (βˆ0 , βˆ1 , . . . , βˆK ) is the parameter that minimizes the sum
of squared residuals.

Part III: Multiple Regression Model OLS Estimation 101 / 288


How to Compute the OLS Estimator?

For a multiple regression model, the estimation problem is solved by software packages
like EViews or R.

Some mathematical details:


▶ Take the first partial derivative of (34) with respect to each candidate parameter
γk , k = 0, . . . , K
▶ This yields a system of K + 1 linear equations in γ0 , . . . , γK , which has a unique
solution under certain conditions on the matrix X, having N rows and K + 1
columns, containing in each row i the predictor values (1 x1,i . . . xK ,i )

Part III: Multiple Regression Model OLS Estimation 102 / 288


Matrix Notation of the Multiple Regression Model

Matrix notation for the observed data:


..
 
 1 x1,1 . xK ,1  
 y1

 1 x .. 
. xK ,2 y2
  
 1,2   
 . . .. ..

..

. .

X = , y=  
 . . . .   . 
..
 
yN−1 
  
1 x1,N−1 . xK ,N−1 
   


..  yN
1 x1,N . xK ,N
▶ X is an N × (K + 1) matrix (often called design matrix)
▶ y is an N × 1 vector
For those who want to really understand matrices and vectors, I highly recommend the
video series “Essence of linear algebra” to be found here: 3Blue1Brown.com

Part III: Multiple Regression Model OLS Estimation 103 / 288


Matrix Notation of the Multiple Regression Model

In matrix notation, the N equations given in (30) for i = 1, . . . , N, may be written as:

y = Xβ + u

where
 
u1  
β0

u2 
 .
. 
  
u=  .. , β=
 .
.
  
βK
 
uN

Part III: Multiple Regression Model OLS Estimation 104 / 288


The OLS Estimator
Цього я, на жаль, теж не розумію

▶ Note that X ′ X is a quadratic matrix with (K + 1) rows and columns


▶ (X ′ X)−1 denotes the inverse of X ′ X (if it exists)
▶ The OLS estimator β̂ has an explicit form, depending on X and the vector y; it is
given by:
OLS Estimator
′ ′
β̂ = (X X)−1 X y (35)

▶ Note: The matrix X ′ X has to be invertible in order to obtain a unique


estimator for β.

Part III: Multiple Regression Model OLS Estimation 105 / 288


Proof Using Matrix Differentiation

First, note that SSR(γ) can be written as

u ′ u = (y − Xγ)′ (y − Xγ) = y ′ y − γ ′ X ′ y − y ′ Xγ +γ ′ X ′ Xγ = y ′ y − 2γ ′ X ′ y + γ ′ X ′ Xγ
| {z } | {z }
scalar! scalar!

Now, find β̂ := arg minγ SSR(γ) which can be done by finding γ such that the FOC is
satisfied:
∂u ′ u
= −2X ′ y + 2X ′ Xγ = 0
∂γ
In other words
′ ′
β̂ = (X X)−1 X y

Part III: Multiple Regression Model OLS Estimation 106 / 288


The OLS Estimator


Necessary conditions for X X being invertible:
▶ We have to observe sample variation for each predictor Xk , i.e., the sample
variances of xk,1 , . . . , xk,N are positive for all k = 1, . . . , K
▶ No exact linear relation between any predictors Xk and Xl may be present,
i.e., the empirical correlation coefficients of all pairwise data sets (xk,i , xl,i ),
i = 1, . . . , N are different from 1 and −1 for l ̸= k.
′ ′
Note: EViews produces an error if X X is not invertible, whereas R tries to make X X
invertible by removing predictors.

Part III: Multiple Regression Model OLS Estimation 107 / 288


The OLS Estimator

It is sufficient to make the following assumption about the predictors X1 , . . . , XK in a


multiple regression model:
No Perfect Multicollinearity
The predictors X1 , . . . , XK are not linearly dependent, i.e., no predictor Xj may be
expressed as a linear function of the remaining predictors X1 , . . . , Xj−1 , Xj+1 , . . . , XK

If this assumption is violated. . .


▶ . . . the (unique) OLS estimator does not exist, as the matrix X ′ X is not invertible
▶ . . . there are infinitely many parameter values γ having the same minimal sum of
squared residuals (SSR(γ)).
▶ . . . the parameters in the regression model are not identified.

Part III: Multiple Regression Model OLS Estimation 108 / 288


Case Study Yields

Homework in EViews / R, yieldus

yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ui ,


yi . . . yield with maturity 3 months
x1,i . . . yield with maturity 1 month
x2,i . . . yield with maturity 60 months
x3,i . . . spread between these yields, i.e. x2,i − x1,i

x3,i is a linear combination of x1,i and x2,i .

⇒ R-code code_eco_I.R

Part III: Multiple Regression Model OLS Estimation 109 / 288


Case Study Yields

Let β = (β0 , β1 , β2 , β3 ) be a certain parameter.

Any parameter β ⋆ = (β0 , β1⋆ , β2⋆ , β3⋆ ), where β3⋆ may be arbitrarily chosen and

β2⋆ = β2 + β3 − β3⋆
β1⋆ = β1 − β3 + β3⋆

will lead to the same sum of mean squared errors as β. The OLS estimator is not
unique!

Part III: Multiple Regression Model OLS Estimation 110 / 288


Part IV
Expected Value and Variance of the OLS Estimator
Outline

Part IV: Expected Value and Variance of the OLS Estimator

▶ Econometric Inference

▶ OLS Residuals

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 112 / 288
Understanding Econometric Inference

Econometric Inference
Learning from data about the unknown parameter β in the regression model:
▶ Use the OLS estimator β̂ to learn about the regression parameter.
▶ Is this estimator equal to the true value?
▶ How large is the difference between the OLS estimator and the true parameter?
▶ Is there a better estimator than the OLS estimator?

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 113 / 288
Unbiasedness of the OLS Estimator

OLS is Unbiased
Under the assumptions (26), the OLS estimator (if it exists) is unbiased, i.e. the
estimated values are on average equal to the true values:
 
E β̂j = βj , j = 0, . . . , K

In matrix notation:
   
E β̂ = β, E β̂ − β = 0 (36)

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 114 / 288
Unbiasedness of the OLS Estimator (Proof)

The OLS estimator may be expressed as:


′ ′ ′ ′ ′ ′
β̂ = (X X)−1 X y = (X X)−1 X (Xβ + u) = β + (X X)−1 X u

Then, the estimation error may be expressed as:


′ ′
β̂ − β = (X X)−1 X u (37)

Result (36) follows immediately:


′ ′
 
E β̂ − β | X = (X X)−1 X E(u | X) = 0

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 115 / 288
Covariance Matrix of the OLS Estimator

 
▶ Due to unbiasedness, the expected value E β̂j of the OLS estimator is equal to βj
for j = 0, . . . , K .
 
▶ Hence, the variance V β̂j measures the variation of the OLS estimator β̂j around
the true value βj :
    2   2 
V β̂j = E β̂j − E β̂j =E β̂j − βj

▶ Are the deviation of the estimator from the true value correlated for different
coefficients of the OLS estimators?

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 116 / 288
Recap: Effect of “De-Centering” (100 Experiments)

N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2


−1.2

−1.2
−1.4

−1.4

−1.6

−1.6
● ●
● ● ●

● ●
● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●● ●
●●●
● ● ●●● ● ● ● ● ●
● ●●
● ●● ● ●

−1.8

−1.8
● ●● ● ● ● ●

● ●●●● ●
● ● ●●

●● ●● ●
β1

β1
● ●
●● ● ●● ●
● ● ● ● ●●● ● ● ● ●
^

^

● ● ●●● ● ● ● ● ● ●● ● ●
● ●
● ● ● ●● ● ●
● ●●
● ● ●● ● ● ● ● ●●● ●● ●●●●
● ●
●● ● ●● ● ●
● ● ● ●● ● ●● ● ●●
● ●● ● ● ●

● ● ● ● ● ●●
● ●
−2.0

−2.0
● ●● ● ●
● ●

● ●


−2.2

−2.2
−2.4

−2.4

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 117 / 288
Covariance Matrix of the OLS Estimator
 
▶ The covariance Cov β̂j , β̂k of different coefficients of the OLS estimators
measures how strongly deviations between the estimator and the true value are
correlated:
    
Cov β̂j , β̂k = E β̂j − βj β̂k − βk

▶ This information is summarized for all possible pairs of coefficients in the


covariance matrix of the OLS estimator.

Note that
   
Cov β̂ = E (β̂ − β)(β̂ − β)′

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 118 / 288
Covariance Matrix of the OLS Estimator

The covariance matrix of a random vector is a square matrix, containing


▶ in the diagonal the variances of the various elements of the random vector, and
▶ in the off-diagonal elements the covariances.

       
V β̂0 Cov β̂0 , β̂1 ··· Cov β̂0 , β̂K
       
   Cov β̂0 , β̂1 V β̂1 ··· Cov β̂1 , β̂K

 
Cov β̂ =  .. ..

 .. 
. ···  . . 
 
   
Cov β̂0 , β̂K ··· Cov β̂K −1 , β̂K V β̂K

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 119 / 288
Homoskedasticity

 
To derive Cov β̂ , we make an additional assumption:

Homoskedasticity

V(u | X1 , . . . , Xk ) = σ 2 (38)

This means that the variance of the error term u is the same, regardless of the
predictor variables X1 , . . . , XK .

Analogously to the univariate case, it follows that

V(Y | X1 , . . . , Xk ) = σ 2

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 120 / 288
Covariance Matrix of Error Vector

▶ Because the observations are (by assumption) a random sample from the
population, any two observations yi and yl are uncorrelated. Hence also the errors
ui and ul are uncorrelated.
▶ Together with homoskedasticity (38) we obtain the following covariance matrix
of the error vector u:

Cov(u | X1 , . . . , Xk ) = σ 2 I

with I being the identity matrix.

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 121 / 288
Covariance Matrix of the OLS Estimator

Under assumption (26) and (38), the covariance matrix of the OLS estimator β̂ is given
by:


 
Cov β̂ = σ 2 (X X)−1 (39)

поняття не маю, що це

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 122 / 288
Covariance Matrix of the OLS Estimator

Proof of (39)
Using (37), we obtain:
′ ′
β̂ − β = Au with A = (X X)−1 X

The following holds:


    ′ 
Cov β̂ = E β̂ − β β̂ − β = E(Auu ′ A′ ) = AE(uu ′ ) A′ = ACov(u) A′

Therefore:
′ ′ ′ ′
 
Cov β̂ = σ 2 AA′ = σ 2 (X X)−1 X X(X X)−1 = σ 2 (X X)−1

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 123 / 288
Covariance Matrix of the OLS Estimator


 
The diagonal elements of the matrix σ 2 (X X)−1 define the variance V β̂j of the
OLS estimator for each component
 
The standard deviation sd β̂j of each OLS estimator is defined as:

Standard Deviations of the “Betas”


  r   q
sd β̂j = V β̂j = σ (X ′ X)−1
j+1,j+1 (40)

It measures the estimation error in the same unit as βj .

Evidently, the standard deviation is larger for larger error variances σ 2 . What other
factors influence the standard deviation?

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 124 / 288
Multicollinearity

In practical regression analysis very often high (but not perfect) multicollinearity is
present.

How well may Xj be explained by the other regressors?

Consider Xj as left-hand variable in the following regression model, whereas all the
remaining predictors remain on the right hand side:

Xj = β̃0 + β̃1 X1 + . . . + β̃j−1 Xj−1 + β̃j+1 Xj+1 + . . . + β̃K XK + ũ

Use OLS estimation to estimate the parameters and let x̂j,i be the values predicted from
this (OLS) regression:

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 125 / 288
Multicollinearity

▶ Define Rj as the correlation between the observed values xj,i and the predicted
values x̂j,i in this regression.
▶ If Rj2 is close to 0, then Xj cannot be predicted from the other regressors.
⇒ Xj contains additional, “independent” information.
▶ The closer Rj2 is to 1, the better Xj is predicted from the other regressors and
multicollinearity is present.
⇒ Xj does not contain much „independent” information.

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 126 / 288
The Variance of the OLS Estimator

 
Using Rj , the variance V β̂j of the OLS estimators of the coefficient βj corresponding
to Xj may be expressed in the following way for j = 1, . . . , K :

  σ2
V β̂j =
Nsx2j (1 − Rj2 )

 
The variance V β̂j (and consequently the standard deviation) of the estimate β̂j is
large if
⇒ the regressor Xj is highly redundant given the other regressors,
⇒ Rj2 close to 1, almost multicollinearity present.

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 127 / 288
The Variance of the OLS Estimator

All other factors are the same as in the simple regression model, i.e.:
 
The variance V β̂j , j = 1, . . . , K , of the estimator β̂j is large, if
▶ the variance σ 2 of the error term u is large;
▶ the sampling variation in the regressor Xj , i.e. the variance sx2j , is small;
▶ the sample size N is small.

Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 128 / 288
Outline

Part IV: Expected Value and Variance of the OLS Estimator

▶ Econometric Inference

▶ OLS Residuals

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 129 / 288
OLS Residuals

Consider the (OLS-)estimated regression model:

yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi = ŷi + ûi

where
▶ ŷi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i is called the fitted value
▶ ûi is called the OLS residual

OLS residuals are useful:


▶ to estimate the variance σ 2 of the error term
▶ to quantify the quality of the fitted regression model
▶ for residual diagnostics

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 130 / 288
R / EViews Class Exercise

Homework:
Have a look in R / EViews how to obtain the OLS residuals and the fitted regression:
▶ Case Study profit, workfile profit
▶ Case Study Chicken, workfile chicken
▶ Case Study Marketing, workfile marketing
⇒ R-code code_eco_I.R

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 131 / 288
OLS Residuals as Proxies for the Error

Compare the underlying regression model

Y = β0 + β1 X1 + . . . + βK XK + u (41)

with the estimated model for i = 1, . . . , N:

yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi

▶ The OLS residuals ^ u1 , . . . , ^


uN may be considered as a “sample” of the
unobservable error u
▶ Use the OLS residuals û1 , . . . , ûN to estimate σ 2 = V(u | X1 , . . . , Xk )

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 132 / 288
Algebraic Properties of the OLS residuals

The OLS residuals û1 , . . . , ûN obey K + 1 linear equations and have the following
algebraic properties:
▶ The sum (and thus also the mean) of the OLS residuals ûi is equal to zero:
N
1 X
ûi = 0 (42)
N i=1

▶ The sample covariance between xk,i and ûi is zero:

N
1 X
xk,i ûi = 0, ∀k = 1, . . . , K (43)
N i=1

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 133 / 288
Estimating σ 2 - naive estimator

A naive estimator of σ 2 would be the sample variance of the OLS residuals û1 , . . . , ûN :
N N
!2 N
˜2 = 1 1 X 1 X
ûi2 =
SSR
X
σ̂ ûi − ûi =
N i=1 N i=1 N i=1 N
PN
where we used (42) and SSR = i=1 ûi2 is the sum of squared residuals.

However, due to the linear dependence between the OLS residuals,


▶ û1 , . . . , ûN is not a sample of independent random varibales, thus
˜ 2 is a biased estimator of σ 2 .
▶ σ̂

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 134 / 288
Estimating σ 2

▶ Due to the linear dependence between the OLS residuals, only (N − K − 1)


residuals can be chosen independently.
▶ This number is often abbreviated as df and referred to as the degrees of freedom.

▶ An unbiased estimator of the error variance σ 2 in a homoskedastic multiple


regression model is given by:

SSR
σ̂ 2 = (44)
df
where
▶ SSR = N 2
P
i=1 ûi is the sum of squared OLS residuals,
▶ df = (N − K − 1), N is the number of observations, and
▶ K is the number of predictors X1 , . . . , XK .
Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 135 / 288
Standard Errors of the OLS Estimator

 
▶ The standard deviation sd β̂j of the OLS estimator given in (40) depends on

σ = σ2.

▶ To evaluate the estimation error for a given data set in practical regression analysis,
σ 2 is substituted by the estimator σ̂ 2 given in (44).

▶ This yields the so-called standard error of the OLS estimator:

Standard Error of the OLS Estimator


  √ q ′ −1
se β̂j = σ̂ 2 (X X)j+1,j+1 (45)

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 136 / 288
R / Eviews Class Exercise

R / EViews (and other software packages) report for each predictor the OLS estimator
together with the standard errors:
▶ Case Study profit, work file profit
▶ Case Study Chicken, work file chicken
▶ Case Study Marketing, work file marketing
⇒ R-code code_eco_I.R

Note: the standard errors computed by R / EViews (and other software packages) are
valid only under the assumptions made above, in particular, homoskedasticity.

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 137 / 288
Quantifying the model fit - simplest model

▶ How well does the multiple regression model (41) explain the variation in Y ?
▶ Compare it with the following simple model without any predictors:

Y = β0 + ũ (46)

▶ The OLS estimator β̂0 which minimizes the following sum of squared residuals:
N
(yi − γ0 )2
X

i=1

over all candidate values γ0 , is given by β̂0 = y.

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 138 / 288
Coefficient of Determination - TSS

▶ In the model without any predictors (46), the sum of squared residuals RSS is
called the total sum of squares (TSS):
N
(yi − y )2
X
TSS =
i=1

(Note that TSS = N · sy2 )

▶ Is it possible to reduce the sum of squared residuals of the simple model (46), i.e.
TSS, by including the predictor variables X1 , . . . , XK as in (41)?

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 139 / 288
Coefficient of Determination R 2

1. The sum of squared residuals SSR of the multiple regression model (41) is always
smaller than the sum of squared residuals TSS of the simple model (46):

SSR ≤ TSS (47)

2. The coefficient of determination R2 of the multiple regression model (41) is


defined as:
TSS − SSR SSR
R2 = =1− (48)
TSS TSS

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 140 / 288
Coefficient of Determination - Proof

Proof of (47)
The following variance decomposition holds:
N N N N
(yi − ŷi + ŷi − y )2 = ûi2 + 2 (ŷi − y )2
X X X X
TSS = ûi (ŷi − y ) +
i=1 i=1 i=1 i=1

Using the algebraic properties (42) and (43) of the OLS residuals, we obtain:
N
X N
X N
X N
X N
X
ûi (ŷi − y ) = β̂0 ûi + β̂1 ûi x1,i + . . . + β̂K ûi xK ,i − y ûi = 0
i=1 i=1 i=1 i=1 i=1

Therefore:
N
(ŷi − y )2 ≥ SSR
X
TSS = SSR +
i=1

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 141 / 288
Coefficient of Determination - Interpretation

The coefficient of determination R2 is a measure of goodness-of-fit:


▶ If SSR ≈ TSS, then little is gained by including the predictors;
⇒ R2 is close to 0;
⇒ The multiple regression model explains the variation in Y hardly better than the
simple model (46).
▶ If SSR ≪ TSS, then much is gained by including all predictors;
⇒ R2 is close to 1;
⇒ The multiple regression model explains the variation in Y much better than the
simple model (46).

Software packages like R / EViews report SSR and R2 .

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 142 / 288
Coefficient of Determination - examples

2 2
SSR=9.5299, SST=120.0481, R =0.92062 SSR=8.3649, SST=8.6639, R =0.034512
2.5 1
data data
price as predictor price as predictor
2 no predictor 0.8 no predictor

1.5 0.6

1 0.4

0.5 0.2

0 0

−0.5 −0.2

−1 −0.4

−1.5 −0.6

−2 −0.8
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 143 / 288
Part V
Testing Hypotheses (One Coefficient)
Outline

Part V: Testing Hypotheses (One Coefficient)

▶ Testing Hypotheses - One Coefficient

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 145 / 288
Testing Hypothesis

▶ Multiple regression model:

Y = β0 + β1 X1 + . . . + βj Xj + . . . + βK XK + u, (49)

▶ Does the predictor variable Xj exercise an influence on the expected mean


E(Y |X1 , . . . , XK ) of the response variable Y, if we control for all other variables
X1 , . . . , Xj−1 , Xj+1 , . . . , XK ?

Formally,

βj = 0 ?

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 146 / 288
Understanding the Testing Problem

▶ Simulate data from a multiple regression model:


 
Y = 0.2 − 1.8X1 + 0X2 + u, u | X1 , X2 ∼ N 0, σ 2

▶ Run OLS estimation for this model:


 
Y = β0 + β1 X1 + β2 X2 + u, u | X1 , X2 ∼ N 0, σ 2

to obtain (β̂0 , β̂1 , β̂2 ).

▶ Is β̂2 different from 0?

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 147 / 288
Understanding the Testing Problem

N = 100 σ2 = 0.1 σ2X1 = σ2X2 = 1 µX1 = µX2 = 0 σX1X2 = 0.7


0.10

0.05

● ●
● ●
● ●
β2 (redundant variable)

● ● ●
● ● ●

0.00





● ● ●

● ●
● ●●


● ●
● ● ● ●
−0.05

● ●
^



●● ●
● ●

−0.10

−1.90 −1.85 −1.80 −1.75 −1.70


^
β1 (important variable)

The OLS estimator β̂2 of β2 = 0 differs from 0 for a single data set, but is 0 on average.

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 148 / 288
Understanding the Testing Problem

▶ OLS estimation for the true model in comparison to estimating a model with a
redundant predictor:
⇒ including the redundant predictor X2 increases the estimation error for the
other parameters!
One predictor Two predictors
−1.70

−1.70

● ●


−1.75

−1.75

● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ●


● ●
● ● ● ●

● ● ●
● ● ●
−1.80

−1.80
● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ● ●●

β1

β1
● ● ●
^

^
● ●
● ●● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ●

● ● ●
−1.85

−1.85
● ● ●

● ●
● ●



−1.90

−1.90

0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.14 0.16 0.18 0.20 0.22 0.24 0.26
^ ^
β0 β0

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 149 / 288
Testing of Hypotheses

Several issues arise:


▶ What can we learn from the data about hypotheses concerning the unknown
parameters in the regression model, especially about the hypothesis that βj = 0?
▶ May we reject the hypothesis βj = 0 given data?
▶ Testing if βj = 0 is not only of importance for the substantive scientist, but also
from an econometric point of view: Excluding redundant variables may increase
efficiency of estimation of non-zero parameters!

It is possible to answer these questions if we make additional assumptions about the


error term u in a multiple regression model.

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 150 / 288
The Classical Regression Model

Model Assumption in the Classical Linear Regression Model


The error u in the multiple regression model (49) is independent of X1 , . . . , XK and
follows a normal distribution:
 
u ∼ N 0, σ 2 (50)

This assumption implies the more general assumptions (26) and (38):

E(u|X1 , . . . , XK ) = E(u) = 0
V(u|X1 , . . . , XK ) = V(u) = σ 2

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 151 / 288
The Classical Regression Model

▶ It follows that the conditional distribution of Y given X1 , . . . , XK follows a normal


distribution:
 
Y |X1 , . . . , XK ∼ N β0 + β1 X1 + . . . + βj Xj + . . . + βK XK , σ 2

▶ Furthermore, because the observations are a random sample, the error vector u
has a multivariate normal distribution with independent components:
 
u ∼ NN 0, σ 2 I

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 152 / 288
Multivariate Normal Distributions with Independent
Components
Density of the bivariate normal distribution N2 (0, σ 2 I) with σ 2 = 0.5:

2.5

1.5
0.35
1
0.3

0.25 0.5

0.2

x2
0
0.15
−0.5
0.1

0.05 −1

0 −1.5

2
−2
1 2
0 1
−1 0 −2.5
−1
−2 −2
−3 −2 −1 0 1 2 3
x2 x1
x1

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 153 / 288
Multivariate Normal Distributions with Independent
Components
1000 observations from N2 (0, 0.5I) in comparison to 100α%-confidence region (from
the left to the right: α = 0.25, α = 0.5, α = 0.95)
α = 0.25, Rel.H.: 0.242 α = 0.5, Rel.H.: 0.48 α = 0.95, Rel.H.: 0.954
3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 154 / 288
Multivariate Normal Distributions with Dependent
Components
Density of the bivariate normal distribution N2 (µ, Σ) with
! !
2 4 3.2
µ= Σ=
−3 3.2 7

0
0.04

x2
0.03
−5
0.02
10
0.01
5 −10
0
0
5
0
−5
−5 −15
−10 x1
−15 −8 −6 −4 −2 0 2 4 6 8 10 12
x2 x
1
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 155 / 288
Multivariate Normal Distributions with Dependent
Components

1000 observations from N2 (µ, Σ) in comparison to 100α%-confidence region (from the


left to the right: α = 0.25, α = 0.5, α = 0.95)
α = 0.25, Rel.H.: 0.234 α = 0.5, Rel.H.: 0.472 α = 0.95, Rel.H.: 0.94
6 6 6

4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4

−6 −6 −6

−8 −8 −8

−10 −10 −10

−8 −6 −4 −2 0 2 4 6 8 10 12 −8 −6 −4 −2 0 2 4 6 8 10 12 −8 −6 −4 −2 0 2 4 6 8 10 12

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 156 / 288
Distribution of the OLS Estimator
Using (37), we obtain:

    
β̂ − β ∼ NK +1 0, Cov β̂ , Cov β̂ = σ 2 (X X)−1
All marginal distributions are normal:
  2 
β̂j − βj ∼ N 0, sd β̂j ,

thus

β̂j − βj
  ∼ N (0, 1) (51)
sd β̂j

Note: Deviations between the true value and the OLS estimator are usually correlated:

 
β̂ − β ∼ NK +1 0, σ 2 (X X)−1 ,
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 157 / 288
Testing a Single Coefficient: t-Test

▶ If the null hypothesis βj = 0 is valid, then possible differences between the


OLS-estimator β̂j and 0 may be quantified using the following inequality:

|β̂j |
  ≤ cα (52)
sd β̂j

where cα is equal to the (1 − α/2)-quantile of the standard normal distribution.

▶ This can be used to construct a test statistic:

β̂j
tj =   (53)
sd β̂j

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 158 / 288
Testing a Single Coefficient: t-Test

If (50) holds and σ 2 is known, then tj follows a standard normal distribution under the
null hypothesis:
▶ Choose a significance level α
▶ Determine the corresponding critical value cα
▶ If |tj | > cα : reject the null hypothesis (the risk to reject the null hypothesis
although it is true is at most α)
▶ If |tj | ≤ cα : do not reject the null hypothesis (the risk to “not reject” a wrong null
hypothesis may be arbitrarily large)

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 159 / 288
Choice of cα when σ 2 is unknown

 
▶ If σ 2 is unknown and estimated as described above, then sd β̂j is substituted by
 
se β̂j , yielding the test statistic:

β̂j
tj =   (54)
se β̂j

▶ Choosing the quantiles of the normal distributions would lead to a test which rejects
the true null-hypothesis more often than desired, e.g. for α = 0.05 and K = 3:
N 10 20 30 40 50 100
P(reject H0 ) 0.09 0.07 0.06 0.06 0.06 0.05

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 160 / 288
Choice of cα when σ 2 is unknown

▶ The reason for this phenomenon is that tj no longer follows a normal distribution,
but a tdf - distribution where df = (N − K − 1).
▶ The critical values tdf,1−α/2 depend on df and are equal to the quantiles of the tdf
distribution.
E.g., for α = 0.05 and for a regression model with 3 parameters, these values are
approximately:

df = N − 3 7 17 27 37 47 97 ∞
tdf,0.975 2.36 2.11 2.05 2.03 2.01 1.98 1.96

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 161 / 288
The Student t distribution

95% region for t distribution with 2 degrees of freedom


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

t2,0.975 ≈ 4.30
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 162 / 288
The Student t distribution

95% region for t distribution with 3 degrees of freedom


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

t3,0.975 ≈ 3.18
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 163 / 288
The Student t distribution

95% region for t distribution with 5 degrees of freedom


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

t5,0.975 ≈ 2.57
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 164 / 288
The Student t distribution

95% region for t distribution with 10 degrees of freedom


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

t10,0.975 ≈ 2.23
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 165 / 288
The Student t distribution

95% region for t distribution with 30 degrees of freedom


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

t30,0.975 ≈ 2.04
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 166 / 288
The Student t distribution

95% region for the standard normal distribution


0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

The tdf distribution converges to the standard normal for large df: t∞,0.975 ≈ 1.96
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 167 / 288
The p-value

The p-value is derived from the distribution of the t-statistic under the null
hypothesis and is easier to interpret than the t-statistic which has to be compared
to the correct quantiles:
▶ Choose a significance level α
▶ If p < α: reject the null hypothesis (risk to reject the null hypothesis although it is
true is at most α)
▶ If p ≥ α: do not reject the null hypothesis (risk to “not reject” a wrong null
hypothesis may be arbitrarily large)

An Old Saying. . .
If the p is low, the null must go
If the p is high, the null will fly

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 168 / 288
R / EViews Class Exercise

Have a look in R how to formulate sensible null hypotheses and how to test them using
the t-statistic and the p-value:
▶ Case Study Profit, workfile profit
▶ Case Study Chicken, workfile chicken
▶ Case Study Marketing, workfile marketing
⇒ R-code code_eco_I.R

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 169 / 288
Case Study Chicken

The t-statistic for the variable income is equal to 1.024, p-value: 0.319 (rounded)
0.4
0.3
0.2
0.1

0.16 0.16
0.0

1.024
−4 −2 0 2 4

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 170 / 288
Case Study Chicken

The t-statistic for the variable ppork is equal to 3.114, p-value: 0.006
0.4
0.3
0.2
0.1

0.003 0.003
0.0

3.114
−4 −2 0 2 4

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 171 / 288
Understanding p-Values

▶ A small p-value shows that the value observed for the t-statistic (or an even more
“extreme” value) is unlikely under the null hypothesis, thus we reject the null
hypothesis for small p-values.
⇒ There is substantial evidence in the data that βj ̸= 0.
▶ A p-value considerable larger than 0 shows that the observed value (or an even
more “extreme” value) for the t-statistic is plausible under the null hypothesis, thus
we do not reject the null hypothesis for large p-values.
⇒ There is little evidence in the data that βj ̸= 0.
Note that not rejecting the null does not necessarily mean that βj = 0, because the risk
to accept a wrong null hypothesis is not controlled!

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 172 / 288
Confidence Intervals for the Unknown Coefficients

The marginal distribution (51) is also useful for obtaining 100(1 − α)% confidence
regions for the unknown regression coefficients (e.g., α = 0.05 leads to a 95%
confidence region)

Two-Sided Confidence Regions


 
β̂j − βj
P−c1−α/2 ≤   ≤ c1−α/2  = 1 − α (55)
sd β̂j

where cp is the p-quantile of the standard normal distribution


The confidence interval reads:
h    i
β̂j − c1−α/2 sd β̂j , β̂j + c1−α/2 sd β̂j

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 173 / 288
Confidence Intervals for the Unknown Coefficients

One-Sided Confidence Regions


 
β̂ − βj
 j
P   ≤ c1−α  = 1 − α
sd β̂j
 
β̂j − βj 
P−c1−α ≤   =1−α
sd β̂j

This yields (with probability 1 − α):


 
▶ β̂j − c1−α sd β̂j is a lower bound for βj
 
▶ β̂j + c1−α sd β̂j is an upper bound for βj

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 174 / 288
Confidence Intervals for the Unknown Coefficients

   
If σ 2 is unknown, then sd β̂j is substituted by se β̂j . Instead of (51), we obtain with
df = (N − K − 1):

β̂j − βj
  ∼ tdf
se β̂j

If tdf,p is the p-quantiles of the tdf -distribution, this yields:


h    i
▶ βj lies in β̂j − tdf,1−α/2 se β̂j , β̂j + tdf,1−α/2 se β̂j
 
▶ β̂j + tdf,1−α se β̂j is an upper bound for βj
 
▶ β̂j − tdf,1−α se β̂j is a lower bound for βj

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 175 / 288
More about the distribution of the OLS estimator

▶ For any subset of coefficients β̃ = (βj1 , . . . , βjq )′ , the OLS estimator


˜
β̂ = (β̂j1 , . . . , β̂jq )′ follows a multivariate normal distribution:
  
˜ − β̃ ∼ N 0, Cov β̂
β̂ ˜ (56)
q

 
˜ is obtained from the rows and columns j , . . . , j of Covβ̂ 
where Cov β̂ 1 q

▶ This result may be used to construct 95%-confidence ellipsoids for all pairs of
parameters (βj1 , βj2 )

Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 176 / 288
Part VI
Testing Hypotheses (More Coefficients)
Testing More Than One Coefficient

▶ Testing the null hypothesis βj = 0 based on tj is only valid if all other parameters
remain in the model.
▶ Often, we want to test joint hypotheses about our parameters.
▶ E.g., if the tj -statistic is not significant for more than one parameter j1 , . . . , jq , then
one needs to test, if βj1 = 0, βj2 = 0, . . . , βjq = 0 simultaneously.
▶ We cannot simply check each tj -statistic separately, as it is possible for jointly
insignificant regressors to be individually significant (and vice versa).

Part VI: Testing Hypotheses (More Coefficients) 178 / 288


Testing More Than One Coefficient

Joint Hypothesis Testing


Given the data, is it possible to reject the null hypothesis βj1 = 0, βj2 = 0, . . . , βjq = 0?

˜ = (βˆ , . . . , β̂ )′
Reject the null hypothesis, if the distance between the OLS estimator β̂ j1 jq
and 0 is “large” (one-sided test).

The corresponding test statistic has to take into account that


▶ the standard deviations of the various OLS estimators are different;
▶ deviations of the OLS estimators from the true value are likely to be correlated.

Part VI: Testing Hypotheses (More Coefficients) 179 / 288


Testing More Than One Coefficient

 
▶ Aggregate tjl = βˆjl /sd βˆjl for l = 1, . . . , q, e.g., by taking the sum of squared t
statistics?

▶ If the deviations of the OLS estimators βˆj1 , . . . , β̂jq from the true values are
uncorrelated, then the aggregated test statistic
2
q
X βˆjl
 2
l=1 sd βˆj
l

is the sum of q independent squared standard normal random variables.


▶ Such a random variable follows a χ2q -distribution with q degrees of freedom.

Part VI: Testing Hypotheses (More Coefficients) 180 / 288


The χ2q distribution
ν = 50
ν = 20
2.5
0.5
ν= 2
ν= 2
ν= 5
ν= 5
0.45 ν=10
ν=10
ν=20
ν=20
2 ν=50
0.4

0.35

0.3 1.5

0.25

0.2 1

0.15

0.1 0.5

0.05

0 0
0 5 10 15 20 25 30 35 40 0 1 2 3 4 5 6 7

Left hand side: density of the χ2q -distribution; right hand side: density of the random
variable X /q, where X ∼ χ2q ; degrees of freedom q = ν ∈ {2, 5, 10, 20}.
Part VI: Testing Hypotheses (More Coefficients) 181 / 288
Testing More Than One Coefficient

Usually, the deviations of the OLS estimators βˆj1 , . . . , β̂jq from the true values are
correlated:
▶ Transform the deviations to a coordinate system with independent standard normal
random variables. In this new coordinate system, the sum of squared deviations
follow a χ2q -distribution with q degrees of freedom. The appropriate transformation
reads:
 −1
˜ ′ Cov β̂
β̂ ˜ ˜ ∼ χ2
β̂ q

▶ Note: The χ2q -distribution results only if σ 2 is known.

Part VI: Testing Hypotheses (More Coefficients) 182 / 288


The F-Test

▶ The F -statistic is obtained by substituting the unknown variance σ 2 by σ̂ 2 and


dividing by q.
▶ If the null hypothesis βj1 = 0, βj2 = 0, . . . , βjq = 0 is true, then the F -statistic
follows a Fq,df -distribution with parameters q (number of tested coefficients) and
df = N − K − 1.
▶ Remark 1: For q = 1, F = tj2 , where tj is the t-statistic.
▶ Remark 2: The F -statistic is the ratio of two (independent) sum of squares, divided
by the degrees of freedom, i.e. a χ2q /q and χ2df /df, where df = N − K − 1.

Part VI: Testing Hypotheses (More Coefficients) 183 / 288


The F-Distribution
ν1 = 5,ν2=100
3.5
ν1=1,ν2=100
ν1=2,ν2=100
3 ν =3,ν =100
1 2
ν1=4,ν2=100
ν1=5,ν2=100
2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4

Density of the Fq,df -distribution with parameters df = 100 and q = 1, . . . , 5


Part VI: Testing Hypotheses (More Coefficients) 184 / 288
The F-Test

Reject the null hypothesis, if


▶ the F -statistic is larger than the critical value from the corresponding
Fq,df -distribution (one-sided test);
▶ the corresponding p-value is smaller than the significance level. A p-value close to
0 shows that the value observed for the F -statistic (or an even larger value) is
unlikely under the null hypothesis.
⇒ At least one of the coefficients βj1 , . . . , βjq is different from 0.

Part VI: Testing Hypotheses (More Coefficients) 185 / 288


The F-Test

Do not reject the null-hypothesis, if


▶ the F -statistic is smaller than the critical value from the corresponding
Fq,df -distribution (one-sided test);
▶ the corresponding p-value is larger than the significance level. A p-value
considerably larger than 0 shows that the observed value for the F -statistic (or an
even larger value) is plausible under the null hypothesis.
⇒ There is little evidence in the data that we should reject the null hypothesis that all
coefficients βj1 , . . . , βjq are equal to 0.

Part VI: Testing Hypotheses (More Coefficients) 186 / 288


Case Study Marketing

The F -statistic for testing the joint hypothesis βgender = βage = 0 jointly is equal to
2.086, p-value: 0.124 (rounded)
1.0
0.8
0.6
0.4
0.2

0.124
0.0

2.086
0 1 2 3 4 5 6

Part VI: Testing Hypotheses (More Coefficients) 187 / 288


Case Study Marketing

The F -statistic for testing the joint hypothesis βgender = βage = βprice = 0 jointly is
equal to 451.572, p-value: 0.000 (rounded)
0.3
0.2
0.1
0.0

451.572
1 2 5 10 20 50 100 200 500

Part VI: Testing Hypotheses (More Coefficients) 188 / 288


An Alternative Form of the F-Statistic

Equivalent forms of the F -statistic show that the F-statistic measures the loss of fit
from imposing the q restrictions on the model:

(SSRr − SSR)/q (R2 − R2r )/q


F = =
SSR/df (1 − R2 )/df

Here,
▶ SSR is the minimum sum of squared residuals and R2 is the coefficient of
determination for the unrestricted regression model.
▶ SSRr is the minimum sum of squared residuals and R2r is the coefficient of
determination for the restricted regression model.

Note that SSRr ≥ SSR and R2r ≤ R2 .

Part VI: Testing Hypotheses (More Coefficients) 189 / 288


Testing the whole regression model

▶ In the standard regression output of R, an F -statistic is available by default.


▶ This F -statistic tests the hypothesis that none of the predictor variables influences
the response variable:

β1 = 0, β2 = 0, . . . , βK = 0

▶ In this case, R2r = 0, and the F-statistic reads:

R2 /K
F =
(1 − R2 )/df
▶ Under the null hypothesis, F follows a FK ,df -distribution. Hopefully, the
corresponding p-value is close to 0. Otherwise, the usefulness of the whole
regression model is somewhat doubtful!

Part VI: Testing Hypotheses (More Coefficients) 190 / 288


Linear Combinations of Parameters

Suppose we want to test the hypothesis that two regression coefficients are equal,
e.g. β1 = β2 . This is equivalent to testing the following linear constraint (null
hypothesis):

β1 − β2 = 0 (57)

Test statistic based on the difference of the OLS estimators β̂1 − β̂2 :
▶ If |β̂1 − β̂2 | is small, then the hypothesis (57) is not rejected.
▶ If |β̂1 − β̂2 | is large, then the hypothesis (57) is rejected.

What is the distribution of β̂1 − β̂2 under the null hypothesis?

Part VI: Testing Hypotheses (More Coefficients) 191 / 288


Testing Linear Combinations of Parameters

Testing the linear constraint β1 − β2 = 0 for β = (β0 , β1 , . . . , βK ) is equivalent to
testing
h i
Lβ = 0 where L = 0 1 −1 0 · · · 0 (58)

What is the distribution of Lβ̂?


  
Using β̂ − β ∼ NK +1 0, Cov β̂ , we obtain the following:
  
˜
Lβ̂ − Lβ ∼ Nq 0, Cov β̂

where  
˜ = LCovβ̂  L′
Cov β̂

Part VI: Testing Hypotheses (More Coefficients) 192 / 288


Testing Linear Constraints

The F -statistic may also be used to test more than one linear constraint on the
coefficients, i.e. Lβ = 0, where L is a q × (K + 1)-matrix with q > 1.

We have seen that ˜


 the OLS estimator β̂ = Lβ̂ follows the multivariate normal
˜
distribution Nq 0, Cov β̂ under the null hypothesis.

The F -statistic is constructed as above and follows an Fq,df -distribution, where q is the
number of linear constraints.

Part VI: Testing Hypotheses (More Coefficients) 193 / 288


Part VII
Further Properties of the OLS Estimator and
Dummy Variables
Outline

Part VII: Further Properties of the OLS Estimator and


Dummy Variables

▶ Further Properties of the OLS Estimator

▶ Dummy Variables

Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 195 / 288
Further Properties of the OLS Estimator

The Gauss Markov Theorem


Under the assumptions (26) and (38), the OLS estimator is BLUE, i.e. the
▶ Best
▶ Linear
▶ Unbiased
▶ Estimator
▶ Here, “best” means that any other linear unbiased estimator β̃ results in a larger
variance
 than
 the OLS
 estimator β̂:
▶ sd β̃j ≥ sd β̂j
   
▶ Cov β̃ − Cov β̂ is positive semi-definite
▶ “Linear” means that that β̃ = Cy where y = (y1 , . . . , yN )′ and where C is some
matrix independent of β, but possibly dependent on the design matrix X.
▶ “Unbiased” means that E β̂ = β.
Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 196 / 288
Efficiency of OLS Estimation

Under the normality assumption (50) about the error term, the OLS estimator is not
only BLUE. A stronger optimality result holds:
Efficiency of OLS estimation
Under assumption (50), the OLS estimator β̂ is the minimum variance unbiased
estimator.
Any other unbiased estimator β̃ (which need not be a linear estimator) has larger
standard deviations
   
than the OLS estimator:
▶ sd β̃j ≥ sd β̂j
   
▶ Cov β̃ − Cov β̂ is positive semi-definite
However, if assumption (50) is violated, other (nonlinear) estimation methods may be
more efficient.
Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 197 / 288
Consistency of OLS Estimation

Let β̂N be an estimator for β, based on sample size N.


Then, β̂N is a consistent estimator for β, if for every ϵ > 0 the following holds:
 
P |β̂N − β| ≥ ϵ → 0 as N → ∞

or, equivalently,
 
P |β̂N − β| < ϵ → 1 as N → ∞.

Note that ϵ may be arbitrarily small!

Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 198 / 288
Consistency of OLS Estimation

▶ Consistency means that the OLS estimator converges “in probability” to the true
value with increasing number of observations N.  
▶ A sufficient condition for this convergence in probability is that E β̂N → β and
 
sd β̂N → 0 as N → ∞.
▶ Under the Gauss Markov assumptions, the OLS estimator is a consistent estimator
of β.
▶ Note that consistency also holds if the normality assumption (50) is violated.

Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 199 / 288
Consistency of OLS Estimation

“Proof”
For each j = 1, . . . , K :  
▶ The OLS estimator is unbiased, i.e. E β̂j = βj .
 
▶ The standard deviation sd β̂j goes to 0 for N → ∞:
  σ
sd β̂j = q → 0 as N → ∞
Nsx2j (1 − Rj2 )

Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 200 / 288
Outline

Part VII: Further Properties of the OLS Estimator and


Dummy Variables

▶ Further Properties of the OLS Estimator

▶ Dummy Variables

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 201 / 288
Regression Models with Dummy Variables as
Predictors

▶ A dummy variable (binary variable) D is a variable that assumes two values only: 0
or 1
▶ Examples: EU member (D = 1 if EU member, 0 otherwise), brand (D = 1 if
product has a particular brand, 0 otherwise), gender (D = 1 if male, 0 otherwise)
▶ Note that the labelling is not unique, a dummy variable could be labeled in two
ways, i.e. for variable gender:
▶ D = 1 if male, D = 0 if female
▶ D = 1 if female, D = 0 if male

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 202 / 288
Regression Models with Dummy Variables as
Predictors

Consider a regression model with one continuous variable X and one dummy variable D:

Y = β0 + β1 D + β2 X + u

If D = 0, then:

Y = β0 +β2 X + u
|{z}
intercept

If D = 1, then:

Y = β0 + β1 +β2 X + u
| {z }
intercept

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 203 / 288
Regression Models with Dummy Variables as
Predictors
Example: Y = 20 + 3.2D − 2.5X
24

22
D=1
20

18

16
Y

14

12 D=0

10

6
0 1 2 3 4 5
X

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 204 / 288
Regression Models with Dummy Variables as
Predictors

Interpretation:
▶ The observed units are split into 2 groups according to D (e.g. into men and
women).
▶ The group with D = 0 is called the baseline (e.g. men).
▶ The regression coefficient β1 of D quantifies the expected difference of considering
the other group (e.g. women) on the dependent variable Y , while holding all other
variables (e.g. X ) fixed.
▶ The null hypothesis β1 = 0 corresponds to the assumption that the conditional
average value of Y given all remaining regrossors is the same for both groups.

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 205 / 288
Regression Models with Dummy Variables as
Predictors

Consider model
Y = 20 + 3.2D − 2.5X + u
where D = 1 if female. Assume that X = 4:
▶ expected value of Y for a man: E(Y |X = 4, D = 0) = 20 − 2.5 · 4 = 10
▶ expected value of Y for a woman: E(Y |X = 4, D = 1) = 20 + 3.2 − 2.5 · 4 = 13.2
▶ expected difference between women and men is equal to β1 = 3.2

The expected difference between women and men is equal to β1 = 3.2 for all values of
X!

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 206 / 288
Combining More Dummy Variables

Estimate a model where D1 is the gender (1: female, 0: male), D2 is the brand (1:
specific brand, 0: no-name), and P is the price:

Y = β0 + β1 D1 + β2 D2 + β3 P + u

▶ β0 corresponds to the baseline (male, no-name product).


▶ β1 corresponds to the difference in the expected rating between male and female
consumers (same brand, same price).
▶ β2 corresponds to the difference in the expected rating between the specific brand
and a no-name product (same person, same price).

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 207 / 288
Categorical Variables

We can use dummy variables to control for characteristics with multiple categories (K
categories ⇒ K − 1 dummies)

Suppose one of the predictors is the highest level of education. Such variables are often
coded in the following way:
edu
1 high school dropout
2 high school degree
3 college degree

What is the expected effect of education on a variable Y , e.g. hourly wages?

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 208 / 288
Categorical Variables

Including edu directly into a linear regression model would mean that the effect of a
high school degree compared to a drop out is the same as the effect of a college degree
compared to a high school degree.

To include the highest level of education as predictor in a regression model, define 2


dummy variables D1 and D2 :
edu D1 D2
1 high school dropout 0 0
2 high school degree 1 0
3 college degree 0 1

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 209 / 288
Categorical Variables

This yields:
▶ Baseline (all dummies 0): high school dropout
▶ D1 = 1, if highest degree from high school, 0 otherwise
▶ D2 = 1, if college degree, 0 otherwise

Include D1 and D2 as dummy predictors in a regression model:

Y = β0 + β1 D1 + β2 D2 + β3 X + u

The intercept β0 corresponds to the baseline (D1 = 0, D2 = 0).

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 210 / 288
Categorical Variables

In other words:
▶ β1 is the effect of a high school degree compared to a drop out.
▶ β2 is the effect of a college degree compared to a drop out.

Testing hypothesis:
▶ Is the effect of a high school degree compared to a drop out the same as the effect
of a college degree compared to a high school degree?
▶ Test if 2β1 = β2 , or equivalently, test the linear hypothesis 2β1 − β2 = 0.

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 211 / 288
Case Study Marketing

There are 5 different brands of mineral water (KR, RO, VO, JU, A):
▶ Select one mineral water as baseline, e.g. KR.
▶ Introduce 4 dummy variables D1 , . . . , D4 , and assign each of them to the remaining
brands, e.g. D1 = 1, if brand is equal to RO and D1 = 0, otherwise; D2 = 1, if
brand is equal to VO and D2 = 0, otherwise; etc.

The model reads:

Y = β0 + β1 D1 + . . . + β4 D4 + β5 P + u (59)

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 212 / 288
Case Study Marketing

Interpretation of the coefficients in model (59) for a given price level P:


▶ The expected rating for the brand corresponding to the baseline is given by
β0 + β5 P.
▶ The expected rating for the brand corresponding to Dj is given by β0 + βj + β5 P.
▶ The coefficient βj measures the effect of the brand Dj in comparison to the brand
corresponding to the baseline

∆E(Y |P) = β0 + βj + β5 P − (β0 + β5 P) = βj .

▶ The difference in the expected average rating between two arbitrary brands Dj and
Dk is equal to βj − βk .
▶ Is the rating different for the brands Dj and Dk ? Test the linear hypothesis
βj − βk = 0!

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 213 / 288
Case Study Marketing

Including an additional dummy variable D5 , where D5 = 1 if brand is KR, i.e.

Y = β0 + β1 D1 + . . . + β5 D5 + β6 P + u

leads to a model which is not identified because:

D1 + D2 + . . . + D5 = 1

Hence, the set of regressors D1 , . . . , D5 is perfectly correlated with the regressor ’1’
corresponding to the intercept ⇒ EViews produces an error message indicating
difficulties with estimating the model.

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 214 / 288
Case Study Marketing

It is possible to include all 5 regressors if no constant is included in the model, with a


slightly different interpretation of the coefficients:

Y = β1 D1 + . . . + β5 D5 + β6 P + u

▶ βj is a brand specific intercept of the regression model for the brand corresponding
to Dj .
▶ For a given price level P, the expected rating for the brand corresponding to Dj is
given by βj + β6 P.
▶ The difference in the expected average rating between two arbitrary brands Dj and
Dk is (again) equal to βj − βk .

Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 215 / 288
Part VIII
Residual Diagnosis
Outline

Part VIII: Residual Diagnosis

▶ Residual Diagnostics

▶ Model Evaluation and Model Comparison

Part VIII: Residual Diagnosis Residual Diagnostics 217 / 288


Checking model assumptions

Hypothetical model:

Y = β0 + β1 X1 + . . . + βK XK + u

Estimated model:

yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi

where ûi is the OLS residual.


▶ Due to consistency, the OLS residual ûi approaches the unobservable error ui as N
increases.
▶ Use OLS residuals ûi to test assumptions about ui .

Part VIII: Residual Diagnosis Residual Diagnostics 218 / 288


R / EViews Class Exercise

Discuss in R / EViews how to obtain the OLS residuals


▶ Case Study Profit, workfile profit
▶ Case Study Chicken, workfile chicken
▶ Case Study Marketing, workfile marketing
⇒ R-code code_eco_I.R

Part VIII: Residual Diagnosis Residual Diagnostics 219 / 288


Testing Normality

The error follows a normal distribution:


 
u | X1 , . . . , XK ∼ N 0, σ 2

▶ Roughly 95% of the OLS residuals lie between [−2σ̂, 2σ̂]; only 5% lie outside.
▶ Assumption often violated if outliers are present.
▶ Normality often improved through transformations.

Part VIII: Residual Diagnosis Residual Diagnostics 220 / 288


Testing Normality

To test normality of u, check normality of the OLS residuals ûi :


▶ Histogram
▶ Q-Q plot
▶ Skewness coefficient m3 close to 0?
▶ Kurtosis coefficient m4 close to 3?

N N
! !
1 1 X 1 1 X
m3 = 3 û 3 m4 = 4 û 4
σ̂ N i=1 i σ̂ N i=1 i

Part VIII: Residual Diagnosis Residual Diagnostics 221 / 288


Testing Normality

Jarque-Bera-Statistic:
N −K 1
 
J= m32 + (m4 − 3)2 (60)
6 4

▶ Null hypothesis H0 : the errors follow a normal distribution


▶ Under H0 , J asymptotically (i.e. for N large) follows a χ22 -distribution with 2
degrees of freedom (95%-quantile χ22,0.95 = 5.9915)
▶ Reject H0 if J > χ22,0.95 (or p-value of J smaller than 0.05).

Part VIII: Residual Diagnosis Residual Diagnostics 222 / 288


Case Study Yields

yi = β0 + β1 x1,i + β2 x2,i + ui (61)

where

yi . . . yield with maturity 3 months


x1,i . . . yield with maturity 1 month
x2,i . . . yield with maturity 60 months

Demonstration in R / EViews, data yieldus.csv, see R-code code_eco_I.R.

Part VIII: Residual Diagnosis Residual Diagnostics 223 / 288


Case Study Yields

Jarque-Bera test statistic J = 910.094, p-value: 0.000 (rounded)

0.30
50

0.25
40

0.20
30
Frequency

0.15
20

0.10
10

0.05
910.094

0.00
0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 1 10 100 1000 10000

Residuals Distribution of J under normality

Part VIII: Residual Diagnosis Residual Diagnostics 224 / 288


Case Study Profit

yi = β0 + β1 x1,i + β2 x2,i + ui (62)

where

yi . . . profit 1994
x1,i . . . profit 1993
x2,i . . . turnover 1994

Consider only large firms (i = 1, . . . , 20).

Part VIII: Residual Diagnosis Residual Diagnostics 225 / 288


Case Study Profit

Jarque-Bera test statistic J = 2.811, p-value: 0.245


3.0

0.30
2.5

0.25
0.20
2.0
Frequency

0.15
1.5

0.10
1.0

0.05
0.5

0.245

0.00
0.0

2.811
−1500 −1000 −500 0 500 1000 1500 2000 1 10 100 1000 10000

Residuals Distribution of J under normality

Part VIII: Residual Diagnosis Residual Diagnostics 226 / 288


Checking Homoskedasticity

Assumption (38) claims that the variance of ui is homoskedastic, i.e.

V(u|X1 , . . . , XK ) = σ 2

▶ If this assumption is violated, the model is said to have heteroskedastic errors.


▶ This assumption is often violated because the variance of u depends on a predictor
variable.

First informal check: residual plot; more about formal tests later.

Part VIII: Residual Diagnosis Residual Diagnostics 227 / 288


Case Study Yields
2.5

2.5
2.0

2.0
0.5 1.0 1.5

1.5
Residuals

1.0
0.5
0.0

0.0
−1.0

−1.0
2 4 6 8 10 12 14 16 4 6 8 10 12 14
Yield with maturity 1 month Yield with maturity 60 months
20

Residual
15

Actual
Fitted
10
5
0

1966 1969 1971 1974 1976 1979 1982 1984 1987 1989 1992

Variability of residuals seems to depend on X1 , X2 (and on time)! ⇝ Heteroskedasticity


Part VIII: Residual Diagnosis Residual Diagnostics 228 / 288
Checking for Assumption (26)

Assumption (26)
The model does not contain any systematic error, i.e.

E(u|X1 , . . . , XK ) = 0

▶ If assumption (26) is violated, the model is said to have a specification error:


▶ the true value of yi will be underrated, if E(ui |·) > 0
▶ the true value of yi will be overrated, if E(ui |·) < 0
▶ This assumption is often violated, when an important predictor variable has been
omitted (“omitted variables bias”) or the functional form is misspecified

Part VIII: Residual Diagnosis Residual Diagnostics 229 / 288


Checking for Assumption (26)

Example: Simulate data from a simple log-linear regression model with β̃1 = 0.2 and
β2 = −1.8:

yi = 0.2xi−1.8 e ui (63)

▶ Residual plot for the log-linear regression model (true model).


▶ Residual plot for the linear regression model (misspecified model).

Part VIII: Residual Diagnosis Residual Diagnostics 230 / 288


Checking for Assumption (26)

OLS − true model, σ2 = 0.01


−2 0.3

−2.5 0.2

0.1
−3
log(demand)

OLS−error
0
−3.5
−0.1
−4
−0.2
−4.5 −0.3

−5 −0.4
0 0.5 1 1.5 2 0 0.5 1 1.5 2
log(price) log(price)

OLS − misspecification
0.12 0.04

0.1 0.03
0.08
0.02

OLS−error
demand

0.06
0.01
0.04
0
0.02

0 −0.01

−0.02 −0.02
1 2 3 4 5 1 2 3 4 5
price price
Part VIII: Residual Diagnosis Residual Diagnostics 231 / 288
Case Study Profit

Model average profit 1994 only as a function of profit 1993:


regression line
4000

4000
3000

3000
2000

2000
Residuals
GEW94
1000

1000
0

0
−1000

−1000
−2000 −1000 0 1000 2000 3000 −2000 −1000 0 1000 2000 3000
GEW93 GEW93

Assumption (26) seems to be violated!


Part VIII: Residual Diagnosis Residual Diagnostics 232 / 288
Outline

Part VIII: Residual Diagnosis

▶ Residual Diagnostics

▶ Model Evaluation and Model Comparison

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 233 / 288
Model Comparison Using R2 and AIC/BIC

▶ Model evaluation using the coefficient of determination R2 .


▶ Problems with R2 : R2 increases with increasing number of variables, because SSR
decreases ⇒ may lead to overfitting.
▶ Model comparison using AIC and SC (BIC): Penalize the ever decreasing SSR by
including the number of parameters.

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 234 / 288
Coefficient of Determination R2

Recall: Coefficient of determination R2 can be written as follows:


TSS − SSR SSR
R2 = =1−
TSS TSS
▶ SSR is the sum of squared residuals.
▶ TSS is the total sum of squares, i.e. the sum of squared residuals of the simple
model without predictor.
▶ SSR is always smaller than TSS.
▶ If SSR is much smaller than TSS, then the regression model M1 is much better
than the simple model M0 .
▶ R2 is close to 1 if SSR ≪ TSS and close to 0 if SSR ≈ TSS, thus R2 can be used
for model selection.

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 235 / 288
R / EViews Class Exercise

Discuss in R / EViews where to find SSR and R2 ; discuss how SSR and R2 change when
number of predictors is increased
▶ Case Study Profit, workfile profit
▶ Case Study Chicken, workfile chicken
▶ Case Study Marketing, workfile marketing
⇒ R-code code_eco_I.R

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 236 / 288
Case Study Chicken (Log-Linear Model)

Predictors SSR R2
pchick 0.273487 0.647001
income 0.041986 0.945807
income, pchick 0.015437 0.980074
income, pchick, ppork 0.014326 0.981509
income, pchick, ppork, pbeef 0.013703 0.982313

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 237 / 288
Problems with R2

▶ Choosing the model with the smallest SSR (largest R2 ) leads to overfitting: R2
“automatically” increases when the number of variables increases.
▶ R2 is 1 for K = N − 1 because SSR = 0 if we include as many predictors as
observations (even if the predictors are useless!).
▶ However, the increase is small when a useless predictor is added. ⇒ penalize the
ever decreasing SSR by incorporating the number of parameters used for
estimation!

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 238 / 288
Adjusted R2

A very simple way out is to adjust R2 to cater for the number of parameters:

Adjusted R2
N −1 N − 1 SSR s2
R2adj = 1 − (1 − R2 ) = 1 − = 1 − û2
N −K −1 N − K − 1 TSS sy

Choose the model that maximizes R2adj .

Alternatively (or better), use so-called “information criteria” AIC and SC (BIC).

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 239 / 288
Information Criteria

Information Criteria: Model Fit + Penalty

N · log(SSR/N) + N + N · log(2π) + m · (K + 1) (64)

▶ SSR: Sum of squared residuals


▶ K + 1: The number of estimable parameters in the model
▶ m = 2: AIC (Akaike Information Criterion)
▶ m = log N: SC (Schwarz Criterion), also called BIC (Bayesian IC)

Choose the model that minimizes a particular criterion.

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 240 / 288
Information Criteria

Information Criteria: Model Fit + Penalty

N · log(SSR/N) + N + N · log(2π) + m · (K + 1) (64)

▶ SSR: Sum of squared residuals


▶ K + 1: The number of estimable parameters in the model
▶ m = 2: AIC (Akaike Information Criterion)
▶ m = log N: SC (Schwarz Criterion), also called BIC (Bayesian IC)

Choose the model that minimizes a particular criterion.


Caveat: Implementations in R and EViews differ slightly. Hence, you may only compare
the numbers stemming from the same software. However, the implied ranking is the
same in R and in EViews.
Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 240 / 288
Information Criteria
The various criteria may lead to different choices; SC has a larger penalty for the model
size K if the number of observations is N ≥ 8 (N > e 2 ).
50
aic
45 sc

40

35

30
Penalty

25

20

15

10

0
1 2 3 4 5 6 7 8 9 10
Parameter k

AIC and SC penalty for N = 100 as a function of the number of parameters K .


Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 241 / 288
R / EViews Class Exercise

Discuss in EViews where to find R2adj , AIC, and Schwarz criterion; discuss how to choose
predictors based on these model choice criteria.
▶ Case Study Profit, workfile profit
▶ Case Study Chicken, workfile chicken
▶ Case Study Marketing, workfile marketing
⇒ R-code code_eco_I.R

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 242 / 288
Case Study Chicken (Log-Linear Model)

Predictors SSR R2 AIC BIC


pchick 0.273487 0.647001 -30.66474 -27.25826
income 0.041986 0.945807 -73.76486 -70.35838
income, pchick 0.015437 0.980074 -94.77735 -90.23538
income, pchick,
ppork 0.014326 0.981509 -94.49623 -88.81876
income, pchick,
ppork, pbeef 0.013703 0.982313 -93.51870 -86.70573

Caveat: Mind the different implementations in R and EViews (but the result of the best
model remains the same).

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 243 / 288
Comparing Linear and Log-Linear Models

▶ The residual sum of squares SSR depends on the scale of yi , therefore AIC and SC
are scale dependent.
▶ AIC and SC cannot be used directly to compare a linear and a log-linear model.
▶ AIC and SC of the log-linear model could be matched back to the original scale by
adding 2 times the mean (EViews) or 2 times the sum (R) of the log-values of yi .

Correction Formula for AIC and SC


N
R: C = C ⋆ + 2
X
log(yi ) (65)
i=1
N
⋆ 1 X
EViews: C = C + 2 log(yi ) (66)
N i=1

where C ⋆ is the model choice criterion for the log-linear model.


Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 244 / 288
EViews Class Exercise: Case Study Chicken

Predictor SSR R2 AIC BIC


income, pchick (log-linear) 0.015437 0.980074 -94.77735 -90.23538
income, pchick (linear) 106.65 0.9108 108.5549 113.0969

Transform AIC and SC of the log-linear model:

AIC = −94.77735 + 2 × 84.26939 = 73.76143


SC = −90.23538 + 2 × 84.26939 = 78.3034

⇒ log-linear model is preferred.

Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 245 / 288
Part IX
Advanced Multiple Regression Models
Outline

Part IX: Advanced Multiple Regression Models


▶ Quadratic Terms

▶ Interaction Terms

▶ Dummy Variables with Interaction Terms

Part IX: Advanced Multiple Regression Models Quadratic Terms 247 / 288
Models with Quadratic Terms

Sometimes there is an interest in modeling increasing or decreasing marginal effects of


certain variables. Why not capture such effects by including quadratic terms?

Y = β0 + β1 X + β2 X 2 + u (67)

Implications:
▶ OLS estimation of β0 , β1 , and β2 proceeds as discussed above, based on the
predictors X1 = X and X2 = X 2
▶ Although the relationship between X1 and X2 is deterministic (note that X2 = X12 ),
the predictors X1 and X2 are not linearly dependent; hence, OLS estimation is
feasible
▶ Note that the relationship between X and E (Y |X ) is non-linear

Part IX: Advanced Multiple Regression Models Quadratic Terms 248 / 288
Models with Quadratic Terms

Two examples where X varies over all real numbers:


20 120

0 100

−20
80
−40
60
−60

Y
Y

40
−80

−100 20

−120 0

−140
−20
−160
−30 −20 −10 0 10 20 30 40 50 −20 −10 0 10 20 30 40 50
X X

Left hand side: E (Y |X ) = 1 + 2X − 0.1X 2 ; right hand side: E (Y |X ) = 1 − 3X + 0.1X 2


Part IX: Advanced Multiple Regression Models Quadratic Terms 249 / 288
Models with Quadratic Terms

Same examples with X being limited to the range [0, 10].


12 5

10
0

8
−5

Y
Y

−10
4

−15
2

0 −20
0 2 4 6 8 10 0 2 4 6 8 10
X X

Left hand side: E (Y |X ) = 1 + 2X − 0.1X 2 ; right hand side: E (Y |X ) = 1 − 3X + 0.1X 2


Part IX: Advanced Multiple Regression Models Quadratic Terms 250 / 288
Models with Quadratic Terms

▶ The parabola corresponding to the quadratic function (67) opens up iff β2 > 0 and
opens down iff β2 < 0
▶ The vertex (Scheitel) is obtained by setting the first derivative of E(Y |X = x ) with
respect to x equal to 0:

∂E(Y |X = x )
= β1 + 2β2 x = 0
∂x
yields that the vertex lies at x0 = −β1 /(2β2 ); note that x0 is negative if β1 and β2
have the same sign, and positive otherwise.

Part IX: Advanced Multiple Regression Models Quadratic Terms 251 / 288
Monotonic Behavior

Often, only part of the parabola is used to describe a monotonic behavior over a certain
range of X , e.g. the smallest and the largest observed value of X

The position of the vertex is important in this respect:


▶ Vertex outside the relevant range of X : effect is monotone
▶ Vertex within the relevant range of X : effect is not monotone

Part IX: Advanced Multiple Regression Models Quadratic Terms 252 / 288
Testing for Non-Linearity

▶ Model (67) reduces to a model which is linear in X if β2 = 0 ⇒ test the null


hypothesis H0 : β2 = 0 to test for the presence of non-linear effects.
▶ If β2 ̸= 0, non-linearity is present in model (67). In this case, β1 does not measure
the expected change in Y with respect to X , since X2 = X 2 cannot be held
constant, while X1 = X changes. Changing X changes both predictors X1 and X2 .

Part IX: Advanced Multiple Regression Models Quadratic Terms 253 / 288
Understanding the Coefficients

The instantaneous change of E(Y |X = x ) is equal to the first derivative with respect to
x:
∂E(Y |X = x )
= β1 + 2β2 x (68)
∂x

Part IX: Advanced Multiple Regression Models Quadratic Terms 254 / 288
Understanding the Coefficients

▶ If β2 = 0, the expected change of E(Y |X = x ) is equal to β1 .


▶ For β2 ̸= 0, the expected change of E(Y |X = x ) depends not only on β1 , but also
on β2 and the current value x of X .
▶ The expected change of E(Y |X = x ) switches sign at the vertex / apex (=
Scheitel), i.e. at the point X0 = −β1 /(2β2 )
▶ The model describes a monotonic behaviour if only values of x are considered
which lie on one side of the vertex.

Part IX: Advanced Multiple Regression Models Quadratic Terms 255 / 288
Understanding the Coefficients

Suppose that β1 is positive while β2 is negative. Then according to the first term in
(68), increasing x will increase E(Y |X = x ), however, this positive effect becomes
smaller with increasing x . It remains positive as long as x is smaller than the vertex x0 :
β1
x <−
2β2
If x is larger than the vertex x0 , there is a negative effect of increasing x , which gets
larger with increasing x .

Part IX: Advanced Multiple Regression Models Quadratic Terms 256 / 288
Monotonic Behavior

Vertex smaller than the relevant range of x :


▶ β2 > 0: positive; β2 < 0: negative
effect of increasing x gets bigger the larger x .

Vertex larger than the relevant range of x :


▶ β2 > 0: negative; β2 < 0: positive
effect of increasing x gets smaller the closer x is to the vertex.

Part IX: Advanced Multiple Regression Models Quadratic Terms 257 / 288
Monotonic Behavior
Example: E(Y |X = x ) = 20 + 0.005x − 0.2x 2 , 1 ≤ x ≤ 5
▶ Parabola opens down because β2 = −0.2 < 0
▶ Vertex: 0.005 − 0.4x0 = 0 ⇒ x0 = 0.0125
▶ Range of x restricted to the right hand side ⇒ monotonically decreasing function
20

19.5

19

18.5

18

17.5
Y

17

16.5

16

15.5

15
1 1.5 2 2.5 3 3.5 4 4.5 5
X

Part IX: Advanced Multiple Regression Models Quadratic Terms 258 / 288
Case Study Chicken

Estimate the model:

Y = β0 + β1 X1 + β2 P1 + β3 P12 + β4 P2 + β5 P22 + u

with X1 the income, P1 the price of chicken, and P2 the price of pork. This model
outperforms a model without quadratic terms according to AIC and SC.

β2 is negative, but the negative effect decreases as the price increases, since β3 is
positive. The vertex is equal to
β2 −1.69
− =− = 60.
2β3 2 × 0.014
This value lies in the range of observed prices, hence the chicken price effect changes
sign over the whole range of observations.
Part IX: Advanced Multiple Regression Models Quadratic Terms 259 / 288
Case Study Chicken

β4 is positive, but the positive effect decreases as the price of pork increases, since β5 is
negative. The vertex is equal to
β4 0.542
− =− = 113.
2β5 2 × −0.0024
This value lies in the range of observed prices, hence the pork price effect changes sign
over the whole range of observations.

Part IX: Advanced Multiple Regression Models Quadratic Terms 260 / 288
Outline

Part IX: Advanced Multiple Regression Models


▶ Quadratic Terms

▶ Interaction Terms

▶ Dummy Variables with Interaction Terms

Part IX: Advanced Multiple Regression Models Interaction Terms 261 / 288
Models with Interaction Terms

▶ In some cases it makes sense to make the effect of a variable X1 on Y dependent


on another regressor X2 .
▶ One way to capture such effects to include interactions terms:

Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + u (69)

▶ OLS estimation of β0 , . . . , β3 proceeds as discussed above, based on the predictors


X1 , X2 , and X3 = X1 × X2
▶ Note that the relationship between X1 and E(Y |X1 , X2 ) is non-linear as is the
relationship between X2 and E(Y |X1 , X2 ).

Part IX: Advanced Multiple Regression Models Interaction Terms 262 / 288
Models with Interaction Terms

▶ The first derivative of E(Y |X1 = x1 , X2 = x2 ) with respect to x1 is given by:

∂E(Y |X1 = x1 , X2 = x2 )
= β1 + β3 x 2
∂x1
and depends on the actual value of x2 .
▶ The first derivative of E(Y |X1 = x1 , X2 = x2 ) with respect to x2 is given by:

∂E(Y |X1 = x1 , X2 = x2 )
= β2 + β3 x 1
∂x2
and depends on the actual value of x1 .

Part IX: Advanced Multiple Regression Models Interaction Terms 263 / 288
Understanding the Coefficients

Therefore, β1 is the effect of X1 on E(Y |X1 , X2 ) only for X2 = 0, which is not


necessarily a reasonable value of X2 . The average effect δ1 of X1 on E(Y |X1 , X2 ) can be
evaluated at the sample mean X2 of X2 :

δ1 = β1 + β3 X2

Similarly, the average effect δ2 of X2 on E(Y |X1 , X2 ) can be evaluated at the sample
mean X1 of X1 :

δ2 = β2 + β3 X1

Part IX: Advanced Multiple Regression Models Interaction Terms 264 / 288
Centering the Predictors

An alternative parameterization of the model is

Y = δ0 + δ1 X1 + δ2 X2 + δ3 (X1 − X1 )(X2 − X2 ) + u

where the interaction term involves the standardized predictors X1 and X2 .

Thus,
▶ δ1 is the average effect of X1 on E(Y |X1 , X2 ) at the mean of X2
▶ δ2 is the average effect of X2 on E(Y |X1 , X2 ) at the mean of X1

Part IX: Advanced Multiple Regression Models Interaction Terms 265 / 288
Case Study Chicken

Estimate a model with income X1 , price of chicken P1 and price of pork P2 :

Y = β0 + β1 X1 + β2 P1 + β3 P2 + β4 X1 P2 + u

This model outperforms a model without the interaction term according to AIC and BIC.

β3 is positive, but the positive effect of increasing the price of pork decreases as the
income X1 increases, since β4 is negative:

∂E(Y |X1 = x1 , P1 = p1 , P2 = p2 )
= β3 + β4 x1
∂p2

Part IX: Advanced Multiple Regression Models Interaction Terms 266 / 288
Case Study Chicken

The average income is equal to X1 = 1035.065, hence the average effect of the price of
pork is equal to:

δ3 = β3 + β4 X1 = 0.162937 + 1035.065 × (−8.62 × 10−5 ) = 0.0737

This value is considerably smaller than the effect obtained from the model without an
interaction term (0.174).

The average effect of the price of pork is obtained immediately from OLS estimation if
following model is fit to the data:

Y = δ0 + δ1 X1 + δ2 P1 + δ3 P2 + δ4 (X1 − X1 )(P2 − P2 ) + u

and is equal to δ3 . [EViews-Hint: Use @mean() to obtain the mean of a variable.]

Part IX: Advanced Multiple Regression Models Interaction Terms 267 / 288
Outline

Part IX: Advanced Multiple Regression Models


▶ Quadratic Terms

▶ Interaction Terms

▶ Dummy Variables with Interaction Terms

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 268 / 288
Interaction with Dummy Variables

Consider interacting a dummy variable D with a continuous variable X :

Y = β0 + β1 D + β2 X + β3 XD + u (70)

If D = 0, then:

Y = β0 + β2 X + u

If D = 1, then:

Y = (β0 + β1 ) + (β2 + β3 )X + u

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 269 / 288
Interaction with Dummy Variables

Example: Y = 20 + 3.2D − 2.5X + 1.5DX


24

D=1
22

20

18

16
Y

14
D=0
12

10

6
0 1 2 3 4 5
X

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 270 / 288
Interaction with Dummy Variables

Interpretation:
▶ The observed units are split into 2 groups according to D (e.g. into men and
women)
▶ The coefficient β3 models the difference in the marginal effect of X between the
two groups. A change ∆X has an expected change of Y equals to
▶ E(Y |X = x + ∆x , D = 0) − E(Y |X = x , D = 0) = β2 ∆x ,
▶ E(Y |X = x + ∆x , D = 1) − E(Y |X = x , D = 1) = (β2 + β3 )∆x ,
▶ The difference in the expected value of Y between the two groups for a given value
of X is equal to:

E(Y |X , D = 1) − E(Y |X , D = 0) = β1 + β3 X

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 271 / 288
Interaction with Dummy Variables

Testing hypotheses:
▶ The null hypothesis β3 = 0 corresponds to the assumption that the effect of X is
the same for both groups (interaction effect is not significant)
▶ The joint null hypothesis β2 = 0, β2 + β3 = 0 corresponds to the assumption that
the effect of X is zero for both groups
▶ The joint null hypothesis β1 = 0, β3 = 0 corresponds to the assumption that the
regression model is the same for both groups

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 272 / 288
Case Study Marketing

Estimate a model with a specific brand (KR, D = 1) and price P:

Y = β0 + β1 D + β2 P + β3 PD + u

Results:
▶ There is a very significant price effect for the specific brand
▶ Increasing the price for an ordinary brand by one unit leads to an expected decrease
in the rating by β2 , i.e. around 0.31 points
▶ For the KR brand, the price effect is equal to β2 + β3 , i.e. increasing the price for
the specific brand by one unit leads to an expected decrease in the rating by 0.26
points

Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 273 / 288
Part X
Regression with Heteroscedastic Errors
Outline

Part X: Regression with Heteroscedastic Errors

▶ Regression Models with Heteroskedastic Errors

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 275 / 288
Regression Models with Heteroskedastic Errors

If assumption (38) (homoskedastic errors) is violated, one has to deal with


heteroskedastic errors, i.e. the variance differs among the observations:
Heteroskedastic Errors

V(ui |X1,i , . . . , XK ,i ) = σi2 (71)

▶ Standard errors of OLS estimation are no longer valid.


▶ OLS estimator is no longer BLUE, better estimation methods exist.

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 276 / 288
Case Study Profit

Demonstration in EViews, workfile profit.wf1:

yi = β0 + β1 x1,i + β2 x2,i + ui
yi . . . profit 1994
x1,i . . . profit 1993
x2,i . . . turnover 1994

Variances increases with size of the firm.

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 277 / 288
OLS Estimation under Heteroskedasticity

Simulate data from a regression model with β0 = 0.2 and β1 = −1.8 and
heteroskedastic errors:
 
yi = 0.2 − 1.8xi + ui , ui ∼ N 0, σi2
σi2 = σ 2 (0.2 + xi )2
1.5

0.5
Y

−0.5

−1
−0.5 0 0.5
X
Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 278 / 288
OLS Estimation under Heteroskedasticity
2 2
N=50,σ =0.1,Design 2 N=50,σ =0.1,Design 2
−1.5 −1.5

−1.6 −1.6

−1.7 −1.7

−1.8 −1.8
β2 (price)

β (price)
2
−1.9 −1.9

−2 −2

−2.1 −2.1

−2.2 −2.2
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
β1 (constant) β1 (constant)

Left hand side: estimation errors obtained from a simulation study with 200 data sets
(each N = 50 observations); right hand side: contours show estimation error according
to OLS estimation
Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 279 / 288
Weighted Least Squares Estimation

If the variance increases with an observed variable Zi ,


Observed Heteroskedasticity
V(ui |X1,i , . . . , XK ,i ) = σi2 , σi2 = σ 2 Zi

a simple transformation leads to a model with homoskedastic variances:


ui
ui⋆ = √
Zi
⋆ 2
V(ui |X1,i , . . . , XK ,i ) = σ

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 280 / 288
Weighted Least Squares Estimation

Therefore a simple transformation of the original regression model

yi = β0 + β1 x1,i + . . . + βK xK ,i + ui

leads to a model with homoskedastic variances:


y 1 x x
√ i = β0 √ + β1 √1,i + . . . + βK √K ,i + ui⋆ (72)
Zi Zi Zi Zi

Regression model (72) has identical parameters as the original model, but a transformed
response variable as well as transformed predictors.

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 281 / 288
Weighted Least Squares Estimation

Rewrite model (72) as



yi⋆ = β0 x0,i ⋆
+ β1 x1,i + . . . + βK xK⋆ ,i + ui⋆ (73)

where
yi 1 xj,i
yi⋆ = √ , ⋆
x0,i =√ , ⋆
xj,i =√ , ∀j = 1, . . . , K
Zi Zi Zi

Note that model (73) fulfills assumption (38), i.e. it is a model with homoskedastic
errors.

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 282 / 288
Weighted Least Squares Estimation

Use OLS estimation for the transformed model (73):



yi⋆ = β0 x0,i ⋆
+ β1 x1,i + . . . + βK xK⋆ ,i + ui⋆
and minimize the sum of squared residuals in the transformed model:
N
(ûi⋆ )2
X
SSR =
i=1

Due to the relation


ui
ui⋆ = √
Zi
the OLS estimator of the transformed model is equal to a weighted least squares
estimator in the original model:
N N
1
(ûi⋆ )2 = ûi2
X X
SSR =
i=1 i=1 Zi
Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 283 / 288
Weighted Least Squares Estimation

Residuals ui for observations with big variances are down-weighted, while residuals for
observations with small variances obtain a higher weight. Hence the name weighted
least squares estimation.
There is no “intercept” in the model (73), only covariates. Using the matrix formulation
of the multiple regression model (73), we obtain the following matrix of predictors and
observation vector:

X ⋆ = Diag (w1 , . . . , wN ) X, y ⋆ = Diag (w1 , . . . , wN ) y

where
1
wi = √ , i = 1, . . . , N
Zi

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 284 / 288
Weighted Least Squares Estimation

The OLS estimator is computed for the transformed model, i.e.


′ ′
β̂ = ((X ⋆ ) X ⋆ )−1 (X ⋆ ) y ⋆

This is equal to following WLS estimator, which is expressed entirely in terms of the
original variables:
′ ′
β̂ = (X WX)−1 X Wy (74)

where W = Diag (w12 , . . . , wN2 ).

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 285 / 288
Testing for Heteroskedasticity

▶ Classical tests for heteroskedasticity are based on the squared OLS residuals ûi2 ,
e.g. the White or the Breusch-Pagan heteroskedasticity test. The idea is to test for
dependence of the squared residuals on any of the predictor variables using a
regression type model:

ûi2 = α0 + α1 x1,i + . . . + αK xK ,i + ξi

and test if α0 = α1 = . . . = αK = 0 using the F-test.


▶ Problem: Test is not reliable, as the errors ξi of this regression model are not
normal!

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 286 / 288
Case Study Profit

Demonstration in EViews, workfile profit.wf1:

yi = β0 + β1 x1,i + β2 x2,i + ui

▶ Discuss classical tests for heteroskedasticity [View → Residual Diagnostics]


▶ Possible choice for Zi : Zi = x2,i [Est.Eq. → Options → um94 as Std.dev.]
▶ Show how to estimate the transformed model [Divide everything by um94]
▶ Perform residual diagnostics for the transformed model [View → Res.Diag.]

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 287 / 288
Some Final Words. . .

http://xkcd.com/552/

Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 288 / 288

You might also like