All Slides

Econometrics I
An Introductory Course to Econometric Modeling, WS 2023/24
S. Adhikari, E. Flonner, G. Malsiner-Walli

Acknowledgements to: S. Frühwirth-Schnatter, T. Fissler
Institute for Statistics and Mathematics, Department of Finance, Accounting and Statistics
Introductory Course to Econometric Modeling
The course – in particular this set of slides – is based on and coherent with previous
econometrics courses held by Sylvia Frühwirth-Schnatter. It is aimed at being consistent
with other courses from this lecture series (“Econometrics II” / “Applied
Econometrics”).
Throughout the next few months, we strive for competence in the following. . .
Major Milestones
▶ Part I: Basic Concepts of Econometric Modeling
▶ Part II: OLS Estimation
▶ Part III: Multiple Regression Model
2 / 288
Literature
Introductory and largely non-mathematical:
▶ Gary Koop: Analysis of Economic Data. Wiley, 4th edition, 2013.
The “Classics”:
▶ James H. Stock and Mark W. Watson: Introduction to Econometrics. Prentice
Hall, 3rd international edition, 2011.
▶ Jeffrey M. Wooldridge: Introductory Econometrics: A Modern Approach. Cengage,
5th international edition, 2013.
In German:
▶ Herbert Stocker: Methoden der Empirischen Wirtschaftsforschung.
https://www.hsto.info/econometrics/.
▶ Peter Hackl: Einführung in die Ökonometrie. Pearson, 2. Auflage, 2013.
Older editions are good enough.

3 / 288
Workload
▶ 60 ECTS credits are the equivalent of a full year of study.

▶ Workload of Econometrics I: 4 ECTS
▶ Workload in hours: 4 x 25 hours = 100 hours
▶ Workload a week: 100/12 ≈ 8 hours.
4 / 288
Part I
Basic Concepts of Econometric Modeling
Outline
Part I: Basic Concepts of Econometric Modeling

▶ What is econometric modeling?
▶ First steps in R
▶ Common data structures
▶ The simple regression model

▶ Model formulation and basic assumptions
▶ The log-linear regression model
Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 6 / 288

Econometric Modeling
Econometrics deals with learning about an economic phenomenon (e.g. status of the
economy, influence of product attributes, volatility on financial markets, wage mobility)
from data.
▶ Econometric model: description of the phenomenon involving quantities that are
observable
▶ Data: collected for the observable variables
▶ Econometric inference: draw conclusions from the data about the phenomenon
of interest

Econometric Modeling - Stochastic/Deterministic
Models?
Example: Relationship between price and demand:

▶ Description of the phenomenon involving quantities that are observable.
▶ Economic model: simplified description of the process behind the data based on
a deterministic, mathematical model.
▶ Sometimes the deterministic model is based on some economic theory,
sometimes it’s simply a “convenient choice”.
▶ Stochastic model rather than a deterministic model (mainly) because of
simplification.

Deterministic Model
Exact quantitative relationship between the variables of interest is assumed to be

known.
Example: Deterministic Relationship between Demand and Price
D = f (p),
where D is the demand and p is the price.
Linear model:
D = β0 + β1 p
Non-linear model:
D = β0 p β1

Stochastic model
Exact quantitative relationship between the variables of interest is NOT known, but
disturbed by a (stochastic) error term.
Example: Stochastic Relationship between Demand and Price

D = f (p, u)
where D is the demand, p is the price, and u is an unobservable error.
Linear model:
D = β0 + β1 p + u
Non-linear model:
D = β0 p β1 u

Econometric Model
Deterministic model Stochastic model

−1.5 −1.5
−2 −2
demand
demand
−2.5 −2.5
−3 −3
−3.5 −3.5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
0.25 0.25
0.2 0.2
demand
demand
0.15 0.15
0.1 0.1
0.05 0.05
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
Where Does the Error Come From?
▶ u aggregates variables that are not included into the model because
▶ their influence is not known a priori;
▶ these variables are unobservable or difficult to quantify.
▶ u aggregates measurement errors which are caused by quantifying economic
variables.
▶ u captures the unpredictable randomness in the left hand side variable of the model.

Econometric Inference - Example
Example: Relationship between Demand and Price

Estimate β0 and β1 from the linear model:
D = β0 + β1 p + u
or from the non-linear model:

D = β0 p β1 u
from data.
Remark: For the second model, β1 is the price elasticity

Econometric Inference - Definition
Econometric inference is, in general, concerned with drawing conclusions from observed
data about quantities that are not directly observable.
For instance, we might be interested in

▶ model parameters of economic interest (e.g. price elasticities)
▶ hypotheses about these parameters
▶ prediction (e.g. the expected demand for a particular price)

Econometric Inference - Uncertainty
Due to this impossibility to observe these quantities of interest, any statement about
these quantities will be uncertain. Two ways of dealing with this uncertainty:
▶ Classical inference: parameter estimation, hypothesis testing, and prediciton as
discussed in the PI Statistik.
▶ Bayesian inference: is based on the concept that the state of knowledge about
any unknown quantity is best expressed in terms of a probability distribution which
is updated in the light of new knowledge.

Applied Econometrics - Steps
▶ Model formulation
▶ Model estimation
▶ Econometric inference: parameter estimation, hypothesis testing,
forecasting/prediction
▶ Model choice
▶ Model checking

Outline


Part I: Basic Concepts of Econometric Modeling First steps in R 17 / 288

Software
▶ A software package is needed for practical econometric inference.

▶ We will use R.
▶ Detailed instruction on how to use EViews is given in the tutorial.
▶ We already assume a certain familiarity with R. Still, help and instructions can be
found in the tutorial.
Part I: Basic Concepts of Econometric Modeling First steps in R 18 / 288

Outline


Part I: Basic Concepts of Econometric Modeling Common data structures 19 / 288

Data Structure - Typs
Experimental data: data obtained through a designed experiment (medicine, marketing,

etc.). In experiments, a lot of important variables can be explicitly controlled (age,
gender, etc.).
This is a rare situation in economics (and many other areas without laboratories).
(Socio-)Economists mostly deal with observational (non-experimental) data:

▶ Cross-sectional Data
▶ Time series data
▶ Panel data

Cross-sectional Data
▶ We are interested in variables (Y , X ) (e.g. relationship between demand D and

price P) or a set of variables (Y , X1 , . . . , XK ).
▶ We are observing these variables simultaneously for N subjects drawn randomly
from a population (e.g., for various individuals, firms, supermarkets, countries).
Typically, cross sectional data are indexed as follows:
(yi , xi ), or (yi , x1i , . . . , xKi ), i = 1, . . . , N
If the data set is not a (simple) random sample, there is a sample-selection problem.

Time Series Data
▶ We are (traditionally) interested in a single variable Y (e.g. the return of a

financial asset).
▶ We are observing this variable over time (e.g. every month).
▶ Data cannot be regarded as random sample. It is important to account for trends,
seasonality, . . .
Typically, time series data are indexed as follows:
yt , t = 1, . . . , T .

Panel Data
▶ Panel data or longitudinal data: The same (random) individual observations Yi

is followed over time, i.e., we have a time series for each cross-section unit.
Typically, panel data are indexed as follows:
yit , i = 1, . . . , N, t = 1, . . . , T

R Homework
Have a look at how data are organized. Files and R code are available on learn@wu:
▶ Case Study Marketing, workfile marketing
▶ Case Study Profit, workfile profit
▶ Case Study Vienna Stocks, workfile viennastocks
▶ Case Study Yields, workfile yieldus
▶ Case Study Chicken, workfile chicken
▶ Case Study Labor Force, workfile change
The code file is called code_eco_I.R.

Outline


Part I: Basic Concepts of Econometric Modeling The simple regression model 25 / 288
Question and Data
▶ We are interested in a
▶ dependent variable Y (left-hand side, explained, response), which is supposed
▶ to depend on an variable X explanatory (right-hand side, independent, control,
predictor).
▶ Examples:
▶ demand is a response variable and price is a predictor variable;
▶ wage is a response and years of education is a predictor.
▶ Data: We observe the pair of variables (Y , X ) for N subjects drawn randomly from
a population (e.g. for various supermarkets, for various individuals): (yi , xi ),
i = 1, . . . , N.
Model Formulation
The simple linear regression model describes the dependence between the variables X
and Y as:
Simple Linear Regression Model
Y = β0 + β1 X + u. (1)
The parameters β0 and β1 need to be estimated:

▶ β0 is referred to as the constant or intercept
▶ β1 is referred to as the slope parameter
Impact of the Error Term
2 2
σ =0.2 σ =1
1 1
0 0
−1 −1
demand
demand
−2 −2
−3 −3
−4 −4
−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
2 2
σ =0.01 σ =0
1 1
0 0
−1 −1
demand
demand
−2 −2
−3 −3
−4 −4
−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
Basic Assumptions
▶ The average value of the error term u in the population is 0 (not restrictive, we can
always use β0 to normalize E(u) to 0):
Assumption About the Unconditional Mean Error
E(u) = 0 (2)
▶ A crucial assumption is that u and X are uncorrelated. This means that the
conditional mean of u is zero, i.e., knowing X does not give us any information
about u.
Assumption About the Conditional Mean Error
E(u|X ) = E(u) (3)
Main Assumption
The linear model given in (1) and assumptions (2) and (3) imply that E(Y |X ) (i.e., the
conditional mean of Y given X ) is a linear function of X :
Modeling Assumption on the Conditional Mean
E(Y |X ) = β0 + β1 X (4)
Loosely speaking: For a fixed value of X = x , on average over the population, the linear
prediction β0 + β1 x is correct.
Understanding the Regression Model - error term
▶ Simulate data from a simple regression model with β0 = 0.2 and β1 = −1.8:
Y = 0.2 − 1.8X + u (5)
▶ Specification of the error term:

u independent of X and u ∼ N 0, σ 2 (6)
▶ Demonstration ⇒ R-code code_eco_I.R
Understanding the Regression Model
2 2
σ =0.2 σ =1
1 1
0 0
−1 −1
demand
demand
−2 −2
−3 −3
−4 −4
−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
2 2
σ =0.01 σ =0
1 1
0 0
−1 −1
demand
demand
−2 −2
−3 −3
−4 −4
−5 −5
1 1.2 1.4 1.6 1.8 2 1 1.2 1.4 1.6 1.8 2
price price
Understanding the Parameters - interpretation of β1
Expected value of Y , given X = x :
E(Y |X = x ) = β0 + β1 x
Expected value of Y , if the predictor X is changed by 1:
E(Y |X = x + 1) = β0 + β1 (x + 1)
Thus, β1 is the expected absolute change of the response variable Y , if the predictor X
is increased by 1:
E(Y |X = x + 1) − E(Y |X = x ) = β1
Understanding the Parameters
▶ The effect of changing X is independent of the level of X .

▶ The sign shows the direction of the expected change:
▶ If β1 > 0, then the changes of X and Y go into the same direction.
▶ If β1 < 0, then the changes of X and Y go into opposite directions.
▶ If β1 = 0, then a change in X has no influence on Y .
The Log-Linear Regression Model
▶ Log-linear regression model assumes a (specific) nonlinear relation btw Y and X :

Log-linear regression model
Suppose Y , X > 0,
Y = β̃0 · X β1 · ũ, β̃0 , ũ > 0. (7)

▶ By taking the natural logarithm on both sides we obtain a linear (in the
parameters) regression model for the transformed variables log Y and log X :
Log-linear regression model with assumption on error
log Y = β0 + β1 log X + u, E(u|X ) = 0. (8)
where β0 = log β̃0 and u = log ũ.

This model is sometimes called the “log-log” model, because logarithms are taken w.r.t. X and Y .
Visualizing the Log-Linear Regression Model
▶ Simulate data from a simple log-linear regression model with β̃0 = 0.2 and
β1 = −1.8:
Y = 0.2 · X −1.8 e u
▶ Specification of the error term:

u independent of X and u ∼ N 0, σ 2
▶ Demonstration ⇒ R-code regsimlog.R
Visualizing the Log-Linear Regression Model
2
σ =0.01
0.25 −1
0.2 −1.5
log(demand)
demand
0.15 −2
0.1 −2.5
0.05 −3
1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8
price log(price)
2
σ =0.1
0.35 −1
0.3
−1.5
0.25
log(demand)
−2
demand
0.2
0.15 −2.5
0.1
−3
0.05
0 −3.5
1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8
price log(price)
Part II
OLS Estimation
Understanding the Parameters I
In economics, elasticity measures how changing one variable affects other variables in
relative terms. If y = f (x ), then the elasticity is the ratio of the percentage change
%∆y in y and the percentage change %∆x in the variable x :
%∆y ∂y ∂x ∂log y
≈ / =
%∆x y x ∂ log x
From equation (8) we obtain the following expected value of log Y , if the predictor X is
equal to x :
E(log Y |X = x ) = β0 + β1 log x .
Therefore:
! !
%∆Y ∂log y ∂E(log y )
E ≈E = = β1
%∆X ∂ log x ∂ log x
Part II: OLS Estimation 39 / 288

Understanding the Parameters II
▶ The parameter β1 is approximately the expected change in % of the response

variable Y , if the predictor X is increased by 1% (elasticity).
▶ The sign of β1 shows the direction the of expected relative change of the response
variable Y . If β1 = 0, then a change in X has no influence in Y .
▶ If X is increased by p%, then the expected change of Y is equal to β1 p% for
small p.

Understanding the Parameters with log variables
Summary of functional forms involving logarithms

Dependent variable Independent variable Interpretation of β1
y x ∆y = β1 ∆x
y log(x ) ∆y = (β1 /100)%∆x
log(y ) x % ∆y = (β1 · 100)∆x
log(y ) log(x ) %∆y = β1 %∆x

Ordinary Least Squares (OLS) Estimation -
Estimation Problem
▶ Let (yi , xi ), i = 1, . . . , N, denote a random sample of size N from the population.

Hence, for each i
yi = β0 + β1 xi + ui . (9)
▶ The population parameters β0 and β1 are estimated from a sample.
▶ The parameter estimates are typically denoted by a hat: βˆ0 and βˆ1 .
▶ Estimation problem: How to choose the unknown parameters β0 and β1 ?

OLS-Estimation - black box?
▶ Estimation as Black Box? Very conveniently, the estimation problem is solved by

software packages like R or EViews. It helps, however, to have a deeper
understanding of what is going on.
▶ The commonly used method to estimate the parameters in a simple (mean)
regression model is ordinary least square (OLS) estimation.

OLS-Estimation
▶ Let (γ0 , γ1 ) denote a candidate for (β0 , β1 ).

▶ For each observation xi , the prediction ŷi of yi depends on the candidate choice
(γ0 , γ1 ):
ŷi (γ0 , γ1 ) = γ0 + γ1 xi (10)
▶ For each observation xi define the residual ui (prediction error) as:
ui = yi − ŷi = yi − (γ0 + γ1 xi ) (11)
▶ For each candidate (γ0 , γ1 ), an overall measure of fit is obtained by aggregating

these prediction errors.

OLS-Estimation
▶ The aggregated squared prediction errors:
Sum of Squared Residuals (SSR)
N N N
2 2
(yi − γ0 − γ1 xi )2
X X X
SSR = ui (γ0 , γ1 ) = (yi − ŷi (γ0 , γ1 )) = (12)
i=1 i=1 i=1
Then: β̂ = (βˆ0 , βˆ1 ) = arg min SSR(γ0 , γ1 ) (13)

γ0 ,γ1
▶ Intuitively, OLS is fitting a line through the sample points such that the sum of
squared residuals is as small as possible.
▶ The OLS-estimator β̂ = (βˆ0 , βˆ1 ) is the parameter that minimizes the sum of
squared residuals

OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12

OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12

OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12

OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 26.80
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 22.19
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 18.03
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 14.33
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 11.09
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 8.30
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 5.97
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 4.09
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 2.67
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12
RSS = 1.11
OLS-Estimation
18
14
8 10
6
0 2 4 6 8 10 12

How to Compute the OLS Estimator?
OLS estimates for the Simple linear regression model

sy
β̂0 = y − β̂1 x β̂1 = rxy (14)
sx
where
▶ x mean of x1 , . . . , xN
▶ y mean of y1 , . . . , yN
▶ sx standard deviation of x1 , . . . , xN
▶ sy standard deviation of y1 , . . . , yN
▶ rxy linear correlation coefficient
The only requirement is that we have sample variation in X , i.e. sx2 > 0.

Proof Я цього не розумію, а завчати не надто хочу
The OLS estimator is obtained as solution to the following minimization problem:

N
(yi − γ0 − γ1 xi )2
X
arg min
γ0 ,γ1 i=1
Taking derivatives with respect to γ0 and γ1 , the first-order conditions are:

N
X
−2 (yi − γ0 − γ1 xi ) = 0, and (15)
i=1
N
X
−2 xi (yi − γ0 − γ1 xi ) = 0 (16)
i=1

Proof
From (15) we have ȳ − γ0 − γ1 x̄ = 0. Thus:
y = β̂0 + β̂1 x =⇒ β̂0 = y − β̂1 x (17)
Implications (algebraic properties of OLS):

▶ The regression line passes through the sample mean.
▶ The sum (and thus also the average) of the OLS residuals
ûi = yi − β̂0 − β̂1 xi
is equal to zero. Follows directly from (15):

N N
1 X 1 X
ûi = (yi − βˆ0 − βˆ1 xi ) = 0
N i=1 N i=1

Proof
Substituting β̂0 = y − β̂1 x into (16) and solving for β̂1 , we obtain:
PN
i=1 (xi − x )(yi − y) sy
β̂1 = PN 2
= rxy (18)
i=1 (xi − x ) sx
PN
provided that i=1 (xi − x )2 > 0 (or sx2 > 0).
Implications (algebraic properties of OLS):

▶ The slope estimate is the sample covariance between X and Y , divided by the
sample variance of X .
▶ If X and Y are positively (negatively) correlated, the slope will be positive
(negative).

Proof
▶ The sample covariance between the regressor and the OLS residuals is zero.
Follows from (16):
N N
1 X 1 X
xi ûi = xi (yi − βˆ0 − βˆ1 xi ) = 0.
N i=1 N i=1
Use OLS estimator to estimate Y for a given X = x :

Predicted Values
ŷ = β̂0 + β̂1 x .

Statistical Properties of OLS Estimation
Econometric inference: learning from the data about the unknown parameter
′
β = (β0 , β1 ) in the regression model.
▶ Use the OLS estimator β̂ to learn about the regression parameter.
▶ Is this estimator equal to the true value?
▶ How large is the difference between the OLS estimator and the true parameter?
▶ Is there a better estimator than the OLS estimator?

Understanding the Estimation Problem
1. Simulate data from a simple regression model:

Yi = 0.2 − 1.8Xi + ui , ui |Xi ∼ N 0, σ 2 (19)
2. Run OLS estimation to obtain (β̂0 , β̂1 ) and compare the estimated values with the
true values β0 = 0.2 and β1 = −1.8,
3. Repeat this experiment several times.

Small vs. Large Sample Size (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 1000 σ2 = 1 σ2X = 1 µX = 0

4
4
2
2
0
0
y
y
−2
−2
−4
−4
−2 −1 0 1 2 −2 −1 0 1 2
x x

Small vs. Large Sample Size (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 1000 σ2 = 1 σ2X = 1 µX = 0

−1.2
−1.2
−1.4
−1.4
●
●
−1.6
−1.6
●
● ●● ● ●
●
● ● ●
●
●● ●● ● ● ●● ●● ● ● ●●
●● ●
● ●● ● ● ● ●● ●●
●
● ●●●●●
● ● ● ●● ● ● ●
●●
−1.8
−1.8
●
● ● ●
● ●
● ●● ● ●
● ●●● ●
●●●●
●●
●● ● ●
●●
β1
β1
● ● ●
● ●● ● ●● ●●●●● ●
●
●●●●●
^
^
●● ● ● ●● ● ● ●●
●
●● ●●●●
●●●
●
●
●
●
● ● ●●●●●● ●
● ● ● ● ● ● ●● ●
●
● ● ● ● ● ●●●●●
●
● ● ● ●●●● ●●●●●●
● ● ● ●●
● ● ● ● ● ●
●
●
−2.0
−2.0
●●
●
−2.2
−2.2
−2.4
−2.4
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Small vs. Large Error Variance (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 3 σ2X = 1 µX = 0

4
4
2
2
0
0
y
y
−2
−2
−4
−4
−2 −1 0 1 2 −2 −1 0 1 2
x x

Small vs. Large Error Variance (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 3 σ2X = 1 µX = 0

−1.2
−1.2
●
−1.4
−1.4
●
● ●
● ●
● ● ● ●
−1.6
−1.6
● ●
● ● ●
● ●
● ● ● ● ● ●
●
● ● ● ● ● ●
● ●● ● ● ● ●
●● ● ● ● ● ● ●
● ● ● ●
● ● ●●● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ●● ● ● ● ● ● ● ● ●
−1.8
−1.8
● ●
●
●
● ● ● ● ● ●
●
● ●●● ● ●
β1
β1
● ●● ●● ● ● ● ●
^
^
● ● ● ●
● ●●● ●● ● ● ●
● ● ● ●●
● ●● ● ● ●● ● ●
●
●● ● ●● ● ●●●
●●● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●●●●●● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
−2.0
−2.0
● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
●
●● ●
−2.2
−2.2
●
●
−2.4
−2.4
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Small vs. Large Spread of X (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 10 µX = 0

4
4
2
2
0
0
y
y
−2
−2
−4
−4
−2 −1 0 1 2 −2 −1 0 1 2
x x

Small vs. Large Spread of X (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 10 µX = 0

−1.2
−1.2
−1.4
−1.4
●●
● ●
●
−1.6
−1.6
●
● ●●
● ● ● ●
● ● ●●
●
● ● ●● ●
● ● ●●
● ●●
● ● ● ●● ● ● ● ●
● ●● ●● ● ●● ● ● ●●● ●
●● ● ●●●● ●●● ●●●●●● ●
−1.8
−1.8
● ● ● ● ●● ●●●
● ●
● ●● ● ● ● ●
●●●
● ● ●● ●
● ●● ● ● ●
β1
β1
●●●● ● ● ●
● ● ●●
● ●● ●
●●●
● ●● ● ●●●● ●
^
^
● ●● ●
● ● ●
● ●
● ● ●●●● ● ●●
● ● ●●
●● ●●
● ●● ●
●
● ● ● ● ● ● ● ● ●
●● ● ● ● ●
● ● ●
●● ●● ● ●● ●
● ● ●
● ● ● ●
● ●● ● ●
●
−2.0
−2.0
● ●
●
●
−2.2
−2.2
−2.4
−2.4
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Effect of “De-Centering” (100 Experiments)
середнє
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2
4
4
2
2
0
0
y
y
−2
−2
−4
−4
−2 −1 0 1 2 −2 −1 0 1 2
x x

Effect of “De-Centering” (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2

−1.2
−1.2
−1.4
−1.4
●
−1.6
−1.6
● ●
● ● ●
●
● ●
● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●● ●
●●●
● ● ●●● ● ● ● ● ●
● ●●
● ●● ● ●
●
−1.8
−1.8
● ●● ● ● ● ●
●
● ●●●● ●
● ● ●●
●
●● ●● ●
β1
β1
● ●
●● ● ●● ●
● ● ● ● ●●● ● ● ● ●
^
^
●
● ● ●●● ● ● ● ● ● ●● ● ●
● ●
● ● ● ●● ● ●
● ●●
● ● ●● ● ● ● ● ●●● ●● ●●●●
● ●
●● ● ●● ● ●
● ● ● ●● ● ●● ● ●●
● ●● ● ● ●
●
● ● ● ● ● ●●
● ●
−2.0
−2.0
● ●● ● ●
● ●
● ●
●
●
−2.2
−2.2
−2.4
−2.4
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

Understanding the Estimation Problem
▶ Although we are estimating the true model (no model misspecification), the OLS
estimates differ from the true value.
▶ Many different data sets of size N may be generated by the same regression
model due to the stochastic error term.
▶ The estimated parameters differ, as the sample mean, sample variance and
correlation coefficient are different for each data set:
OLS Estimates (Recap)

sy
β̂0 = y − β̂1 x , β̂1 = rxy
sx

About the Expected Error of the OLS Estimator
Obviously, the estimator is a random variable. Hence it makes sense to study the
statistical properties of OLS estimation.
Questions
▶ Are the OLS estimates unbiased, i.e., is the expected difference between the OLS
estimator and the true parameter equal to 0?
▶ How precise are these parameter estimates, i.e., how large is the variance of the
two estimators?
▶ Are the OLS coefficients correlated?
▶ How are the OLS coefficients distributed?

About the Expected Error of the OLS Estimator
▶ For fixed values of x1 , . . . , xN the sampling properties of y , sy2 and rxy determine
the estimation error:
Properties of the Estimation Error
The estimation error. . .
▶ decreases with increasing number of observations N
▶ increases with increasing error variance σ 2
▶ depends on the predictor variable through sx2
[Note: The estimation error of β̂0 also depends on x ]

Unbiasedness
OLS is Unbiased
Under assumption (4), the OLS estimator is unbiased, i.e., on average the estimated
value is equal to the true one:

E β̂1 = β1 , E β̂0 = β0
Relation between β̂1 and β1 :

PN
i=1 (Xi − X )ui
β̂1 = β1 + . (20)
N · sX2
Proof of (20) as exercise.

Unbiasedness

Since E(ui |Xi ) = 0, we obtain from (20) that E β̂1 = β1 because
N
1 X
E β̂1 = E(β1 ) + (Xi − X )E(ui |Xi ) = β1 .
NsX2 i=1

Hence E β1 − β̂1 = 0. Furthermore:
N
1 X
β̂0 = Y − β̂1 X = β0 + β1 X + ū − β̂1 X = β0 + (β1 − β̂1 )X + ui ,
N i=1
which implies
N
1 X
E β̂0 = E(β0 ) + E β1 − β̂1 X + E(ui |Xi ) = β0 .
N i=1

Homoskedasticity
▶ How big is the difference between the OLS estimator and the true parameter?
▶ To answer this question, we make an additional assumption on the conditional
variance:
Assumption of Homoskedasticity
V(u|X ) = σ 2 (21)
▶ This means that the variance of the error term u is the same, regardless of the
value of the predictor variable X .
▶ Note: If assumption (21) is violated, e.g. if V(u|X ) = σ 2 h(X ), then we say the
error term is heteroskedastic.

Homoskedasticity
▶ Assumption (21) certainly holds if u and X are assumed to be independent.

However, (21) is a weaker assumption.
▶ Assumption (21) implies that σ 2 is also the unconditional variance of u, referred to
as error variance V(u):

V(u) = E u 2 − E(u)2 = σ 2
Its square root σ is the standard deviation of the error.

▶ It follows that V(Y |X ) = σ 2 .

Variance of the OLS Estimator
▶ How large is the variation of the OLS estimator around the true parameter?

▶ We know that E β̂1 − β1 = 0
▶ We measure the variation of the OLS estimator around the true parameter through
the expected squared difference, i.e. the variance:

E (β̂1 − β1 )2 = V β̂1 (22)

▶ Similarly for β̂0 : V β̂0 = E (β̂0 − β0 )2

Variance of the slope estimator β̂1 follows from (20):

N N
1 X
2 σ2 X 2 σ2
V β̂1 = 2 2 2 (Xi − X ) V(ui ) = 2 2 2 (Xi − X ) = (23)
N (sX ) i=1 N (sX ) i=1 NsX2
▶ The variance of the slope estimator is the larger, the smaller the number of
observations N (or the smaller, the larger N). Doubling the sample size N halves
the variance of β̂1 .
▶ The variance of the slope estimator is the larger, the larger the error variance σ 2 .
Doubling the error variance σ 2 doubles the variance of β̂1 .
▶ The variance of the slope estimator is the larger, the smaller the variation in X.
Doubling sX2 halves the variance of β̂1 .

▶ The variance is in general different for the two parameters in the simple regression
model. V(β0 ) is given by (without proof):
N
σ2 1 X
V β̂0 = 2
· Xi2 (24)
NsX N i=1
▶ The standard deviations sd(βˆ0 ) and sd(βˆ1 ) of the OLS estimators are defined as:
r r
sd(βˆ0 ) = V βˆ0 , sd(βˆ1 ) = V βˆ1

Checking the Assumptions
▶ We present checks for correct model specification (4), that is,
E(u|X ) = 0
and for homoskedasticity (21), meaning
V(u|X ) = σ 2 .

Checking Correct Model Specification
▶ Since we don’t observe the error terms ui directly, we take the residuals ûi as
proxies.
▶ We usually don’t have enough observations of the regressor X for any possible
value x . That’s why we check E(u|X ) = 0 not for X = x , but for a ≤ X ≤ b.
OLS−true model
0.3
50
0.2
0.1
40
OLS−residuals
−0.1 0.0
log(y)
30
20
10
−0.3
0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
log(x) log(x)
OLS−misspecification
6e+24
3e+24
4e+24
OLS−residuals
y
1e+24
2e+24
−1e+24
0e+00
0 200 400 600 800 1000 0 200 400 600 800 1000
x x

Checking Homoskedasticity
▶ Again, we take the residuals ûi as proxies for the errors.

▶ As before, we check V(u|X ) = σ 2 not for X = x , but for a ≤ X ≤ b to see
whether the variability of the residuals change with the level of X .
OLS−homoskedasticity
6
4
3000
OLS−residuals
2
0
y
2000
−2
−4
1000
−6
0 200 400 600 800 1000 0 2000 4000 6000 8000
x x
OLS−heteroskedasticity
2000 4000 6000 8000
2000
OLS−residuals
y
0
−4000
0
0 200 400 600 800 1000 0 200 400 600 800 1000
x x

Part III
Multiple Regression Model
Outline
Part III: Multiple Regression Model
▶ Model Formulation
▶ OLS Estimation
Part III: Multiple Regression Model Model Formulation 89 / 288

Data
▶ We are interested in a
▶ dependent (left-hand side, explained, response) variable Y , which is supposed to
depend on
▶ K explanatory (right-hand side, independent, control, predictor) variables
X1 , . . . , XK .
▶ Example: wage is a response; education, gender, and experience are predictor
variables.
▶ Sample: We observe these variables for N subjects drawn randomly from a
population (e.g. for various supermarkets, for various individuals):
(yi , x1,i , . . . , xK ,i ) for i = 1, . . . , N

Model Formulation
▶ The multiple regression model describes the relation between the response variable
Y and the predictor variables X1 , . . . , XK as:
Multiple Linear Regression Model
Y = β0 + β1 X1 + . . . + βK XK + u (25)
where β0 , β1 , . . . , βK are unknown parameters.
▶ Key Assumption:
E(u|X1 , . . . , XK ) = E(u) = 0 (26)

Model Formulation
▶ Assumption (26) implies:
Linearity
E(Y |X1 , . . . , XK ) = β0 + β1 X1 + . . . + βK XK (27)
▶ Note that E(Y |X1 , . . . , XK ) is a linear function

▶ in the parameters β0 , β1 , . . . , βK (important for ”easy´´ OLS estimation)
▶ in the predictor variables X1 , . . . , XK (important for the correct interpretation of the
parameters)

▶ The parameter βk is the expected absolute change of the response Y , if the

predictor variable Xk is increased by 1, and all other predictors remain the same
(“ceteris paribus”)
▶ If Xk is increased by c, the expected absolute change of Y is βk c, ceteris paribus.
Proof:
E(Y |Xk = x + c) − E(Y |Xk = x )

= β0 + β1 X1 + . . . + βk (x + c) + . . . + βK XK
− (β0 + β1 X1 + . . . + βk x + . . . + βK XK )
= βk c

The sign shows the direction of the expected change:

▶ If βk > 0, then larger Xk implies larger Y ceteris paribus (and vice versa).
▶ If βk < 0, then larger Xk implies smaller Y ceteris paribus (and vice versa).
▶ If βk = 0, then a change in Xk has no influence on Y .

The Multiple Log-Linear Model
▶ The multiple log-linear model (also called the multiple “log-log” model) reads:
Y = e β0 · X1β1 · · · XKβK · e u (28)
▶ The log transformation of all variables yields a model that is linear in the
parameters β0 , β1 , . . . , βK ,
log Y = β0 + β1 log X1 + . . . + βK log XK + u, (29)
but is nonlinear in the predictor variables X1 , . . . , XK .

▶ This is important for the correct interpretation of the parameters!

The Multiple Log-Linear Model
Interpretation of the parameters:

▶ The coefficient βk is the elasticity of the response variable Y with respect to the
variable Xk , i.e. the expected relative percentage change of Y , if the predictor
variable Xk is increased by 1% and all other predictor variables remain the same
(ceteris paribus).
▶ If Xk is increased by p%, then the expected relative change of Y is approximately
equal to βk p%, ceteris paribus (for small p).

R Homework
Homework:
Have a look in R how to define a multiple regression model and discuss the meaning of
the estimated parameters:
▶ Case Study Chicken, work file chicken
▶ Case Study Marketing, work file marketing
▶ Case Study Profit, work file profit
⇒ R-code code_eco_I.R

Outline
Part III: Multiple Regression Model
▶ Model Formulation
▶ OLS Estimation
Part III: Multiple Regression Model OLS Estimation 98 / 288

OLS Estimation
▶ Let (yi , x1,i , . . . , xK ,i ), i = 1, . . . , N denote a random sample of size N from the

population. Hence, for each i:
yi = β0 + β1 x1,i + . . . + βk xk,i + . . . + βK xK ,i + ui (30)
▶ The population parameters β0 , β1 , . . . , βK are estimated from a sample.

▶ The parameter estimates (coefficients) are typically denoted by βˆ0 , βˆ1 , . . . , βˆK . We
will use the following vector notation:
′ ′
β = (β0 , . . . , βK ) , β̂ = (βˆ0 , βˆ1 , . . . , βˆK ) (31)

OLS Estimation
The commonly used method to estimate the parameters in a multiple regression model
is, again, OLS estimation:
▶ Denote the candidate choice by γ = (γ0 , . . . , γK )′ .
▶ For each observation yi , the prediction ŷi (γ) of yi depends on γ.
▶ For each yi , define the regression residuals (prediction error) ui (γ) as:
ui (γ) = yi − ŷi (γ) = yi − (γ0 + γ1 x1,i + . . . + γK xK ,i ) (32)

OLS Estimation
▶ For each candidate value γ, an overall measure of fit is obtained by aggregating

these prediction errors:
Sum of Squared Residuals (SSR)
N N
ui (γ)2 = (yi − γ0 − γ1 x1,i − . . . − γK xK ,i )2
X X
SSR(γ) = (33)
i=1 i=1
β̂ = arg min SSR(γ) (34)

γ
▶ The OLS-estimator β̂ = (βˆ0 , βˆ1 , . . . , βˆK ) is the parameter that minimizes the sum
of squared residuals.

How to Compute the OLS Estimator?
For a multiple regression model, the estimation problem is solved by software packages
like EViews or R.
Some mathematical details:

▶ Take the first partial derivative of (34) with respect to each candidate parameter
γk , k = 0, . . . , K
▶ This yields a system of K + 1 linear equations in γ0 , . . . , γK , which has a unique
solution under certain conditions on the matrix X, having N rows and K + 1
columns, containing in each row i the predictor values (1 x1,i . . . xK ,i )

Matrix Notation of the Multiple Regression Model
Matrix notation for the observed data:

..
 
 1 x1,1 . xK ,1  
 y1

 1 x .. 
. xK ,2 y2
  
 1,2   
 . . .. ..

..

. .

X = , y=  
 . . . .   . 
..
 
yN−1 
  
1 x1,N−1 . xK ,N−1 
   


..  yN
1 x1,N . xK ,N
▶ X is an N × (K + 1) matrix (often called design matrix)
▶ y is an N × 1 vector
For those who want to really understand matrices and vectors, I highly recommend the
video series “Essence of linear algebra” to be found here: 3Blue1Brown.com

Matrix Notation of the Multiple Regression Model
In matrix notation, the N equations given in (30) for i = 1, . . . , N, may be written as:
y = Xβ + u
where
 
u1  
β0

u2 
 .
. 
  
u=  .. , β=
 .
.
  
βK
 
uN

The OLS Estimator
Цього я, на жаль, теж не розумію
▶ Note that X ′ X is a quadratic matrix with (K + 1) rows and columns

▶ (X ′ X)−1 denotes the inverse of X ′ X (if it exists)
▶ The OLS estimator β̂ has an explicit form, depending on X and the vector y; it is
given by:
OLS Estimator
′ ′
β̂ = (X X)−1 X y (35)
▶ Note: The matrix X ′ X has to be invertible in order to obtain a unique

estimator for β.

Proof Using Matrix Differentiation
First, note that SSR(γ) can be written as
u ′ u = (y − Xγ)′ (y − Xγ) = y ′ y − γ ′ X ′ y − y ′ Xγ +γ ′ X ′ Xγ = y ′ y − 2γ ′ X ′ y + γ ′ X ′ Xγ
| {z } | {z }
scalar! scalar!
Now, find β̂ := arg minγ SSR(γ) which can be done by finding γ such that the FOC is
satisfied:
∂u ′ u
= −2X ′ y + 2X ′ Xγ = 0
∂γ
In other words
′ ′
β̂ = (X X)−1 X y

The OLS Estimator
′
Necessary conditions for X X being invertible:
▶ We have to observe sample variation for each predictor Xk , i.e., the sample
variances of xk,1 , . . . , xk,N are positive for all k = 1, . . . , K
▶ No exact linear relation between any predictors Xk and Xl may be present,
i.e., the empirical correlation coefficients of all pairwise data sets (xk,i , xl,i ),
i = 1, . . . , N are different from 1 and −1 for l ̸= k.
′ ′
Note: EViews produces an error if X X is not invertible, whereas R tries to make X X
invertible by removing predictors.

The OLS Estimator
It is sufficient to make the following assumption about the predictors X1 , . . . , XK in a

multiple regression model:
No Perfect Multicollinearity
The predictors X1 , . . . , XK are not linearly dependent, i.e., no predictor Xj may be
expressed as a linear function of the remaining predictors X1 , . . . , Xj−1 , Xj+1 , . . . , XK
If this assumption is violated. . .

▶ . . . the (unique) OLS estimator does not exist, as the matrix X ′ X is not invertible
▶ . . . there are infinitely many parameter values γ having the same minimal sum of
squared residuals (SSR(γ)).
▶ . . . the parameters in the regression model are not identified.

Case Study Yields
Homework in EViews / R, yieldus
yi = β0 + β1 x1,i + β2 x2,i + β3 x3,i + ui ,

yi . . . yield with maturity 3 months
x1,i . . . yield with maturity 1 month
x2,i . . . yield with maturity 60 months
x3,i . . . spread between these yields, i.e. x2,i − x1,i
x3,i is a linear combination of x1,i and x2,i .

Case Study Yields
Let β = (β0 , β1 , β2 , β3 ) be a certain parameter.
Any parameter β ⋆ = (β0 , β1⋆ , β2⋆ , β3⋆ ), where β3⋆ may be arbitrarily chosen and
β2⋆ = β2 + β3 − β3⋆
β1⋆ = β1 − β3 + β3⋆
will lead to the same sum of mean squared errors as β. The OLS estimator is not
unique!

Part IV
Expected Value and Variance of the OLS Estimator
Outline
Part IV: Expected Value and Variance of the OLS Estimator
▶ Econometric Inference
▶ OLS Residuals
Part IV: Expected Value and Variance of the OLS Estimator Econometric Inference 112 / 288
Understanding Econometric Inference
Econometric Inference
Learning from data about the unknown parameter β in the regression model:
▶ Use the OLS estimator β̂ to learn about the regression parameter.
▶ Is this estimator equal to the true value?
▶ How large is the difference between the OLS estimator and the true parameter?
▶ Is there a better estimator than the OLS estimator?
Unbiasedness of the OLS Estimator
OLS is Unbiased
Under the assumptions (26), the OLS estimator (if it exists) is unbiased, i.e. the
estimated values are on average equal to the true values:

E β̂j = βj , j = 0, . . . , K
In matrix notation:

E β̂ = β, E β̂ − β = 0 (36)
Unbiasedness of the OLS Estimator (Proof)
The OLS estimator may be expressed as:

′ ′ ′ ′ ′ ′
β̂ = (X X)−1 X y = (X X)−1 X (Xβ + u) = β + (X X)−1 X u
Then, the estimation error may be expressed as:

′ ′
β̂ − β = (X X)−1 X u (37)
Result (36) follows immediately:

′ ′

E β̂ − β | X = (X X)−1 X E(u | X) = 0
Covariance Matrix of the OLS Estimator

▶ Due to unbiasedness, the expected value E β̂j of the OLS estimator is equal to βj
for j = 0, . . . , K .

▶ Hence, the variance V β̂j measures the variation of the OLS estimator β̂j around
the true value βj :
2 2
V β̂j = E β̂j − E β̂j =E β̂j − βj
▶ Are the deviation of the estimator from the true value correlated for different
coefficients of the OLS estimators?
Recap: Effect of “De-Centering” (100 Experiments)
N = 100 σ2 = 1 σ2X = 1 µX = 0 N = 100 σ2 = 1 σ2X = 1 µX = 2

−1.2
−1.2
−1.4
−1.4
●
−1.6
−1.6
● ●
● ● ●
●
● ●
● ● ● ●● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●● ●● ● ●● ●
● ●● ● ● ●● ●● ●
●●●
● ● ●●● ● ● ● ● ●
● ●●
● ●● ● ●
●
−1.8
−1.8
● ●● ● ● ● ●
●
● ●●●● ●
● ● ●●
●
●● ●● ●
β1
β1
● ●
●● ● ●● ●
● ● ● ● ●●● ● ● ● ●
^
^
●
● ● ●●● ● ● ● ● ● ●● ● ●
● ●
● ● ● ●● ● ●
● ●●
● ● ●● ● ● ● ● ●●● ●● ●●●●
● ●
●● ● ●● ● ●
● ● ● ●● ● ●● ● ●●
● ●● ● ● ●
●
● ● ● ● ● ●●
● ●
−2.0
−2.0
● ●● ● ●
● ●
● ●
●
●
−2.2
−2.2
−2.4
−2.4
−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
^ ^
β0 β0

▶ The covariance Cov β̂j , β̂k of different coefficients of the OLS estimators
measures how strongly deviations between the estimator and the true value are
correlated:

Cov β̂j , β̂k = E β̂j − βj β̂k − βk
▶ This information is summarized for all possible pairs of coefficients in the

covariance matrix of the OLS estimator.
Note that

Cov β̂ = E (β̂ − β)(β̂ − β)′
The covariance matrix of a random vector is a square matrix, containing

▶ in the diagonal the variances of the various elements of the random vector, and
▶ in the off-diagonal elements the covariances.
 
V β̂0 Cov β̂0 , β̂1 ··· Cov β̂0 , β̂K
 
 Cov β̂0 , β̂1 V β̂1 ··· Cov β̂1 , β̂K

 
Cov β̂ =  .. ..

 .. 
. ··· . .
 
 
Cov β̂0 , β̂K ··· Cov β̂K −1 , β̂K V β̂K
Homoskedasticity

To derive Cov β̂ , we make an additional assumption:
Homoskedasticity
V(u | X1 , . . . , Xk ) = σ 2 (38)
This means that the variance of the error term u is the same, regardless of the
predictor variables X1 , . . . , XK .
Analogously to the univariate case, it follows that
V(Y | X1 , . . . , Xk ) = σ 2
Covariance Matrix of Error Vector
▶ Because the observations are (by assumption) a random sample from the
population, any two observations yi and yl are uncorrelated. Hence also the errors
ui and ul are uncorrelated.
▶ Together with homoskedasticity (38) we obtain the following covariance matrix
of the error vector u:
Cov(u | X1 , . . . , Xk ) = σ 2 I
with I being the identity matrix.
Under assumption (26) and (38), the covariance matrix of the OLS estimator β̂ is given
by:
′

Cov β̂ = σ 2 (X X)−1 (39)
поняття не маю, що це
Proof of (39)
Using (37), we obtain:
′ ′
β̂ − β = Au with A = (X X)−1 X
The following holds:

′
Cov β̂ = E β̂ − β β̂ − β = E(Auu ′ A′ ) = AE(uu ′ ) A′ = ACov(u) A′
Therefore:
′ ′ ′ ′

Cov β̂ = σ 2 AA′ = σ 2 (X X)−1 X X(X X)−1 = σ 2 (X X)−1
′

The diagonal elements of the matrix σ 2 (X X)−1 define the variance V β̂j of the
OLS estimator for each component

The standard deviation sd β̂j of each OLS estimator is defined as:
Standard Deviations of the “Betas”

r q
sd β̂j = V β̂j = σ (X ′ X)−1
j+1,j+1 (40)
It measures the estimation error in the same unit as βj .
Evidently, the standard deviation is larger for larger error variances σ 2 . What other
factors influence the standard deviation?
Multicollinearity
In practical regression analysis very often high (but not perfect) multicollinearity is
present.
How well may Xj be explained by the other regressors?
Consider Xj as left-hand variable in the following regression model, whereas all the
remaining predictors remain on the right hand side:
Xj = β̃0 + β̃1 X1 + . . . + β̃j−1 Xj−1 + β̃j+1 Xj+1 + . . . + β̃K XK + ũ
Use OLS estimation to estimate the parameters and let x̂j,i be the values predicted from
this (OLS) regression:
Multicollinearity
▶ Define Rj as the correlation between the observed values xj,i and the predicted
values x̂j,i in this regression.
▶ If Rj2 is close to 0, then Xj cannot be predicted from the other regressors.
⇒ Xj contains additional, “independent” information.
▶ The closer Rj2 is to 1, the better Xj is predicted from the other regressors and
multicollinearity is present.
⇒ Xj does not contain much „independent” information.
The Variance of the OLS Estimator

Using Rj , the variance V β̂j of the OLS estimators of the coefficient βj corresponding
to Xj may be expressed in the following way for j = 1, . . . , K :
σ2
V β̂j =
Nsx2j (1 − Rj2 )

The variance V β̂j (and consequently the standard deviation) of the estimate β̂j is
large if
⇒ the regressor Xj is highly redundant given the other regressors,
⇒ Rj2 close to 1, almost multicollinearity present.
The Variance of the OLS Estimator
All other factors are the same as in the simple regression model, i.e.:

The variance V β̂j , j = 1, . . . , K , of the estimator β̂j is large, if
▶ the variance σ 2 of the error term u is large;
▶ the sampling variation in the regressor Xj , i.e. the variance sx2j , is small;
▶ the sample size N is small.
Outline
Part IV: Expected Value and Variance of the OLS Estimator
▶ Econometric Inference
▶ OLS Residuals
Part IV: Expected Value and Variance of the OLS Estimator OLS Residuals 129 / 288
OLS Residuals
Consider the (OLS-)estimated regression model:
yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi = ŷi + ûi
where
▶ ŷi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i is called the fitted value
▶ ûi is called the OLS residual
OLS residuals are useful:

▶ to estimate the variance σ 2 of the error term
▶ to quantify the quality of the fitted regression model
▶ for residual diagnostics
R / EViews Class Exercise
Homework:
Have a look in R / EViews how to obtain the OLS residuals and the fitted regression:
▶ Case Study profit, workfile profit
OLS Residuals as Proxies for the Error
Compare the underlying regression model
Y = β0 + β1 X1 + . . . + βK XK + u (41)
with the estimated model for i = 1, . . . , N:
yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi
▶ The OLS residuals ^ u1 , . . . , ^

uN may be considered as a “sample” of the
unobservable error u
▶ Use the OLS residuals û1 , . . . , ûN to estimate σ 2 = V(u | X1 , . . . , Xk )
Algebraic Properties of the OLS residuals
The OLS residuals û1 , . . . , ûN obey K + 1 linear equations and have the following
algebraic properties:
▶ The sum (and thus also the mean) of the OLS residuals ûi is equal to zero:
N
1 X
ûi = 0 (42)
N i=1
▶ The sample covariance between xk,i and ûi is zero:
N
1 X
xk,i ûi = 0, ∀k = 1, . . . , K (43)
N i=1
Estimating σ 2 - naive estimator
A naive estimator of σ 2 would be the sample variance of the OLS residuals û1 , . . . , ûN :
N N
!2 N
˜2 = 1 1 X 1 X
ûi2 =
SSR
X
σ̂ ûi − ûi =
N i=1 N i=1 N i=1 N
PN
where we used (42) and SSR = i=1 ûi2 is the sum of squared residuals.
However, due to the linear dependence between the OLS residuals,

▶ û1 , . . . , ûN is not a sample of independent random varibales, thus
˜ 2 is a biased estimator of σ 2 .
▶ σ̂
Estimating σ 2
▶ Due to the linear dependence between the OLS residuals, only (N − K − 1)

residuals can be chosen independently.
▶ This number is often abbreviated as df and referred to as the degrees of freedom.
▶ An unbiased estimator of the error variance σ 2 in a homoskedastic multiple

regression model is given by:
SSR
σ̂ 2 = (44)
df
where
▶ SSR = N 2
P
i=1 ûi is the sum of squared OLS residuals,
▶ df = (N − K − 1), N is the number of observations, and
▶ K is the number of predictors X1 , . . . , XK .
Standard Errors of the OLS Estimator

▶ The standard deviation sd β̂j of the OLS estimator given in (40) depends on
√
σ = σ2.
▶ To evaluate the estimation error for a given data set in practical regression analysis,
σ 2 is substituted by the estimator σ̂ 2 given in (44).
▶ This yields the so-called standard error of the OLS estimator:
Standard Error of the OLS Estimator

√ q ′ −1
se β̂j = σ̂ 2 (X X)j+1,j+1 (45)
R / Eviews Class Exercise
R / EViews (and other software packages) report for each predictor the OLS estimator
together with the standard errors:
▶ Case Study profit, work file profit
▶ Case Study Chicken, work file chicken
▶ Case Study Marketing, work file marketing
Note: the standard errors computed by R / EViews (and other software packages) are
valid only under the assumptions made above, in particular, homoskedasticity.
Quantifying the model fit - simplest model
▶ How well does the multiple regression model (41) explain the variation in Y ?
▶ Compare it with the following simple model without any predictors:
Y = β0 + ũ (46)
▶ The OLS estimator β̂0 which minimizes the following sum of squared residuals:
N
(yi − γ0 )2
X
i=1
over all candidate values γ0 , is given by β̂0 = y.
Coefficient of Determination - TSS
▶ In the model without any predictors (46), the sum of squared residuals RSS is
called the total sum of squares (TSS):
N
(yi − y )2
X
TSS =
i=1
(Note that TSS = N · sy2 )
▶ Is it possible to reduce the sum of squared residuals of the simple model (46), i.e.
TSS, by including the predictor variables X1 , . . . , XK as in (41)?
Coefficient of Determination R 2
1. The sum of squared residuals SSR of the multiple regression model (41) is always
smaller than the sum of squared residuals TSS of the simple model (46):
SSR ≤ TSS (47)
2. The coefficient of determination R2 of the multiple regression model (41) is

defined as:
TSS − SSR SSR
R2 = =1− (48)
TSS TSS
Coefficient of Determination - Proof
Proof of (47)
The following variance decomposition holds:
N N N N
(yi − ŷi + ŷi − y )2 = ûi2 + 2 (ŷi − y )2
X X X X
TSS = ûi (ŷi − y ) +
i=1 i=1 i=1 i=1
Using the algebraic properties (42) and (43) of the OLS residuals, we obtain:
N
X N
X N
X N
X N
X
ûi (ŷi − y ) = β̂0 ûi + β̂1 ûi x1,i + . . . + β̂K ûi xK ,i − y ûi = 0
i=1 i=1 i=1 i=1 i=1
Therefore:
N
(ŷi − y )2 ≥ SSR
X
TSS = SSR +
i=1
Coefficient of Determination - Interpretation
The coefficient of determination R2 is a measure of goodness-of-fit:

▶ If SSR ≈ TSS, then little is gained by including the predictors;
⇒ R2 is close to 0;
⇒ The multiple regression model explains the variation in Y hardly better than the
simple model (46).
▶ If SSR ≪ TSS, then much is gained by including all predictors;
⇒ R2 is close to 1;
⇒ The multiple regression model explains the variation in Y much better than the
simple model (46).
Software packages like R / EViews report SSR and R2 .
Coefficient of Determination - examples
2 2
SSR=9.5299, SST=120.0481, R =0.92062 SSR=8.3649, SST=8.6639, R =0.034512
2.5 1
data data
price as predictor price as predictor
2 no predictor 0.8 no predictor
1.5 0.6
1 0.4
0.5 0.2
0 0
−0.5 −0.2
−1 −0.4
−1.5 −0.6
−2 −0.8
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Part V
Testing Hypotheses (One Coefficient)
Outline
Part V: Testing Hypotheses (One Coefficient)
▶ Testing Hypotheses - One Coefficient
Part V: Testing Hypotheses (One Coefficient) Testing Hypotheses - One Coefficient 145 / 288
Testing Hypothesis
▶ Multiple regression model:
Y = β0 + β1 X1 + . . . + βj Xj + . . . + βK XK + u, (49)
▶ Does the predictor variable Xj exercise an influence on the expected mean

E(Y |X1 , . . . , XK ) of the response variable Y, if we control for all other variables
X1 , . . . , Xj−1 , Xj+1 , . . . , XK ?
Formally,
βj = 0 ?
Understanding the Testing Problem
▶ Simulate data from a multiple regression model:

Y = 0.2 − 1.8X1 + 0X2 + u, u | X1 , X2 ∼ N 0, σ 2
▶ Run OLS estimation for this model:

Y = β0 + β1 X1 + β2 X2 + u, u | X1 , X2 ∼ N 0, σ 2
to obtain (β̂0 , β̂1 , β̂2 ).
▶ Is β̂2 different from 0?
N = 100 σ2 = 0.1 σ2X1 = σ2X2 = 1 µX1 = µX2 = 0 σX1X2 = 0.7

●
0.10
●
●
0.05
●
● ●
● ●
● ●
β2 (redundant variable)
●
● ● ●
● ● ●
●
0.00
●
●
●
●
●
● ● ●
● ●
● ●●
●
●
● ●
● ● ● ●
−0.05
● ●
^
●
●
●● ●
● ●
●
−0.10
−1.90 −1.85 −1.80 −1.75 −1.70

^
β1 (important variable)
The OLS estimator β̂2 of β2 = 0 differs from 0 for a single data set, but is 0 on average.
▶ OLS estimation for the true model in comparison to estimating a model with a
redundant predictor:
⇒ including the redundant predictor X2 increases the estimation error for the
other parameters!
One predictor Two predictors
−1.70
−1.70
●
● ●
●
●
●
−1.75
−1.75
●
● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ●
●
●
● ●
● ● ● ●
●
● ● ●
● ● ●
−1.80
−1.80
● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ● ●●
●
β1
β1
● ● ●
^
^
● ●
● ●● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ●
●
● ● ●
−1.85
−1.85
● ● ●
● ●
● ●
●
●
●
●
−1.90
−1.90
●
0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.14 0.16 0.18 0.20 0.22 0.24 0.26
^ ^
β0 β0
Testing of Hypotheses
Several issues arise:

▶ What can we learn from the data about hypotheses concerning the unknown
parameters in the regression model, especially about the hypothesis that βj = 0?
▶ May we reject the hypothesis βj = 0 given data?
▶ Testing if βj = 0 is not only of importance for the substantive scientist, but also
from an econometric point of view: Excluding redundant variables may increase
efficiency of estimation of non-zero parameters!
It is possible to answer these questions if we make additional assumptions about the

error term u in a multiple regression model.
The Classical Regression Model
Model Assumption in the Classical Linear Regression Model

The error u in the multiple regression model (49) is independent of X1 , . . . , XK and
follows a normal distribution:

u ∼ N 0, σ 2 (50)
This assumption implies the more general assumptions (26) and (38):
E(u|X1 , . . . , XK ) = E(u) = 0
V(u|X1 , . . . , XK ) = V(u) = σ 2
The Classical Regression Model
▶ It follows that the conditional distribution of Y given X1 , . . . , XK follows a normal

distribution:

Y |X1 , . . . , XK ∼ N β0 + β1 X1 + . . . + βj Xj + . . . + βK XK , σ 2
▶ Furthermore, because the observations are a random sample, the error vector u
has a multivariate normal distribution with independent components:

u ∼ NN 0, σ 2 I
Multivariate Normal Distributions with Independent
Components
Density of the bivariate normal distribution N2 (0, σ 2 I) with σ 2 = 0.5:
2.5
1.5
0.35
1
0.3
0.25 0.5
0.2
x2
0
0.15
−0.5
0.1
0.05 −1
0 −1.5
2
−2
1 2
0 1
−1 0 −2.5
−1
−2 −2
−3 −2 −1 0 1 2 3
x2 x1
x1
Multivariate Normal Distributions with Independent
Components
1000 observations from N2 (0, 0.5I) in comparison to 100α%-confidence region (from
the left to the right: α = 0.25, α = 0.5, α = 0.95)
α = 0.25, Rel.H.: 0.242 α = 0.5, Rel.H.: 0.48 α = 0.95, Rel.H.: 0.954
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Multivariate Normal Distributions with Dependent
Components
Density of the bivariate normal distribution N2 (µ, Σ) with
! !
2 4 3.2
µ= Σ=
−3 3.2 7
0
0.04
x2
0.03
−5
0.02
10
0.01
5 −10
0
0
5
0
−5
−5 −15
−10 x1
−15 −8 −6 −4 −2 0 2 4 6 8 10 12
x2 x
1
Multivariate Normal Distributions with Dependent
Components
1000 observations from N2 (µ, Σ) in comparison to 100α%-confidence region (from the

left to the right: α = 0.25, α = 0.5, α = 0.95)
α = 0.25, Rel.H.: 0.234 α = 0.5, Rel.H.: 0.472 α = 0.95, Rel.H.: 0.94
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−6 −6 −6
−8 −8 −8
−10 −10 −10
−8 −6 −4 −2 0 2 4 6 8 10 12 −8 −6 −4 −2 0 2 4 6 8 10 12 −8 −6 −4 −2 0 2 4 6 8 10 12
Distribution of the OLS Estimator
Using (37), we obtain:
′

β̂ − β ∼ NK +1 0, Cov β̂ , Cov β̂ = σ 2 (X X)−1
All marginal distributions are normal:
2
β̂j − βj ∼ N 0, sd β̂j ,
thus
β̂j − βj
∼ N (0, 1) (51)
sd β̂j
Note: Deviations between the true value and the OLS estimator are usually correlated:
′

β̂ − β ∼ NK +1 0, σ 2 (X X)−1 ,
Testing a Single Coefficient: t-Test
▶ If the null hypothesis βj = 0 is valid, then possible differences between the

OLS-estimator β̂j and 0 may be quantified using the following inequality:
|β̂j |
≤ cα (52)
sd β̂j
where cα is equal to the (1 − α/2)-quantile of the standard normal distribution.
▶ This can be used to construct a test statistic:
β̂j
tj = (53)
sd β̂j
Testing a Single Coefficient: t-Test
If (50) holds and σ 2 is known, then tj follows a standard normal distribution under the
null hypothesis:
▶ Choose a significance level α
▶ Determine the corresponding critical value cα
▶ If |tj | > cα : reject the null hypothesis (the risk to reject the null hypothesis
although it is true is at most α)
▶ If |tj | ≤ cα : do not reject the null hypothesis (the risk to “not reject” a wrong null
hypothesis may be arbitrarily large)
Choice of cα when σ 2 is unknown

▶ If σ 2 is unknown and estimated as described above, then sd β̂j is substituted by

se β̂j , yielding the test statistic:
β̂j
tj = (54)
se β̂j
▶ Choosing the quantiles of the normal distributions would lead to a test which rejects
the true null-hypothesis more often than desired, e.g. for α = 0.05 and K = 3:
N 10 20 30 40 50 100
P(reject H0 ) 0.09 0.07 0.06 0.06 0.06 0.05
Choice of cα when σ 2 is unknown
▶ The reason for this phenomenon is that tj no longer follows a normal distribution,
but a tdf - distribution where df = (N − K − 1).
▶ The critical values tdf,1−α/2 depend on df and are equal to the quantiles of the tdf
distribution.
E.g., for α = 0.05 and for a regression model with 3 parameters, these values are
approximately:
df = N − 3 7 17 27 37 47 97 ∞
tdf,0.975 2.36 2.11 2.05 2.03 2.01 1.98 1.96
The Student t distribution
95% region for t distribution with 2 degrees of freedom

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
t2,0.975 ≈ 4.30

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
t3,0.975 ≈ 3.18

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
t5,0.975 ≈ 2.57

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
t10,0.975 ≈ 2.23

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
t30,0.975 ≈ 2.04
95% region for the standard normal distribution

0.4
0.3
f(x)
0.2
0.1
0.0
−4 −2 0 2 4
The tdf distribution converges to the standard normal for large df: t∞,0.975 ≈ 1.96
The p-value
The p-value is derived from the distribution of the t-statistic under the null
hypothesis and is easier to interpret than the t-statistic which has to be compared
to the correct quantiles:
▶ Choose a significance level α
▶ If p < α: reject the null hypothesis (risk to reject the null hypothesis although it is
true is at most α)
▶ If p ≥ α: do not reject the null hypothesis (risk to “not reject” a wrong null
hypothesis may be arbitrarily large)
An Old Saying. . .
If the p is low, the null must go
If the p is high, the null will fly
Have a look in R how to formulate sensible null hypotheses and how to test them using
the t-statistic and the p-value:
Case Study Chicken
The t-statistic for the variable income is equal to 1.024, p-value: 0.319 (rounded)
0.4
0.3
0.2
0.1
0.16 0.16
0.0
1.024
−4 −2 0 2 4
Case Study Chicken
The t-statistic for the variable ppork is equal to 3.114, p-value: 0.006
0.4
0.3
0.2
0.1
0.003 0.003
0.0
3.114
−4 −2 0 2 4
Understanding p-Values
▶ A small p-value shows that the value observed for the t-statistic (or an even more
“extreme” value) is unlikely under the null hypothesis, thus we reject the null
hypothesis for small p-values.
⇒ There is substantial evidence in the data that βj ̸= 0.
▶ A p-value considerable larger than 0 shows that the observed value (or an even
more “extreme” value) for the t-statistic is plausible under the null hypothesis, thus
we do not reject the null hypothesis for large p-values.
⇒ There is little evidence in the data that βj ̸= 0.
Note that not rejecting the null does not necessarily mean that βj = 0, because the risk
to accept a wrong null hypothesis is not controlled!
Confidence Intervals for the Unknown Coefficients
The marginal distribution (51) is also useful for obtaining 100(1 − α)% confidence
regions for the unknown regression coefficients (e.g., α = 0.05 leads to a 95%
confidence region)
Two-Sided Confidence Regions

 
β̂j − βj
P−c1−α/2 ≤ ≤ c1−α/2  = 1 − α (55)
sd β̂j
where cp is the p-quantile of the standard normal distribution

The confidence interval reads:
h i
β̂j − c1−α/2 sd β̂j , β̂j + c1−α/2 sd β̂j
One-Sided Confidence Regions

 
β̂ − βj
 j
P ≤ c1−α  = 1 − α
sd β̂j
 
β̂j − βj 
P−c1−α ≤ =1−α
sd β̂j
This yields (with probability 1 − α):

▶ β̂j − c1−α sd β̂j is a lower bound for βj

▶ β̂j + c1−α sd β̂j is an upper bound for βj

If σ 2 is unknown, then sd β̂j is substituted by se β̂j . Instead of (51), we obtain with
df = (N − K − 1):
β̂j − βj
∼ tdf
se β̂j
If tdf,p is the p-quantiles of the tdf -distribution, this yields:

h i
▶ βj lies in β̂j − tdf,1−α/2 se β̂j , β̂j + tdf,1−α/2 se β̂j

▶ β̂j + tdf,1−α se β̂j is an upper bound for βj

▶ β̂j − tdf,1−α se β̂j is a lower bound for βj
More about the distribution of the OLS estimator
▶ For any subset of coefficients β̃ = (βj1 , . . . , βjq )′ , the OLS estimator

˜
β̂ = (β̂j1 , . . . , β̂jq )′ follows a multivariate normal distribution:

˜ − β̃ ∼ N 0, Cov β̂
β̂ ˜ (56)
q

˜ is obtained from the rows and columns j , . . . , j of Covβ̂
where Cov β̂ 1 q
▶ This result may be used to construct 95%-confidence ellipsoids for all pairs of
parameters (βj1 , βj2 )
Part VI
Testing Hypotheses (More Coefficients)
Testing More Than One Coefficient
▶ Testing the null hypothesis βj = 0 based on tj is only valid if all other parameters
remain in the model.
▶ Often, we want to test joint hypotheses about our parameters.
▶ E.g., if the tj -statistic is not significant for more than one parameter j1 , . . . , jq , then
one needs to test, if βj1 = 0, βj2 = 0, . . . , βjq = 0 simultaneously.
▶ We cannot simply check each tj -statistic separately, as it is possible for jointly
insignificant regressors to be individually significant (and vice versa).
Part VI: Testing Hypotheses (More Coefficients) 178 / 288

Joint Hypothesis Testing

Given the data, is it possible to reject the null hypothesis βj1 = 0, βj2 = 0, . . . , βjq = 0?
˜ = (βˆ , . . . , β̂ )′
Reject the null hypothesis, if the distance between the OLS estimator β̂ j1 jq
and 0 is “large” (one-sided test).
The corresponding test statistic has to take into account that

▶ the standard deviations of the various OLS estimators are different;
▶ deviations of the OLS estimators from the true value are likely to be correlated.


▶ Aggregate tjl = βˆjl /sd βˆjl for l = 1, . . . , q, e.g., by taking the sum of squared t
statistics?
▶ If the deviations of the OLS estimators βˆj1 , . . . , β̂jq from the true values are
uncorrelated, then the aggregated test statistic
2
q
X βˆjl
2
l=1 sd βˆj
l
is the sum of q independent squared standard normal random variables.

▶ Such a random variable follows a χ2q -distribution with q degrees of freedom.

The χ2q distribution
ν = 50
ν = 20
2.5
0.5
ν= 2
ν= 2
ν= 5
ν= 5
0.45 ν=10
ν=10
ν=20
ν=20
2 ν=50
0.4
0.35
0.3 1.5
0.25
0.2 1
0.15
0.1 0.5
0.05
0 0
0 5 10 15 20 25 30 35 40 0 1 2 3 4 5 6 7
Left hand side: density of the χ2q -distribution; right hand side: density of the random
variable X /q, where X ∼ χ2q ; degrees of freedom q = ν ∈ {2, 5, 10, 20}.
Usually, the deviations of the OLS estimators βˆj1 , . . . , β̂jq from the true values are
correlated:
▶ Transform the deviations to a coordinate system with independent standard normal
random variables. In this new coordinate system, the sum of squared deviations
follow a χ2q -distribution with q degrees of freedom. The appropriate transformation
reads:
−1
˜ ′ Cov β̂
β̂ ˜ ˜ ∼ χ2
β̂ q
▶ Note: The χ2q -distribution results only if σ 2 is known.

The F-Test
▶ The F -statistic is obtained by substituting the unknown variance σ 2 by σ̂ 2 and

dividing by q.
▶ If the null hypothesis βj1 = 0, βj2 = 0, . . . , βjq = 0 is true, then the F -statistic
follows a Fq,df -distribution with parameters q (number of tested coefficients) and
df = N − K − 1.
▶ Remark 1: For q = 1, F = tj2 , where tj is the t-statistic.
▶ Remark 2: The F -statistic is the ratio of two (independent) sum of squares, divided
by the degrees of freedom, i.e. a χ2q /q and χ2df /df, where df = N − K − 1.

The F-Distribution
ν1 = 5,ν2=100
3.5
ν1=1,ν2=100
ν1=2,ν2=100
3 ν =3,ν =100
1 2
ν1=4,ν2=100
ν1=5,ν2=100
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Density of the Fq,df -distribution with parameters df = 100 and q = 1, . . . , 5

The F-Test
Reject the null hypothesis, if

▶ the F -statistic is larger than the critical value from the corresponding
Fq,df -distribution (one-sided test);
▶ the corresponding p-value is smaller than the significance level. A p-value close to
0 shows that the value observed for the F -statistic (or an even larger value) is
unlikely under the null hypothesis.
⇒ At least one of the coefficients βj1 , . . . , βjq is different from 0.

The F-Test
Do not reject the null-hypothesis, if

▶ the F -statistic is smaller than the critical value from the corresponding
Fq,df -distribution (one-sided test);
▶ the corresponding p-value is larger than the significance level. A p-value
considerably larger than 0 shows that the observed value for the F -statistic (or an
even larger value) is plausible under the null hypothesis.
⇒ There is little evidence in the data that we should reject the null hypothesis that all
coefficients βj1 , . . . , βjq are equal to 0.

Case Study Marketing
The F -statistic for testing the joint hypothesis βgender = βage = 0 jointly is equal to
2.086, p-value: 0.124 (rounded)
1.0
0.8
0.6
0.4
0.2
0.124
0.0
2.086
0 1 2 3 4 5 6

The F -statistic for testing the joint hypothesis βgender = βage = βprice = 0 jointly is
equal to 451.572, p-value: 0.000 (rounded)
0.3
0.2
0.1
0.0
451.572
1 2 5 10 20 50 100 200 500

An Alternative Form of the F-Statistic
Equivalent forms of the F -statistic show that the F-statistic measures the loss of fit
from imposing the q restrictions on the model:
(SSRr − SSR)/q (R2 − R2r )/q

F = =
SSR/df (1 − R2 )/df
Here,
▶ SSR is the minimum sum of squared residuals and R2 is the coefficient of
determination for the unrestricted regression model.
▶ SSRr is the minimum sum of squared residuals and R2r is the coefficient of
determination for the restricted regression model.
Note that SSRr ≥ SSR and R2r ≤ R2 .

Testing the whole regression model
▶ In the standard regression output of R, an F -statistic is available by default.

▶ This F -statistic tests the hypothesis that none of the predictor variables influences
the response variable:
β1 = 0, β2 = 0, . . . , βK = 0
▶ In this case, R2r = 0, and the F-statistic reads:
R2 /K
F =
(1 − R2 )/df
▶ Under the null hypothesis, F follows a FK ,df -distribution. Hopefully, the
corresponding p-value is close to 0. Otherwise, the usefulness of the whole
regression model is somewhat doubtful!

Linear Combinations of Parameters
Suppose we want to test the hypothesis that two regression coefficients are equal,
e.g. β1 = β2 . This is equivalent to testing the following linear constraint (null
hypothesis):
β1 − β2 = 0 (57)
Test statistic based on the difference of the OLS estimators β̂1 − β̂2 :
▶ If |β̂1 − β̂2 | is small, then the hypothesis (57) is not rejected.
▶ If |β̂1 − β̂2 | is large, then the hypothesis (57) is rejected.
What is the distribution of β̂1 − β̂2 under the null hypothesis?

Testing Linear Combinations of Parameters
′
Testing the linear constraint β1 − β2 = 0 for β = (β0 , β1 , . . . , βK ) is equivalent to
testing
h i
Lβ = 0 where L = 0 1 −1 0 · · · 0 (58)
What is the distribution of Lβ̂?

Using β̂ − β ∼ NK +1 0, Cov β̂ , we obtain the following:

˜
Lβ̂ − Lβ ∼ Nq 0, Cov β̂
where
˜ = LCovβ̂ L′
Cov β̂

Testing Linear Constraints
The F -statistic may also be used to test more than one linear constraint on the
coefficients, i.e. Lβ = 0, where L is a q × (K + 1)-matrix with q > 1.
We have seen that ˜

the OLS estimator β̂ = Lβ̂ follows the multivariate normal
˜
distribution Nq 0, Cov β̂ under the null hypothesis.
The F -statistic is constructed as above and follows an Fq,df -distribution, where q is the
number of linear constraints.

Part VII
Further Properties of the OLS Estimator and
Dummy Variables
Outline
Part VII: Further Properties of the OLS Estimator and

Dummy Variables
▶ Further Properties of the OLS Estimator
▶ Dummy Variables
Part VII: Further Properties of the OLS Estimator and Dummy Variables Further Properties of the OLS Estimator 195 / 288
Further Properties of the OLS Estimator
The Gauss Markov Theorem

Under the assumptions (26) and (38), the OLS estimator is BLUE, i.e. the
▶ Best
▶ Linear
▶ Unbiased
▶ Estimator
▶ Here, “best” means that any other linear unbiased estimator β̃ results in a larger
variance
than
the OLS
estimator β̂:
▶ sd β̃j ≥ sd β̂j

▶ Cov β̃ − Cov β̂ is positive semi-definite
▶ “Linear” means that that β̃ = Cy where y = (y1 , . . . , yN )′ and where C is some
matrix independent of β, but possibly dependent on the design matrix X.
▶ “Unbiased” means that E β̂ = β.
Efficiency of OLS Estimation
Under the normality assumption (50) about the error term, the OLS estimator is not
only BLUE. A stronger optimality result holds:
Efficiency of OLS estimation
Under assumption (50), the OLS estimator β̂ is the minimum variance unbiased
estimator.
Any other unbiased estimator β̃ (which need not be a linear estimator) has larger
standard deviations

than the OLS estimator:
▶ sd β̃j ≥ sd β̂j

▶ Cov β̃ − Cov β̂ is positive semi-definite
However, if assumption (50) is violated, other (nonlinear) estimation methods may be
more efficient.
Consistency of OLS Estimation
Let β̂N be an estimator for β, based on sample size N.

Then, β̂N is a consistent estimator for β, if for every ϵ > 0 the following holds:

P |β̂N − β| ≥ ϵ → 0 as N → ∞
or, equivalently,

P |β̂N − β| < ϵ → 1 as N → ∞.
Note that ϵ may be arbitrarily small!
▶ Consistency means that the OLS estimator converges “in probability” to the true
value with increasing number of observations N.
▶ A sufficient condition for this convergence in probability is that E β̂N → β and

sd β̂N → 0 as N → ∞.
▶ Under the Gauss Markov assumptions, the OLS estimator is a consistent estimator
of β.
▶ Note that consistency also holds if the normality assumption (50) is violated.
“Proof”
For each j = 1, . . . , K :
▶ The OLS estimator is unbiased, i.e. E β̂j = βj .

▶ The standard deviation sd β̂j goes to 0 for N → ∞:
σ
sd β̂j = q → 0 as N → ∞
Nsx2j (1 − Rj2 )
Outline
Part VII: Further Properties of the OLS Estimator and

Dummy Variables
▶ Further Properties of the OLS Estimator
▶ Dummy Variables
Part VII: Further Properties of the OLS Estimator and Dummy Variables Dummy Variables 201 / 288
Regression Models with Dummy Variables as
Predictors
▶ A dummy variable (binary variable) D is a variable that assumes two values only: 0
or 1
▶ Examples: EU member (D = 1 if EU member, 0 otherwise), brand (D = 1 if
product has a particular brand, 0 otherwise), gender (D = 1 if male, 0 otherwise)
▶ Note that the labelling is not unique, a dummy variable could be labeled in two
ways, i.e. for variable gender:
▶ D = 1 if male, D = 0 if female
▶ D = 1 if female, D = 0 if male
Predictors
Consider a regression model with one continuous variable X and one dummy variable D:
Y = β0 + β1 D + β2 X + u
If D = 0, then:
Y = β0 +β2 X + u
|{z}
intercept
If D = 1, then:
Y = β0 + β1 +β2 X + u
| {z }
intercept
Predictors
Example: Y = 20 + 3.2D − 2.5X
24
22
D=1
20
18
16
Y
14
12 D=0
10
6
0 1 2 3 4 5
X
Predictors
Interpretation:
▶ The observed units are split into 2 groups according to D (e.g. into men and
women).
▶ The group with D = 0 is called the baseline (e.g. men).
▶ The regression coefficient β1 of D quantifies the expected difference of considering
the other group (e.g. women) on the dependent variable Y , while holding all other
variables (e.g. X ) fixed.
▶ The null hypothesis β1 = 0 corresponds to the assumption that the conditional
average value of Y given all remaining regrossors is the same for both groups.
Predictors
Consider model
Y = 20 + 3.2D − 2.5X + u
where D = 1 if female. Assume that X = 4:
▶ expected value of Y for a man: E(Y |X = 4, D = 0) = 20 − 2.5 · 4 = 10
▶ expected value of Y for a woman: E(Y |X = 4, D = 1) = 20 + 3.2 − 2.5 · 4 = 13.2
▶ expected difference between women and men is equal to β1 = 3.2
The expected difference between women and men is equal to β1 = 3.2 for all values of
X!
Combining More Dummy Variables
Estimate a model where D1 is the gender (1: female, 0: male), D2 is the brand (1:
specific brand, 0: no-name), and P is the price:
Y = β0 + β1 D1 + β2 D2 + β3 P + u
▶ β0 corresponds to the baseline (male, no-name product).

▶ β1 corresponds to the difference in the expected rating between male and female
consumers (same brand, same price).
▶ β2 corresponds to the difference in the expected rating between the specific brand
and a no-name product (same person, same price).
Categorical Variables
We can use dummy variables to control for characteristics with multiple categories (K
categories ⇒ K − 1 dummies)
Suppose one of the predictors is the highest level of education. Such variables are often
coded in the following way:
edu
1 high school dropout
2 high school degree
3 college degree
What is the expected effect of education on a variable Y , e.g. hourly wages?
Including edu directly into a linear regression model would mean that the effect of a
high school degree compared to a drop out is the same as the effect of a college degree
compared to a high school degree.
To include the highest level of education as predictor in a regression model, define 2

dummy variables D1 and D2 :
edu D1 D2
1 high school dropout 0 0
2 high school degree 1 0
3 college degree 0 1
This yields:
▶ Baseline (all dummies 0): high school dropout
▶ D1 = 1, if highest degree from high school, 0 otherwise
▶ D2 = 1, if college degree, 0 otherwise
Include D1 and D2 as dummy predictors in a regression model:
Y = β0 + β1 D1 + β2 D2 + β3 X + u
The intercept β0 corresponds to the baseline (D1 = 0, D2 = 0).
In other words:
▶ β1 is the effect of a high school degree compared to a drop out.
▶ β2 is the effect of a college degree compared to a drop out.
Testing hypothesis:
▶ Is the effect of a high school degree compared to a drop out the same as the effect
of a college degree compared to a high school degree?
▶ Test if 2β1 = β2 , or equivalently, test the linear hypothesis 2β1 − β2 = 0.
There are 5 different brands of mineral water (KR, RO, VO, JU, A):
▶ Select one mineral water as baseline, e.g. KR.
▶ Introduce 4 dummy variables D1 , . . . , D4 , and assign each of them to the remaining
brands, e.g. D1 = 1, if brand is equal to RO and D1 = 0, otherwise; D2 = 1, if
brand is equal to VO and D2 = 0, otherwise; etc.
The model reads:
Y = β0 + β1 D1 + . . . + β4 D4 + β5 P + u (59)
Interpretation of the coefficients in model (59) for a given price level P:

▶ The expected rating for the brand corresponding to the baseline is given by
β0 + β5 P.
▶ The expected rating for the brand corresponding to Dj is given by β0 + βj + β5 P.
▶ The coefficient βj measures the effect of the brand Dj in comparison to the brand
corresponding to the baseline
∆E(Y |P) = β0 + βj + β5 P − (β0 + β5 P) = βj .
▶ The difference in the expected average rating between two arbitrary brands Dj and
Dk is equal to βj − βk .
▶ Is the rating different for the brands Dj and Dk ? Test the linear hypothesis
βj − βk = 0!
Including an additional dummy variable D5 , where D5 = 1 if brand is KR, i.e.
Y = β0 + β1 D1 + . . . + β5 D5 + β6 P + u
leads to a model which is not identified because:
D1 + D2 + . . . + D5 = 1
Hence, the set of regressors D1 , . . . , D5 is perfectly correlated with the regressor ’1’
corresponding to the intercept ⇒ EViews produces an error message indicating
difficulties with estimating the model.
It is possible to include all 5 regressors if no constant is included in the model, with a

slightly different interpretation of the coefficients:
Y = β1 D1 + . . . + β5 D5 + β6 P + u
▶ βj is a brand specific intercept of the regression model for the brand corresponding
to Dj .
▶ For a given price level P, the expected rating for the brand corresponding to Dj is
given by βj + β6 P.
▶ The difference in the expected average rating between two arbitrary brands Dj and
Dk is (again) equal to βj − βk .
Part VIII
Residual Diagnosis
Outline
Part VIII: Residual Diagnosis
▶ Residual Diagnostics
▶ Model Evaluation and Model Comparison
Part VIII: Residual Diagnosis Residual Diagnostics 217 / 288

Checking model assumptions
Hypothetical model:
Y = β0 + β1 X1 + . . . + βK XK + u
Estimated model:
yi = β̂0 + β̂1 x1,i + . . . + β̂K xK ,i + ûi
where ûi is the OLS residual.

▶ Due to consistency, the OLS residual ûi approaches the unobservable error ui as N
increases.
▶ Use OLS residuals ûi to test assumptions about ui .

Discuss in R / EViews how to obtain the OLS residuals


Testing Normality
The error follows a normal distribution:

u | X1 , . . . , XK ∼ N 0, σ 2
▶ Roughly 95% of the OLS residuals lie between [−2σ̂, 2σ̂]; only 5% lie outside.
▶ Assumption often violated if outliers are present.
▶ Normality often improved through transformations.

Testing Normality
To test normality of u, check normality of the OLS residuals ûi :

▶ Histogram
▶ Q-Q plot
▶ Skewness coefficient m3 close to 0?
▶ Kurtosis coefficient m4 close to 3?
N N
! !
1 1 X 1 1 X
m3 = 3 û 3 m4 = 4 û 4
σ̂ N i=1 i σ̂ N i=1 i

Testing Normality
Jarque-Bera-Statistic:
N −K 1

J= m32 + (m4 − 3)2 (60)
6 4
▶ Null hypothesis H0 : the errors follow a normal distribution

▶ Under H0 , J asymptotically (i.e. for N large) follows a χ22 -distribution with 2
degrees of freedom (95%-quantile χ22,0.95 = 5.9915)
▶ Reject H0 if J > χ22,0.95 (or p-value of J smaller than 0.05).

Case Study Yields
yi = β0 + β1 x1,i + β2 x2,i + ui (61)
where
yi . . . yield with maturity 3 months

x1,i . . . yield with maturity 1 month
x2,i . . . yield with maturity 60 months
Demonstration in R / EViews, data yieldus.csv, see R-code code_eco_I.R.

Case Study Yields
Jarque-Bera test statistic J = 910.094, p-value: 0.000 (rounded)
0.30
50
0.25
40
0.20
30
Frequency
0.15
20
0.10
10
0.05
910.094
0.00
0
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 1 10 100 1000 10000
Residuals Distribution of J under normality

Case Study Profit
yi = β0 + β1 x1,i + β2 x2,i + ui (62)
where
yi . . . profit 1994
x1,i . . . profit 1993
x2,i . . . turnover 1994
Consider only large firms (i = 1, . . . , 20).

Case Study Profit
Jarque-Bera test statistic J = 2.811, p-value: 0.245

3.0
0.30
2.5
0.25
0.20
2.0
Frequency
0.15
1.5
0.10
1.0
0.05
0.5
0.245
0.00
0.0
2.811
−1500 −1000 −500 0 500 1000 1500 2000 1 10 100 1000 10000
Residuals Distribution of J under normality

Checking Homoskedasticity
Assumption (38) claims that the variance of ui is homoskedastic, i.e.
V(u|X1 , . . . , XK ) = σ 2
▶ If this assumption is violated, the model is said to have heteroskedastic errors.

▶ This assumption is often violated because the variance of u depends on a predictor
variable.
First informal check: residual plot; more about formal tests later.

Case Study Yields
2.5
2.5
2.0
2.0
0.5 1.0 1.5
1.5
Residuals
1.0
0.5
0.0
0.0
−1.0
−1.0
2 4 6 8 10 12 14 16 4 6 8 10 12 14
Yield with maturity 1 month Yield with maturity 60 months
20
Residual
15
Actual
Fitted
10
5
0
1966 1969 1971 1974 1976 1979 1982 1984 1987 1989 1992
Variability of residuals seems to depend on X1 , X2 (and on time)! ⇝ Heteroskedasticity

Checking for Assumption (26)
Assumption (26)
The model does not contain any systematic error, i.e.
E(u|X1 , . . . , XK ) = 0
▶ If assumption (26) is violated, the model is said to have a specification error:

▶ the true value of yi will be underrated, if E(ui |·) > 0
▶ the true value of yi will be overrated, if E(ui |·) < 0
▶ This assumption is often violated, when an important predictor variable has been
omitted (“omitted variables bias”) or the functional form is misspecified

Example: Simulate data from a simple log-linear regression model with β̃1 = 0.2 and
β2 = −1.8:
yi = 0.2xi−1.8 e ui (63)
▶ Residual plot for the log-linear regression model (true model).

▶ Residual plot for the linear regression model (misspecified model).

OLS − true model, σ2 = 0.01

−2 0.3
−2.5 0.2
0.1
−3
log(demand)
OLS−error
0
−3.5
−0.1
−4
−0.2
−4.5 −0.3
−5 −0.4
0 0.5 1 1.5 2 0 0.5 1 1.5 2
log(price) log(price)
OLS − misspecification
0.12 0.04
0.1 0.03
0.08
0.02
OLS−error
demand
0.06
0.01
0.04
0
0.02
0 −0.01
−0.02 −0.02
1 2 3 4 5 1 2 3 4 5
price price
Case Study Profit
Model average profit 1994 only as a function of profit 1993:

regression line
4000
4000
3000
3000
2000
2000
Residuals
GEW94
1000
1000
0
0
−1000
−1000
−2000 −1000 0 1000 2000 3000 −2000 −1000 0 1000 2000 3000
GEW93 GEW93
Assumption (26) seems to be violated!

Outline
Part VIII: Residual Diagnosis
▶ Residual Diagnostics
▶ Model Evaluation and Model Comparison
Part VIII: Residual Diagnosis Model Evaluation and Model Comparison 233 / 288
Model Comparison Using R2 and AIC/BIC
▶ Model evaluation using the coefficient of determination R2 .

▶ Problems with R2 : R2 increases with increasing number of variables, because SSR
decreases ⇒ may lead to overfitting.
▶ Model comparison using AIC and SC (BIC): Penalize the ever decreasing SSR by
including the number of parameters.
Coefficient of Determination R2
Recall: Coefficient of determination R2 can be written as follows:

TSS − SSR SSR
R2 = =1−
TSS TSS
▶ SSR is the sum of squared residuals.
▶ TSS is the total sum of squares, i.e. the sum of squared residuals of the simple
model without predictor.
▶ SSR is always smaller than TSS.
▶ If SSR is much smaller than TSS, then the regression model M1 is much better
than the simple model M0 .
▶ R2 is close to 1 if SSR ≪ TSS and close to 0 if SSR ≈ TSS, thus R2 can be used
for model selection.
Discuss in R / EViews where to find SSR and R2 ; discuss how SSR and R2 change when
number of predictors is increased
Case Study Chicken (Log-Linear Model)
Predictors SSR R2
pchick 0.273487 0.647001
income 0.041986 0.945807
income, pchick 0.015437 0.980074
income, pchick, ppork 0.014326 0.981509
income, pchick, ppork, pbeef 0.013703 0.982313
Problems with R2
▶ Choosing the model with the smallest SSR (largest R2 ) leads to overfitting: R2
“automatically” increases when the number of variables increases.
▶ R2 is 1 for K = N − 1 because SSR = 0 if we include as many predictors as
observations (even if the predictors are useless!).
▶ However, the increase is small when a useless predictor is added. ⇒ penalize the
ever decreasing SSR by incorporating the number of parameters used for
estimation!
Adjusted R2
A very simple way out is to adjust R2 to cater for the number of parameters:
Adjusted R2
N −1 N − 1 SSR s2
R2adj = 1 − (1 − R2 ) = 1 − = 1 − û2
N −K −1 N − K − 1 TSS sy
Choose the model that maximizes R2adj .
Alternatively (or better), use so-called “information criteria” AIC and SC (BIC).
Information Criteria
Information Criteria: Model Fit + Penalty
N · log(SSR/N) + N + N · log(2π) + m · (K + 1) (64)
▶ SSR: Sum of squared residuals

▶ K + 1: The number of estimable parameters in the model
▶ m = 2: AIC (Akaike Information Criterion)
▶ m = log N: SC (Schwarz Criterion), also called BIC (Bayesian IC)
Choose the model that minimizes a particular criterion.
Information Criteria: Model Fit + Penalty
N · log(SSR/N) + N + N · log(2π) + m · (K + 1) (64)
▶ SSR: Sum of squared residuals

▶ K + 1: The number of estimable parameters in the model
▶ m = 2: AIC (Akaike Information Criterion)
▶ m = log N: SC (Schwarz Criterion), also called BIC (Bayesian IC)
Choose the model that minimizes a particular criterion.

Caveat: Implementations in R and EViews differ slightly. Hence, you may only compare
the numbers stemming from the same software. However, the implied ranking is the
same in R and in EViews.
The various criteria may lead to different choices; SC has a larger penalty for the model
size K if the number of observations is N ≥ 8 (N > e 2 ).
50
aic
45 sc
40
35
30
Penalty
25
20
15
10
0
1 2 3 4 5 6 7 8 9 10
Parameter k
AIC and SC penalty for N = 100 as a function of the number of parameters K .

Discuss in EViews where to find R2adj , AIC, and Schwarz criterion; discuss how to choose
predictors based on these model choice criteria.
Case Study Chicken (Log-Linear Model)
Predictors SSR R2 AIC BIC

pchick 0.273487 0.647001 -30.66474 -27.25826
income 0.041986 0.945807 -73.76486 -70.35838
income, pchick 0.015437 0.980074 -94.77735 -90.23538
income, pchick,
ppork 0.014326 0.981509 -94.49623 -88.81876
income, pchick,
ppork, pbeef 0.013703 0.982313 -93.51870 -86.70573
Caveat: Mind the different implementations in R and EViews (but the result of the best
model remains the same).
Comparing Linear and Log-Linear Models
▶ The residual sum of squares SSR depends on the scale of yi , therefore AIC and SC
are scale dependent.
▶ AIC and SC cannot be used directly to compare a linear and a log-linear model.
▶ AIC and SC of the log-linear model could be matched back to the original scale by
adding 2 times the mean (EViews) or 2 times the sum (R) of the log-values of yi .
Correction Formula for AIC and SC

N
R: C = C ⋆ + 2
X
log(yi ) (65)
i=1
N
⋆ 1 X
EViews: C = C + 2 log(yi ) (66)
N i=1
where C ⋆ is the model choice criterion for the log-linear model.

EViews Class Exercise: Case Study Chicken
Predictor SSR R2 AIC BIC

income, pchick (log-linear) 0.015437 0.980074 -94.77735 -90.23538
income, pchick (linear) 106.65 0.9108 108.5549 113.0969
Transform AIC and SC of the log-linear model:
AIC = −94.77735 + 2 × 84.26939 = 73.76143

SC = −90.23538 + 2 × 84.26939 = 78.3034
⇒ log-linear model is preferred.
Part IX
Advanced Multiple Regression Models
Outline
Part IX: Advanced Multiple Regression Models

▶ Quadratic Terms
▶ Interaction Terms
▶ Dummy Variables with Interaction Terms
Part IX: Advanced Multiple Regression Models Quadratic Terms 247 / 288
Models with Quadratic Terms
Sometimes there is an interest in modeling increasing or decreasing marginal effects of

certain variables. Why not capture such effects by including quadratic terms?
Y = β0 + β1 X + β2 X 2 + u (67)
Implications:
▶ OLS estimation of β0 , β1 , and β2 proceeds as discussed above, based on the
predictors X1 = X and X2 = X 2
▶ Although the relationship between X1 and X2 is deterministic (note that X2 = X12 ),
the predictors X1 and X2 are not linearly dependent; hence, OLS estimation is
feasible
▶ Note that the relationship between X and E (Y |X ) is non-linear
Two examples where X varies over all real numbers:

20 120
0 100
−20
80
−40
60
−60
Y
Y
40
−80
−100 20
−120 0
−140
−20
−160
−30 −20 −10 0 10 20 30 40 50 −20 −10 0 10 20 30 40 50
X X
Left hand side: E (Y |X ) = 1 + 2X − 0.1X 2 ; right hand side: E (Y |X ) = 1 − 3X + 0.1X 2

Same examples with X being limited to the range [0, 10].

12 5
10
0
8
−5
Y
Y
−10
4
−15
2
0 −20
0 2 4 6 8 10 0 2 4 6 8 10
X X
Left hand side: E (Y |X ) = 1 + 2X − 0.1X 2 ; right hand side: E (Y |X ) = 1 − 3X + 0.1X 2

▶ The parabola corresponding to the quadratic function (67) opens up iff β2 > 0 and
opens down iff β2 < 0
▶ The vertex (Scheitel) is obtained by setting the first derivative of E(Y |X = x ) with
respect to x equal to 0:
∂E(Y |X = x )
= β1 + 2β2 x = 0
∂x
yields that the vertex lies at x0 = −β1 /(2β2 ); note that x0 is negative if β1 and β2
have the same sign, and positive otherwise.
Monotonic Behavior
Often, only part of the parabola is used to describe a monotonic behavior over a certain
range of X , e.g. the smallest and the largest observed value of X
The position of the vertex is important in this respect:

▶ Vertex outside the relevant range of X : effect is monotone
▶ Vertex within the relevant range of X : effect is not monotone
Testing for Non-Linearity
▶ Model (67) reduces to a model which is linear in X if β2 = 0 ⇒ test the null

hypothesis H0 : β2 = 0 to test for the presence of non-linear effects.
▶ If β2 ̸= 0, non-linearity is present in model (67). In this case, β1 does not measure
the expected change in Y with respect to X , since X2 = X 2 cannot be held
constant, while X1 = X changes. Changing X changes both predictors X1 and X2 .
Understanding the Coefficients
The instantaneous change of E(Y |X = x ) is equal to the first derivative with respect to
x:
∂E(Y |X = x )
= β1 + 2β2 x (68)
∂x
▶ If β2 = 0, the expected change of E(Y |X = x ) is equal to β1 .

▶ For β2 ̸= 0, the expected change of E(Y |X = x ) depends not only on β1 , but also
on β2 and the current value x of X .
▶ The expected change of E(Y |X = x ) switches sign at the vertex / apex (=
Scheitel), i.e. at the point X0 = −β1 /(2β2 )
▶ The model describes a monotonic behaviour if only values of x are considered
which lie on one side of the vertex.
Suppose that β1 is positive while β2 is negative. Then according to the first term in
(68), increasing x will increase E(Y |X = x ), however, this positive effect becomes
smaller with increasing x . It remains positive as long as x is smaller than the vertex x0 :
β1
x <−
2β2
If x is larger than the vertex x0 , there is a negative effect of increasing x , which gets
larger with increasing x .
Monotonic Behavior
Vertex smaller than the relevant range of x :

▶ β2 > 0: positive; β2 < 0: negative
effect of increasing x gets bigger the larger x .
Vertex larger than the relevant range of x :

▶ β2 > 0: negative; β2 < 0: positive
effect of increasing x gets smaller the closer x is to the vertex.
Monotonic Behavior
Example: E(Y |X = x ) = 20 + 0.005x − 0.2x 2 , 1 ≤ x ≤ 5
▶ Parabola opens down because β2 = −0.2 < 0
▶ Vertex: 0.005 − 0.4x0 = 0 ⇒ x0 = 0.0125
▶ Range of x restricted to the right hand side ⇒ monotonically decreasing function
20
19.5
19
18.5
18
17.5
Y
17
16.5
16
15.5
15
1 1.5 2 2.5 3 3.5 4 4.5 5
X
Case Study Chicken
Estimate the model:
Y = β0 + β1 X1 + β2 P1 + β3 P12 + β4 P2 + β5 P22 + u
with X1 the income, P1 the price of chicken, and P2 the price of pork. This model
outperforms a model without quadratic terms according to AIC and SC.
β2 is negative, but the negative effect decreases as the price increases, since β3 is
positive. The vertex is equal to
β2 −1.69
− =− = 60.
2β3 2 × 0.014
This value lies in the range of observed prices, hence the chicken price effect changes
sign over the whole range of observations.
Case Study Chicken
β4 is positive, but the positive effect decreases as the price of pork increases, since β5 is
negative. The vertex is equal to
β4 0.542
− =− = 113.
2β5 2 × −0.0024
This value lies in the range of observed prices, hence the pork price effect changes sign
over the whole range of observations.
Outline

▶ Quadratic Terms
Part IX: Advanced Multiple Regression Models Interaction Terms 261 / 288
Models with Interaction Terms
▶ In some cases it makes sense to make the effect of a variable X1 on Y dependent

on another regressor X2 .
▶ One way to capture such effects to include interactions terms:
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + u (69)
▶ OLS estimation of β0 , . . . , β3 proceeds as discussed above, based on the predictors

X1 , X2 , and X3 = X1 × X2
▶ Note that the relationship between X1 and E(Y |X1 , X2 ) is non-linear as is the
relationship between X2 and E(Y |X1 , X2 ).
Models with Interaction Terms
▶ The first derivative of E(Y |X1 = x1 , X2 = x2 ) with respect to x1 is given by:
∂E(Y |X1 = x1 , X2 = x2 )
= β1 + β3 x 2
∂x1
and depends on the actual value of x2 .
▶ The first derivative of E(Y |X1 = x1 , X2 = x2 ) with respect to x2 is given by:
∂E(Y |X1 = x1 , X2 = x2 )
= β2 + β3 x 1
∂x2
and depends on the actual value of x1 .
Therefore, β1 is the effect of X1 on E(Y |X1 , X2 ) only for X2 = 0, which is not

necessarily a reasonable value of X2 . The average effect δ1 of X1 on E(Y |X1 , X2 ) can be
evaluated at the sample mean X2 of X2 :
δ1 = β1 + β3 X2
Similarly, the average effect δ2 of X2 on E(Y |X1 , X2 ) can be evaluated at the sample
mean X1 of X1 :
δ2 = β2 + β3 X1
Centering the Predictors
An alternative parameterization of the model is
Y = δ0 + δ1 X1 + δ2 X2 + δ3 (X1 − X1 )(X2 − X2 ) + u
where the interaction term involves the standardized predictors X1 and X2 .
Thus,
▶ δ1 is the average effect of X1 on E(Y |X1 , X2 ) at the mean of X2
▶ δ2 is the average effect of X2 on E(Y |X1 , X2 ) at the mean of X1
Case Study Chicken
Estimate a model with income X1 , price of chicken P1 and price of pork P2 :
Y = β0 + β1 X1 + β2 P1 + β3 P2 + β4 X1 P2 + u
This model outperforms a model without the interaction term according to AIC and BIC.
β3 is positive, but the positive effect of increasing the price of pork decreases as the
income X1 increases, since β4 is negative:
∂E(Y |X1 = x1 , P1 = p1 , P2 = p2 )
= β3 + β4 x1
∂p2
Case Study Chicken
The average income is equal to X1 = 1035.065, hence the average effect of the price of
pork is equal to:
δ3 = β3 + β4 X1 = 0.162937 + 1035.065 × (−8.62 × 10−5 ) = 0.0737
This value is considerably smaller than the effect obtained from the model without an
interaction term (0.174).
The average effect of the price of pork is obtained immediately from OLS estimation if
following model is fit to the data:
Y = δ0 + δ1 X1 + δ2 P1 + δ3 P2 + δ4 (X1 − X1 )(P2 − P2 ) + u
and is equal to δ3 . [EViews-Hint: Use @mean() to obtain the mean of a variable.]
Outline

▶ Quadratic Terms
Part IX: Advanced Multiple Regression Models Dummy Variables with Interaction Terms 268 / 288
Interaction with Dummy Variables
Consider interacting a dummy variable D with a continuous variable X :
Y = β0 + β1 D + β2 X + β3 XD + u (70)
If D = 0, then:
Y = β0 + β2 X + u
If D = 1, then:
Y = (β0 + β1 ) + (β2 + β3 )X + u
Example: Y = 20 + 3.2D − 2.5X + 1.5DX

24
D=1
22
20
18
16
Y
14
D=0
12
10
6
0 1 2 3 4 5
X
Interpretation:
▶ The observed units are split into 2 groups according to D (e.g. into men and
women)
▶ The coefficient β3 models the difference in the marginal effect of X between the
two groups. A change ∆X has an expected change of Y equals to
▶ E(Y |X = x + ∆x , D = 0) − E(Y |X = x , D = 0) = β2 ∆x ,
▶ E(Y |X = x + ∆x , D = 1) − E(Y |X = x , D = 1) = (β2 + β3 )∆x ,
▶ The difference in the expected value of Y between the two groups for a given value
of X is equal to:
E(Y |X , D = 1) − E(Y |X , D = 0) = β1 + β3 X
Testing hypotheses:
▶ The null hypothesis β3 = 0 corresponds to the assumption that the effect of X is
the same for both groups (interaction effect is not significant)
▶ The joint null hypothesis β2 = 0, β2 + β3 = 0 corresponds to the assumption that
the effect of X is zero for both groups
▶ The joint null hypothesis β1 = 0, β3 = 0 corresponds to the assumption that the
regression model is the same for both groups
Estimate a model with a specific brand (KR, D = 1) and price P:
Y = β0 + β1 D + β2 P + β3 PD + u
Results:
▶ There is a very significant price effect for the specific brand
▶ Increasing the price for an ordinary brand by one unit leads to an expected decrease
in the rating by β2 , i.e. around 0.31 points
▶ For the KR brand, the price effect is equal to β2 + β3 , i.e. increasing the price for
the specific brand by one unit leads to an expected decrease in the rating by 0.26
points
Part X
Regression with Heteroscedastic Errors
Outline
Part X: Regression with Heteroscedastic Errors
▶ Regression Models with Heteroskedastic Errors
Part X: Regression with Heteroscedastic Errors Regression Models with Heteroskedastic Errors 275 / 288
Regression Models with Heteroskedastic Errors
If assumption (38) (homoskedastic errors) is violated, one has to deal with

heteroskedastic errors, i.e. the variance differs among the observations:
Heteroskedastic Errors
V(ui |X1,i , . . . , XK ,i ) = σi2 (71)
▶ Standard errors of OLS estimation are no longer valid.

▶ OLS estimator is no longer BLUE, better estimation methods exist.
Case Study Profit
Demonstration in EViews, workfile profit.wf1:
yi = β0 + β1 x1,i + β2 x2,i + ui
yi . . . profit 1994
x1,i . . . profit 1993
x2,i . . . turnover 1994
Variances increases with size of the firm.
OLS Estimation under Heteroskedasticity
Simulate data from a regression model with β0 = 0.2 and β1 = −1.8 and
heteroskedastic errors:

yi = 0.2 − 1.8xi + ui , ui ∼ N 0, σi2
σi2 = σ 2 (0.2 + xi )2
1.5
0.5
Y
−0.5
−1
−0.5 0 0.5
X
OLS Estimation under Heteroskedasticity
2 2
N=50,σ =0.1,Design 2 N=50,σ =0.1,Design 2
−1.5 −1.5
−1.6 −1.6
−1.7 −1.7
−1.8 −1.8
β2 (price)
β (price)
2
−1.9 −1.9
−2 −2
−2.1 −2.1
−2.2 −2.2
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
β1 (constant) β1 (constant)
Left hand side: estimation errors obtained from a simulation study with 200 data sets
(each N = 50 observations); right hand side: contours show estimation error according
to OLS estimation
Weighted Least Squares Estimation
If the variance increases with an observed variable Zi ,

Observed Heteroskedasticity
V(ui |X1,i , . . . , XK ,i ) = σi2 , σi2 = σ 2 Zi
a simple transformation leads to a model with homoskedastic variances:

ui
ui⋆ = √
Zi
⋆ 2
V(ui |X1,i , . . . , XK ,i ) = σ
Therefore a simple transformation of the original regression model
yi = β0 + β1 x1,i + . . . + βK xK ,i + ui
leads to a model with homoskedastic variances:

y 1 x x
√ i = β0 √ + β1 √1,i + . . . + βK √K ,i + ui⋆ (72)
Zi Zi Zi Zi
Regression model (72) has identical parameters as the original model, but a transformed
response variable as well as transformed predictors.
Rewrite model (72) as

⋆
yi⋆ = β0 x0,i ⋆
+ β1 x1,i + . . . + βK xK⋆ ,i + ui⋆ (73)
where
yi 1 xj,i
yi⋆ = √ , ⋆
x0,i =√ , ⋆
xj,i =√ , ∀j = 1, . . . , K
Zi Zi Zi
Note that model (73) fulfills assumption (38), i.e. it is a model with homoskedastic
errors.
Use OLS estimation for the transformed model (73):

⋆
yi⋆ = β0 x0,i ⋆
+ β1 x1,i + . . . + βK xK⋆ ,i + ui⋆
and minimize the sum of squared residuals in the transformed model:
N
(ûi⋆ )2
X
SSR =
i=1
Due to the relation

ui
ui⋆ = √
Zi
the OLS estimator of the transformed model is equal to a weighted least squares
estimator in the original model:
N N
1
(ûi⋆ )2 = ûi2
X X
SSR =
i=1 i=1 Zi
Residuals ui for observations with big variances are down-weighted, while residuals for
observations with small variances obtain a higher weight. Hence the name weighted
least squares estimation.
There is no “intercept” in the model (73), only covariates. Using the matrix formulation
of the multiple regression model (73), we obtain the following matrix of predictors and
observation vector:
X ⋆ = Diag (w1 , . . . , wN ) X, y ⋆ = Diag (w1 , . . . , wN ) y
where
1
wi = √ , i = 1, . . . , N
Zi
The OLS estimator is computed for the transformed model, i.e.

′ ′
β̂ = ((X ⋆ ) X ⋆ )−1 (X ⋆ ) y ⋆
This is equal to following WLS estimator, which is expressed entirely in terms of the
original variables:
′ ′
β̂ = (X WX)−1 X Wy (74)
where W = Diag (w12 , . . . , wN2 ).
Testing for Heteroskedasticity
▶ Classical tests for heteroskedasticity are based on the squared OLS residuals ûi2 ,
e.g. the White or the Breusch-Pagan heteroskedasticity test. The idea is to test for
dependence of the squared residuals on any of the predictor variables using a
regression type model:
ûi2 = α0 + α1 x1,i + . . . + αK xK ,i + ξi
and test if α0 = α1 = . . . = αK = 0 using the F-test.

▶ Problem: Test is not reliable, as the errors ξi of this regression model are not
normal!
Case Study Profit
Demonstration in EViews, workfile profit.wf1:
yi = β0 + β1 x1,i + β2 x2,i + ui
▶ Discuss classical tests for heteroskedasticity [View → Residual Diagnostics]

▶ Possible choice for Zi : Zi = x2,i [Est.Eq. → Options → um94 as Std.dev.]
▶ Show how to estimate the transformed model [Divide everything by um94]
▶ Perform residual diagnostics for the transformed model [View → Res.Diag.]
Some Final Words. . .
http://xkcd.com/552/

All Slides

Uploaded by

Copyright:

Available Formats

You might also like

All Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

All Slides

Uploaded by

Copyright:

Available Formats

Econometrics I

An Introductory Course to Econometric Modeling, WS 2023/24

S. Adhikari, E. Flonner, G. Malsiner-Walli

Older editions are good enough.

▶ 60 ECTS credits are the equivalent of a full year of study.

Part I: Basic Concepts of Econometric Modeling

▶ Common data structures

▶ The simple regression model

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 6 / 288

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 7 / 288

Example: Relationship between price and demand:

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 8 / 288

Exact quantitative relationship between the variables of interest is assumed to be

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 9 / 288

Example: Stochastic Relationship between Demand and Price

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 10 / 288

Deterministic model Stochastic model

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 12 / 288

Example: Relationship between Demand and Price

or from the non-linear model:

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 13 / 288

For instance, we might be interested in

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 14 / 288

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 15 / 288

Part I: Basic Concepts of Econometric Modeling What is econometric modeling? 16 / 288

Part I: Basic Concepts of Econometric Modeling

▶ Common data structures

▶ The simple regression model

Part I: Basic Concepts of Econometric Modeling First steps in R 17 / 288

▶ A software package is needed for practical econometric inference.

Part I: Basic Concepts of Econometric Modeling First steps in R 18 / 288

Part I: Basic Concepts of Econometric Modeling

▶ Common data structures

▶ The simple regression model

Part I: Basic Concepts of Econometric Modeling Common data structures 19 / 288

Experimental data: data obtained through a designed experiment (medicine, marketing,

(Socio-)Economists mostly deal with observational (non-experimental) data:

Part I: Basic Concepts of Econometric Modeling Common data structures 20 / 288

▶ We are interested in variables (Y , X ) (e.g. relationship between demand D and

(yi , xi ), or (yi , x1i , . . . , xKi ), i = 1, . . . , N

Part I: Basic Concepts of Econometric Modeling Common data structures 21 / 288

▶ We are (traditionally) interested in a single variable Y (e.g. the return of a

Part I: Basic Concepts of Econometric Modeling Common data structures 22 / 288

▶ Panel data or longitudinal data: The same (random) individual observations Yi

Part I: Basic Concepts of Econometric Modeling Common data structures 23 / 288

Part I: Basic Concepts of Econometric Modeling Common data structures 24 / 288

Part I: Basic Concepts of Econometric Modeling

▶ Common data structures

▶ The simple regression model

The parameters β0 and β1 need to be estimated:

Assumption About the Unconditional Mean Error

E(u|X ) = E(u) (3)

Modeling Assumption on the Conditional Mean

Y = 0.2 − 1.8X + u (5)

▶ Specification of the error term:

▶ Demonstration ⇒ R-code code_eco_I.R

Expected value of Y , given X = x :

Expected value of Y , if the predictor X is changed by 1:

▶ The effect of changing X is independent of the level of X .