1 EMF univariateOLS

Simple Linear Regression
Empirical Methods for Finance
Prof. Robert Hill
Nova SBE
2022
Robert Hill Empirical Methods for Finance 1 / 34

Outline
1 The Linear Regression Model
2 Population Regression Function and Fitted Line
3 Ordinary Least Squares (OLS)
4 Goodness of Fit
5 Exercise

The Linear Regression Model

Simple Linear Regression Model
y = β0 + β1 x + u
This describes the data generating process of y in the population
▶ y and x are linearly related

▶ The relationship is not exact
▶ The model implies that u captures everything that determines y that is
not x. Many times, this includes a lot of stuff!

Simple Linear Regression Model: Variables and Parameters
y = β0 + β1 x + u
(y , x, u) are random variables

(y , x) are observable (we have a sample from the population)
u is (always!) unobservable
(β0 , β1 ) are unobservable population parameters. This is what we
want to estimate.

Simple Linear Regression Model: Terminology
y = β0 + β1 x + u
y:
x:
u:
β0 :
β1 :

Example: Does Pedigree Predict Performance
RET = β0 + β1 SAT + u
RET is the return of a fund above the return on a benchmark portfolio

SAT is the average SAT score of students at the undergraduate institution
of the fund’s manager
What about u? All other factors that determine funds’ performance
Does the managers’ quality of education explain differences in performance
across MF managers?

Ceteris Paribus: everything else held constant
Definition of the causal effect of x on y :

How y changes when only x changes
This means, when all other factors that possibly affect y are held
unchanged (ceteris paribus = everything else equal)
Causation is different from correlation (correlation does not
imply causation)
▶ Correlation: x moves with y
▶ Causation: x moves y
Most interesting questions are ceteris paribus questions

Correlation vs Causation
An empirical observation:
Spurious correlation! Due to coincidence or to the variation of an omitted

factor that is driving both variables (“confounding factor”)
Correlation vs Causation
Another example of spurious correlation
Is Facebook driving the Greek debt crisis?

Ceteris Paribus Interpretation of the Linear Regression Model
y = β0 + β1 x + u
β1 measures the (linear) causal effect of a change in x on y :
∆y = β1 ∆x
when ∆u = 0
β1 is the ceteris paribus effect of x on y , i.e. keeping everything else
constant a change in x by 1 unit, will cause y to change by β1 units
Since β1 is unobservable, we need to estimate it using data about x
and y
How can we hope to learn about the effect of x on y holding other
factors fixed, when we are ignoring all those other factors?

Example: Does Pedigree Predict Performance?
RET = β0 + β1 SAT + u
Another empirical observation: managers who attended higher-SAT

undergraduate institutions have systematically higher excess returns
Interpretations
▶ Causal:
▶ Non-causal:
Important distinction: is the cost of ivy league education worth it?

Key assumption for causality
Zero conditional mean assumption
E (u|x) = E (u) = 0
Makes two assumptions:

(1) Mean independence of the error term
E (u|x) = E (u), for all values x
(2) Zero mean:

E (u) = 0

(1) Mean independence of the error term
E (u|x) = E (u), for all values x
▶ The average value of u does not depend on the value of x

▶ This is the key assumption. A very strong assumption!
▶ In the example RET = β0 + β1 SAT + u, u contains innate ability
among other things
▶ Mean independence of u means that E (ability |SAT ) = E (ability ), i.e.
that the average level of ability is the same across people from different
institutions
▶ This implies E (ability |SAT = 1500 Princeton) = E (ability |SAT =
1177 U. of Alabama). Realistic?

(2) Also assumes that u is zero in expectation
E (u) = 0
▶ Harmless assumption (normalization) as long as there is an intercept

▶ The constant (intercept) will absorb any non-zero mean of u

Population Regression Function and Fitted Line

The Population Regression Function (PRF)
Under the zero conditional mean assumption E (u|x) = E (u) = 0
E (y |x) = β0 + β1 x
▶ The PRF gives us a relationship between the average level of y at

different levels of x. Whether the actual y is above or below the PRF
depends on the unobserved factors in u
▶ β1 now tells us how the average value of y changes with x
▶ y can be decomposed into a systematic and a idiosyncratic part
y = E (y |x) + u

Expected Values and Errors
For a sample of the population {yi , xi }, i = 1 . . . n

yi = E (y |xi ) + ui
For a given value of x, we observe different values of y because of the

randomness in u

Fitted Values and Residuals
Given a sample {yi , xi }, i = 1 . . . n we estimate
ŷ = β̂0 + β̂1 x
Regression residuals are defined as
ûi = yi − ŷi ⇔ y = ŷ + ûi

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)
The most common estimator is known as OLS (ordinary least square)

Choose βb0 and βb1 such that, collectively, the difference between the true
value yi and the fitted value ybi is minimized
This is achieved by minimizing the sum of the squared residuals:
N
X N
X N
X
min SSR = min ubi2 = min (yi − ybi )2 = min (yi − βb0 − βb1 xi )2
β
b0 ,β
b1 β
b0 ,β
b1
i=1 β
b0 ,β
b1
i=1 β
b0 ,β
b1
i=1

OLS Estimators
PN
i=1 (yi − y )(xi − x) sample covariance(x, y )
βb1 = PN =
i=1 (xi − x)
2 sample variance(x)
βb0 = y − βb1 x
PN PN
where x = 1/N i=1 xi and y = 1/N i=1 yi
(Derivation: video)
You can compute them manually. In practice, this is done by

econometric packages (e.g. STATA)

Algebraic Properties of OLS
Properites of OLS estimators that follow directly from algebra and are
therefore always true. In other words, OLS estimators βb0 and βb1
are chosen such that:
PN
1
i=1 ubi = 0
The sum (and the sample average) of the OLS residuals is zero
PN
2
i=1 xi u
bi =0
The sample covariance between the regressor(s) and the OLS residuals
is zero1
3 The point (x, y ) always lies on the regression line
1 1
P N 1
PN 1
PN 1
PN
n−1 i=1 (xi − x)(ûi − û) = n−1 i=1 (xi − x)ûi = n−1 i=1 xi ûi − x n−1 i=1 ûi =
1
P N PN
n−1 i=1 i ûi = 0 ⇔
x i=1 xi ubi = 0
Errors vs. Residuals
Errors ui
▶ all other factors that affect y
▶ the vertical distances between observations and the PRF
▶ never observed
▶ assumptions of the model are built around u
Residuals ûi
▶ computed from the data
▶ the vertical distances between observations and the estimated
regression function
▶ have several important algebraic properties

Goodness of Fit

Goodness of Fit: Some definitions
Sum of Squares Total (SST): measures the total sample variation in the yi
N
X
SST = (yi − y )2
i=1
Sum of Squares Explained (SSE): measures the sample variation in the ŷi
N
X
SSE = (ŷi − ŷ )2
i=1
Sum of Squares Residual (SSR): measures the sample variation in the ûi
N
X N
X
SSR = (ûi − û)2 = ûi2
i=1 i=1

Goodness of Fit: R-squared
The total variation can be decomposed into variation explained and variation
residual (unexplained):
SST = SSE + SSR
Intuitively, a good measure of the regression fit is how much of the total
variation can the model explain. This is the definition of R 2
R 2 = SSE /SST = 1 − SSR/SST
Some remarks:
▶ R2: proportion of the variation in y explained by variation in x
▶ R 2 is always between 0 and 1
▶ Higher R 2 means that a higher proportion of variation in yi is explained
by the variation in xi
▶ Low R 2 are not uncommon, especially for cross-sectional data
▶ High R 2 is useless if correlation is spurious
Exercise

Can you help my friends?
Some friends recently started a

start-up that produces vegan
salmon
The company is doing great, but
they have no quantitative
background
They are collecting data about
the production activity but they
cannot figure how to get the
information they need out of
them...

Can you help my friends?
Each week they record

1 how many units of product 1 they produced
2 how many units of product 2 they produced
3 the total number of hours worked
For each product, they need to know how many units they produce in
one hour

1 EMF univariateOLS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 EMF univariateOLS

Uploaded by

Copyright:

Available Formats

Simple Linear Regression

Empirical Methods for Finance

Prof. Robert Hill

Robert Hill Empirical Methods for Finance 1 / 34

1 The Linear Regression Model

2 Population Regression Function and Fitted Line

3 Ordinary Least Squares (OLS)

Robert Hill Empirical Methods for Finance 2 / 34

Robert Hill Empirical Methods for Finance 3 / 34

This describes the data generating process of y in the population

▶ y and x are linearly related

Robert Hill Empirical Methods for Finance 4 / 34

(y , x, u) are random variables

Robert Hill Empirical Methods for Finance 5 / 34

Robert Hill Empirical Methods for Finance 6 / 34

RET is the return of a fund above the return on a benchmark portfolio

Robert Hill Empirical Methods for Finance 7 / 34

Definition of the causal effect of x on y :

Most interesting questions are ceteris paribus questions

Robert Hill Empirical Methods for Finance 8 / 34

Spurious correlation! Due to coincidence or to the variation of an omitted

Robert Hill Empirical Methods for Finance 10 / 34

Robert Hill Empirical Methods for Finance 11 / 34

Another empirical observation: managers who attended higher-SAT

Important distinction: is the cost of ivy league education worth it?

Robert Hill Empirical Methods for Finance 12 / 34

Zero conditional mean assumption

Makes two assumptions:

E (u|x) = E (u), for all values x

(2) Zero mean:

Robert Hill Empirical Methods for Finance 13 / 34

E (u|x) = E (u), for all values x

▶ The average value of u does not depend on the value of x

Robert Hill Empirical Methods for Finance 14 / 34

▶ Harmless assumption (normalization) as long as there is an intercept

Robert Hill Empirical Methods for Finance 15 / 34

Robert Hill Empirical Methods for Finance 16 / 34

Under the zero conditional mean assumption E (u|x) = E (u) = 0

▶ The PRF gives us a relationship between the average level of y at

▶ β1 now tells us how the average value of y changes with x

▶ y can be decomposed into a systematic and a idiosyncratic part

Robert Hill Empirical Methods for Finance 17 / 34

For a sample of the population {yi , xi }, i = 1 . . . n

For a given value of x, we observe different values of y because of the

Robert Hill Empirical Methods for Finance 18 / 34

Robert Hill Empirical Methods for Finance 19 / 34

Robert Hill Empirical Methods for Finance 21 / 34

The most common estimator is known as OLS (ordinary least square)

Robert Hill Empirical Methods for Finance 22 / 34

You can compute them manually. In practice, this is done by

Robert Hill Empirical Methods for Finance 23 / 34

3 The point (x, y ) always lies on the regression line

Robert Hill Empirical Methods for Finance 25 / 34

Robert Hill Empirical Methods for Finance 26 / 34

Robert Hill Empirical Methods for Finance 27 / 34

R 2 = SSE /SST = 1 − SSR/SST

Robert Hill Empirical Methods for Finance 32 / 34

Some friends recently started a

Robert Hill Empirical Methods for Finance 33 / 34

Each week they record

Robert Hill Empirical Methods for Finance 34 / 34

You might also like