Simple Linear Regression

Empirical Methods for Finance

Prof. Robert Hill

Nova SBE


1 The Linear Regression Model

2 Population Regression Function and Fitted Line

3 Ordinary Least Squares (OLS)

4 Goodness of Fit

5 Exercise

The Linear Regression Model

Robert Hill Empirical Methods for Finance 3 / 34

Simple Linear Regression Model

y = β0 + β1 x + u

This describes the data generating process of y in the population

▶ y and x are linearly related

▶ The relationship is not exact
▶ The model implies that u captures everything that determines y that is
not x. Many times, this includes a lot of stuff!

Simple Linear Regression Model: Variables and Parameters

y = β0 + β1 x + u

(y , x, u) are random variables

(y , x) are observable (we have a sample from the population)
u is (always!) unobservable
(β0 , β1 ) are unobservable population parameters. This is what we
want to estimate.

Simple Linear Regression Model: Terminology

y = β0 + β1 x + u

β0 :
β1 :

Example: Does Pedigree Predict Performance

RET = β0 + β1 SAT + u

RET is the return of a fund above the return on a benchmark portfolio

SAT is the average SAT score of students at the undergraduate institution
of the fund’s manager
What about u? All other factors that determine funds’ performance
Does the managers’ quality of education explain differences in performance
across MF managers?

Ceteris Paribus: everything else held constant

Definition of the causal effect of x on y :

How y changes when only x changes
This means, when all other factors that possibly affect y are held
unchanged (ceteris paribus = everything else equal)
Causation is different from correlation (correlation does not
imply causation)
▶ Correlation: x moves with y
▶ Causation: x moves y

Most interesting questions are ceteris paribus questions

Correlation vs Causation
An empirical observation:

Spurious correlation! Due to coincidence or to the variation of an omitted

factor that is driving both variables (“confounding factor”)
Correlation vs Causation
Another example of spurious correlation
Is Facebook driving the Greek debt crisis?

Ceteris Paribus Interpretation of the Linear Regression Model

y = β0 + β1 x + u
β1 measures the (linear) causal effect of a change in x on y :

∆y = β1 ∆x
when ∆u = 0
β1 is the ceteris paribus effect of x on y , i.e. keeping everything else
constant a change in x by 1 unit, will cause y to change by β1 units
Since β1 is unobservable, we need to estimate it using data about x
and y
How can we hope to learn about the effect of x on y holding other
factors fixed, when we are ignoring all those other factors?

Example: Does Pedigree Predict Performance?

RET = β0 + β1 SAT + u

Another empirical observation: managers who attended higher-SAT

undergraduate institutions have systematically higher excess returns


▶ Causal:

▶ Non-causal:

Important distinction: is the cost of ivy league education worth it?

Key assumption for causality

Zero conditional mean assumption

E (u|x) = E (u) = 0

Makes two assumptions:

(1) Mean independence of the error term

E (u|x) = E (u), for all values x

(2) Zero mean:

E (u) = 0

(1) Mean independence of the error term

E (u|x) = E (u), for all values x

▶ The average value of u does not depend on the value of x

▶ This is the key assumption. A very strong assumption!
▶ In the example RET = β0 + β1 SAT + u, u contains innate ability
among other things
▶ Mean independence of u means that E (ability |SAT ) = E (ability ), i.e.
that the average level of ability is the same across people from different
▶ This implies E (ability |SAT = 1500 Princeton) = E (ability |SAT =
1177 U. of Alabama). Realistic?

(2) Also assumes that u is zero in expectation

E (u) = 0

▶ Harmless assumption (normalization) as long as there is an intercept

▶ The constant (intercept) will absorb any non-zero mean of u

Population Regression Function and Fitted Line

The Population Regression Function (PRF)

Under the zero conditional mean assumption E (u|x) = E (u) = 0

E (y |x) = β0 + β1 x

▶ The PRF gives us a relationship between the average level of y at

different levels of x. Whether the actual y is above or below the PRF
depends on the unobserved factors in u

▶ β1 now tells us how the average value of y changes with x

▶ y can be decomposed into a systematic and a idiosyncratic part

y = E (y |x) + u

Expected Values and Errors

For a sample of the population {yi , xi }, i = 1 . . . n

yi = E (y |xi ) + ui

For a given value of x, we observe different values of y because of the

randomness in u

Fitted Values and Residuals
Given a sample {yi , xi }, i = 1 . . . n we estimate
ŷ = β̂0 + β̂1 x
Regression residuals are defined as
ûi = yi − ŷi ⇔ y = ŷ + ûi

Robert Hill Empirical Methods for Finance 20 / 34
Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

The most common estimator is known as OLS (ordinary least square)

Choose βb0 and βb1 such that, collectively, the difference between the true
value yi and the fitted value ybi is minimized
This is achieved by minimizing the sum of the squared residuals:

min SSR = min ubi2 = min (yi − ybi )2 = min (yi − βb0 − βb1 xi )2
b0 ,β
b1 β
b0 ,β
i=1 β
b0 ,β
i=1 β
b0 ,β

OLS Estimators

i=1 (yi − y )(xi − x) sample covariance(x, y )
βb1 = PN =
i=1 (xi − x)
2 sample variance(x)

βb0 = y − βb1 x

where x = 1/N i=1 xi and y = 1/N i=1 yi

(Derivation: video)

You can compute them manually. In practice, this is done by

econometric packages (e.g. STATA)

Algebraic Properties of OLS

Properites of OLS estimators that follow directly from algebra and are
therefore always true. In other words, OLS estimators βb0 and βb1
are chosen such that:
i=1 ubi = 0
The sum (and the sample average) of the OLS residuals is zero
i=1 xi u
bi =0
The sample covariance between the regressor(s) and the OLS residuals
is zero1

3 The point (x, y ) always lies on the regression line

1 1
P N 1
PN 1
PN 1
n−1 i=1 (xi − x)(ûi − û) = n−1 i=1 (xi − x)ûi = n−1 i=1 xi ûi − x n−1 i=1 ûi =
n−1 i=1 i ûi = 0 ⇔
x i=1 xi ubi = 0
Errors vs. Residuals
Errors ui
▶ all other factors that affect y
▶ the vertical distances between observations and the PRF
▶ never observed
▶ assumptions of the model are built around u

Residuals ûi
▶ computed from the data
▶ the vertical distances between observations and the estimated
regression function
▶ have several important algebraic properties

Goodness of Fit

Robert Hill Empirical Methods for Finance 26 / 34

Goodness of Fit: Some definitions

Sum of Squares Total (SST): measures the total sample variation in the yi
SST = (yi − y )2

Sum of Squares Explained (SSE): measures the sample variation in the ŷi
SSE = (ŷi − ŷ )2

Sum of Squares Residual (SSR): measures the sample variation in the ûi
SSR = (ûi − û)2 = ûi2
i=1 i=1

Goodness of Fit: R-squared

The total variation can be decomposed into variation explained and variation
residual (unexplained):

Intuitively, a good measure of the regression fit is how much of the total
variation can the model explain. This is the definition of R 2

R 2 = SSE /SST = 1 − SSR/SST

Some remarks:
▶ R2: proportion of the variation in y explained by variation in x
▶ R 2 is always between 0 and 1
▶ Higher R 2 means that a higher proportion of variation in yi is explained
by the variation in xi
▶ Low R 2 are not uncommon, especially for cross-sectional data
▶ High R 2 is useless if correlation is spurious
Can you help my friends?

Some friends recently started a

start-up that produces vegan
The company is doing great, but
they have no quantitative
They are collecting data about
the production activity but they
cannot figure how to get the
information they need out of

Can you help my friends?

Each week they record

1 how many units of product 1 they produced
2 how many units of product 2 they produced
3 the total number of hours worked
For each product, they need to know how many units they produce in
one hour

