Financial Econometrics Lecture 4

FINANCIAL ECONOMETRICS
Lecturer: EDMUND TAMUKE
Lecture 4
Week 5: July 12, 2024
Last week
We studied the desirable (small sample) properties of the OLS
estimators under Gauss-Markov assumptions, i.e. unbiasedness
and efficiency, as well as learnt how to do hypothesis testing
Today, we will first briefly look at deriving unbiasedness and

efficiency of the OLS estimators using matrix form
Then, we will move onto discussing asymptotic or large sample

properties of OLS estimators
Finally, we will focus on various specification problems and begin

the discussion on violation of OLS assumptions
Readings
Wooldridge Chapter 5, 6 and 8
Verbeek Chapter 2, 3 and 4

Brief Recap of Properties of OLS Estimator
Let’s consider the linear regression model:
𝑦 = 𝑋 𝛽+ 𝜀
(NX1) (NXK) (KX1) (NX1)
Under the Gauss-Markov assumptions (A1)-(A4), the OLS
estimator 𝛽 is unbiased and efficient
Unbiasedness implies 𝐸 𝛽 = 𝛽 . In matrix form, remember

𝛽= −1 𝑋 ′ 𝑦. So, taking expectations, we get
𝐸 𝛽 = 𝐸 𝑋 ′𝑋 −1 𝑋 ′ 𝑦 = 𝐸 𝑋 ′𝑋 −1𝑋′ 𝑋𝛽 + 𝜀
= 𝐸 𝑋′𝑋 −1(𝑋′𝑋)𝛽 + 𝑋 ′𝑋 −1𝑋′𝜀
=𝐸 𝛽+ −1 𝑋 ′ 𝜀 = 𝛽 + −1 𝑋 ′ 𝐸 𝜀 = 𝛽
(since we are conditioning on X, they are fixed and by (A2),

𝐸 𝜀|𝑋 = 0 i.e. zero conditional mean of errors)
Efficiency: The variance-covariance matrix 𝑉(𝛽) is a 𝐾𝑋𝐾 matrix
consisting of the variances 𝛽1, 𝛽2…, 𝛽𝐾 on the diagonal and their
covariances as the off-diagonal elements
′
𝑉 𝛽 =𝐸 −1 𝑋′ss ′𝑋 −1]
𝛽 − 𝛽 𝛽 − 𝛽 = 𝐸[
= 𝑋′ 𝑋 −1𝑋′𝐸 ss ′ 𝑋 𝑋 ′𝑋 −1
= 𝑋 ′ 𝑋 −1 𝑋′𝜎2 𝐼𝑁𝑋 𝑋 ′ 𝑋 −1
−1 (𝑋′ 𝑋) 𝑋 ′ 𝑋 −1
= 𝜎2
−1
= 𝜎2
Reminder: Gauss-Markov Theorem says that under assumptions (A1)-

(A4), the OLS estimator 𝛽 is the Best Linear Unbiased Estimator for 𝛽
Asymptotic Properties of the OLS Estimator
In many cases, the small sample properties of the OLS estimator
may deviate from the ones discussed till now
E.g. when the Gauss-Markov assumptions are relaxed, properties

of estimators are very difficult to ascertain
In such cases we use an alternative approach to evaluate the

quality of our estimators, which is based on “asymptotic theory”
Asymptotic (or large sample) theory refers to the question as to

what happens, if hypothetically, the sample size becomes very
large (technically speaking, infinitely large)
Asymptotically, estimators usually have nice properties, and we
use these asymptotic properties to approximate the properties in
the finite sample that we happen to have
The asymptotic properties of the OLS estimator we will now

discuss are:
1. Consistency
2. Asymptotic Normality
3. Asymptotic Efficiency
Consistency
Unbiasedness of estimators, though important, cannot always be
achieved
So economists agree that consistency is a minimal requirement

that a useful estimator must fulfil
Let us first try to understand it intuitively with the help of a

diagram
Let 𝛽1 be the OLS estimator of 𝛽1 (equally applicable for any 𝛽𝑘)
For each n, 𝛽1 has a probability distribution, representing its

possible values in different random samples of size n
If this estimator is consistent, then the distribution of 𝛽1 become

more and more tightly distributed around 𝛽1 as the sample grows
• As n tends to infinity (𝑛 → ∞), the distribution of 𝛽1 collapses

to the single point 𝛽1
• This means that if we collect more and more data, then we can
get our estimator very close to 𝛽1
From Wooldridge p. 168
Confusion may arise because in reality, we only have a fixed
sample size when we are doing applied work!
Best to think of consistency as a “thought experiment” as follows:

let’s say we could obtain numerous random samples for a given
sample size n (just like in case of repeated sampling discussed
in previous lectures). This will give us the sampling
distribution of 𝛽1 for that n. Now, consider what would happen
if n keeps getting larger and larger?
The main idea is that obtaining more and more data should get us
very close to the true parameter value of interest
Consistency is based on the Law of Large Numbers (LLN), which
we encountered last week
Remember that LLN says that sample averages converge (in

probability) to population means. Thus LLN helps us show that the
sample distribution of the OLS estimator is centered around the
true parameter value. This is true asymptotically, i.e. it holds
when 𝑛 → ∞
Technical jargon of consistency: we say 𝛽𝑘 is a consistent

estimator of 𝛽𝑘 if
lim 𝑃 |𝛽𝑘 − 𝛽𝑘 > 𝛿 = 0 for all 𝛿 > 0
𝑛→∞
In words, this means that asymptotically (i.e. as 𝑛 → ∞), the
probability that the OLS estimator deviates more than 𝛿 from the
true parameter value is zero. Also referred to as:
plim 𝛽𝑘 = 𝛽𝑘
Thus, consistency is a large sample property that (loosely

speaking) implies that if we obtain more and more data, the
probability that our estimator is some positive number away from
the true parameter value becomes smaller and smaller, i.e. our
estimator approaches the true parameter value
Asymptotic Normality
Now, we need more than consistency to do inference
We need to know the sampling distribution of the OLS estimators

(remember this from last week’s discussion of Hypothesis Testing?)
In large samples, under Gauss-Markov assumptions, the

distribution of OLS estimators is approximately normally
distributed
Remember we discussed the Central Limit Theorem last week

regarding the shape of the sampling distribution of the OLS
estimator?
The Central Limit Theorem is also an asymptotic concept, i.e. it
says that if we take repeated independent random samples from a
population of size n, as 𝑛 → ∞, the distribution of the sample
means approaches a normal distribution
This is true even if the distribution in the parent population is not

normal
Asymptotic normality of our OLS estimator implies that our t and

F statistics have approximate t and F distributions in large samples
• For the purposes of hypothesis testing, therefore, nothing
changes from what we have done before
Asymptotic Efficiency
Under the Gauss-Markov assumptions, OLS estimators 𝛽𝑘 have
the smallest asymptotic variance among the class of all consistent
estimators of 𝛽𝑘
Specification Problems
Multicollinearity
If the correlation between two explanatory variables is too high,
becomes difficult to tell which is influencing the outcome variable y
For example, in a wage equation, we may want to include both age

and experience. You may expect these two to be correlated (older
people have more experience), but if they are very highly correlated,
then it becomes hard to identify the individual impact of these
variables
In the extreme case that the two variables move exactly together (i.e.
one explanatory variable is an exact linear combination of one or
more explanatory variable), it becomes impossible to identify the
individual effects
This is known as perfect collinearity. Technically, this implies that
the (𝑋′𝑋) matrix is not invertible
But multicollinearity (i.e. explanatory variables are highly but not

perfectly correlated) can still be tolerated
Multicollinearity is not a violation of the G-M assumptions, so

OLS estimator 𝛽 is still unbiased, but Var{𝛽} may be large
Hence, t-statistics maybe quite small and you will think the
variable is not significant when in fact it is. F-statistics may be
very large though, as is 𝑅2
All models contain some multicollinearity – the question is when
is it a problem
Multicollinearity is a problem of degree, ranging from perfect to

zero. It is a problem only if it is the former
Multicollinearity problems might be mitigated by either collecting

more data (sometimes infeasible) or dropping some of the
explanatory variables (be careful though!)
Including Irrelevant Regressors
Inclusion of an irrelevant regressor implies that one (or more)
explanatory variable is included in the model even though it has no
partial effect on y in the population
So, let’s say the true model is

𝑦 = 𝛽0 + 𝛽1𝑥1 + s
But we estimate
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + s
where 𝑥2 is the irrelevant variable such that 𝛽2 = 0

What is its impact on 𝛽1?
Well, there is no impact on the unbiasedness of 𝛽1
However, there will be some loss in efficiency i.e. the sampling

variance of 𝛽1 will be larger. Intuitively, this is because if 𝑥2 has
no partial effect on y, then including it in the model only increases
the multicollinearity problem (assuming 𝑥2 and 𝑥1 are correlated),
which will, as we just discussed a few slides ago, will increase the
variance of 𝛽1
Hence, including too many explanatory variables in a model is not
a great idea
But notice, that inclusion of variables is not fatal to the OLS

estimators (i.e. they are still unbiased), but exclusion of some
variables might be, leading to biased estimators, and hence may
pose a problem! The latter is the omitted variable problem (OVB)
…that we will discuss in detail next week
For now, let’s continue with our discussion of a few more

specification problems
Misspecifying the Functional Form
So far, we have assumed the model is linear (primarily due to their
convenience)
What if this is not appropriate? i.e. what if the true model is a non-
linear one?
Non-linearities can arise in two different ways:
1. The model is still linear in parameters but non-linear in its

explanatory variables, i.e. we include non-linear functions of
the 𝑥𝑖’s as additional explanatory variables
E.g. including (𝑎𝑔𝑒2) in an individual wage equation
Such a model can still be estimated using OLS
2. The model is non-linear in parameters e.g.
𝑦 = 𝛽0 + 𝛽1𝑥𝑖𝛽2
Estimation of such a model is less easy, and uses a non-
linear version of least squares that does not easily yield
analytical solutions for estimators
How do we test if the linear model is the appropriate one?
We use Ramsey’s RESET tests (Regression Equation Specification

Error Test)
The basic idea is that we add powers of the 𝑋 variables (multi-
variate case) to the estimating equation. But, instead of using
powers of 𝑋, we use powers of the fitted values 𝑦^, as the null
hypothesis of the test is that non-linear functions of the fitted values
do not explain 𝑦
Hence, the procedure is:

• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + s
• Obtain fitted values 𝑦^
• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + 𝑐 𝑦^ 2 + s
• Test for significance of 𝑐. If it is significantly different from zero,
reject linear specification
Stability Testing
So far, we have assumed that the functional form of the model is
same for all observations in the sample
However, sometimes this might not be the case
Effects may be different across two or more sub-samples. In a

cross-sectional sample, we can think of sub-samples being made of
males and females. In a time series context, sub-samples may be
typically defined by time. E.g. the regression coefficients may be
different before and after a major change in macroeconomic policy
Such changes in regression coefficients is referred to as a structural
break
In order to test for structural break, we use a Chow Test
Formally, we wish to test for a break after 𝑛1. So,

• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + s over the first 𝑛1
observations. Obtain 𝑅𝑆𝑆1
• Then, estimate 𝑦 = 𝛽0′ + 𝛽1′𝑥1 + ⋯ + 𝛽𝐾′𝑥𝐾 + s over the
remaining 𝑛2 observations. Obtain 𝑅𝑆𝑆2
• Calculate 𝑅𝑆𝑆𝑈 = 𝑅𝑆𝑆1+ 𝑅𝑆𝑆2
• Then estimate the restricted regression for the whole sample:
𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾𝑥𝐾 + s over 𝑛1+ 𝑛2 observations
and obtain 𝑅𝑆𝑆𝑅
(𝑅𝑆𝑆𝑅−𝑅𝑆𝑆𝑈)/𝐾
• Use the F-test: F = ~𝐹
𝑅𝑆𝑆𝑈/(𝑁−2𝐾)
• This F-test is the Chow Test for structural break
• Null hypothesis 𝐻0: coefficients are stable
• If we reject the null, then we can say there is a structural break
• Intuitively, 𝑅𝑆𝑆𝑅 must be larger than 𝑅𝑆𝑆𝑈 if there is a break. So
if we find a big enough difference, we reject the null of stability
Violation of OLS Assumptions
So far, we have seen that the Gauss-Markov Assumptions (A1)-
(A4) may not always be satisfied
But that need not be fatal for the OLS estimator, because it is still
unbiased
Now we will discuss heteroscedasticity and autocorrelation, which

represent violations of (A3) and (A4)
Remember that (A3) implies homoscedasticity or constant variance

of the error term and (A4) implies no auto-correlation or error terms
are uncorrelated
In matrix form, therefore, these relate to the variance-covariance
matrix, that is 𝑉𝑎𝑟 =𝐸 = 𝜎2𝐼𝑁
Where 𝐼𝑁 is a 𝑁𝑋𝑁 identity matrix with 1 across the diagonal and

zeroes elsewhere. Hence, the variance-covariance matrix has 𝜎2 on
the diagonal and zeroes elsewhere
• Heteroscedasticity implies the diagonal elements of the matrix

are no longer the same i.e. 𝜎𝑖2 ≠ 𝜎j 2
• Autocorrelation implies that this matrix has non-zero elements
off the diagonal
Heteroscedasticity
Variance of the error term now increases across observations
Often occurs in cross-sectional data over the range of the 𝑋 variable
E.g. holiday expenditure and income; we may expect that higher

income is associated with higher holiday expenditure (i.e. the
relationship between these two variables is positive), but also that
variation in holiday expenditures among high-income households is
larger than the variation among low income households
In practice, this looks like the scatter plot is ‘fanning out’
Hence, our regression line is still correct, i.e. it still minimizes the
sum of squared residuals, but there is this fanned out section in
parts of the regression line
Consequences of heteroscedasticity:
OLS estimator is still unbiased and consistent, but inefficient (no
longer BLUE) and no longer asymptotically efficient
• Unbiased because 𝑋’s are still exogenous (A2 still holds)
• Estimators of the error variance are biased. This invalidates
inference because we can’t trust our OLS estimates for higher
values of 𝑋 since those values are associated with bigger errors
• Inefficiency is harder to demonstrate, but simply put, OLS gives
equal weight to all observations but if some have higher variance
than others, then it would be better to attach low weights to the
ones with higher variance
How to deal with heteroscedasticity

The most common approach is to use robust standard errors
Since it is only the standard errors rather than the coefficients that
are biased due to heteroscedasticity, one cure can be to find
alternative estimates of the standard errors
White (1980) suggested that replacing 𝜎𝑖2 with s^𝑖2, i.e. the
appropriate OLS residual in the standard formula for calculating
OLS variance-covariance matrix (details not necessary)
Use of such robust standard errors have now become standard

practice in most applied economic work
Implemented in STATA using the ‘robust’ option after the main

command
There may be a bit of heteroscedasticity in most empirical settings,

hence best to use robust standard errors
Sometimes, we also ‘cluster’ our standard errors. This is similar to
‘robust’ but we give STATA some hint regarding what we think the
pattern of heteroscedasticity may look like
For example, if you are working with a dataset of firms, you would
cluster your standard errors by firm (the unit of analysis)
The intuition is that observations from the same firm must have
errors plucked from the same error distribution. Different firms will
then have different error distributions and this is where the
heteroscedasticity comes from
Estimation in presence of heteroscedasticity
Use of Generalized Least Squares (GLS)
As we know, OLS assumes 𝐸 = 𝜎2𝐼𝑁, a very restrictive

form of error structure
GLS allows the assumption that 𝐸 = 𝜎 2 𝜑 where 𝜑 is any

𝑁𝑋𝑁 matrix
Taking account of 𝜑 enables us to get estimators that are BLUE

The GLS approach works by applying a transformation to the
variables in the model
Let’s first assume we can decompose the 𝜑 matrix as follows:

𝜑−1 = 𝑃′𝑃
So, we can rewrite the original 𝜑 as
𝜑 = (𝑃′𝑃)−1= 𝑃−1(𝑃′)−1
𝑃𝜑𝑃′ = 𝑃𝑃−1(𝑃′)−1𝑃′ = 𝐼
The point of this is to show that there is some matrix P that we can
use to pre-multiply the 𝜑 matrix to get an identity matrix I that will
be representative of a homoscedastic error structure
Next, we pre-multiply the whole model with the P matrix:
𝑃𝑦 = 𝑃𝑋𝛽 + 𝑃𝜀
which we can then re-write in different notation to represent a
‘transformed model’:
𝑦* = 𝑋*𝛽 + 𝜀* (*)
Intuitively, the P matrix is some external piece of information that
we feed into the regression model that contains information about
the structure of heteroscedasticity
The starred model (*) satisfies the classical assumptions about the
error term and we could apply OLS to it obtain estimators that are
BLUE (for more details, see Wooldridge p. 277)
Final point: Robust standard errors approach is used much more in
practice. This is because GLS requires us to know something about
the structure of heteroscedasticity, whereas the robust approach just
estimates the structure from the data
Conclusion
Today, we studied asymptotic properties of OLS estimators
Then we moved on to discussion of situations when OLS

assumptions are relaxed or violated. So far, none have led to
OLS estimators being biased, i.e. none have been fatal!
Next week, we will discuss autocorrelation, the remaining case

where an OLS assumption is violated without leading to bias,
before moving onto the important cases where bias does arise,
making OLS estimation very difficult

Financial Econometrics Lecture 4

Uploaded by

Copyright:

Available Formats

You might also like

Financial Econometrics Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Financial Econometrics Lecture 4

Uploaded by

Copyright:

Available Formats

FINANCIAL ECONOMETRICS

Lecturer: EDMUND TAMUKE

Today, we will first briefly look at deriving unbiasedness and

Then, we will move onto discussing asymptotic or large sample

Finally, we will focus on various specification problems and begin

Verbeek Chapter 2, 3 and 4

Unbiasedness implies 𝐸 𝛽 = 𝛽 . In matrix form, remember

(since we are conditioning on X, they are fixed and by (A2),

Reminder: Gauss-Markov Theorem says that under assumptions (A1)-

E.g. when the Gauss-Markov assumptions are relaxed, properties

In such cases we use an alternative approach to evaluate the

Asymptotic (or large sample) theory refers to the question as to

The asymptotic properties of the OLS estimator we will now

So economists agree that consistency is a minimal requirement

Let us first try to understand it intuitively with the help of a

For each n, 𝛽1 has a probability distribution, representing its

If this estimator is consistent, then the distribution of 𝛽1 become

• As n tends to infinity (𝑛 → ∞), the distribution of 𝛽1 collapses

Best to think of consistency as a “thought experiment” as follows:

Remember that LLN says that sample averages converge (in

Technical jargon of consistency: we say 𝛽𝑘 is a consistent

Thus, consistency is a large sample property that (loosely

We need to know the sampling distribution of the OLS estimators

In large samples, under Gauss-Markov assumptions, the

Remember we discussed the Central Limit Theorem last week

This is true even if the distribution in the parent population is not

Asymptotic normality of our OLS estimator implies that our t and

For example, in a wage equation, we may want to include both age

But multicollinearity (i.e. explanatory variables are highly but not

Multicollinearity is not a violation of the G-M assumptions, so

Multicollinearity is a problem of degree, ranging from perfect to

Multicollinearity problems might be mitigated by either collecting

So, let’s say the true model is

where 𝑥2 is the irrelevant variable such that 𝛽2 = 0

Well, there is no impact on the unbiasedness of 𝛽1

However, there will be some loss in efficiency i.e. the sampling

But notice, that inclusion of variables is not fatal to the OLS

…that we will discuss in detail next week

For now, let’s continue with our discussion of a few more

Non-linearities can arise in two different ways:

1. The model is still linear in parameters but non-linear in its

How do we test if the linear model is the appropriate one?

We use Ramsey’s RESET tests (Regression Equation Specification

Hence, the procedure is:

However, sometimes this might not be the case

Effects may be different across two or more sub-samples. In a

In order to test for structural break, we use a Chow Test

Formally, we wish to test for a break after 𝑛1. So,

Now we will discuss heteroscedasticity and autocorrelation, which

Remember that (A3) implies homoscedasticity or constant variance

Where 𝐼𝑁 is a 𝑁𝑋𝑁 identity matrix with 1 across the diagonal and

• Heteroscedasticity implies the diagonal elements of the matrix

Often occurs in cross-sectional data over the range of the 𝑋 variable

E.g. holiday expenditure and income; we may expect that higher

How to deal with heteroscedasticity

Use of such robust standard errors have now become standard

Implemented in STATA using the ‘robust’ option after the main

There may be a bit of heteroscedasticity in most empirical settings,

As we know, OLS assumes 𝐸 = 𝜎2𝐼𝑁, a very restrictive