Financial Econometrics Lecture 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

FINANCIAL ECONOMETRICS

Lecturer: EDMUND TAMUKE

Lecture 4
Week 5: July 12, 2024
Last week
We studied the desirable (small sample) properties of the OLS
estimators under Gauss-Markov assumptions, i.e. unbiasedness
and efficiency, as well as learnt how to do hypothesis testing

Today, we will first briefly look at deriving unbiasedness and


efficiency of the OLS estimators using matrix form

Then, we will move onto discussing asymptotic or large sample


properties of OLS estimators

Finally, we will focus on various specification problems and begin


the discussion on violation of OLS assumptions
Readings
Wooldridge Chapter 5, 6 and 8

Verbeek Chapter 2, 3 and 4


Brief Recap of Properties of OLS Estimator
Let’s consider the linear regression model:
𝑦 = 𝑋 𝛽+ 𝜀
(NX1) (NXK) (KX1) (NX1)
Under the Gauss-Markov assumptions (A1)-(A4), the OLS
estimator 𝛽 is unbiased and efficient

Unbiasedness implies 𝐸 𝛽 = 𝛽 . In matrix form, remember


𝛽= −1 𝑋 ′ 𝑦. So, taking expectations, we get

𝐸 𝛽 = 𝐸 𝑋 ′𝑋 −1 𝑋 ′ 𝑦 = 𝐸 𝑋 ′𝑋 −1𝑋′ 𝑋𝛽 + 𝜀
= 𝐸 𝑋′𝑋 −1(𝑋′𝑋)𝛽 + 𝑋 ′𝑋 −1𝑋′𝜀
=𝐸 𝛽+ −1 𝑋 ′ 𝜀 = 𝛽 + −1 𝑋 ′ 𝐸 𝜀 = 𝛽

(since we are conditioning on X, they are fixed and by (A2),


𝐸 𝜀|𝑋 = 0 i.e. zero conditional mean of errors)
Efficiency: The variance-covariance matrix 𝑉(𝛽) is a 𝐾𝑋𝐾 matrix
consisting of the variances 𝛽1, 𝛽2…, 𝛽𝐾 on the diagonal and their
covariances as the off-diagonal elements


𝑉 𝛽 =𝐸 −1 𝑋′ss ′𝑋 −1]
𝛽 − 𝛽 𝛽 − 𝛽 = 𝐸[
= 𝑋′ 𝑋 −1𝑋′𝐸 ss ′ 𝑋 𝑋 ′𝑋 −1
= 𝑋 ′ 𝑋 −1 𝑋′𝜎2 𝐼𝑁𝑋 𝑋 ′ 𝑋 −1
−1 (𝑋′ 𝑋) 𝑋 ′ 𝑋 −1
= 𝜎2
−1
= 𝜎2

Reminder: Gauss-Markov Theorem says that under assumptions (A1)-


(A4), the OLS estimator 𝛽 is the Best Linear Unbiased Estimator for 𝛽
Asymptotic Properties of the OLS Estimator
In many cases, the small sample properties of the OLS estimator
may deviate from the ones discussed till now

E.g. when the Gauss-Markov assumptions are relaxed, properties


of estimators are very difficult to ascertain

In such cases we use an alternative approach to evaluate the


quality of our estimators, which is based on “asymptotic theory”

Asymptotic (or large sample) theory refers to the question as to


what happens, if hypothetically, the sample size becomes very
large (technically speaking, infinitely large)
Asymptotically, estimators usually have nice properties, and we
use these asymptotic properties to approximate the properties in
the finite sample that we happen to have

The asymptotic properties of the OLS estimator we will now


discuss are:

1. Consistency
2. Asymptotic Normality
3. Asymptotic Efficiency
Consistency
Unbiasedness of estimators, though important, cannot always be
achieved

So economists agree that consistency is a minimal requirement


that a useful estimator must fulfil

Let us first try to understand it intuitively with the help of a


diagram
Let 𝛽1 be the OLS estimator of 𝛽1 (equally applicable for any 𝛽𝑘)

For each n, 𝛽1 has a probability distribution, representing its


possible values in different random samples of size n

If this estimator is consistent, then the distribution of 𝛽1 become


more and more tightly distributed around 𝛽1 as the sample grows

• As n tends to infinity (𝑛 → ∞), the distribution of 𝛽1 collapses


to the single point 𝛽1

• This means that if we collect more and more data, then we can
get our estimator very close to 𝛽1
From Wooldridge p. 168
Confusion may arise because in reality, we only have a fixed
sample size when we are doing applied work!

Best to think of consistency as a “thought experiment” as follows:


let’s say we could obtain numerous random samples for a given
sample size n (just like in case of repeated sampling discussed
in previous lectures). This will give us the sampling
distribution of 𝛽1 for that n. Now, consider what would happen
if n keeps getting larger and larger?

The main idea is that obtaining more and more data should get us
very close to the true parameter value of interest
Consistency is based on the Law of Large Numbers (LLN), which
we encountered last week

Remember that LLN says that sample averages converge (in


probability) to population means. Thus LLN helps us show that the
sample distribution of the OLS estimator is centered around the
true parameter value. This is true asymptotically, i.e. it holds
when 𝑛 → ∞

Technical jargon of consistency: we say 𝛽𝑘 is a consistent


estimator of 𝛽𝑘 if
lim 𝑃 |𝛽𝑘 − 𝛽𝑘 > 𝛿 = 0 for all 𝛿 > 0
𝑛→∞
In words, this means that asymptotically (i.e. as 𝑛 → ∞), the
probability that the OLS estimator deviates more than 𝛿 from the
true parameter value is zero. Also referred to as:
plim 𝛽𝑘 = 𝛽𝑘

Thus, consistency is a large sample property that (loosely


speaking) implies that if we obtain more and more data, the
probability that our estimator is some positive number away from
the true parameter value becomes smaller and smaller, i.e. our
estimator approaches the true parameter value
Asymptotic Normality
Now, we need more than consistency to do inference

We need to know the sampling distribution of the OLS estimators


(remember this from last week’s discussion of Hypothesis Testing?)

In large samples, under Gauss-Markov assumptions, the


distribution of OLS estimators is approximately normally
distributed

Remember we discussed the Central Limit Theorem last week


regarding the shape of the sampling distribution of the OLS
estimator?
The Central Limit Theorem is also an asymptotic concept, i.e. it
says that if we take repeated independent random samples from a
population of size n, as 𝑛 → ∞, the distribution of the sample
means approaches a normal distribution

This is true even if the distribution in the parent population is not


normal

Asymptotic normality of our OLS estimator implies that our t and


F statistics have approximate t and F distributions in large samples
• For the purposes of hypothesis testing, therefore, nothing
changes from what we have done before
Asymptotic Efficiency
Under the Gauss-Markov assumptions, OLS estimators 𝛽𝑘 have
the smallest asymptotic variance among the class of all consistent
estimators of 𝛽𝑘
Specification Problems
Multicollinearity
If the correlation between two explanatory variables is too high,
becomes difficult to tell which is influencing the outcome variable y

For example, in a wage equation, we may want to include both age


and experience. You may expect these two to be correlated (older
people have more experience), but if they are very highly correlated,
then it becomes hard to identify the individual impact of these
variables

In the extreme case that the two variables move exactly together (i.e.
one explanatory variable is an exact linear combination of one or
more explanatory variable), it becomes impossible to identify the
individual effects
This is known as perfect collinearity. Technically, this implies that
the (𝑋′𝑋) matrix is not invertible

But multicollinearity (i.e. explanatory variables are highly but not


perfectly correlated) can still be tolerated

Multicollinearity is not a violation of the G-M assumptions, so


OLS estimator 𝛽 is still unbiased, but Var{𝛽} may be large

Hence, t-statistics maybe quite small and you will think the
variable is not significant when in fact it is. F-statistics may be
very large though, as is 𝑅2
All models contain some multicollinearity – the question is when
is it a problem

Multicollinearity is a problem of degree, ranging from perfect to


zero. It is a problem only if it is the former

Multicollinearity problems might be mitigated by either collecting


more data (sometimes infeasible) or dropping some of the
explanatory variables (be careful though!)
Including Irrelevant Regressors
Inclusion of an irrelevant regressor implies that one (or more)
explanatory variable is included in the model even though it has no
partial effect on y in the population

So, let’s say the true model is


𝑦 = 𝛽0 + 𝛽1𝑥1 + s

But we estimate
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + s

where 𝑥2 is the irrelevant variable such that 𝛽2 = 0


What is its impact on 𝛽1?

Well, there is no impact on the unbiasedness of 𝛽1

However, there will be some loss in efficiency i.e. the sampling


variance of 𝛽1 will be larger. Intuitively, this is because if 𝑥2 has
no partial effect on y, then including it in the model only increases
the multicollinearity problem (assuming 𝑥2 and 𝑥1 are correlated),
which will, as we just discussed a few slides ago, will increase the
variance of 𝛽1
Hence, including too many explanatory variables in a model is not
a great idea

But notice, that inclusion of variables is not fatal to the OLS


estimators (i.e. they are still unbiased), but exclusion of some
variables might be, leading to biased estimators, and hence may
pose a problem! The latter is the omitted variable problem (OVB)

…that we will discuss in detail next week

For now, let’s continue with our discussion of a few more


specification problems
Misspecifying the Functional Form
So far, we have assumed the model is linear (primarily due to their
convenience)

What if this is not appropriate? i.e. what if the true model is a non-
linear one?

Non-linearities can arise in two different ways:

1. The model is still linear in parameters but non-linear in its


explanatory variables, i.e. we include non-linear functions of
the 𝑥𝑖’s as additional explanatory variables
E.g. including (𝑎𝑔𝑒2) in an individual wage equation
Such a model can still be estimated using OLS
2. The model is non-linear in parameters e.g.
𝑦 = 𝛽0 + 𝛽1𝑥𝑖𝛽2
Estimation of such a model is less easy, and uses a non-
linear version of least squares that does not easily yield
analytical solutions for estimators

How do we test if the linear model is the appropriate one?

We use Ramsey’s RESET tests (Regression Equation Specification


Error Test)
The basic idea is that we add powers of the 𝑋 variables (multi-
variate case) to the estimating equation. But, instead of using
powers of 𝑋, we use powers of the fitted values 𝑦^, as the null
hypothesis of the test is that non-linear functions of the fitted values
do not explain 𝑦

Hence, the procedure is:


• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + s
• Obtain fitted values 𝑦^
• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + 𝑐 𝑦^ 2 + s
• Test for significance of 𝑐. If it is significantly different from zero,
reject linear specification
Stability Testing
So far, we have assumed that the functional form of the model is
same for all observations in the sample

However, sometimes this might not be the case

Effects may be different across two or more sub-samples. In a


cross-sectional sample, we can think of sub-samples being made of
males and females. In a time series context, sub-samples may be
typically defined by time. E.g. the regression coefficients may be
different before and after a major change in macroeconomic policy
Such changes in regression coefficients is referred to as a structural
break

In order to test for structural break, we use a Chow Test

Formally, we wish to test for a break after 𝑛1. So,


• Estimate 𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾 𝑥𝐾 + s over the first 𝑛1
observations. Obtain 𝑅𝑆𝑆1
• Then, estimate 𝑦 = 𝛽0′ + 𝛽1′𝑥1 + ⋯ + 𝛽𝐾′𝑥𝐾 + s over the
remaining 𝑛2 observations. Obtain 𝑅𝑆𝑆2
• Calculate 𝑅𝑆𝑆𝑈 = 𝑅𝑆𝑆1+ 𝑅𝑆𝑆2
• Then estimate the restricted regression for the whole sample:
𝑦 = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝐾𝑥𝐾 + s over 𝑛1+ 𝑛2 observations
and obtain 𝑅𝑆𝑆𝑅
(𝑅𝑆𝑆𝑅−𝑅𝑆𝑆𝑈)/𝐾
• Use the F-test: F = ~𝐹
𝑅𝑆𝑆𝑈/(𝑁−2𝐾)
• This F-test is the Chow Test for structural break
• Null hypothesis 𝐻0: coefficients are stable
• If we reject the null, then we can say there is a structural break
• Intuitively, 𝑅𝑆𝑆𝑅 must be larger than 𝑅𝑆𝑆𝑈 if there is a break. So
if we find a big enough difference, we reject the null of stability
Violation of OLS Assumptions
So far, we have seen that the Gauss-Markov Assumptions (A1)-
(A4) may not always be satisfied

But that need not be fatal for the OLS estimator, because it is still
unbiased

Now we will discuss heteroscedasticity and autocorrelation, which


represent violations of (A3) and (A4)

Remember that (A3) implies homoscedasticity or constant variance


of the error term and (A4) implies no auto-correlation or error terms
are uncorrelated
In matrix form, therefore, these relate to the variance-covariance
matrix, that is 𝑉𝑎𝑟 =𝐸 = 𝜎2𝐼𝑁

Where 𝐼𝑁 is a 𝑁𝑋𝑁 identity matrix with 1 across the diagonal and


zeroes elsewhere. Hence, the variance-covariance matrix has 𝜎2 on
the diagonal and zeroes elsewhere

• Heteroscedasticity implies the diagonal elements of the matrix


are no longer the same i.e. 𝜎𝑖2 ≠ 𝜎j 2
• Autocorrelation implies that this matrix has non-zero elements
off the diagonal
Heteroscedasticity
Variance of the error term now increases across observations

Often occurs in cross-sectional data over the range of the 𝑋 variable

E.g. holiday expenditure and income; we may expect that higher


income is associated with higher holiday expenditure (i.e. the
relationship between these two variables is positive), but also that
variation in holiday expenditures among high-income households is
larger than the variation among low income households
In practice, this looks like the scatter plot is ‘fanning out’
Hence, our regression line is still correct, i.e. it still minimizes the
sum of squared residuals, but there is this fanned out section in
parts of the regression line

Consequences of heteroscedasticity:
OLS estimator is still unbiased and consistent, but inefficient (no
longer BLUE) and no longer asymptotically efficient
• Unbiased because 𝑋’s are still exogenous (A2 still holds)
• Estimators of the error variance are biased. This invalidates
inference because we can’t trust our OLS estimates for higher
values of 𝑋 since those values are associated with bigger errors
• Inefficiency is harder to demonstrate, but simply put, OLS gives
equal weight to all observations but if some have higher variance
than others, then it would be better to attach low weights to the
ones with higher variance

How to deal with heteroscedasticity


The most common approach is to use robust standard errors

Since it is only the standard errors rather than the coefficients that
are biased due to heteroscedasticity, one cure can be to find
alternative estimates of the standard errors
White (1980) suggested that replacing 𝜎𝑖2 with s^𝑖2, i.e. the
appropriate OLS residual in the standard formula for calculating
OLS variance-covariance matrix (details not necessary)

Use of such robust standard errors have now become standard


practice in most applied economic work

Implemented in STATA using the ‘robust’ option after the main


command

There may be a bit of heteroscedasticity in most empirical settings,


hence best to use robust standard errors
Sometimes, we also ‘cluster’ our standard errors. This is similar to
‘robust’ but we give STATA some hint regarding what we think the
pattern of heteroscedasticity may look like

For example, if you are working with a dataset of firms, you would
cluster your standard errors by firm (the unit of analysis)

The intuition is that observations from the same firm must have
errors plucked from the same error distribution. Different firms will
then have different error distributions and this is where the
heteroscedasticity comes from
Estimation in presence of heteroscedasticity
Use of Generalized Least Squares (GLS)

As we know, OLS assumes 𝐸 = 𝜎2𝐼𝑁, a very restrictive


form of error structure

GLS allows the assumption that 𝐸 = 𝜎 2 𝜑 where 𝜑 is any


𝑁𝑋𝑁 matrix

Taking account of 𝜑 enables us to get estimators that are BLUE


The GLS approach works by applying a transformation to the
variables in the model

Let’s first assume we can decompose the 𝜑 matrix as follows:


𝜑−1 = 𝑃′𝑃
So, we can rewrite the original 𝜑 as
𝜑 = (𝑃′𝑃)−1= 𝑃−1(𝑃′)−1
𝑃𝜑𝑃′ = 𝑃𝑃−1(𝑃′)−1𝑃′ = 𝐼
The point of this is to show that there is some matrix P that we can
use to pre-multiply the 𝜑 matrix to get an identity matrix I that will
be representative of a homoscedastic error structure
Next, we pre-multiply the whole model with the P matrix:
𝑃𝑦 = 𝑃𝑋𝛽 + 𝑃𝜀
which we can then re-write in different notation to represent a
‘transformed model’:
𝑦* = 𝑋*𝛽 + 𝜀* (*)
Intuitively, the P matrix is some external piece of information that
we feed into the regression model that contains information about
the structure of heteroscedasticity

The starred model (*) satisfies the classical assumptions about the
error term and we could apply OLS to it obtain estimators that are
BLUE (for more details, see Wooldridge p. 277)
Final point: Robust standard errors approach is used much more in
practice. This is because GLS requires us to know something about
the structure of heteroscedasticity, whereas the robust approach just
estimates the structure from the data
Conclusion
Today, we studied asymptotic properties of OLS estimators

Then we moved on to discussion of situations when OLS


assumptions are relaxed or violated. So far, none have led to
OLS estimators being biased, i.e. none have been fatal!

Next week, we will discuss autocorrelation, the remaining case


where an OLS assumption is violated without leading to bias,
before moving onto the important cases where bias does arise,
making OLS estimation very difficult

You might also like