Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

OLS ESTIMATION OF SINGLE EQUATION

MODELS

Structural Model:

y = 0 + 1 x1 + 2 x2 + + K xK + u
= x + u,
where, x1K = (x1 x2 xK ) with x1 = 1 (intercept) and

k1 =

.
...

Assumptions
1. We can obtain a random sample from the population, where the
sample observations {(xi, yi) : i = 1, 2, , N } are iid.
2. The population error has zero mean and is uncorrelated to the
regressors.
E(u) = 0, cov(xj , u) = 0, j = 1, 2, , K

(1)

Sufficient for (1) is the assumption [HW show it]


E(u|x1, x2, , xK ) = E(u|x) = 0

(2)

[Note:

i. An explanatory variable is called endogenous if it is correlated


with population error. An econometric model woth endogenous
explanatory variables is said to suffer from endogeneity.

ii. Usually endogeneity arises in one of the three ways:


(a) Omitted Variables: Suppose we cannot control for some explanatory variables or regressors in the structural model because we do not have data on them (they may not be enumerable at all). Let E(y|x, q) is the true population regression
function that is linear in all xs and q.

If we do not have

data on q, we may estimate E(y|x) (an estimable equation)

where q becomes part of u. Now if q and xj is correlated for


any j = 1, 2, , K it leads to endogeneity in the estimable
model.
(b) Measurement Error: Suppose we want to include a regressor
xK in the true structural model, but the data allows us to
observe only an imperfect measure of xK , namely xK [e.g.
true income versus reported income], where xK = xK + eK , eK
being measurement error. Depending on how xK and xK are
correlated, xK and u may be correlated if we use xK in place
of xK in the estimable model, leading to endogeneity.
(c) Simultaneity: Simultaneity arises when at least one explanatory
variable is determined simultaneously with the dependent variable of the equation. Let xK is determined partly as a function

of y. Then xK and u are generally correlated. E.g.- If quantity supplied is a dependent variable and price is an explanatory
variable, then the market clearing (or equilibrium) phenomenon
codetermines the values of quantity supplied and price as may
be observed in data. In this case xK (or price) is likely to be
endogenous.

]
Assumption (2) is stronger than what is required for deriving the
OLS . So we stick to (1):
asymptotic properties of

E(x0u) = 0

[OLS 1]

x1 u
0


x u
0
2

E
= ,
..
..
.
.

xK u

i.e. E(xj u) = 0 j = 1, , K.
Also as x1 = 1, Ex1u = 0 E(u) = 0 . Thus, E[xj E(xj )][u
E(u)] = 0 (remember E(xj ) is a population moment, hence constant).
cov(xj , u) = 0, j = 1, 2, , K .
3.

rank E(x0x) = K [OLS 2]

0
E(x x) = E

x1

x2

(x1x2 xK ) =
...

xK

2
Ex1

Ex x

2 1

...

ExK x1

Ex1x2

.....

Ex2
2
...

.....
...

ExK x2

.....

Ex1xk

Ex2xK

...

Ex2
K

KK

Since E(x0x) is symmetric and K K, full rank E(x0x) is positive


definite.
This condition implies that we are not replicating regressors. If, for
instance, x1 xK , then the first and last columns of of x0x is identical.
So rank(E(x0x)) < K. Also if the regressors are linearly dependent
[for instance if you include dummy variables for all categories] then
rank(E(x0x)) < K. Assumption OLS 2 precludes that possibility.

Identification of in Structural Model


In the context of linear models, identification of parameters of a
model implies that the parameters can be expressed in terms of population moments of observable variables.
Our model is, y = x + u
x 0 y = x0 x + x0 u
E(x0y) = E(x0x) + E(x0u)
By OLS 1, E(x0u) = 0
= [E(x0x)]1E(x0y), where OLS 2 ensures that [E(x0x)]1 exists.

Method of Moments
Replace population moments E(x0x) and E(x0y) with corresponding
sample moments to obtain the sample counterpart (or, estimator) of
the population parameter.
So,
OLS

= (N 1

N
X

N
X

xixi)1(N 1

i=1

xi yi )

i=1
1

P 2
x1i
P
x2ix1i

P 2
x2i

...
P

xKix1i

x1ix2i
...

xKix2i

.....
.....
...
.....

x1ixki

P
x2ixKi

...

P 2
xKi

x1iyi

x2iyi

...

P
xKiyi

The full data matrix analysis yields the same result.


Suppose the full data matrices are as follows:

x11

Let , XN K = 12
..
.

x21

.....

x22
...

.....
...

xK1

xK2

be the N-observation data


...

x1N x2N ..... xKN


matrix on regressors x1, x2, , xK , and

yN 1 =

able y.

y1

y2

be the N-observation data vector on dependent vari...

yN

OLS = (N 1 PN x0 xi)1(N 1 PN x0 yi) = (X0X)1X0y.


Then
i=1 i
i=1 i
Under assumption OLS 2, X0X is non-singular with probability apPN
0
0
p
proaching 1. This is because as N , [(1/N ) i=1 xixi] E(x x)

(the sample of size N approaching the entire population causing sample moment to converge to corresponding population moment in probability).

But E(x x) is non-singular.

P
0
0
Hence P [ N
x
x

X
X is
i=1 i i

non-singular] 1 as N .
PN
0
1
1
Hence by Corollary 1 of Asymptotic Theory, plim [(N
i=1 xixi) ] =
0
A1, where A = E(x x).

.
We have used the method of moments to derive this estimator
Why are we calling it Ordinary Least Square then?

The answer may be found by looking at property 8 of Conditional


Expectation Operator.
min
2 ].
E[(y

m(x))
mM

If (x) E(y|x) then is a solution to

The least square exercise, as we commonly

understand it, is a sample counterpart of this population problem:


min 1 PN
2 . In method of moments we did exactly the same
(y
x

)
i
i

i=1
N

thing: first, expressing in terms of population moments of x and


y and then replacing population moments with corresponding sample
moments. Thus in effect we found the sample counterpart of E(y|x),
or, 0 + 1x1 + 2x2 + + K xK as 0 + 1x1 + 2x2 + + K xK .
K1 is called least-square estimator of .
Hence

Consistency of OLS
OLS

= (N 1

N
X

xixi)1(N 1

i=1

+ (N 1

N
X

x i yi )

i=1
N
X

xixi)1(N 1

i=1

N
X

xiui).

i=1

By weak law of large numbers, (Theorem 1),


N 1

PN
0
0
p
x
u

E(x
u) = 0 by OLS 1.
i=1 i i

Hence by Slutskys Theorem (Lemma 4),


PN
0
1
1
1

plim OLS = + A 0 = (Remember, plim [(N


i=1 xixi) ] =
A1 and plim [(N 1

PN
0
x
i=1 iui)] = 0).

Note that if OLS1 or OLS2 fails, is not identified.


OLS is not necessarily unbiased under OLS1 and OLS2.

OLS = + (N 1 PN x0 xi)1(N 1 PN x0 ui) = + (X0X)1X0u,

i=1 i
i=1 i

where, uN 1 =

u1

u2

.
...

uN
OLS ) = + E[(X0X)1X0u] 6= in general.
But E(
However if we use the stronger assumption (2) instead of OLS1, i.e.,
OLS may be retrieved.
E(u|x) = 0 then unbiasedness of
OLS = + (X0X)1X0u.

|X) = + (X0X)1X0E(u|X) = + 0 = .
Now E(
) = E[E(
|X)] = E( ) = .
But E(

Asymptotic Inference Using OLS


Note that

OLS ) = (N 1 PN x0 xi)1(N 1/2 PN x0 ui).


N (
i=1 i
i=1 i

PN
0
p
1
1
We know that, (N
x
x
)
A1, so that
i=1 i i
PN
0
p
1 A1
1
x
x
)
(N
i=1 i i

, i.e.

PN
0
1
1 A1 = o (1).
(N
p
i=1 xixi)
0

Again, E(xiui) = 0, i = 1, 2, by OLS 1.


0

Also, {(xiui) : i = 1, 2, } is an iid sequence with zero mean and


each term is assumed to have a finite variance. Then by Central
Limit Theorem,

PN
0
0
0
d
1/2
x
u

N
(0,
B),
where
B
=
var(x
u
)
=
E(x
N
KK
i=1 i i
i i
i ui u i xi ) =
0
2 x0 x) for any i.
x
x
)
=
E(u
E(u2
i i i

This means N 1/2


Then

PN
0
x
i=1 iui = Op(1), by Lemma 5.

N (OLS ) = [A1 + op(1)]Op(1) = A1(N 1/2

op(1)Op(1) = A1(N 1/2

PN
0
x
i=1 iui) +

PN
0
x
i=1 iui) + op(1), by Lemma 2.

Assumption 4
4.

E(u2x x) = 2E(x x), where 2 E(u2) [OLS 3]

Since E(u) = 0, 2 E(u2) [E(u)]2 = var(u). In other words, OLS


3 states that the variance of u, viz E(u2) = 2 is constant and hence
0

independent of x, and thus can be taken out of E(u2x x).


OLS 3 is the weak homoskedasticity assumption. It means that
u2 is uncorrelated with xj , x2
j and xj xk , j, k = 1, ..., K.
Sufficient for OLS 3 is the assumption E(u2|x) = 2 which is
equivalent to var(u|x) = 2 when E(u|x) = 0.
Asymptotic Normality of OLS

OLS ) =
N (

PN
0
a
1
1/2
2 A1 ).

A (N
x
u
)+o
(1),
it
follows
that
N
(

N
(0,

p
i
OLS
i=1 i
From OLS 1 - OLS 3, and the fact that

Proof : N 1/2

PN
0
d
x
u

N (0, B).
i
i=1 i

Hence, by Corollary 2, A1(N 1/2


Now,

PN
0
d
1 BA1 ).
x
u
)

N
(0,
A
i
i=1 i

OLS ) A1(N 1/2 PN x0 ui) = op(1), i.e.,


N (
i=1 i

) A1(N 1/2

OLS
N (

PN
0
p
x
u
)

0.
i=1 i i

Hence by Lemma 7 (Asymptotic Equivalence),

d
OLS )
N (

N (0, A1BA1).
But under OLS 3, B = 2E(x0x) = 2A.
Thus,

a
OLS )
N (
N (0, 2A1).

OLS as approximately normal


The above result allows us to treat
2 A1
2 [E(x0 x)]1

with mean and variance-covariance matrix N , i.e.,


.
N

Usual estimator of 2 is
2 = NRSS
K , where RSS =
).
squared OLS residuals: ui = yi xi

PN
2 (sum of
u

i
i=1

It can be shown that


2 is consistent. [H.W. show it]
PN
2
2
0
1
0
Replace with
and E(x x) by the sample average N
i=1 xixi =
1 (X0 X).
N
1 =
OLS ) =

Thus Avar(
2N (X0X)1 N
2(X0X)1.

Hence under OLS 1 - OLS 3, usual OLS standard errors, t-statistics


and F -statistics are also asymptotically valid (the F -statistic being a
degrees of freedom adjusted Wald-statistic for testing linear restriction of the form R = r.)
[See undergraduate notes for derivation of t and F -stats by distributions of quadratic forms.]

Violation of CLRM Assumptions


Suppose OLS 3 does not hold (Heteroskedasticity)
We have already shown that,

d
OLS )
N (
N (0, A1BA1).

OLS is
In other words, asymptotic variance of
1 1

Avar( OLS ) = A BA1,


N

where AKK = E(x x)


0
2
and BKK = var(x u) = E(u x x).

PN
0
1
A consistent estimator of A is (N
i=1 xixi).

What is a consistent estimator of B?

By Law of Large Numbers (Theorem 1), N 1

PN
p
2 x0 x
2 x0 x) =
u
E(u
i=1 i i i

B.
OLS .
As ui cannot be observed, replace ui by OLS residual u
i = yi xi
White (1980, Econometrica) proves the following.
= N 1
White (1980): A consistent estimator of B is B

PN
2 x0 x .
u

i=1 i i i

Proof: The Proof Consists of several parts.


Part I:
)
u
i = ui xi(

)
xi u
i = xiui xixi(

(1)

Transposing,
0 0

) xi x i
u
ixi = uix (

(2)

Multiplying (1) by (2),


0
0
2
2
u
i xixi = ui xixi

)0 x i x i
ui x i (
0

)xi
ui x i x i (
0
0
)(
)0 x i x i
+ x i x i (
N
N
N
X
X
0
0
0
0
1 X
1
1
2
2
0

u
x xi =
u x xi
u i x i ( ) x i x i
N i=1 i i
N i=1 i i
N i=1
N
0
1 X
)xi
u i x i x i (

N i=1
N
0
0
1 X
0

+
xixi( )( ) xixi
N i=1
(3)

Part II: A digression on Matrix Algebra


The vec operator: stacking columns of a matrix to form a vector.

a
Thus, vec(A) = vec 11
a21

a11

a12
a21

a12
a22

a22

vec(ABC) = (C0A)vec(B) , where is the Kronecker or Direct

a11B

a B

Product: AKLBM N = CKM LN = 21

...

aK1B

a12B

a22B
...

...

aK2B

a1LB

a2LB

...
aKLB

To prove: vec(ABC) = (C0A)vec(B).


We prove it using an example.

b1


a13

, B31 = b2 , C12 =
c1

a23
b3

a
Let A23 = 11
a21

a12
a22

c2 ,

c (a b + a12b2 + a13b3)
ABC = 1 11 1
c1(a21b1 + a22b2 + a23b3)

c2(a11b1 + a12b2 + a13b3)


c2(a21b1 + a22b2 + a23b3)

c1(a11b1 + a12b2 + a13b3)

c (a b + a b + a b )

22 2
23 3
Therefore, LHS = vec(ABC) = 1 21 1

c2(a11b1 + a12b2 + a13b3)

c2(a21b1 + a22b2 + a23b3)

Now RHS=

c
a
(C 0 A)vec(B) = ( 1 11
c2
a21

b1

a13

)vec b2

a23
b3

a12
a22

c1a11

c a

= 1 21

c2a11

c2a21

c1a12
c1a22
c2a12
c2a22

c1a13

c1a23

c2a13

c2a23

b1

b2

b3

a11b1c1 + a12b2c1 + a13b3c1

a b c +a b c +a b c

22 2 1
23 3 1
= 21 1 1

a11b1c2 + a12b2c2 + a13b3c2

a21b1c2 + a22b2c2 + a23b3c2

c1(a11b1 + a12b2 + a13b3)

c (a b + a b + a b3)

22 2
23
= 1 21 1

c2(a11b1 + a12b2 + a13b3)

c2(a21b1 + a22b2 + a23b3)

Hence LHS=RHS.

Part III:
Consider 3rd term on RHS of eqn 3.
1 PN vec(u x0 x (
i
i )xi)
N

i=1

1 PN (u x0 x0 x )vec(
).
=N
i=1 i i
i i

[ui is a scalar.

)K1 = B and
Treat (xixi)KK = A and (

xi1K = C]
p
p
)
)
Now, clearly (
0 vec(
0.

uix1i

0
0
uix2i
Again, uixi xixi =

...

uixKi

2
x1i

x x
2i 1i

...

xKix1i

x1ix2i
x2
2i

.....

...

.....
...

xKix2i

.....

x1ixKi

x2ixKi

...

x2
Ki

uix1i

=
uix2i

u x
i Ki

2
x1i

x x
2i 1i

...

x1ix2i

.....

x2
2i
...

.....
...

xKix1i

xKix2i

.....

2
x1i

x x
2i 1i

...

x1ix2i

.....

x2
2i
...

.....
...

xKix1i

xKix2i
...

.....

2
x1i

x x
2i 1i

...

x1ix2i

.....

x2
2i
...

.....
...

xKix2i

.....

xKix1i

x1ixKi

x2ixKi

...

xKi

x1ixKi

x2ixKi

...

x2

Ki

x1ixKi

x2ixKi

...

x2
K 2 K
Ki

P
PN
0
0
1
1
The terms of the final N (uixixixi) matrix are of the form N i=1 uix3
ki
P
P
1
N u x2 x or 1
N
or N
i=1 i ki ji
N i=1 uixkixji xli, j, k, l = 1, . . . , K.
2
Assume corresponding population moments, i.e., E(x3
j u), E(xk xj u),
or E(xj xk xl u) exist and are finite.

PN
0
0
p
1
Then by WLLN, N i=1(uixi xixi) E(), the corresponding popu-

lation moments matrix.


Consequently, by Lemma 1 of Asymptotic Theory, the 3rd term of
RHS of (3) is Op(1)op(1) = op(1).

The 2nd term can be treated

similarly, as it is only a transpose of the 3rd term.

Now consider 4th term on RHS of equation (3).


1 PN vec(x0 x (
0 x0 x )

)(

)
i i
i i
N i=1
1 PN (x0 x x0 x )vec((
)(
)0).
=N
i=1 i i
i i
p
p

)(
)0 )
Again,
0, thus vec((
0, by Lemma 2.

Also, xixi xixi =

x2
1i

x21i

x2ix1i
..
.
xKix1i

x1ix2i
x22i
...
xKix2i

x2ix1i

.
..
.
.

xkix1i

.
..
.
.

.
.
...
.
...
.
.
...
.

...
...
...
...
...
...
...
...

x1ixki

x2ixKi
...

x2Ki

x1ix2i

.
...

.
..
.
.

x22i

.
..
.
.

.
.
...
.
.
.
...
.

...
...
...
...
...
...
...
...

.
...

...
...
...
...

.
...

xkix2i

.
..
.
.

.
.
...
.

x1ixki

.
...

...

x2ixki

.
..
.
.

.
.
...
.

...
...
...
...

.
.
...
.
...

...
...
...
...

.
..
.
.

...
...
...
...
...

.
...

x2ki

.
..
.
.

.
.
...
.

...
...
...
...

...

...

...

.

1 PN (x0 x x0 x ) matrix are of the form


The terms of the final N
i=1 i i
i i
1 PN x4 , 1 PN x3 x , 1 PN x2 x2 , 1 PN x2 x x , 1 PN x x x x ,
N i=1 ji N i=1 ji ki N i=1 ji ki N i=1 ji ki li N i=1 ji ki li mi
j, k, l, m = 1, . . . , K.
2 x ), E(x3 x ),
x
Assume corresponding population moments, i.e., E(x2
j k
j k l
2 2
E(x4
j ), E(xi xi ), E(xj xk xl xm) exist and are finite.

PN
0
0
p
1
Hence by WLLN, N i=1(xixi xixi) E(), the corresponding pop-

ulation moments matrix.


Consequently, by Lemma 1, the 4th term is also Op(1)op(1) = op(1).
1 PN vec(u x0 x (
1 PN vec(x0 x (

)(

Note also, if N

)x
)
or
i i i
i
i=1
i i
N i=1
0
0
) xixi) are op(1), then so also are the original expressions in RHS of
P
0
)xi) or 1 PN (x0 xi(
)(

equation (3), viz., 1 N (uix xi(


N

i=1

i=1

)0xixi), since the vec operator does nothing but stack the columns
of the original matrices into vectors.
PART IV: Finally, from equation (3),
N
N
X
0
0
1 X
1
2
2
u
x xi =
u x xi + op(1).
N i=1 i i
N i=1 i i

So
N
N
X
0
0
1 X
1
p
2
2
u
i xixi
ui xi x i .
N i=1
N i=1

We already know that,


N
0
0
1 X
p
2
2
u x xi E(u x x) = B, by WLLN.
N i=1 i i

Thus,
N
0
1 X
p
2
u
x xi B.
N i=1 i i

OLS
Hence the heteroskedasticity-robust-variance-covariance matrix of
is:
1B
A
1
A
)robust =

Avar(
N


N
1 X

0
1
x i xi
N
N i=1

N
X

N
1 X

N i=1

0
2
u
i xixi

N
1 X

N i=1

1
0

x i xi

N
X

0
2
0
1
0
1

u
i xixi (X X) , where
xixi = X0X.
= (X X)
i=1
i=1

This matrix is also often called the sandwich matrix because of its

form.
OLS ) is obtained, we can
Once heteroskedasticity-consistent varcov(
OLS , by
also get the heteroskedasticity-robust standard errors of
)robust.

taking square roots of diagonal terms of Avar(
Once robust standard errors are obtained, t and F statistics can be
computed in the usual way (robust t or F stats).
However under the heteroskedasticity, F -stats are usually not valid
even asymptotically. So instead Wald-stats should be employed.
H 0 : R(QK) (K1) = r(Q1), where rank(R) = Q K.

r)0(RVR

0)1(R
The heteroskedasticity robust Wald stat is W = (R
a 2
)robust. W
= Avar(

r), where V
.
Q

So far, we were proceeding with weak exogeneity assumption: OLS


1. If we had assumed strong exogeneity, i.e., E(u|x) = 0 then there
exists another solution to violation of OLS 3, i.e., heteroskedasticity.
In this case, i,e, E(u|x) = 0 if OLS 3 fails, then we can specify a
model for var(y|x), estimate the model and apply generalized least
square (GLS).
For observation i, yi and each element of xi are divided by an estimate
of the conditional standard deviation [var(yi|xi)]1/2. OLS applied to

GLS . GLS is a special form


this transformed (weighted) data gives
of weighted least squares (WLS).
In modern econometrics, it is however a more popular approach to
and heteroskedasticity correction for
stick to OLS estimate for
, viz., Avar(
OLS )robust.

the estimated variance-covariance matrix of
This latter matrix and consequent standard-errors are then used for
testing.
Note that robust standard errors are valid even when OLS 3 holds
)OLS ) simplifies to 2(X0X)1). So this is an easier

(only then Avar(
approach.
GLS may be avoided for other reasons.

1. GLS leads to efficiency gain only when the model for var(y|x) is
correct. So it requires a lot of information.
2. Finite sample properties of Feasible GLS are usually not known
except for simplest cases.
3. Under weak exogeneity assumption [OLS 1] GLS is generally inconsistent if E(u|x) 6= 0.
[See undergraduate notes for treatment of GLS].

You might also like