Econometrics Creel

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 492

Econometrics

c Michael Creel

Version 0.70, September, 2005

D EPT.

OF

B ARCELONA ,

E CONOMICS

AND

E CONOMIC H ISTORY, U NIVERSITAT A UTNOMA

MICHAEL . CREEL @ UAB . ES , H T T P :// P A R E T O . U A B . E S / M C R E E L

DE

Contents
List of Figures

10

List of Tables

12

Chapter 1. About this document

13

1.1. License

14

1.2. Obtaining the materials

14

1.3. An easy way to use LYX and Octave today

15

1.4. Known Bugs

17

Chapter 2. Introduction: Economic and econometric models

18

Chapter 3. Ordinary Least Squares

21

3.1. The Linear Model

21

3.2. Estimation by least squares

22

3.3. Geometric interpretation of least squares estimation

25

3.4. Inuential observations and outliers

28

3.5. Goodness of t

31

3.6. The classical linear regression model

34

3.7. Small sample statistical properties of the least squares estimator

36

3.8. Example: The Nerlove model

43

Exercises

49

Chapter 4. Maximum likelihood estimation


3

50

CONTENTS

4.1. The likelihood function

50

4.2. Consistency of MLE

54

4.3. The score function

56

4.4. Asymptotic normality of MLE

58

4.6. The information matrix equality

63

4.7. The Cramr-Rao lower bound

65

Exercises

68

Chapter 5. Asymptotic properties of the least squares estimator

70

5.1. Consistency

70

5.2. Asymptotic normality

71

5.3. Asymptotic efciency

72

Chapter 6. Restrictions and hypothesis tests

75

6.1. Exact linear restrictions

75

6.2. Testing

81

6.3. The asymptotic equivalence of the LR, Wald and score tests

90

6.4. Interpretation of test statistics

94

6.5. Condence intervals

94

6.6. Bootstrapping

95

6.7. Testing nonlinear restrictions, and the Delta Method

98

6.8. Example: the Nerlove data

102

Chapter 7. Generalized least squares

110

7.1. Effects of nonspherical disturbances on the OLS estimator

111

7.2. The GLS estimator

112

7.3. Feasible GLS

115

7.4. Heteroscedasticity

117

CONTENTS

7.5. Autocorrelation

130

Exercises

151

Exercises

153

Chapter 8. Stochastic regressors

154

8.1. Case 1

155

8.2. Case 2

156

8.3. Case 3

158

8.4. When are the assumptions reasonable?

158

Exercises

161

Chapter 9. Data problems

162

9.1. Collinearity

162

9.2. Measurement error

171

9.3. Missing observations

175

Exercises

181

Exercises

181

Exercises

181

Chapter 10. Functional form and nonnested tests

182

10.1. Flexible functional forms

183

10.2. Testing nonnested hypotheses

195

Chapter 11. Exogeneity and simultaneity

199

11.1. Simultaneous equations

199

11.2. Exogeneity

202

11.3. Reduced form

205

11.4. IV estimation

208

CONTENTS

11.5. Identication by exclusion restrictions

214

11.6. 2SLS

227

11.7. Testing the overidentifying restrictions

231

11.8. System methods of estimation

236

11.9. Example: 2SLS and Kleins Model 1

245

Chapter 12. Introduction to the second half

248

Chapter 13. Numeric optimization methods

257

13.1. Search

258

13.2. Derivative-based methods

258

13.3. Simulated Annealing

267

13.4. Examples

268

13.5. Duration data and the Weibull model

272

13.6. Numeric optimization: pitfalls

276

Exercises

282

Chapter 14. Asymptotic properties of extremum estimators

283

14.1. Extremum estimators

283

14.2. Consistency

284

14.3. Example: Consistency of Least Squares

289

14.4. Asymptotic Normality

291

14.5. Examples

294

14.6. Example: Linearization of a nonlinear model

298

Exercises

303

Chapter 15. Generalized method of moments (GMM)


15.1. Denition

304
304

CONTENTS

15.2. Consistency

307

15.3. Asymptotic normality

308

15.4. Choosing the weighting matrix

310

15.5. Estimation of the variance-covariance matrix

313

15.6. Estimation using conditional moments

316

15.7. Estimation using dynamic moment conditions

322

15.8. A specication test

322

15.9. Other estimators interpreted as GMM estimators

325

15.10. Example: The Hausman Test

334

15.11. Application: Nonlinear rational expectations

341

15.12. Empirical example: a portfolio model

345

Exercises

347

Chapter 16. Quasi-ML

348

Chapter 17. Nonlinear least squares (NLS)

354

17.1. Introduction and denition

354

17.2. Identication

356

17.3. Consistency

358

17.4. Asymptotic normality

358

17.5. Example: The Poisson model for count data

360

17.6. The Gauss-Newton algorithm

361

17.7. Application: Limited dependent variables and sample selection 364


Chapter 18. Nonparametric inference

368

18.1. Possible pitfalls of parametric inference: estimation

368

18.2. Possible pitfalls of parametric inference: hypothesis testing

372

18.3. The Fourier functional form

373

CONTENTS

18.4. Kernel regression estimators

385

18.5. Kernel density estimation

391

18.6. Semi-nonparametric maximum likelihood

391

18.7. Examples

397

Chapter 19. Simulation-based estimation

408

19.1. Motivation

408

19.2. Simulated maximum likelihood (SML)

415

19.3. Method of simulated moments (MSM)

418

19.4. Efcient method of moments (EMM)

422

19.5. Example: estimation of stochastic differential equations

428

Chapter 20. Parallel programming for econometrics

431

Chapter 21. Introduction to Octave

432

21.1. Getting started

432

21.2. A short introduction

432

21.3. If youre running a Linux installation...

435

Chapter 22. Notation and Review

436

22.1. Notation for differentiation of vectors and matrices

436

22.2. Convergenge modes

437

22.3. Rates of convergence and asymptotic equality

441

Exercises

444

Chapter 23. The GPL

445

Chapter 24. The attic

456

24.1. MEPS data: more on count models

457

24.2. Hurdle models

462

CONTENTS

24.3. Models for time series data

474

Bibliography

491

Index

492

List of Figures
1.2.1

LYX

15

1.2.2

Octave

16

3.2.1

Typical data, Classical Model

23

3.3.1

Example OLS Fit

26

3.3.2

The t in observation space

26

3.4.1

Detection of inuential observations

30

3.5.1

Uncentered

32

3.7.1

Unbiasedness of OLS under classical assumptions

37

3.7.2

Biasedness of OLS when an assumption fails

38

3.7.3

Gauss-Markov Result: The OLS estimator

41

3.7.4

Gauss-Markov Result: The split sample estimator

42

6.5.1

Joint and Individual Condence Regions

96

6.8.1

RTS as a function of rm size

107

7.4.1

Residuals, Nerlove model, sorted by rm size

125

7.5.1

Autocorrelation induced by misspecication

132

7.5.2

Durbin-Watson critical values

144

7.6.1

Residuals of simple Nerlove model

147

7.6.2

OLS residuals, Klein consumption equation

149

10

LIST OF FIGURES

9.1.2

9.1.1

11

when there is no collinearity

164

when there is collinearity

165

9.3.1

Sample selection bias

179

13.1.1

The search method

259

13.2.1

Increasing directions of search

261

13.2.2

Newton-Raphson method

263

13.2.3

Using MuPAD to get analytic derivatives

266

13.5.1

Life expectancy of mongooses, Weibull model

275

13.5.2

Life expectancy of mongooses, mixed Weibull model

277

13.6.1

A foggy mountain

278

15.10.1

OLS and IV estimators when regressors and errors are


correlated

21.2.1

335

Running an Octave program

433

List of Tables
1

Marginal Variances, Sample and Estimated (Poisson)

457

Marginal Variances, Sample and Estimated (NB-II)

462

Actual and Poisson tted frequencies

463

Actual and Hurdle Poisson tted frequencies

467

Information Criteria, OBDV

474

12

CHAPTER 1

About this document


This document integrates lecture notes for a one year graduate level course
with computer programs that illustrate and apply the methods that are studied. The immediate availability of executable (and modiable) example programs when using the PDF1 version of the document is one of the advantages
of the system that has been used. On the other hand, when viewed in printed
form, the document is a somewhat terse approximation to a textbook. These
notes are not intended to be a perfect substitute for a printed textbook. If you
are a student of mine, please note that last sentence carefully. There are many
good textbooks available. A few of my favorites are listed in the bibliography.
With respect to contents, the emphasis is on estimation and inference within
the world of stationary data, with a bias toward microeconometrics. The second half is somewhat more polished than the rst half, since I have taught that
course more often. If you take a moment to read the licensing information in
the next section, youll see that you are free to copy and modify the document.
If anyone would like to contribute material that expands the contents, it would
be very welcome. Error corrections and other additions are also welcome. As
an example of a project that has made use of these notes, see these very nice
lecture slides.

It is possible to have the program links open up in an editor, ready to run using keyboard
macros. To do this with the PDF version you need to do some setup work. See the bootable
CD described below.
13

1.2. OBTAINING THE MATERIALS

14

1.1. License
All materials are copyrighted by Michael Creel with the date that appears
above. They are provided under the terms of the GNU General Public License,
which forms Section 23 of the notes. The main thing you need to know is that
you are free to modify and distribute these materials in any way you like, as
long as you do so under the terms of the GPL. In particular, you must make
available the source les, in editable form, for your modied version of the
materials.
1.2. Obtaining the materials
The materials are available on my web page, in a variety of forms including
PDF and the editable sources, at pareto.uab.es/mcreel/Econometrics/. In addition to the nal product, which youre looking at in some form now, you can
obtain the editable sources, which will allow you to create your own version,
if you like, or send error corrections and contributions. The main document
was prepared using LYX (www.lyx.org) and Octave (www.octave.org). LYX is
a free2 what you see is what you mean word processor, basically working as
A
a graphical frontend to LTEX. It (with help from other applications) can export
A
your work in LTEX, HTML, PDF and several other forms. It will run on Linux,

Windows, and MacOS systems. Figure 1.2.1 shows LYX editing this document.
GNU Octave has been used for the example programs, which are scattered
though the document. This choice is motivated by two factors. The rst is the
high quality of the Octave environment for doing applied econometrics. The
fundamental tools exist and are implemented in a way that make extending
2

Free is used in the sense of freedom, but LYX is also free of charge.

1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY

15

F IGURE 1.2.1. LYX

them fairly easy. The example programs included here may convince you of
this point. Secondly, Octaves licensing philosophy ts in with the goals of this
project. Thirdly, it runs on Linux, Windows and MacOS. Figure 1.2.2 shows an
Octave program being edited by NEdit, and the result of running the program
in a shell window.

1.3. An easy way to use LYX and Octave today


The example programs are available as links to les on my web page in the
PDF version, and here. Support les needed to run these are available here.
The les wont run properly from your browser, since there are dependencies

1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY

16

F IGURE 1.2.2. Octave

between les - they are only illustrative when browsing. To see how to use
these les (edit and run them), you should go to the home page of this document, since you will probably want to download the pdf version together
with all the support les and examples. Then set the base URL of the PDF le
to point to wherever the Octave les are installed. All of this may sound a bit
complicated, because it is. An easier solution is available:
The le pareto.uab.es/mcreel/Econometrics/econometrics.iso is an ISO image le that may be burnt to CDROM. It contains a bootable-from-CD Gnu/Linux

1.4. KNOWN BUGS

17

system that has all of the tools needed to edit this document, run the Octave example programs, etcetera. In particular, it will allow you to cut out small portions of the notes and edit them, and send them to me as LYX (or TEX) les for
inclusion in future versions. Think error corrections, additions, etc.! The CD
automatically detects the hardware of your computer, and will not touch your
hard disk unless you explicitly tell it to do so. It is based upon the Knoppix
GNU/Linux distribution, with some material removed and other added. Additionally, you can use it to install Debian GNU/Linux on your computer (run
knoppix-installer as the root user). The versions of programs on the CD
may be quite out of date, possibly with security problems that have not been
xed. So if you do a hard disk installation you should do apt-get update,
apt-get upgrade toot sweet. See the Knoppix web page for more information.
1.4. Known Bugs
This section is a reminder to myself to try to x a few things.
The PDF version has hyperlinks to gures that jump to the wrong gure. The numbers are correct, but the links are not. ps2pdf bugs?

CHAPTER 2

Introduction: Economic and econometric models


Economic theory tells us that the demand function for a good is something
like:

#   
$!"! 
is the quantity demanded

) '
0(&

is

vector of prices of the good and its substitutes and comple-

 

%

ments

is income
is a vector of other variables such as individual characteristics that

# 

affect preferences

Suppose we have a sample consisting of one observation on

1 8 5
!@A8@89764)  3
1

demands at time period (this is a cross section, where

individuals
indexes the

individuals in the sample). The individual demand functions are

B ! B ! B  CDC
#
B B
The model is not estimable as it stands, since:

B#

Some components of

8
E3

The form of the demand function is different for all

may not be observable to an outside modeler.

For example, people dont eat the same lunch every day, and you cant
tell what they will order just by looking at them. Suppose we can
18




2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS

into the observable components

BG

B#

component

B F

break

19

and a single unobservable

A step toward an estimable econometric model is to suppose that the model


may be written as

G P ` RB F P W
P T R PI B
B 0aS YVXS B VU$ BSQDH HC
We have imposed a number of restrictions on the theoretical model:
which in principle may differ for all

dc C
b B

restricted to all belong to the same parametric family.

The functions

have been

Of all parametric families of functions, we have restricted the model

The parameters are constant across individuals.

to the class of linear in the variables functions.

There is a single unobservable component, and we assume it is additive.

If we assume nothing about the error term , we can always write the last

equation. But in order for the

coefcients to have an economic meaning,

and in order to be able to estimate them from sample data, we need to make
additional assumptions. These additional assumptions have no theoretical
basis, they are assumptions on top of those needed to prove the existence of
a demand function. The validity of any results we obtain using this model
will be contingent on these additional restrictions being at least approximately
correct. For this reason, specication testing will be needed, to check that the
model seems to be reasonable. Only when we are convinced that the model is
at least approximately correct should we use it for economic analysis.

2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS

20

When testing a hypothesis using an econometric model, three factors can


cause a statistical test to reject the null hypothesis:
(1) the hypothesis is false
(2) a type I error has occured
(3) the econometric model is not correctly specied so the test does not
have the assumed distribution
We would like to ensure that the third reason is not contributing to rejections,
so that rejection will be due to either the rst or second reasons. Hopefully the
above example makes it clear that there are many possible sources of misspecication of econometric models. In the next few sections we will obtain results
supposing that the econometric model is entirely correctly specied. Later we
will examine the consequences of misspecication and see some methods for
determining if a model is correctly specied. Later on, econometric methods
that seek to minimize maintained assumptions are introduced.

CHAPTER 3

Ordinary Least Squares


3.1. The Linear Model

can consider a model that is a linear approximation:

Linearity: the model is a linear function of the parameter vector

. We

r p
Yq

using the variables

h 8 
i @@8A89  gI 

Consider approximating a variable

e h P 8
uP i p h ut@8A8P  p VsI  pI
P

 f
or, using vector notation:

e
wP p dR  f
v
is a scalar random variable,

is a -vector of explanatory variables, and

perscript 0 in

y bb
8 c p h xxb p pI  
p
y h bb
i xxb  I  Xv


The dependent variable

The su-

means this is the true value of the unknown parameter.

It will be dened more precisely later, and usually suppressed when its not
necessary for clarity.
Suppose that we want to use data to try to determine the best linear ap-

8 v

using the variables

The data

1 85
A@8A874)  v
2  f

proximation to

are

obtained by some form of sampling1. An individual observation is thus

G P v
t0 R  f
1

For example, cross-sectional data may be obtained by random sampling. Time series data
accumulate historically.
21

3.2. ESTIMATION BY LEAST SQUARES

observations can be written in matrix form as

(3.1.1)

G P
V 
h ff bb f I
) '
jQ1 R igxxb ef d

where

is

and

h
bb

R "f v xxb v I v lk

The

22

Linear models are more general than they might rst appear, since one can
employ nonlinear transformations of the variables:

 F
gpvI mu I  $ p u f
# m
F n #
G P r F b b b F
V spT m xxqp m I m o $ p m

B t

where the

are known functions. Dening

etc. leads

to a model in the form of equation 3.6.1. For example, the Cobb-Douglas model

G  ~ } x| z x w
7gwyF gyF  #
{
can be transformed logarithmically to obtain

 w  D$7#  f
I
8G
70P | F | uP F uP w A  #

If we dene

etc., we can put the model in the form needed.

The approximation is linear in the parameters, but not necessarily linear in the
variables.

3.2. Estimation by least squares


Figure 3.2.1, obtained by running TypicalData.m shows some data that fol. The green line is the true regression

, and the red crosses are the data points


 f
g! 

H
 PI

dom error that has mean zero and is independent of

where

Ee

line

e
EaP  aH  gf
PI

lows the linear model

is a ran-

. Exactly how the green

line is dened will become clear later. In practice, we only have the data, and

3.2. ESTIMATION BY LEAST SQUARES

23

F IGURE 3.2.1. Typical data, Classical Model


10

data
true regression line

-5

-10

-15

10
X

12

14

16

18

20

we dont know where the green line lies. We need to gain information about
the straight line that best ts the data points.
The ordinary least squares (OLS) estimator is dened as the value that minimizes the sum of the squared errors:


 E


where

%
R V tuis
R P R 5 R



% R q %
I
@
 R f
v g
f



 

3.2. ESTIMATION BY LEAST SQUARES

24

This last expression makes it clear how the OLS estimator is dened: it minand

The tted OLS coefcients

8
"

imizes the Euclidean distance between

will dene the best linear approximation to using

as basis functions, where

best means minimum Euclidean distance. One could think of other estimators based upon other metrics. For example, the minimum absolute distance

R v gf q@
I f

(MAD) minimizes

. Later, we will see that which estimator is

best in terms of their statistical properties, rather than in terms of the metrics
that dene them, depends upon the properties of , about which we have as

and it to zero:


g

To minimize the criterion

nd the derivative with respect to

yet made no assumptions.

5 P
5
x

 R w R p  q
so

8
R I ! R 

To verify that this is a minimum, check the s.o.s.c.:

x
R 5 

this matrix is positive denite, since its a quadratic

i  G


8


The tted values are in the vector


The residuals are in the vector

1

minimizer.

, so

form in a p.d. matrix (identity matrix of order

 

Since

is in fact a

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

25

Note that

P
G
G P
0




 G R

 h % R

 R R

which is to say, the OLS residuals are orthogonal to


this more carefully.

Also, the rst order conditions can be written as

. Lets look at

3.3. Geometric interpretation of least squares estimation


!i

3.3.1. In

Space. Figure 3.3.1 shows a typical t to data, along with the

true regression line. Note that the true line and the estimated line are different.
This gure was created by running the Octave program OlsFit.m . You can
experiment with changing the parameter values to see how this affects the t,
and to see how the tted line will sometimes be close to the true line, and
sometimes rather far away.

3.3.2. In Observation Space. If we want to plot in observation space, well


need to use only two or three observations, or well encounter some limitations
of the blackboard. Lets use two. With only two observations, we cant have

8
4)

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

26

F IGURE 3.3.1. Example OLS Fit


15

data points
fitted line
true line

10

-5

-10

-15

10
X

12

14

16

18

20

F IGURE 3.3.2. The t in observation space


Observation 2

e = M_xY

S(x)

x
x*beta=P_xY

Observation 1

3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION

into two components: the orthogonal projection


,

nent that is the orthogonal projection onto the

Since

is chosen to make

to the space spanned by

as short as possible,

Since

8
i
G
8 
G i

orthogonal to the span of

is in this space,

and the composubpace that is

will be orthogonal

8  G R

dimensional space spanned by

i1


onto the

We can decompose

27

Note that

the f.o.c. that dene the least squares estimator imply that this is so.

is the projection of onto the span of


i

3.3.3. Projection Matrices.

or

fi I Qi 
R
R

onto the span of

Therefore, the matrix that projects

is

R R
i I !Xiu  s

since


8
f 

. We have that

 G

jf

8fHRi I !X%upf d
R
fi I !X%ujf
R R


to the span of

dimensional space that is orthogonal

is the projection of onto the

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

onto the space orthogonal to the span of

is

8 pf

R R
i I 6Xiujpf

So the matrix that projects

28

X


We have

8
f Q  G
Therefore

8 P
G

f X f s
P
dimensional vector

dimensional

are symmetric and idempotent.

A symmetric matrix

is one such that

An idempotent matrix

and

8 ww
w 
8 R w
w 

Note that both


i

space.

and the portion that lies in the orthogonal

u1

ned by

into two

dimensional space de-

orthogonal components - the portion that lies in the

 f

These two projection matrices decompose the

is one such that

The only nonsingular idempotent matrix is the identity matrix.

3.4. Inuential observations and outliers

 3

element of the vector


B

f BR
f AB i I 6Qi
R R


The OLS estimator of the

is simply

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

29

This is how we dene a linear estimator - its a linear function of the dependent variable. Since its a linear combination of the observations on the
dependent variable, where the weights are detemined by the observations on
the regressors, some observations may have more inuence than others. Dene

) 

and





)

a in the t position). So

is a

is the t element on the main diagonal of

vector of zeros with

8
q1  e sy

. If the weight is much higher, then

the observation has the potential to affect the t importantly. The weight,

1 

So, on average, the weight on the s is

is referred to as the leverage of the observation. However, an observation may

To account for this, consider estimation of

innon, pp. 32-5 for proof) that

8
@

tion (designate this estimator as

s.

, rather than the weight it is multiplied

without using the

 2

by, which only depends on the

also be inuential due to the value of

observa-

One can show (see Davidson and MacK-

t G i I !X% g ) )  @
R R

3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS

30

F IGURE 3.4.1. Detection of inuential observations


14

Data points
fitted
Leverage
Influence

12

10

0.5

1.5
X

2.5

observations tted value is

so the change in the

G g

 2

-2

)

 @ aj a

While an observation may be inuential if it doesnt affect its own tted value,
it certainly is inuential if it does. A fast means of identifying inuential ob-


t G h I

servations is to plot

(which I will refer to as the own inuence of the

observation) as a function of . Figure 3.4.1 gives an example plot of data, t,

leverage and inuence. The Octave program is InuentialObservation.m . If


you re-run the program you will see that the leverage of the last observation
(an outlying value of x) is always high, and the inuence is sometimes high.
After inuential observations are detected, one needs to determine why
they are inuential. Possible causes include:

3.5. GOODNESS OF FIT

31

data entry error, which can easily be corrected once detected. Data

special economic factors that affect some observations. These would

entry errors are very common.

need to be identied and incorporated in the model. This is the idea


behind structural change: the parameters may not be constant across all
observations.
pure randomness may have caused us to sample a low-probability observation.

There exist robust estimation methods that downweight outliers.


3.5. Goodness of t
The tted model is

P
G  f
Take the inner product:

P R R 5 P R R
R
G tR G G i i  ff
R P R R
R
G t G i  ff
is dened as

 t
H 4v
f

f
fR f
R R

f R f )

G R G


(3.5.1)

The uncentered

 G R

But the middle term of the RHS is zero since

, so

3.5. GOODNESS OF FIT

and the span of

changes if we add a constant to


f

The uncentered

is the angle between

since this changes

(see Figure 3.5.1, the yellow vector is a constant, since its on the
degree line in observation space). Another, more common deni-

F IGURE 3.5.1. Uncentered

tion measures the contribution of the variables, other than the constant
term, to explaining the variation in

f
8
f

where

32

the model to explain the variation of


mean.

Thus it measures the ability of


about its unconditional sample

 q

8
4)

then one can show that


g

Supposing that a column of ones is in the space spanned by

R R
X 

where
So

R P R R
G G i  f f
R

In this case

8
G  9G

R
G i


contains a column of ones (i.e., there is a constant term),

7u @f f R f 
f f I

) f f
 G y) 
G R

and


 t G

so

Supposing that

G R G 

where

The centered

is dened as

9G t G X is  f f
R P R R
R
from the mean, equation 3.5.1 becomes
Let

1  R t9@A8@8944 
) 8))

just returns the vector of deviations from the mean. In terms of deviations

1 wsf
R
R I 6REwsf



-vector. So
3.5. GOODNESS OF FIT

33

3.6. THE CLASSICAL LINEAR REGRESSION MODEL

34

3.6. The classical linear regression model


Up to this point the model is empty of content beyond the denition of a

best linear approximation to

and some geometrical properties. There is no

economic content to the model, and the regression parameters have no eco-

7

respect to

? The linear approximation is

nomic interpretation. For example, what is the partial derivative of

with

e h P 8
wP ih u@8A8P  uI  H  f
P I

 P  

e
f

Up to now, theres no guarantee that

=0. For the

The partial derivative is

to have an economic

meaning, we need to make additional assumptions. The assumptions that are


appropriate to make depend on the data under consideration. Well start with
the classical linear regression model, which incorporates some assumptions
that are clearly not realistic for economic data. This is to be able to explain
some concepts with a minimum of confusion and notational clutter. Later well
adapt the results to what we can get with more realistic assumptions.

e h P 8
uP i p h ut@8A8P  p VsI  pI
P

 f

(3.6.1)

or, using vector notation:

r p
Yq

Linearity: the model is a linear function of the parameter vector

e
wP p dR  f
v

3.6. THE CLASSICAL LINEAR REGRESSION MODEL

, its number of columns, and 3.6.2

(3.6.2)

where

is a xed matrix of con-

1

R ) A

stants, it has rank

Nonstochastic linearly independent regressors:

35

is a nite positive denite matrix. This is needed to be able to iden-

tify the individual effects of the explanatory variables.


Independently and identically distributed errors:


$f ! e

(3.6.3)

is jointly distributed IIN. This implies the following two properties:


Homoscedastic errors:

2
c p 
G

(3.6.4)
Nonautocorrelated errors:

p  E!G
 2 e

(3.6.5)

Optionally, we will sometimes assume that the errors are normally distributed.
Normally distributed errors:


4f ! je

(3.6.6)

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

36

3.7. Small sample statistical properties of the least squares estimator


Up to now, we have only examined numeric properties of the OLS estimator, that always hold. Now we will examine statistical properties. The statistical properties depend upon the assumptions we can make.

Rf I X R 

3.7.1. Unbiasedness. We have

. By linearity,

Gi I 6Qi
R R P
G P
V" R I Q R



By 3.6.2 and 3.6.3

G i I 6Xi
R R
Gi I !Xi
R R

R R
 Gi I !X%


so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.7.1 shows the results of a small Monte Carlo experiment where the
OLS estimator was calculated for 10000 samples from the classical model with
,

, and


  5  1

G 5 P
P  0)  f

that the

, where

is xed across samples. We can see

appears to be estimated without bias. The program that generates

the plot is Unbiased.m , if you would like to experiment with this.


With time series data, the OLS estimator will often be biased. Figure 3.7.2
shows the results of a small Monte Carlo experiment where the OLS estimator

and

) 

5  1

where

G P
tvI f 8 P  4f

was calculated for 1000 samples from the AR(1) model with

. In this case, assumption 3.6.2 does not hold: the

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

37

F IGURE 3.7.1. Unbiasedness of OLS under classical assumptions


Beta hat - Beta true
0.12

0.1

0.08

0.06

0.04

0.02

-3

-2

-1

regressors are stochastic. We can see that the bias in the estimation of
about -0.2.

is

The program that generates the plot is Biased.m , if you would like to experiment with this.

8
RG I X R a 
P

3.7.2. Normality. With the linearity assumption, we have

This is a linear function of . Adding the assumption of normality (3.6.6, which

implies strong exogeneity), then

R 
p I 6Xixui

since a linear function of a normal random vector is also normally distributed.


In Figure 3.7.1 you can see that the estimator appears to be normally distributed. It in fact is normally distributed, since the DGP (see the Octave program) has normal errors. Even when the data may be taken to be IID, the

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

38

F IGURE 3.7.2. Biasedness of OLS when an assumption fails


Beta hat - Beta true
0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
-1.2

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

assumption of normality is often questionable or simply untenable. For example, if the dependent variable is the number of automobile trips per week, it
is a count variable with a discrete distribution, and is thus not normally distributed. Many variables in economics can take on only nonnegative values,
which, strictly speaking, rules out normality.2

3.7.3. The variance of the OLS estimator and the Gauss-Markov theorem. Now lets make all the classical assumptions except the assumption of

Normality may be a good model nonetheless, as long as the probability of a negative value
occuring is negligable under the model. This depends upon the mean being large enough in
relation to the variance.

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR


p I X R 

RG
R 

9I !XRiutGRi I 6Qi


R h 0% h 0
RG I X R 
P

and we know that

normality. We have

39

. So

The OLS estimator is a linear estimator, which means that it is a linear func-

8
f

tion of the dependent variable,

f R I X R

f

where

is a function of the explanatory variables only, not the dependent vari-

able. It is also unbiased under the present assumptions, as we proved above.


that dene some

other linear estimator. Well still insist upon unbiasedness. Consider

9

%

Q 

Note that since

is

it is nonstochastic, too. If the estimator is unbiased, then we


:

f
 u

p "

G
P p "U



must have

matrix function of

1 '
jX

a function of

is some

8
i

where

f


that are a function of

One could consider other weights

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

is

8 p  $
R

The variance of

40

Dene


R I X R 
so


R I X R P 

   9

so


p h I Q R P R

R R
R R
p R % I 6XiP % I 6XiP

Since


 q$

So


7

The inequality is a shorthand means of expressing, more formally, that

is a positive semi-denite matrix. This is a proof of the Gauss-Markov

Theorem. The OLS estimator is the best linear unbiased estimator (BLUE).
It is worth emphasizing again that we have not used the normality
assumption in any way to prove the Gauss-Markov theorem, so it is

valid if the errors are not normally distributed, as long as the other
assumptions hold.
To illustrate the Gauss-Markov result, consider the estimator that results from
equally-sized parts, estimating using each part of

the data separately by OLS, then averaging the

splitting the sample into

resulting estimators. You

should be able to show that this estimator is unbiased, but inefcient with
respect to the OLS estimator. The program Efciency.m illustrates this using

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

41

F IGURE 3.7.3. Gauss-Markov Result: The OLS estimator


Beta 2 hat, OLS
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0

0.5

1.5

2.5

3.5

a small Monte Carlo experiment, which compares the OLS estimator and a 3way split sample estimator. The data generating process follows the classical

8
75 

In Figures 3.7.3 and

 p i  
y

I

)
75  1

model, with

. The true parameter value is

but we still need to

3.7.4 we can see that the OLS estimator is more efcient, since the tails of its
histogram are more narrow.

estimate the variance of ,

and

, in order to have an idea of the precision of the

estimates of . A commonly used estimator of

G R G

1 p


This estimator is unbiased:

p e
 

We have that

is

3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR

F IGURE 3.7.4. Gauss-Markov Result: The split sample estimator


Beta 2 hat, Split Sample Estimator
0.12

0.1

0.08

0.06

0.04

0.02

2.5

1
Su p

1.5

p

Y

G
R tG y

R GG 4y
G R G y
G tG
R

p


1
)

1
)

1
)

1
)

1
)

1
)

1



p


0.5

R
G t G

3.5

42

3.8. EXAMPLE: THE NERLOVE MODEL

y 
w

w y

where we use the fact that

43

when both products are con-

formable. Thus, this estimator is also unbiased under these assumptions.

3.8. Example: The Nerlove model


and

as given, the cost minimization problem is to choose the


to solve the problem

R
 yF

The solution is the vector of factor demands

 F
4 

subject to the restriction

8
 

quantities of inputs

the output level

3.8.1. Theoretical background. For a rm that takes input prices

. The cost function is ob-

tained by substituting the factor demands into the criterion function:

8 F R
4  YF  4F

Monotonicity Increasing factor prices cannot decrease cost, so

F
4

Remember that these derivatives give the conditional factor demands


(Shephards Lemma).
Homogeneity The cost function is homogeneous of degree 1 in input
where is a scalar constant. This is because

 F
4 2  c
 F2

prices:

3.8. EXAMPLE: THE NERLOVE MODEL

44

the factor demands are homogeneous of degree zero in factor prices they only depend upon relative prices.


Returns to scale The returns to scale parameter

is dened as the in-

verse of the elasticity of cost with respect to output:

t4

 

Constant returns to scale is the case where increasing production

im-

plies that cost increases in the proportion 1:1. If this is the case, then
.

) 


3.8.2. Cobb-Douglas functional form. The Cobb-Douglas functional form


is linear in the logarithms of the regressors and the dependent variable. For a


cost function, if there are

factors, the Cobb-Douglas cost function has the

form

F
x x F@8A88 Ix F u

w 


 


with respect to

 x
 x
x F@88A8 IYF w x x F8@8 x F8 Ix F 6



w
I

F
h # !
F "

What is the elasticity of

`






This is one of the reasons the Cobb-Douglas form is popular - the coefcients
are easy to interpret, since they are the elasticities of the dependent variable

3.8. EXAMPLE: THE NERLOVE MODEL

45

with respect to the explanatory variable. Not that in this case,

 F

 
 F
F
h # !
F "

`



$

input. So with a Cobb-Douglas cost function,


%

 F
4

. The cost shares are constants.

where


j

the cost share of the

w A V&


Note that after a logarithmic transformation we obtain

e P (
u )VP  F A  Vt@8A8F A HuP
P 8 PI I


u A

&
'

. So we see that the transformed model is linear in the logs of

the data.

One can verify that the property of HOD1 implies that




)   B

In other words, the cost shares add up to 1.


The hypothesis that the technology exhibits CRTS implies that

Likewise, monotonicity implies that the coefcients

3
 !


  8 8 8 
A@@94)
8
4)  )
(

)  ))( 

so

3.8.3. The Nerlove data and OLS. The le nerlove.data contains data on
145 electric utility companies cost of production, output and input prices. The
data are for the U.S., and were collected by M. Nerlove. The observations are

3.8. EXAMPLE: THE NERLOVE MODEL

1
2

, PRICE OF FUEL

1
0

OF LABOR

, OUTPUT

8
g

g

by row, and the columns are COMPANY, COST

46

PRICE

and PRICE OF CAPITAL

that the data are sorted by output level (the third column).

Note

We will estimate the Cobb-Douglas model

e
uP )VP 5 4uP 1 | VP uD
6
3

PI 
2
0

(3.8.1)

using OLS. To do this yourself, you need the data le mentioned above, as
well as Nerlove.m (the estimation program) , and the library of Octave functions mentioned in the introduction to Octave that forms section 21 of this
document.3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
constant
output
labor
fuel
capital

estimate
-3.527
0.720
0.436
0.427
-0.220

st.err.
1.774
0.017
0.291
0.100
0.339

t-stat.
-1.987
41.244
1.499
4.249
-0.648

p-value
0.049
0.000
0.136
0.000
0.518

*********************************************************

Do the theoretical restrictions hold?

What do you think about RTS?

Does the model t well?


3

If you are running the bootable CD, you have all of this installed and ready to run.

3.8. EXAMPLE: THE NERLOVE MODEL

47

While we will use Octave programs as examples in this document, since following the programming statements is a useful way of learning how theory
is put into practice, you may be interested in a more user-friendly environment for doing econometrics. I heartily recommend Gretl, the Gnu Regression,
Econometrics, and Time-Series Library. This is an easy to use program, available in English, French, and Spanish, and it comes with a lot of data ready to
A
use. It even has an option to save output as LTEX fragments, so that I can just

include the results into this document, no muss, no fuss. Here the results of
the Nerlove model from GRETL:

Model 2: OLS estimates using the 145 observations 1145


Dependent variable: l_cost

5 E ) 8

444 8
) C7 @)8
9

7
5 @F78
9C7 A)8

) 58
E
9C9 A ) 8
A 8
CBA)

)
758
E E
CECC
AG)
9
@5 8
) @C7 8
7 9
5
7 4FA8
9
@5 8

7
8

l_capita

A
D7

l_fuel

p-value

A F8
9
E

58
5 )
8
8
$4 5)
8)
E
A C 4

l_labor

-statistic

444 8
8

l_output

Std. Error

E
CE

const

Coefcient

Variable

3.8. EXAMPLE: THE NERLOVE MODEL

9
C9

S.D. of dependent variable


Sum of squared residuals

Standard error of residuals ( )

Adjusted

Unadjusted

9
)

Akaike information criterion


Schwarz Bayesian criterion

A
D9 8
)
E
C 8 )

9 FD7
98 A
E

7
E I5 8
4 5 8
9 @5 H8
7 7
5 4 g75
8)
54AG)75 4)
8
5 8
BA4)

Mean of dependent variable

48

Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in the bootable CD mentioned in the introduction. I recommend using
GRETL to repeat the examples that are done using Octave.
The previous properties hold for nite sample sizes. Before considering
the asymptotic properties of the OLS estimator it is useful to review the MLE
estimator, since under the assumption of normal errors the two estimators coincide.

EXERCISES

49

Exercises
(1) Prove that the split sample estimator used to generate gure 3.7.4 is unbiased.
(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL,
and provide printouts of the results. Interpret the results.
(3) Do an analysis of whether or not there are inuential observations for OLS
estimation of the Nerlove model. Discuss.
(4) Using GRETL, examine the residuals after OLS estimation and tell me whether
or not you believe that the assumption of independent identically distributed normal errors is warranted. No need to do formal tests, just look
at the plots. Print out any that you think are relevant, and interpret them.

and

w y

for

(6) Using Octave, write a little program that veries that

Y 
w

 S Q
gT R

and are conformable matrices of contants?

P
o w

where

what is the distribution of

U
V

(5) For a random vector

4x4 matrices of random numbers. Note: there is an Octave

function trace.

e P
E  H  4f
P I

(7) For the model with a constant and a single regressor,

which satises the classical assumptions, prove that the variance of the
OLS estimator declines to zero as the sample size increases.

CHAPTER 4

Maximum likelihood estimation


The maximum likelihood estimator is important since it is asymptotically
efcient, as is shown below. For the classical linear model with normal errors,

the ML and OLS estimators of

are the same, so the following theory is pre-

sented without examples. In the second half of the course, nonlinear models
with nonnormal errors are introduced, and examples may be found there.
4.1. The likelihood function
Suppose we have a sample of size of the random vectors and . Suppose

# 88 I
h f98 g# 
f

r p X
ff 888 I
h g9qf 

and

ized by a parameter vector

the joint density of

is character-

8 X W
p yT

a Y
b`

This is the joint density of the sample. This density can be factored as

p !Ye ` p YW
 W a d

p X W
 YT

a Y
cG

 X W
g YT

i g X
ph

X W
 UyT
f

is a parameter space.
X

likelihood function.
50

The maximum likelihood estimator of

is the value of
X

where

Y
a

The likelihood function is just this density evaluated at other values

that maximizes the

4.1. THE LIKELIHOOD FUNCTION

with respect to is the same as the max-

d
YW

 W d
!Y 4YW

X W
 UYT

a Y
c`

that correspond to . In this case, the variables

for the elements of

pd

imizer of the overall likelihood function

a
`

tional likelihood function

share no elements, then the maximizer of the condiY

and

Note that if

51

are said

to be exogenous for estimation of , and we may more conveniently work with


d

~  p

d

observations are independent, the likelihood function can be

pd

fq

d
 YW

I
@

where the

d f
# g

written as

If the

Y
G

D EFINITION 4.1.1. The maximum likelihood estimator of

d
4YW

for the purposes of estimating

d
4yW

Y
a

the conditional likelihood function

are possibly of different form.

If this is not possible, we can always factor the likelihood into contributions of observations, by using the fact that a joint density can be factored

into the product of a marginal and conditional (doing this iteratively)

d f# f888 rI f f bbb d # fI f d #I f dI I f


476f gE999 f sf g xxH | 6 !f | 4 66f 4g#

d
 
f

To simplify notation, dene

# f 8 fI f
6I !@A8@89 !t

 
6t   gg#  I 
#If
I

so

, etc. - it contains exogenous and predetermined endo-

geous variables. Now the likelihood function can be written as

c 
d 
f

fq
I
@

d
 4
f

4.1. THE LIKELIHOOD FUNCTION

52

The criterion function can be dened as the average log-likelihood function:

I
@ 1
1 d f
c f
d 
d
 )  4 f A )  4RC
f

The maximum likelihood estimator may thus be dened equivalently as

 d f
geS ~  d

maximize at the same value of

8
7d

8
7d

has no effect on

and

is a monotonic
Dividing by

increasing function,

d
b

where the set maximized over is dened below. Since

4.1.1. Example: Bernoulli trial. Suppose that we are ipping a coin that
may be biased, so that the probability of a heads may not be 0.5. Maybe were

t
t" )  f

interested in estimating the probability of a heads. Let

be a binary

variable that indicates whether or not a heads is observed. The outcome of a


toss is a Bernoulli random variable:

) f
9 j g !
) g f
9 v! u I p ic p4
 ) u

p  f


Y
`

So a representative term that enters the likelihood function is

 )
 f
I i% u  

Y
G

and

 ) f ) P
iiy S0 f 
 f

Y
G

4.1. THE LIKELIHOOD FUNCTION

53

The derivative of this is

 )
ic

if
 )
ii 
f )
S0 f

Y
`


 i
 f


Averaging this over a sample of size

gives

i B 1
) I


i B f f )  dif

Setting to zero and solving gives

f  
p

So its easy to calculate the MLE of

in this case.

Now imagine that we had a bag full of bent coins, each bent around a
sphere of a different radius (with the head pointing to the outside of the sphere).
We might suspect that the probability of a heads could depend upon the ra-

 ~ ) B
I  BR  7g} P c  ! C 

B

is a 2 1 vector. Now

B  f
BC ! C i B
ic 
 )

BC B i) B  B B if B


Y
`

 !
 f


'

so

B  )
C B % B   q B 

that

where

 B
R r B ) n qi

dius. Suppose that

, so

4.2. CONSISTENCY OF MLE

54

So the derivative of the average log lihelihood function is now


B B  f I
C ! C i B fB  if

This is a set of 2 nolinear equations in the two unknown elements in . There

is no explicit solution for the two elements that set the equations to zero. This
is common with ML estimators, they are often nonlinear, and nding their
values often require use of numeric methods to nd solutions to the rst order
conditions.

4.2. Consistency of MLE


To show consistency of the MLE, we need to make explicit some assumptions.

x g
pwd


x

This implies that is an interior point of the parameter space

which is compact.

imixation is over

an open bounded subset of

Compact parameter space:

Max-

Uniform convergence:

8
x

 d d
g p 7ea

g d
wc

d f
4RC!4 A f 4eS
d f

We have suppressed

here for simplicity. This requires that almost sure con-

vergence holds for all possible parameter values. For a given parameter value,
an ordinary Law of Large Numbers will usually imply almost sure convergence to the limit of the expectation. Convergence for a single element of
the parameter space, combined with the assumption of a compact parameter
space, ensures uniform convergence.

4.2. CONSISTENCY OF MLE


7

is

has a unique maximum in its rst argument.

We will use these assumptions to show that

f
d

First,

This implies that

8 p d d
f
T

p 7RaG
d d

87d
d f
eS9

Identication:

g d
 d

continuous in

8
x

is continuous in

p eaG
d  d

Continuity:

55

certainly exists, since a continuous function has a maximum on a

compact set.


b
p e
d
p e
d
d
4R f f
d
4R f f A
p d d


Second, for any

by Jensens inequality (

is a concave function).

Now, the expectation on the RHS is

p
4)  p e f 4eRdd
ft d

f
f

'

p e
d
d
e f f

is the density function of the observations, and since the integral of

any density is 1 Therefore, since

  t
)

p e
d

since

d
p p e


d
4R f f A
Taking limits, this is

8 E p Ry4R
d f d f

or

pd pd pd d
 e Guq 7e G
except on a set of zero probability (by the uniform convergence assumption).

4.3. THE SCORE FUNCTION

56

By the identication assumption there is a unique maximizer, the inequal-

is a limit point of

at least one limit point). Since

(any sequence from a compact set has

is a maximizer, independent of

we must

have

a.s.


q1

Cd

Suppose that

f
d

f
d

 p d cc p  p Ra uq p 7Ra
 d
d
d
d d

p d d


ity is strict if

p  p Ra uq p  Ra
d
d
d d

These last two inequalities imply that

a.s.

pd 

Thus there is only one limit point, and it is equal to the true parameter value
with probability one. In other words,

8 8  p d  d f

as

This completes the proof of strong consistency of the MLE. One can use weaker

assumptions to prove weak consistency (convergence in probability to

) of

the MLE. This is omitted here. Note that almost sure convergence implies
convergence in probability.

4.3. The score function

pd

of

is twice continuously differentiable

, at least when

p e
d

in a neighborhood

d f
4RC9

Differentiability: Assume that

is large enough.

4.3. THE SCORE FUNCTION

57

To maximize the log-likelihood function, take derivatives:

Note that the score function has

argument, which implies that it is a random function.

8)
gxp'
I
@
1
8 d
ge
$
f )
I
@
1
d f
  g

f )
d f
eS  Sg
d f

This is the score vector (with dim

as an

(and any exogeneous

variables) will often be suppressed for clarity, but one should not forget that
they are still there.
sets the derivatives to zero:

8
$

8 d
p e
82 d
  s4ecg 1)
I
@
1 4d S
4d 

 f

f )

This is the expectation taken with respect

not necessarily

8 f t d f
g$c  g
d f
gf$c  g c  g c  ) g
t d 
f d f
f t d f d f
g$  c  g

d
4R

d
 4R14

to the density

We will show that

The ML estimator

4.4. ASYMPTOTIC NORMALITY OF MLE

Given some regularity conditions on boundedness of

58

we can switch the

order of integration and differentiation, by the dominated convergence theorem. This gives

f t d f
$4c 

d
 eE5

)
9


where we use the fact that the integral of the density is 1.

so it implies that


2

This hold for all

the expectation of the score vector is zero.

8  4S)
d f 

r  4Rx
d 

So

4.4. Asymptotic normality of MLE


is twice continuously differentiable. Take
d

about the true value

r p

d f
4RC9

a rst order Taylors series expansion of

d H


Recall that we assume that

h p d E eH y  p RD
d
d P d


 4 d H

or with appropriate denitions


d

Assume

is invertible (well justify

p RHb1 I 6 R
d 
d

 h p ipd 1
d

) P
Ycyd

e
f

this in a minute). So

e
d
8 e
4) "g  p
d
 d 
g p eHd  h p d R
d
d

where

4.4. ASYMPTOTIC NORMALITY OF MLE

This is
d

d
 e

8 d
e




where the notation

R
8 d d
d f
eS $

I
@
eE
d
f

eS
d f
RD y
d

Now consider

59

d f
4R 9

Given that this is an average of terms, it should usually be the case that this
satises a strong law of large numbers (SLLN). Regularity conditions are a set
of assumptions that guarantee that this will happen. There are different sets
of assumptions that can be used to justify appeal to different SLLNs. For

d
6IeE A

example, the

must not be too strongly dependent over time, and

their variances must not become innite. We dont assume any particular set
here, since the appropriate assumptions will depend upon the particularities
of a given model. However, we assume that a SLLN applies.
d

e
j

. Also, by the above differentiability assumtion,

converges to the limit of its expectation:

p d
R

d f
f
d
 eS Q A  R

This matrix converges to a nite limit.

k
l

e
d

continuous in . Given this,

we have that

is consistent, and since

d
4R
 p e cP
d
)

Also, since we know that

is

4.4. ASYMPTOTIC NORMALITY OF MLE

60

Re-arranging orders of limits and differentiation, which is legitimate given


regularity conditions, we get
d

p  p ea
d
d
d f
E p eS A f

p d
 R


Weve already seen that

p  p RaG p 7R
d
d
d  d
maximizes the limiting objective function. Since there is a unique maxd

i.e.,

imizer, and by the assumption that

p ea
d
d f
4RC9

(which holds in the limit), then

is twice continuously differentiable

must be negative denite, and there-

fore of full rank. Therefore the previous inversion is justied, asymptotically,

h p d 1
d
T


p d f 
 RCb1

1
1

1h
)


8  eE5)
d

by consistency. To avoid this collapse to a degenerate

r.v. (a constant vector) we need to scale by

8
1

 p RS
d f
T

Note that

As such, it is reasonable to assume that


h

a CLT applies.

I
@
p e
d
f
I
@
p c f
d 
 gE
f
d f
eS

8 d 
g p RDb1 h

Weve already seen that

This is

8 d 
g p eHb1 I 6 p ea
d

Now consider

(4.4.1)

and we have

A generic CLT states that, for

4.4. ASYMPTOTIC NORMALITY OF MLE

61

a random vector that satises certain conditions,

f
E$a 

4f a
f

will be of the form of an average, scaled by


h

1
I
a @f 1

f
 a
h

For example, if the

8
c
p RDb1
d 

the properties of the

for example. Then the properties of

depend on

have nite variances and are


h

This is the case for

ally,

must satisfy depend on the case at hand. Usu-

f
a

The certain conditions that

not too strongly dependent, then a CLT for dependent processes will apply.


9  q
  p eSgn1
d f 

Supposing that a CLT applies, and noting that


h

p e gb1 eI ! p ea
d f 
d
p

where

p e fb1 er f

p d
 ea


e4

p R o  q
d

(4.4.2)

d f
d f
R p RCD p RCh 1

This can also be written as

p RCb1
d f 

is known as the information matrix.


d

p ea o 
d

Combining [4.4.1] and [4.4.2], we get


d

8
I p ea p Ra o I p R  d u h p ssd 1
d
d
d

The MLE estimator is asymptotically normally distributed.

we get

and asymptotically normally distributed if


h

h p d 1
d

where

-consistent


n6

(4.4.3)

is

of a parameter
d

D EFINITION 1 (CAN). An estimator

62

4.4. ASYMPTOTIC NORMALITY OF MLE

is a nite positive denite matrix.

There do exist, in special cases, estimators that are consistent such that

1
8 h p ssd 1
d
T
h

mally,

These are known as superconsistent estimators, since nor-

is the highest factor that we can multiply by an still get convergence

to a stable limiting distribution.

D EFINITION 2 (Asymptotic unbiasedness). An estimator

of a parameter

is asymptotically unbiased if

8
7d  d t tA  f

pd

(4.4.4)

Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An
example is

E XERCISE 4.5. Consider an estimator

with density

r
1  d If
I

d
p

 d  I f )
d

Show that this estimator is consistent but asymptotically biased. Also ask
yourself how you could dene an estimator that would have this density.

R s4RD "s4RD 1 4 s4e


d d

 d


14

allows us to write
d

appear to be correlated one may question the specication of the model). This
(This forms the basis for a specication test proposed by White: if the scores

I f!A8@@889If g $ 2
d 
 f
 
4 2
I
@ 1

v
w4
d
w
HR 4dRus4dREg
 s4RE
f )

has conditioned on prior information, so what was random in

4
I@ 1

f )

are uncorrelated for

is xed in .

since for

If

and

and multiply by

d d P d
R s4eE ue5Vs4RE
y d

d
eE

d
ueE 5VP 4eE
A
f4RE s4RE y us4RE A 4eE
t d
d
d
P d
f t d
$4RE y e $eE eE A
d
P f t d d

g


54

d 4

The scores

Now sum over


(4.6.1)





Now differentiate again:

f$4RE E4ec
t d
d
f t d
$eE
so
be short for

d
4RE

 f t d
$eE

Let

d f
c  g

 )

8 d
g4Ra  4R
d

We will show that

4.6. The information matrix equality


d

4.6. THE INFORMATION MATRIX EQUALITY

63

p Rac
d

p Ra p ea o p e
d
d
d
I
I
p ea x d o
d
x d
x
I
p ea x
d
I
x d

p d
 Ran
p d
 Ranx
p d
 Ranx
x

all valid. These include

From this we see that there are alternative ways to estimate

that are

to estimate the information matrix. Why not?

p d
f f
R r d Sg n r d  n 1  Ra
8
4d

I
@
R 
$d d 
d1

f
p Ra
d
o

and

p R
d
d

p d
 ea

Note, one cant use


x

p d
 ea x o d
x

We can use

To estimate the asymptotic variance, we need estimators of

I 6 p Ra o  d h p ipd 1
d
d


(4.6.3)
h

p ea o I 6 p e
d
d
Using this,

I 6 p e
d

8p

dd

 h p d 1
d


in particular, for

simplies to


7d

This holds for all

8 d
4Ra o  e
d

(4.6.2)
d

limits, we get

since all cross products between different periods expect to zero. Finally take
4.6. THE INFORMATION MATRIX EQUALITY

64

4.7. THE CRAMR-RAO LOWER BOUND

65

These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since
it coincides with the covariance estimator of the quasi-ML estimator.

4.7. The Cramr-Rao lower bound


T HEOREM 3. [Cramer-Rao Lower Bound] The limiting variance of a CAN
, say , minus the inverse of the information matrix is a positive

pd

estimator of

semidenite matrix.
Proof: Since the estimator is CAN, it is asymptotically unbiased, so

f
d
 4ip d t4 A

rR
matrix of zeros

8
g

this is a


d f y
 ip d t

f


8  f$t 9 v4d y
  h ip d y
d

8  t h ip d y  t  f p4R y 4R h d
f d
d

P f t d
d d
d
ge A y e  4 y
d
d
we can write

and

8  $4e y 4e h ip d
d d
f t d
h

we get

|
1z
f t d ~ d }{ y
 $e 4R ) 1


t

Playing with powers of

With this we

Now note that


have

Noting that

'
f d d
$t r h d 4 n y

Differentiate wrt

h d 1
d

4.7. THE CRAMR-RAO LOWER BOUND

66

h d 1 n A f
d

R d 
9 r 4eHc1

 h ip d 1
d

This means that the covariance of the score function with

for

h
h

we can write

so

 d
geH

Note that the bracketed part is just the transpose of the score vector,

any CAN estimator, is an identity matrix. Using this, suppose the variance of
Therefore,

d
ea

9
d
49an

h ss d 1
d




d 
4eHn1
h ip d 1
d

(4.7.1)

8
g$ d h n

tends to


n

Since this is a covariance matrix, it is positive semi-denite. Therefore, for any


&


8

&

4Ra o
d
& I

d
ea

9

$ d r

-vector

r e I o R & R & n
d


This simplies to

h 4e I q9r R
d o d
d
ea

&

d
9an

d
e I
o

&

CAN estimator.

&

This means that

proof.

is arbitrary,

Since

is positive semidenite. This conludes the

is a lower bound for the asymptotic variance of a

D EFINITION 4.7.1. (Asymptotic efciency) Given two CAN estimators of a

and ,

, say

is asymptotically efcient with respect to

is a positive semidenite matrix.

parameter

if

4d an0q4 d ar

p

A direct proof of asymptotic efciency of an estimator is infeasible, but


if one can show that the asymptotic variance is equal to the inverse of the

4.7. THE CRAMR-RAO LOWER BOUND

67

information matrix, then the estimator is asymptotically efcient. In particular,


the MLE is asymptotically efcient.
Summary of MLE

Consistent

Asymptotically efcient

This is for general MLE: we havent specied the distribution or the

Asymptotically normal (CAN)

Asymptotically unbiased

linearity/nonlinearity of the estimator

EXERCISES

68

Exercises
(1) Consider coin tossing with a single possibly biased coin. The density func-

t
t" )  f

tion for the random variable

is

) f
9 j g !
) g f
9 v! u I p ic p4
 ) u

p  f


Y
`

Suppose that we have a sample of size . We know from above that the ML

. We also know from the theory above that


d

f  p h

estimator is

I p a p a o I p d da  d p i 1

 f

and

p a

a) nd the analytical expressions for

for this problem

p da

b) Write an Octave program that does a Monte Carlo study that shows that

p %g 1
 f

several values of .

Ee

&

P
j R   f

(2) Consider the model

is large. Please

p iv 1
 f

give me histograms that show the sampling frequency of

is approximately normally distributed when

for

where the errors follow the Cauchy

(Student-t with 1 degree of freedom) density. So

e P) )
9 w

e
 E

The Cauchy density has a shape similar to a normal density, but with much
thicker tails. Thus, extremely small and large errors occur much more frequently with this density than would happen if the errors were normally
d

where

Rh R  d

EePi R  4f


R h & R l

d f
4RCg

distributed. Find the score function

(3) Consider the model classical linear regression model

d f
4RCg

. Find the score function

where

where

! j e

EXERCISES

69

(4) Compare the rst order conditional that dene the ML estimators of problems 2 and 3 and interpret the differences. Why are the rst order conditions that dene an efcient estimator different in the two cases?

CHAPTER 5

Asymptotic properties of the least squares estimator


The OLS estimator under the classical assumptions is unbiased and BLUE,
for all sample sizes. Now lets see what happens when the sample size tends
to innity.

5.1. Consistency

GVP"tRi I 6Qi

R

Rf I Q R

 I yf

I
@ 1
G

 ) 
f
 f
y
f  y f tf A


1
1
GR I R P
G% I 6XiP
R R




p


Consider the last two terms. By assumption

 I

since the inverse of a nonsingular matrix is a continuous function of the

Each

tG 

elements of the matrix. Considering

e1

RG
70

RG

has expectation zero, so

5.2. ASYMPTOTIC NORMALITY

71

The variance of each term is

8 R  

e
 E 

As long as these are nite, and given a technical condition1, the Kolmogorov

I@ 1
f )

This implies that

8 
G


SLLN applies, so

8p


This is the property of strong consistency: the estimator converges in almost


surely to the true value.
The consistency proof does not use the normality assumption.
Remember that almost sure convergence implies convergence in probability.




5.2. Asymptotic normality


Weve seen that the OLS estimator is normally distributed under the assumption of normal errors. If the error distribution is unknown, we of course dont
know the distribution of the estimator. However, we can get asymptotic re-

assumptions hold:

sults. Assuming the distribution of

is unknown, but the the other classical

For application of LLNs and CLTs, of which there are very many to choose from, Im going
to avoid the technicalities. Basically, as long as terms of an average have nite variances and
are not too strongly dependent, one will be able to nd a LLN or CLT to apply.

5.3. ASYMPTOTIC EFFICIENCY

72

1
1
GR I R
h i 6Xi
GR I R
Gi I 6XiP p
R R

 fy
f
8
I I y
p
 h j 1
p

0%
h

Now as before,

Considering

the limit of the variance is


h

p

1

f
A
R ! Re
e

1 f

RG



The mean is of course zero. To get asymptotic normality, we need to


apply a CLT. We assume one (for instance, the Lindeberg-Feller CLT)
holds, so
h

p
!

p
I !

1
GR
h

Therefore,

h p 0 1

In summary, the OLS estimator is normally distributed in small and

applied.

tributed,

is normally distributed. If

large samples if

is not normally dis-

is asymptotically normally distributed when a CLT can be

5.3. Asymptotic efciency


The least squares objective function is

5.3. ASYMPTOTIC EFFICIENCY

73

f
 R  g

I@
f


 q

Supposing that is normally distributed, the model is

G
70P p i  f
so

I
5 @
g5

G D 7~} )
fq

g4f p ! h

 4G

The joint density for can be constructed using a change of variables. We have

yu

and

so

p
I
@

8 g5 f

 A j 5
1
 R  g f


"jf  G
Taking logs,

I
5 @
g5

8
f
~
 S

 R  g s 7g} )
f
fq
hu
4)  y
f 

so

1

y  ! f

Its clear that the fonc for the MLE of

are the same as the fonc for OLS (up

to multiplication by a constant), so the estimators are the same, under the present
assumptions. Therefore, their properties are the same. In particular, under the

classical assumptions with normality, the OLS estimator

is asymptotically efcient.

As well see later, it will be possible to use (iterated) linear estimation


methods and still achieve asymptotic efciency even if the assumption that

as long as

is still normally distributed. This is not the case if


7f
 G

5.3. ASYMPTOTIC EFFICIENCY

74

is nonnormal. In general with nonnormal errors it will be necessary to use

nonlinear estimation methods to achieve asymptotically efcient estimation.

CHAPTER 6

Restrictions and hypothesis tests


6.1. Exact linear restrictions
In many cases, economic theory suggests restrictions on the parameters of
a model. For example, a demand function is supposed to be homogeneous
of degree zero in prices and income. If we have a Cobb-Douglas (log-linear)
model,

G
VP

P I I p
| uP  A Vs HuP  A

then we need that

G
VP 7 | VP uH A HuP p  p

PI I
so

8" | VP  A uPDH HV | uP uDHS A


I I P P I
7 | uP A uH H

PI I

P I I
| uP  A uH H

The only way to guarantee this for arbitrary

is to set

  | uP uDH
PI

which is a parameter restriction. In particular, this is a linear equality restriction,


which is probably the most commonly encountered case.

75

6.1. EXACT LINEAR RESTRICTIONS

76

6.1.1. Imposition. The general formulation of linear equality restrictions


is the model

 f

G P
0"

is of rank

so that there are no redundant restrictions.

subject to the restrictions




Lets consider how to estimate

obvious approach is to set up the Lagrangean

8  p

sible.

vector of constants.

that satises the restrictions: they arent infea-

We also assume that

'

We assume

and is a

matrix,

is a

)
'

 p
where

The most

x
8 5 P f f
g %p R e " R qi ) 1  

The Lagrange multipliers are scaled by 2, which makes things less messy. The

e
$

 $ r

5 P
5 P 5
R ar R w Rf


 e w
 x
 e
which can be written as

fR

Rf


R
I

R


We get

fonc are

R



I
I X R w I

I R I X R I X R u I R I X R  I Q R

}
I
I X R Y



I R I X R
I X R


 I


w
w

so


R I X R

I R I X R

9


and


R I X R

R I X R w

R I X R


$




8
$


I X R w

I X R


Note that
For the masochists: Stepwise Inversion
6.1. EXACT LINEAR RESTRICTIONS

77

6.1. EXACT LINEAR RESTRICTIONS

78

so (everyone should start paying attention again)

makes it easy to determine their

is already known. Recall that for

a matrix and vector of constants, respectively,


U

and










random vector, and for

distributions, since the distribution of

are linear functions of

I
I

P


s I R I X R 9
I R I X R

h % I

h % I R I X R


I
I X R w I


I R I X R I Q R w I R I X R  I X R

and

Rf

The fact that

8 R w   gP Yw
w U 

Though this is the obvious way to go about nding the restricted estima-

tor, an easier way, if the number of restrictions is small, is to impose them by


substitution. Write

I
H

r n
I

nonsingular. Supposing the

'

pendent, one can always make

restrictions are linearly inde-

nonsingular by reorganizing the columns of

8 I 0 I  H
I
I
I

Then

 f
is

G
VP VDH
P I I

where

8
i

6.1. EXACT LINEAR RESTRICTIONS

79

Substitute this into the model

G
0P H I j
I I
G
VP VP I E I E

I I
I I

I I
 I jf
 f

or with the appropriate denitions,

8G
VP  @f

This model satises the classical assumptions, supposing the restriction is true.

One can estimate by OLS. The variance of

is as before

p I D i 

and the estimator is

in the normal way, using the restricted model, i.e.,

p
1

R h Cf

use the restriction. To nd the variance of


so


p R
I I R I R


R I I R E

I
 H


I I
I I


I
H

is a linear function of

I
gH


h @f

p


To recover

I D R 

where one estimates

use the fact that it

and


p0

MSE, depending on the magnitudes of

If the restriction is true, the second term is 0, so we are better off. True

If the restriction is false, we may be better or worse off, in terms of


restrictions improve efciency of estimation.

term is NSD.
So, the rst term is the OLS covariance. The second term is PSD, and the third

R
R R
I 6Xia I I 6Xi


R R
I 6QRia I R p0 up0 I I 6Xi

I X R


 r
of the third, we obtain

zero, and that the cross of the rst and third has a cancellation with the square
Noting that the crosses between the second term and the other terms expect to

R
c0r 0r  Dr

Mean squared error is

Gi I !Xia I
R R

pV I

R R
I !Xi
R R
I !Xi

Gi I !Xi
R R
Gi I 6XR%a I I !X%CpV I I 6QiUGi I !Xi
R

R R
R R P R R P
fRi I 6QRia I I 6Qi I I !Xi

R R
R R P

h % I I 6Xi

R R


 0r




6.1.2. Properties of the restricted estimator. We have that


6.1. EXACT LINEAR RESTRICTIONS

80

6.2. TESTING

81

6.2. Testing
In many cases, one wishes to test economic theories. If theory suggests parameter restrictions, as in the above homogeneity example, one can test theory
by testing parameter restrictions. A number of tests are available.
6.2.1. t-test. Suppose one has the model

p
 pr

vs.


pr

GdP
V"  f
d

with normality of the errors,

and one wishes to test the single restriction

. Under

p R I X R a6

but the test would only be valid asymptotically in this case.

 D
p

P ROPOSITION 4.
(
z
Rv2 )(

t9 j

(6.2.1)

and the

distribution.

 Q
$f j 

is a vector of

e 
 1

R 

'

(6.2.2)

)
t9 j

We need a few results on the


P ROPOSITION 5. If

are independent.

as long as the

independent r.v.s., then

in place of

is unknown. One could use the consistent estimator

p

R I Q R a p p R I X R a

The problem is that

8t9
)

p
so

6.2. TESTING

R
Q

 B Q B e


is the noncentrality parameter.

r.v. has the noncentrality parameter equal to zero, it is referred

noncentrality parameter.

P ROPOSITION 6. If the

dimensional random vector

suppressing the


g 

r.v., and its distribution is written as

to as a central

 1
gD

When a

where

82

then

8 1
gD

 I R 

'

Well prove this one as an indication of how the following unproven propositions could be proved.
as

We have

Proof: Factor

(this is the Cholesky factorization). Then consider

R  jaf

8 R  f
but

af R f

'

. Thus

1
D

 R R

 R


4f  juf

f  R

and thus

R
f

so

but

 I a   R R  ff
  R
and we get the result we wanted.
A more general proposition which implies this result is

R 


'

is idempotent.

if and only if

(6.2.3)

dimensional random vector


 ju 

P ROPOSITION 7. If the

then

v
1 2

x
R p


I X R a


so

  X i I !Xi
R R
and

uv2
1
G R G

z y f
R I X R a p



 y  y

" x
%

RG I Q R 

This will have the

distribution if

and

are independent. But

Now consider (remember that we have only one restriction in this case)

8 

are independent if


 l  1


R 

and


Yw

P ROPOSITION 9. If the random vector (of dimension )

then


1
p
p
G X R G
p
G X R G


p
 GR G

Consider the random variable

R 


'

then

and

(6.2.4)


g  ju  1

is idempotent with rank

P ROPOSITION 8. If the random vector (of dimension )


An immediate consequence is
6.2. TESTING

83

6.2. TESTING

84

In particular, for the commonly encountered test of signicance of an individual


coefcient, for which

, the test statistic is

v B
1 2
B

p
p

 B $r

HB $r

test is strictly valid only if the errors are actually normally

Note: the

vs.

distributed. If one has nonnormal errors, one could use the above as-

)
t

v2
1

then

t
P

f


'

and

af

provided that

and

'

P ROPOSITION 10. If
(6.2.5)

less often since the distribu-

test allows testing multiple restrictions jointly.


gt

test. The
P

6.2.2.

tion is fatter-tailed than is the normal.

distribution if nonnor-

procedure is to take critical values from the


mality is suspected. This will reject

are independent.

P ROPOSITION 11. If the random vector (of dimension )

8 

distribution:

8
!R
1

h % R X R w R h 
I
I

A numerically equivalent expression is

then

distribution, it is simple


yw R 

to show that the following statistic has the


R 

Using these results, and previous results on the

are independent if


g  j  1

and

distri-

In practice, a conservative

as

bution, since

)
t9 j

ymptotic result to justify taking critical values from the

6.2. TESTING

8
!R
1
P


1



test is strictly valid only if the errors are truly normally


P

Note: The

85

distributed. The following tests will be appropriate when one cannot


assume normally distributed errors.

6.2.3. Wald-type tests. The Wald principle is based on the idea that if a
restriction is true, the unrestricted model should approximately satisfy the
restriction. Given that the least squares estimator is asymptotically normally
h

p !
R I

h p 0 1

we have

 h  p pr p

then under

p
I !

distributed:

h 1

so by Proposition [6]

h I p R h % 1
I R

are not observable. The test statistic we use substitutes the


4 R 1

I D1 R

and the statistic to use is

this, there is a cancellation of

as the consistent estimator of

With

consistent estimators. Use

8 I

or

Note that

h
hR R p h
% I I 6Xia R %

The Wald test is a simple way to test restrictions without having to


estimate the restricted model.
Note that this formula is similar to one of the formulae provided for
P

the

test.




6.2. TESTING

86

6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,
an unrestricted model may be nonlinear in the parameters, but the model is
linear in the parameters under the null hypothesis. For example, the model

r p

under


t

but is linear in

8
4) 

G P
VUs"  f
and




is nonlinear in

Estimation of

nonlinear models is a bit more complicated, so one might prefer to have a


test based upon the restricted, linear model. The score test is useful in this
situation.
Score-type tests are based upon the general principle that the gradient
vector of the unrestricted model, evaluated at the restricted estimate,

should be asymptotically normally distributed with mean zero, if the


restrictions are true. The original development was for ML estimation,
but the principle is valid for a wide variety of estimation methods.
We have seen that

h I

h % I 6Xia
I R R

 e

h

h 1

under the null hypothesis,

R p !
I

Given that

p
I R I I !
m

R
p
1
I I I 1 A !

e
1

e
1

or

6.2. TESTING

87

since the s cancel and inserting the limit of a matrix of constants changes

nothing.

However,

R
I
R 1 (

I R
R R
I !XiaD1

1 A




So there is a cancellation and we get


h

e
1

R
e R I X R a e

8 p

estimator of

since the powers of

p
I 1 !

In this case,

cancel. To get a usable test statistic substitute a consistent

This makes it clear why the test is sometimes referred to as a Lagrange


multiplier test. It may seem that one needs the actual Lagrange multipliers to calculate this. If we impose the restrictions by substitution,
these are not available. Note that the test can be written as

p
R X R R h e R

e I

However, we can use the fonc for the restricted estimator:

R P R P R
war i0dfi

6.2. TESTING

88

to get that
R
IG i

fR
Hr i

R
 e


Substituting this into the above, we get

8
gR


G R I X R u R G

@
G R G

but this is simply

To see why the test is also known as a score test, note that the fonc for restricted
least squares

R P R P R
war i0dfi

give us

ijfi e
R R  R

and the rhs is simply the gradient (score) of the unrestricted model, evaluated
at the restricted estimator. The scores evaluated at the unrestricted estimate are
identically zero. The logic behind the score test is that the scores evaluated at
the restricted estimate should be approximately zero, if the restriction is true.
The test is also known as a Rao test, since P. Rao rst proposed it in 1948.

6.2. TESTING

89

6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using
the unrestricted model. The score test can be calculated using only the restricted model. The likelihood ratio test, on the other hand, uses both the restricted and the unrestricted estimators. The test statistic is


q4 d f 5 


x

take a second order Taylors series expansion of

r
d

d
4R
1

$ d d f
h d d gd R h d p d 5 $d

1 P f

about

is the restricted estimate. To show

f
d

that it is asymptotically

is the unrestricted estimate and

h 4 d

where

d
49 f

49 f

(note, the rst order term drops out since

need to multiply the second-order term by

since

is dened in terms of

h

s d d R h d p d y
1

) so

d
4R f A I f

f
d

by the information matrix equality. So

h d d p R R h d p d 1 

o
f

 d
p R o  p R
d

gd 

As

by the fonc and we

We also have that, from [??] that


h

8 d p d
g p eH eI 1 I ! p ea
o

pd
 h ip d 1

An analogous result for the restricted estimator is (this is unproven here, to

  p

prove this set up the Lagrangean for MLE subject to

and manipulate

6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

90

the rst order conditions) :


h

8 d p
g p RD eI 1 h I 6 p Ra o I I 6 p Ra o 70sf I ! p ea
d
R d R d
o

pd
 h ip d 1

Combining the last two equations


h

p eH I 6 p Ra o I I ! p ea o I ! p ea o eI Y  h d p d 1
d
d
R d
R d p 1

so, substituting into [??]

d p
c p RH eI 1 I p Ra o I R I p Ra o R I p ea o R p RD eI 1 
d
d
d
d
p

But since

d
E p e o 

8
g R I p Ra o 6 j
d

m

p eH eI 1
d
p

the linear function

p eH eI 1 I p ea o
d
d
p

We can see that LR is a quadratic form of this rv, with the inverse of its variance
in the middle, so

8
4

6.3. The asymptotic equivalence of the LR, Wald and score tests

fact, they all converge to the same

We have seen that the three tests all converge to

random variables. In

rv, under the null hypothesis. Well show

that the Wald and LR tests are asymptotically equivalent. We have seen that
the Wald test is asymptotically equivalent to

h I p R h % 1 
I R


8 "j R "j )5  0 5
1
f f

1

y  s! f A
h

Under normality, we have seen that the likelihood function is

p eH eI 1 I ! p e o I I 6 p R o I 6 p e o p eH eI 1 
d
d
R d
R d R d p
p

Now consider the likelihood ratio statistic

columns, so the projec-

R I X R u

tion matrix has rank

Note that this matrix is idempotent and has

is the projection matrix formed by the matrix


RG I X R a I R I X R a p 7R
Gi I I I p
R
R
RG 1 1
p
eI
I R
Gi I 6XiaH1
R R

p
G
i R G
p
GR w
I wR w wR G
I X R u R G

R
I tG I 1
R


where

Substitute this into [??] to get

p
 0 aD1

we get

p 0 a 

and



RG I X R  p 0

Using
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

91

 f  

alent. Similarly, one can show that, under the null hypothesis,
This completes the proof that the Wald and LR tests are asymptotically equiv-

p
G
i R G



RG I X R a I R I X R a p R I X R R R G




Substituting these last expressions into [??], we get

pd
I  I e

R
1

yx
p ijf R
p H y x

p Ra
d

so


A


pd
 e

Also, by the information matrix equality:

H1

GR
1

p "j R
f

s! f ) 1 x



$

p H

Using this,

6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

92

6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS

The proof for the statistics except for

93

does not depend upon nor-

mality of the errors, as can be veried by examining the expressions


for the statistics.
statistic is based upon distributional assumptions, since one

The

However, due to the close relationship between the statistics

cant write the likelihood function without them.


P

statistic can be thought of as a pseudoP

supposing normality, the

and

LR statistic, in that its like a LR statistic in that it uses the value of


the objective functions of the restricted and unrestricted models, but it
doesnt require distributional assumptions.
The presentation of the score and Wald tests has been done in the
context of the linear model. This is readily generalizable to nonlinear

models and/or other estimation methods.


Though the four statistics are asymptotically equivalent, they are numerically
different in small samples. The numeric values of the tests also depend upon
d

is estimated, and weve already seen than there are several ways to do

this. For example all of the following are consistent for

under

f
y

f

h y

f
y

h y f

and in general the denominator call be replaced with any quantity

how

such that

8
4)  1 A

6.5. CONFIDENCE INTERVALS

94

It can be shown, for linear regression models subject to linear restrictions,

f
y

is used for the score test,

8 f

f
y

that

is used to calculate the Wald test and

and if

For this reason, the Wald test will always reject if the LR test rejects, and in
turn the LR test rejects if the LM test rejects. This is a bit problematic: there is
the possibility that by careful choice of the statistic used, one can manipulate
reported results to favor or disfavor a hypothesis. A conservative/honest approach would be to report all three test statistics when they are available. In
P

the case of linear models with normal errors the

test is to be preferred, since

asymptotic approximations are not an issue.


The small sample behavior of the tests can be quite different. The true size
(probability of rejection of the null when the null is true) of the Wald test is
often dramatically higher than the nominal size associated with the asymptotic
distribution. Likewise, the true size of the score test is often smaller than the
nominal size.
6.4. Interpretation of test statistics
Now that we have a menu of test statistics, we need to know how to use
them.
6.5. Condence intervals
Condence intervals for single coefcients are generated in the normal
manner. Given the statistic

x



%  v2

6.6. BOOTSTRAPPING
d

condence interval for

&

)
c 4 )

using a

is the interval

The set of such

signicance level:


v2

does not reject

such that

is dened by the bounds of the set of

x

r
p

0% p T St  &


 p $r p

&
p

95

A condence ellipse for two coefcients jointly would be, analogously, the
such that the

(or some other test statistic) doesnt reject at the


P

gD
I

set of {

specied critical value. This generates an ellipse, if the estimators are correlated.
The region is an ellipse, since the CI for an individual coefcient de-

&

nes a (innitely long) rectangle with total prob. mass

since the

other coefcient is marginalized (e.g., can take on any value). Since the
&

Y)

must extend beyond the bounds of the individual CI.

ellipse is bounded in both dimensions but also contains mass

it

From the pictue we can see that:


Rejection of hypotheses individually does not imply that the joint

test will reject.


Joint rejection does not imply individal tests will reject.
6.6. Bootstrapping
When we rely on asymptotic theory to use the normal distribution-based
tests and condence intervals, were often at serious risk of making imporh

tant errors. If the sample size is small and errors are highly nonnormal, the

h p 0 1

small sample distribution of

may be very different than its large

6.6. BOOTSTRAPPING

F IGURE 6.5.1. Joint and Individual Condence Regions

96

6.6. BOOTSTRAPPING

97

sample distribution. Also, the distributions of test statistics may not resemble
their limiting distributions at all. A means of trying to gain information on the
small sample distribution of test statistics and estimators is the bootstrap. Well
consider a simple example, just to get the main idea.
Suppose that

is unknown, the distribution of


p "

G
G

Given that the distribution of

p !

G
0P

is nonstochastic

will be un-

known in small samples. However, since we have random sampling, we could


generate articial data. The steps are:

(its

G  f
P

8 ) '
gtjQ1
1

(2) Then generate the data by

observations from with replacement. Call this vector

(1) Draw

(3) Now take this and estimate

8 f i I !X% 
R R

With this, we can use the replications to calculate the empirical distribution of

&

from smallest to largest, and drop the rst and last

would be to order the

5 q&

condence interval for

One way to form a 100(1-

of

(5) Repeat steps 1-4, until we have a large number,

(4) Save

of the replications,

and use the remaining endpoints as the limits of the CI. Note that this will not
give the shortest CI if the empirical distribution is skewed.

Suppose one was interested in the distribution of some function of

98

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD

for example a test statistic. Simple: just calculate the transformation

tion.


4%

for each

and work with the empirical distribution of the transforma-

If the assumption of iid errors is too strong (for example if there is

heteroscedasticity or autocorrelation, see below) one can work with a

How to choose


f

bootstrap dened by sampling from

with replacement.

should be large enough that the results dont

change with repetition of the entire bootstrap. This is easy to check.

If you nd the results change a lot, increase

and try again.

The bootstrap is based fundamentally on the idea that the empirical distribution of the sample data converges to the actual sampling

distribution as

becomes large, so statistics based on sampling from

the empirical distribution should converge in distribution to statistics


based on sampling from the actual sampling distribution.
In nite samples, this doesnt hold. At a minimum, the bootstrap is a
good way to check if asymptotic theory results offer a decent approxi-

mation to the small sample distribution.

6.7. Testing nonlinear restrictions, and the Delta Method


Testing nonlinear restrictions of a linear model is not much more difcult,
at least when the model is linear. Since estimation subject to nonlinear restrictions requires nonlinear estimation methods, which are beyond the score
of this course, well just consider the Wald test for nonlinear restrictions on a
linear model.

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD

99

Consider the nonlinear restrictions

8  p

is a -vector valued function. Write the derivative of the restriction

as

We suppose that the restrictions are not redundant in a neighborhood of


that


 Ea

8p

p j aw p  q
P

is a convex combination of

Under the null hypothesis we

8 xp p a I p a6 m 1
R

8 p j% 1

p 0 p aD1  1
h

p
h
h

Due to consistency of

and

p 0 v a 

have

8p

where

Take a rst order Taylors series expansion of

we can replace

by

about

in a neighborhood of

, so

evaluated at

x
qa   y x

dc
b

where

, asymptotically, so

Weve already seen the distribution of

Using this we get

Considering the quadratic form

I R p a I p a R 1

the resulting statistic is

p
r

under the null hypothesis. Substituting consistent estimators for

100

and

 p

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD

h R q a I Q R a R


I

under the null hypothesis.

This is known in the literature as the Delta method, or as Kleins approx-

imation.
Since this is a Wald test, it will tend to over-reject in nite samples. The

score and LR tests are also possibilities, but they require estimation
methods for nonlinear models, which arent in the scope of this course.

Note that this also gives a convenient way to estimate nonlinear functions and

not hypothesized to be zero, we just have

associated asymptotic condence intervals. If the nonlinear function

is

x R

p p a I p a6
m

h p q 1

so an approximation to the distribution of the function of the estimator is

 
  



p c p a I 6Xiv p a6 p u
R R 

For example, the vector of elasticities of a function

where

is

means element-by-element multiplication. Suppose we estimate a

linear function

8G P 
VdR  f

6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD

are

R
  

w.r.t.

The elasticities of

101

(note that this is the entire vector of elasticities). The estimated elasticities are

R
  

To calculate the estimated standard errors of all ve elasticites, use

h vh

..


I  H
I


bb
xxb

% R 

.
.
.

.
.
.

 R 

bb
xxb

hi

bb
xxb

..



I
R


bb
xxb

.
.
.

.
.
.


 a

To get a consistent estimator just substitute in . Note that the elasticity and

8 

shows how this can be done.

the standard error are functions of

The program ExampleDeltaMethod.m

In many cases, nonlinear restrictions can also involve the data, not just the

be a demand funcion, where is prices and

is income. An expenditure share

8 & 8 5 3 
a6@A8@89764)  ! !    B

B
CB

goods is

&

system for


!d 

parameters. For example, consider a model of expenditure shares. Let

6.8. EXAMPLE: THE NERLOVE DATA

102

Now demand must be positive, and we assume that expenditures sum to income, so we have the restrictions

)

! B B


3  
4) !d B

Suppose we postulate a linear model for the expenditure shares:

G
0P W uP T R QP I  !d B



B
B
B

It is fairly easy to write restrictions such that the shares sum to one, but the
interval depends on both parameters

It is impossible to impose the restriction that

8
"

8
"

for all possible and

and

)


and the values of

restriction that the shares lie in the

In such cases, one might consider whether

) !d B


or not a linear model is a reasonable specication.


6.8. Example: the Nerlove data
Remember that we in a previous example (section 3.8.3) that the OLS results for the Nerlove model are

*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
constant
output

estimate
-3.527
0.720

st.err.
1.774
0.017

t-stat.
-1.987
41.244

p-value
0.049
0.000

6.8. EXAMPLE: THE NERLOVE DATA

labor
fuel
capital

0.436
0.427
-0.220

0.291
0.100
0.339

103

1.499
4.249
-0.648

0.136
0.000
0.518

*********************************************************
.

u


Remember that if we have constant returns to scale, then

) P 2 P 0

there is homogeneity of degree 1 then

 
4)

, and that

)  VP 2 VP 0

Note that

and if

. We can test these

hypotheses either separately or jointly. NerloveRestrictions.m imposes and


tests CRTS and then HOD1. From it we obtain the results that follow:

Imposing and testing HOD1

*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686

estimate

st.err.

t-stat.

p-value

-4.691

0.891

-5.263

0.000

output

0.721

0.018

41.040

0.000

labor

0.593

0.206

2.878

0.005

fuel

0.414

0.100

4.159

0.000

-0.007

0.192

-0.038

0.969

constant

capital

*******************************************************

6.8. EXAMPLE: THE NERLOVE DATA

Value

p-value

0.574

0.450

Wald

0.594

0.441

LR

0.593

0.441

Score

0.592

104

0.442

Imposing and testing CRTS

*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861

estimate

st.err.

t-stat.

p-value

-7.530

2.966

-2.539

0.012

output

1.000

0.000

Inf

0.000

labor

0.020

0.489

0.040

0.968

fuel

0.715

0.167

4.289

0.000

capital

0.076

0.572

0.132

0.895

constant

*******************************************************
Value

p-value

256.262

0.000

Wald

265.414

0.000

LR

150.863

0.000

6.8. EXAMPLE: THE NERLOVE DATA

Score

93.771

105

0.000

Notice that the input price coefcients in fact sum to 1 when HOD1 is im). Also,

@)8 u&


posed. HOD1 is not rejected at usual signicance levels (e.g.,

does not drop much when the restriction is imposed, compared to the un-


) 0

) j


restricted results. For CRTS, you should note that


satised. Also note that the hypothesis that

, so the restriction is

is rejected by the test sta-

tistics at all reasonable signicance levels. Note that

drops quite a bit when

imposing CRTS. If you look at the unrestricted estimation results, you can see

) V


not overlap 1.

also rejects, and that a condence interval for

does

that a t-test for

From the point of view of neoclassical economic theory, these results are
not anomalous: HOD1 is an implication of the theory, but CRTS is not.
E XERCISE 12. Modify the NerloveRestrictions.m program to impose and
test the restrictions jointly.
The Chow test. Since CRTS is rejected, lets examine the possibilities more
carefully. Recall that the data is sorted by output (the third column). Dene
5 subsamples of rms, with the rst group being the 29 rms with the lowest
output levels, then the next 29 rms, etc. The ve subsamples can be indexed
for

 2

5 
%

85
5@8A874)  2

) 

 @A8@89764) 
8 5

Dene a piecewise linear model

for

8 7
@8A84) 

where

by

, etc.

e
EwP 6 VP 1 3 uS 1 A | VC VP I  A
2
P 0
P

(6.8.1)

where is a superscript (not a power) that inicates that the coefcients may be
%

different according to the subsample in which the observation falls. That is,

6.8. EXAMPLE: THE NERLOVE DATA

which in turn depends upon

Note that the

8
2

the coefcients depend upon

106

rst column of nerlove.data indicates this way of breaking up the sample. The
new model may be written as

.
.
.


%

)
'

)
j' 5

i '

vector of errors for the

vector of coefcient for the

is the

I
f


6
@

I )
iy'

is the

is 29

bb
xxb

I
f

subsample, and

is 29

where

.
.
.

Ie

.
.
.

(6.8.2)

subsample.

The Octave program Restrictions/ChowTest.m estimates the above model.


It also tests the hypothesis that the ve subsamples share the same parameter
vector, or in other words, that there is coefcient stability across the ve subsamples. The null to test is that the parameter vectors for the separate groups
are all the same, that is,
6

 3    I
|

This type of test, that parameters are constant across different sets of data, is
sometimes referred to as a Chow test.

There are 20 restrictions. If thats not clear to you, look at the Octave

The restrictions are rejected at all conventional signicance levels.

program.

Since the restrictions are rejected, we should probably use the unrestricted
model for analysis. What is the pattern of RTS as a function of the output

6.8. EXAMPLE: THE NERLOVE DATA

107

F IGURE 6.8.1. RTS as a function of rm size


2.6

RTS

2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8

1.5

2.5

3
Output group

3.5

4.5

group (small to large)? Figure 6.8.1 plots RTS. We can see that there is increasing RTS for small rms, but that RTS is approximately constant for large rms.

6.8. EXAMPLE: THE NERLOVE DATA

108

(1) Using the Chow test on the Nerlove model, we reject that there is coefcient stability across the 5 groups. But perhaps we could restrict the
input price coefcients to be the same but let the constant and output
coefcients vary by group size. This new model is

e B
6
3
B
B wP q )uP B 5 4uP B 1 | VP s A uP I DB

2
0
(a) estimate this model by OLS, giving

, estimated standard errors

(6.8.3)

for coefcients, t-statistics for tests of signicance, and the associated p-values. Interpret the results in detail.
(b) Test the restrictions implied by this model using the F, Wald, score
and likelihood ratio tests. Comment on the results.
(c) Plot the estimated RTS parameters as a function of rm size. Compare the plot to that given in the notes for the unrestricted model.
Comment on the results.
 x

I 

(2) For the simple Nerlove model, estimated returns to scale is

c.

Apply the delta method to calculate the estimated standard error for

)  $r

) p

 rd

) $r p


than testing

versus

versus

)  $r

estimated RTS. Directly test

rather

. Comment on the results.

(3) Perform a Monte Carlo study that generates data from the model

)
t e
)
}9

|

e
wP |  P  Y  f
)
) P 5

where the sample size is 30,


distributed on

and

are independently uniformly

and

(a) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that

8
5  | VP

6.8. EXAMPLE: THE NERLOVE DATA

109

(b) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that

8
)  | VP

(c) Discuss the results.


(4) Get the Octave scripts bootstrap_example1.m , bootstrap.m , bootstrap_resample_iid.m
and myols.m gure out what they do, run them, and interpret the results.

CHAPTER 7

Generalized least squares


One of the assumptions weve made up to now is that


! tG

or occasionally

8
g ! DtG
Now well investigate the consequences of nonidentically and/or dependently
distributed errors. Well assume xed regressors for now, relaxing this admittedly unrealistic assumption later. The model is

 f

G P
Vi

G
 U
G


is a general symmetric positive denite matrix (well write


S

to simplify the typing of these notes).

in place

is a diagonal matrix gives uncorrelated, nonidentiS

The case where

cally distributed errors. This is known as heteroscedasticity.


The case where
S

of

where

has the same number on the main diagonal but

nonzero elements off the main diagonal gives identically (assuming


higher moments are also the same) dependently distributed errors.
This is known as autocorrelation.
110

7.1. EFFECTS OF NONSPHERICAL DISTURBANCES ON THE OLS ESTIMATOR

111

The general case combines heteroscedasticity and autocorrelation. This

is known as nonspherical disturbances, though why this term is


used, I have no idea. Perhaps its because under the classical assump-

persphere.

would be an

q1

tions, a joint condence region for

dimensional hy-

7.1. Effects of nonspherical disturbances on the OLS estimator


The least square estimator is


Rf I X R

G% I 6Xi
R R P


We have unbiasedness, as before.


is

or the probability

of is invalid. In particular, the formulas for the

 P 2

limit

R
 r c0 j n

Due to this, any test statistic that is based upon

(7.1.1)

R R R
I 6XiVSi I !X%
I 6XiuttGi I !Xi
R R G R R

The variance of

based

tests given above do not lead to statistics with these distributions.


is still consistent, following exactly the same argument given before.

If is normally distributed, then

R R R 
I 6XiVSi I !X%tu"
S

The problem is that

is unknown in general, so this distribution wont

be useful for testing hypotheses.

7.2. THE GLS ESTIMATOR


h

we still have


 h j% 1

G R eI 1
p
RG 1 1
p
eI
I R
Gi I 6Qi1
R R

Dene the limiting variance of

(supposing a CLT applies) as

I I  m h 0% 1
1

tf
R tG R u
G


'

so we obtain

Without normality, and unconditional on

112

Summary: OLS with heteroscedasticity and/or autocorrelation is:

unbiased in the same circumstances in which the estimator is unbiased

has a different variance than before, so the previous test statistics arent

is consistent

with iid errors

valid

is asymptotically normally distributed, but with a different limiting

covariance matrix. Previous test statistics arent valid in this case for
this reason.
is inefcient, as is shown below.

7.2. The GLS estimator


S

Suppose

were known. Then one could form the Cholesky decomposition

I yS  R
We have

f 

f I S R I X I S R


Rf R I Q R R
fR I 6 R





C0

to the transformed model:


satises the classical assumptions. The GLS estimator is simply OLS applied

G
 U

G P
V

G


f
Therefore, the model

f


G
 R R tG

R S

G  @G

This variance of

is

8 V  f
G P
or, making the obvious denitions,


GR "R  fR
P
Consider the model

f  R S

which implies that

R  R S R
so
7.2. THE GLS ESTIMATOR

113

7.2. THE GLS ESTIMATOR

114

The GLS estimator is unbiased in the same circumstances under which the

OLS estimator is unbiased. For example, assuming

is nonstochastic

R R
G0Pi I yS% I 6X I ySi
R R
f I yS% I 6X I ySi
can be calculated using

@0

The variance of the estimator, conditional on

G I P
R 6 R
GVP gR I 6 R

fR I 6 R

C0



so

R
I 6X I ySi

I 6 R

I 6 R R I 6 R


I 6 R R G GR I 6 R

 R h 0

C0

h j

@0




Either of these last formulas can be used.


All the previous results regarding the desirable properties of the least

squares estimator hold, when dealing with the transformed model,


since the transformed model satises the classical assumptions..
Tests are valid, using the previous formulas, as long as we substitute

Furthermore, any test that involves

can set it to

8
%

This is preferable to re-deriving the appropriate formulas.

8
4)

in place of

7.3. FEASIBLE GLS

115

The GLS estimator is more efcient than the OLS estimator. This is a

consequence of the Gauss-Markov theorem, since the GLS estimator is


based on a model that satises the classical assumptions but the OLS
estimator is not. To see this directly, not that (the following needs to
be completed)

y w S w
R R R R
I 6Q I ySi I 6X%uSi I 6Qi

C0

Vq

8 I S R I X I S R R I Q R w

where

This may not seem ob-

vious, but it is true, as you can verify for yourself. Then noting that
is a quadratic form in a positive denite matrix, we conclude

y w S w
y w S w

that

is positive semi-denite, and that GLS is efcient relative to

OLS.

As one can verify by calculating fonc, the GLS estimator is the solution

to the minimization problem

C0

f R f
qij I yBSc"j E

so the metric

is used to weight the residuals.

7.3. Feasible GLS


isnt known usually, so this estimator isnt available.
S

unique elements.

: its an

1 '
Y1

Consider the dimension of

matrix with

1
1 1
 P 5 0

The problem is that

5 DuP
1 1

7.3. FEASIBLE GLS

116

8
q1

faster than

and increases

The number of parameters to estimate is larger than

Theres no way to devise an estimator that satises a

LLN without adding restrictions.


The feasible GLS estimator is based upon making sufcient assumptions


4d iS 
T

in the formulas for the GLS estimator with


pS

If we replace

is a continuous function of

(by the

d 
iS


S

Slutsky theorem). In this case,

d
4iS

as long as

we can cond

is of xed dimension. If we can consistently estimate

sistently estimate

may include

d
4%S 

where

and , where


7d

as well as other parameters, so that

as a function of

Suppose that we parameterize

so that a consistent estimator can be devised.


d

regarding the form of

we obtain the

FGLS estimator. The FGLS estimator shares the same asymptotic properties
as GLS. These are
(1) Consistency
(2) Asymptotic normality
(3) Asymptotic efciency if the errors are normally distributed. (CramerRao).
(4) Test procedures are asymptotically valid.
In practice, the usual way to proceed is
This is a case-by-case proposition,

depending on the parameterization

8 d
g4RS

8
7d

(1) Dene a consistent estimator of

Well see examples below.

7.4. HETEROSCEDASTICITY

(3) Calculate the Cholesky factorization


(4) Transform the model using

I S I) 


4d %S 

(2) Form

117

P
RG i R  Rf

(5) Estimate using OLS on the transformed model.

7.4. Heteroscedasticity
Heteroscedasticity is the case where

R G G
 ttU

is a diagonal matrix, so that the errors are uncorrelated, but have different
variances. Heteroscedasticity is usually thought of as associated with cross
sectional data, though there is absolutely no reason why time series data cannot also be heteroscedastic. Actually, the popular ARCH (autoregressive conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic.
Consider a supply function

is some measure of size of the

 3

is price and

G
T PI 
B VP B iuP B $uH HB

where

rm. One might

suppose that unobservable factors (e.g., talent of managers, degree of coordiIf there

is more variability in these factors for large rms than for small rms, then

may have a higher variance when

is high than when it is low.

BG

8B G

nation between production units, etc.) account for the error term

7.4. HETEROSCEDASTICITY

118

Another example, individual demand.

G
W
T P I 
B 0P B SuP B $VsH DB

is price and

is income. In this case,

BG

where

can reect variations in

preferences. There are more possibilities for expression of preferences when

BG

high.

could be higher when

one is rich, so it is possible that the variance of

is

Add example of group means.

7.4.1. OLS with heteroscedastic consistent varcov estimation. Eicker (1967)


and White (1980) showed how to modify test statistics to account for heteroscedasticity of unknown form. The OLS estimator has asymptotic distrih

I I 

bution

h

0 1

as weve already seen. Recall that we dened

'
1

tf
R tG R u
G

cant estimate

'

This matrix has dimension

and can be consistently estimated, even if we

consistently. The consistent estimator, under heteroscedastic-

ity but no autocorrelation is

I
@ 1


G  R  f )

One can then modify the previous test statistics to obtain tests that are valid
when there is heteroscedasticity of unknown form. For example, the Wald test

7.4. HETEROSCEDASTICITY

119

1
1
h
R h

% R I R I R 5R 1

I
p

 pr

for

would be

7.4.2. Detection. There exist many tests for the presence of heteroscedasticity. Well discuss three methods.

will be independent.


I 1

|
1

I G I y I G  I G R I G

m
m

8
| ! 
1 I 1

G y G  G R G
| | |
| |

P
m

1
|
qs
I 1

so

and

and

|1

and third parts of the sample, separately, so that


Then we have

and

. The model is estimated using the rst

1  | iP iP 1
1 1 I

observations, where

!gq1
1I

Goldfeld-Quandt. The sample is divided in to three parts, with

| G R | G
I G R I G

The distributional result is exact if the errors are normally distributed. This test
is a two-tailed test. Alternatively, and probably more conventionally, if one has
prior ideas about the possible magnitudes of the variances of the observations,
one could order the observations accordingly, from largest to smallest. In this
case, one would use a conventional one-tailed F-test. Draw picture.

Ordering the observations is an important step if the test is to have

The motive for dropping the middle observations is to increase the

any power.

difference between the average variance in the subsamples, supposing that there exists heteroscedasticity. This can increase the power of

7.4. HETEROSCEDASTICITY

120

the test. On the other hand, dropping too many observations will suband

8 G R G
| |

I G R I G

stantially increase the variance of the statistics

A rule of

thumb, based on Monte Carlo experiments is to drop around 25% of


the observations.
If one doesnt have any ideas about the form of the het. the test will

probably have low power since a sensible data ordering isnt available.

Whites test. When one has little idea if there exists heteroscedasticity, and
no idea of its potential form, the White test is a possibility. The idea is that if
there is homoscedasticity, then

2
   U
G
isnt available, use the consistent estimator

(1) Since

shouldnt help to explain

(2) Regress

-vector.

is a

instead.

P R #
ta wP  G

where

The test works as

g G

follows:

or functions of

8 G
g

so that

may include some or all of the variables in

as well as other variables. Whites original suggestion was to use

8c 


c 

, plus the set of all unique squares and cross products of variables in

The

8 

(3) Test the hypothesis that

statistic in this case is

t)
1

7.4. HETEROSCEDASTICITY

 

Note that

121

so dividing both numerator and denomina-

tor by this we get

V)

1
t) 

eroscedasticity, not the

or the articial regression used to test for het-

Note that this is the

of the original model.

An asymptotically equivalent statistic, under the null of no heteroscedasticity


should tend to zero), is

(so that

'

D1

This doesnt require normality of the errors, though it does assume that the

sary?

fourth moment of

is constant, under the null. Question: why is this neces-

The White test has the disadvantage that it may not be very power-

ful unless the

vector is chosen well, and this is hard to do without

knowledge of the form of heteroscedasticity.

It also has the problem that specication errors other than heteroscedas-

Note: the null hypothesis of this test may be interpreted as

ticity may lead to rejection.


d

dc
b

where

 #
gd R P & 
G

the variance model

for

is an arbitrary func-

tion of unknown form. The test is more general than is may appear
from the regression that is used.
Plotting the residuals. A very simple method is to simply plot the residuals
(or their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will

7.4. HETEROSCEDASTICITY

122

be more informative if the observations are ordered according to the suspected


form of the heteroscedasticity.

7.4.3. Correction. Correcting for heteroscedasticity requires that a parabe supplied, and that a means for estimating
d

metric form for

consis-

d
4eS

tently be determined. The estimation method will be specic to the for sup-

8 d
geS

plied for

Well consider two examples. Before this, lets consider the

general nature of GLS when there is heteroscedasticity.


Multiplicative heteroscedasticity
Suppose the model is

H R
# 
G
G P
0 R 

 gf


but the other classical assumptions hold. In this case

R #
iP H g  G
has mean zero. Nonlinear least squares could be used to estimate

we can estimate

consistently using

and

since it is consistent by the Slutsky theorem.

in place of


G

Once we have

observable. The solution is to substitute the squared

 G

OLS residuals

and

consistently, were

and

R #
H  g 

In the second step, we transform the model by dividing by the standard deviation:

S S S

t G P R   g f

7.4. HETEROSCEDASTICITY

123

or

8 0 R   f
G P

Asymptotically, this model satises the classical assumptions.


This model is a bit complex in that NLS is required to estimate the

model of the variance. A simpler version would be

G P
0 R 

G
 

9#

where

is a single variable. There are still two parameters to be esti-

mated, and the model of the variance is still nonlinear in the parameters. However, the search method can be used in this case to reduce the
estimation problem to repeated applications of OLS.

equally spaced values, e.g.,


iP #  G

 W
gUW  

and the corresponding

with the minimum


W

by OLS.

as the estimate.

so one can estimate

8
W

is linear in the parameters, conditional on

The regression

8 #

For each of these values, calculate the variable

Save the pairs (

Partition this interval into

e.g.,

87 8756A@8A8754A)
7
8 8 8
8 7
I g


First, we dene an interval of reasonable values for

Choose the pair

Next, divide the model by the estimated standard deviations.

Works well when the parameter to be searched over is low dimen-

Can rene. Draw picture.

sional, as in this case.

7.4. HETEROSCEDASTICITY

124

Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of economic agents: e.g., 10 years of macroeconomic data on each of a set
of countries or regions, or daily observations of transactions of 200 banks. This
sort of data is a pooled cross-section time-series model. It may be reasonable to presume that the variance is constant over time within the cross-sectional units,
but that it differs across them (e.g., rms or countries of different sizes...). The
model is

2
c B
B V BR 
G P

are the agents, and

& 85
6A@8@8974)  3

each agent.

1 8 5
!@A8@89764)  2

 B G
 B f
where

are the observations on

The other classical assumptions are presumed to hold.




In this case, the variance

In this model, we assume that

the

is specic to each agent, but constant over

observations for that agent.

8  E B G B U
G

This is a strong assumption

To correct for heteroscedasticity, just estimate each

using the natural estima-

B G

I@ 1
 B
f )

tor:

B
s

that well relax later.

Note that we use

here since its possible that there are more than

regressors, so

could be negative. Asymptotically the difference

1
1)

is unimportant.

7.4. HETEROSCEDASTICITY

125

F IGURE 7.4.1. Residuals, Nerlove model, sorted by rm size


Regression residuals
1.5

Residuals

0.5

-0.5

-1

-1.5

20

40

60

80

100

120

140

160

With each of these, transform the model as usual:

B B
B
B G P BR   B f

Do this for each cross-sectional group. This transformed model satises the classical assumptions, asymptotically.

7.4.4. Example: the Nerlove model (again!) Lets check the Nerlove data
for evidence of heteroscedasticity. In what follows, were going to use the
model with the constant and output coefcient varying across 5 groups, but
with the input price coefcients xed (see Equation 6.8.3 for the rationale behind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m
plots the residuals. We can see pretty clearly that the error variance is larger
for small rms than for larger rms.

7.4. HETEROSCEDASTICITY

126

Now lets try out some tests to formally check for heteroscedasticity. The
Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt
tests, using the above model. The results are
Value

GQ test

61.903

0.000

Value

Whites test

p-value

p-value

10.886

0.000

All in all, it is very clear that the data are heteroscedastic. That means that OLS
estimation is not efcient, and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were calculated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m
uses the Wald test to check for CRTS and HOD1, but using a heteroscedasticconsistent covariance estimator.1 The results are
Testing HOD1
Value
6.161

0.013

Value

Wald test

p-value

p-value

20.169

0.001

Testing CRTS

Wald test

By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator directly to restrict the fully general model with all coefcients
varying to the model with only the constant and the output coefcient varying. But
GLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into the
model. The methods are equivalent, but the second is more convenient and easier to understand.

7.4. HETEROSCEDASTICITY

127

We see that the previous conclusions are altered - both CRTS is and HOD1 are
rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald tests
tendency to over-reject?
From the previous plot, it seems that the variance of is a decreasing func-

tion of output. Suppose that the 5 size groups have different error variances
(heteroscedasticity by groups):

  B
e
if

5 85
6A@8A874)  3 ) 

where

, etc., as before. The Octave program GLS/NerloveGLS.m

estimates the model using GLS (through a transformation of the model so that
OLS can be applied). The estimation results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800

Results (Het. consistent var-cov estimator)

estimate

st.err.

t-stat.

p-value

constant1

-1.046

1.276

-0.820

0.414

constant2

-1.977

1.364

-1.450

0.149

constant3

-3.616

1.656

-2.184

0.031

constant4

-4.052

1.462

-2.771

0.006

constant5

-5.308

1.586

-3.346

0.001

0.391

0.090

4.363

0.000

output1

7.4. HETEROSCEDASTICITY

128

output2

0.649

0.090

7.184

0.000

output3

0.897

0.134

6.688

0.000

output4

0.962

0.112

8.612

0.000

output5

1.101

0.090

12.237

0.000

labor

0.007

0.208

0.032

0.975

fuel

0.498

0.081

6.149

0.000

-0.460

0.253

-1.818

0.071

capital

*********************************************************

*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393

Results (Het. consistent var-cov estimator)

estimate

st.err.

t-stat.

p-value

constant1

-1.580

0.917

-1.723

0.087

constant2

-2.497

0.988

-2.528

0.013

constant3

-4.108

1.327

-3.097

0.002

constant4

-4.494

1.180

-3.808

0.000

constant5

-5.765

1.274

-4.525

0.000

output1

0.392

0.090

4.346

0.000

output2

0.648

0.094

6.917

0.000

7.4. HETEROSCEDASTICITY

129

output3

0.892

0.138

6.474

0.000

output4

0.951

0.109

8.755

0.000

output5

1.093

0.086

12.684

0.000

labor

0.103

0.141

0.733

0.465

fuel

0.492

0.044

11.294

0.000

-0.366

0.165

-2.217

0.028

capital

*********************************************************

Testing HOD1
Value
9.312

Wald test

p-value
0.002

The rst panel of output are the OLS estimation results, which are used to

results. Some comments:

The

consistently estimate the

. The second panel of results are the GLS estimation

measures are not comparable - the dependent variables are

not the same. The measure for the GLS results uses the transformed

but I have not done so.

dependent variable. One could calculate a comparable

measure,

The differences in estimated standard errors (smaller in general for


GLS) can be interpreted as evidence of improved efciency of GLS,
since the OLS standard errors are calculated using the Huber-White
estimator. They would not be comparable if the ordinary (inconsistent) estimator had been used.

7.5. AUTOCORRELATION

130

Note that the previously noted pattern in the output coefcients per-

sists. The nonconstant CRTS result is robust.


The coefcient on capital is now negative and signicant at the 3%

level. That seems to indicate some kind of problem with the model or
the data, or economic theory.
Note that HOD1 is now rejected. Problem of Wald test over-rejecting?
Specication error in model?

7.5. Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that is usually associated with time series data, but also can affect crosssectional data. For example, a shock to oil prices will simultaneously affect
all countries, so one could expect contemporaneous correlation of macroeconomic variables across countries.
7.5.1. Causes. Autocorrelation is the existence of correlation across the error term:

8  2 
4  EgGtG
Why might this occur? Plausible explanations include
(1) Lags in adjustment to shocks. In a model such as

 G P 
t0 R  gf
as the equilibrium value. Suppose

gG

R 

stant over a number of observations. One can interpret

one could interpret

is con-

as a shock

that moves the system away from equilibrium. If the time needed to
return to equilibrium is long with respect to the observation frequency,

7.5. AUTOCORRELATION

to be positive, conditional on

I G

induces a correlation.

one could expect

131

positive, which

(2) Unobserved factors that are correlated over time. The error term is
often assumed to correspond to unobservable factors. If these factors
are correlated, there will be autocorrelation.
(3) Misspecication of the model. Suppose that the DGP is

G
tVP  uS  HuP p  gf
P I

but we estimate

G P I
tVS  HuP p  gf

The effects are illustrated in Figure 7.5.1.

7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is
the same as in the case of heteroscedasticity - the standard formula does not
apply. The correct formula is given in equation 7.1.1. Next we discuss two
GLS corrections for OLS. These will potentially induce inconsistency when the
regressors are nonstochastic (see Chapter8) and should either not be used in
that case (which is usually the relevant case) or used with caution. The more
recommended procedure is discussed in section 7.5.5.

7.5.3. AR(1). There are many types of autocorrelation. Well consider two
examples. The rst is the most commonly encountered case: autoregressive

7.5. AUTOCORRELATION

132

F IGURE 7.5.1. Autocorrelation induced by misspecication

order 1 (AR(1) errors. The model is

G P
tV R 
 3
t3
P G
CI t4
E
2

 f

 tG

kC
G
 tU

We assume that the model satises the other classical assumptions.

8
4)

We need a stationarity assumption:

Otherwise the variance of

explodes as increases, so standard asymptotics will not apply.

order autocovariance:

G
t  p

Note that the variance does not depend on

0) G

 t

gG


G
 t

j)



p

qW
G
W  U

gG
p

qW
W C

W  G

as

 ut
P G
RUVCI t4gw I U
P
G 5 P
G

were covariance stationary, we could

is found as

drops out, since

The variance is the


so
obtain this using

If we had directly assumed that

With this, the variance of

obtain
In the limit the lagged

so we

CisI CDu iiP | t4


P
P
G
P
iiDI iDVP tG




CPI iiP t4p

G
P G
iisI t4


 tG


By recursive substitution we obtain


7.5. AUTOCORRELATION

133

7.5. AUTOCORRELATION

is

I


Likewise, the rst order autocovariance

134

G 
 7  EI tG

2
0)


D$
G
t
G P G
I CI 4U

Using the same method, we nd that for

j)

D 

stationary

cov
se se


f
! 
f

f
 S 

corr

is covariance

and ) is dened as

The correlation (in general, for r.v.s

G
gx

G 
 E tG

The autocovariances dont depend on : the process

but in this case, the two standard errors are the same, so the -order autocor-

is


.
.
.

~ j{ ) z
|
}

|
}{

bb
xxb I f

this is the correlation matrix

..

this is the variance

..

.
.
.

has the form

f uxxb
bb
bb
I f uxxb $

All this means that the overall matrix


S


S

relation

7.5. AUTOCORRELATION

135

So we have homoscedasticity, but elements off the main diagonal are

can estimate these consistently, we can apply FGLS.

and

8 s

not zero. All of this depends only on two parameters,

If we

It turns out that its easy to estimate these consistently. The steps are

G P 
0 R  gf

(1) Estimate the model

by OLS.

(2) Take the residuals, and estimate the model

P
I t G  t G

G
T


ctG

Since

gression

this regression is asymptotically equivalent to the re-

P G
CI t4  tG

which satises the classical assumptions. Therefore, obtained by ap-

 S  S

P
I t G  G


ng



i
@
1


T f ) 

S

, the

using the

and estimate by FGLS. Actually, one can omit

8
g f I yS R% h I yS %  @0 2

 )
g 0c

since it cancels out in the formula

One can iterate the process, by taking the rst FGLS estimator of
and

estimating

the factor

form

previous structure of

and

(3) With the consistent estimators

estimator

is consistent. Also, since

plying OLS to

re-

etc. If one iterates to convergences its equivalent

to MLE (supposing normal errors).

7.5. AUTOCORRELATION

136

An asymptotically equivalent approach is to simply estimate the transformed model

p
pf

P R
c!I    I g gf

)
(u1

using

observations (since

and

arent available). This is

the method of Cochrane and Orcutt. Dropping the rst observation is


asymptotically irrelevant, but it can be very important in small samples.
One can recuperate the rst observation by putting

)

)

I  I 

I
f  I f

This somewhat odd-looking result is related to the Cholesky factorSee Davidson and MacKinnon, pg. 348-49 for more

If

discussion. Note that the variance of

is

8I

ization of

asymptotically, so we

see that the transformed model will be homoscedastic (and nonauto-

periods.

are uncorrelated with the


4 R f

correlated, since the

in different time

7.5.4. MA(1). The linear regression model with moving average order 1
errors is

G P
t0 R 

2
! 3
t3
I CnP C
t

 f

 tG

C

 G

t)
P
t

tw
P )

t
p
z I z
I
C  z  I
t
.

..

.
.
.

Note that the rst order autocorrelation is

bb
xxb

bb
xxb

.
.
.

t t)
P
t
wP )
t
t

..

so in this case

| irP CREI CrP CR@


t
t

and

it
CrDI CREI CnwP iA
t P t


I
 i

Similarly


t P )
t
wP
I irSCR
t P




p

G
 t
In this case,

7.5. AUTOCORRELATION

137

7.5. AUTOCORRELATION

and a minimum at

)
4  t

)  t

This achieves a maximum at

138

and the

maximal and minimal autocorrelations are 1/2 and -1/2. Therefore,


series that are more strongly autocorrelated cant be MA(1) processes.

Again the covariance matrix has a simple structure that depends on only two

parameters. The problem in this case is that one cant estimate

using OLS on

I irP C  t G
t

because the

are unobservable and they cant be estimated consistently. How-

ever, there is a simple way to estimate the parameters.


Since the model is homoscedastic, we can estimate

w   t
t P )
G
using the typical estimator:
x

I
@ 1
t P )

G f )  c 

By the Slutsky theorem, we can interpret this as dening an (unidenand


t

tied) estimator of both

e.g., use this as

I
@ 1
P )

G f )  t c

However, this isnt sufcient to dene consistent estimators of the parameters, since its unidentied.
and

I G

gG

To solve this problem, estimate the covariance of

using

@ 1

I t
G 
G t G f )  t  EI tG

7.5. AUTOCORRELATION

139

This is a consistent estimator, following a LLN (and given that the


epsilon hats are consistent for the epsilons). As above, this can be
interpreted as dening an unidentied estimator:

@ 1

I
G G f )  t

Now solve these two equations to obtain identied (and therefore conand

sistent) estimators of both

Dene the consistent estimator

t S  S



following the form weve seen above, and transform the model using the Cholesky decomposition. The transformed model satises the
classical assumptions asymptotically.

7.5.5. Asymptotically valid inferences with autocorrelation of unknown


form. See Hamilton Ch. 10, pp. 261-2 and 280-84.
When the form of autocorrelation is unknown, one may decide to use the
OLS estimator, without correction. Weve seen that this estimator has the limh

h 0 1

f

R t RG A
G

is

where, as before,

I I 

iting distribution

7.5. AUTOCORRELATION

f
tG

r
if  xxb  I  n
bb
G


I@

f
I@
f



I
@

) 1  f 

v
'

8
gE2

autocovariance of

as

8 R
g U 

is potentially autocorrelated:

2
j

8 R


 R

 R


Note that this autocovariance does not depend on


2

(show this with an example). In general, we expect

will be autocorrelated, since

stationarity.

I@
R
f

R
 Gi
that:

Note that

.
.
.

is covariance stationary (so that the covariance between

does not depend on

Dene the

and

We assume that

)
j'

so that

is dened

I
$G

vector). Note that

G 
t  

as a

(recall that

We need a consistent estimate of . Dene

140

due to covariance

7.5. AUTOCORRELATION

), since the regressors

will in general be correlated (more on this later).

B  B U

and heteroscedastic (

, which depends upon ), again since

the regressors will have different variances.

in


k B

contemporaneously correlated (

141

parametrically, we in general have little informa

While one could estimate

tion upon which to base a parametric specication. Recent research has fo-

I@

R
f

Now dene

cused on consistent nonparametric estimators of

I@

1
)  f

We have (show that the following is true, by expanding sum and shifting rows to left)

P
sI f ) 1 xxxH R
Pbbb

1
P u1  IR
5 P

I @

8 R 1
f ) 

P
I f ) 1 9xxP h R
Pbbb

 f

1
7 )

1
P V1 P h IR
5

8 h R

would be

here). So, a natural, but inconsis-

1)
t G 


instead of

h I Rf

1
P
DI ) 1 P p

tent, estimator of

where

(note: one could put

is

I Rf

The natural, consistent estimator of

P s11 P p

I f
1
P
DI ) 1 P p

 f

7.5. AUTOCORRELATION

142

This estimator is inconsistent in general, since the number of parameters to


estimate is more than the number of observations, and increases more rapidly

tends to zero sufciently rapidly as

a modied estimator

 hR

I
p
P

9f (

 f

will be consistent, provided

1
S

as

1


slowly.

where

tends to

On the other hand, supposing that

than , so information does not build up as

grows sufciently

The assumption that autocorrelations die off is reasonable in many

can be dropped because it tends to one for

8
q1

increases slowly relative to

1
D

f f

given that

The term

has autocorrelations

S
1

that die off.

cases. For example, the AR(1) model with

A disadvantage of this estimator is that is may not be positive denite.


statistic, for example!

Newey and West proposed and estimator (Econometrica, 1987) that


solves the problem of possible nonpositive deniteness of the above
estimator. Their estimator is

8 hR

This could cause one to calculate a negative

9f (

I
w
")P

) P p

 f

This estimator is p.d. by construction. The condition for consistency

1
D

I 1

3 p
e

Note that this is a very slow rate of growth

This estimator is nonparametric - weve placed no parametric

restrictions on the form of

for

is that

It is an example of a kernel estimator.

7.5. AUTOCORRELATION

We can now use

f
T

as its limit,

has

Finally, since

143

and

R If

to consistently estimate the limiting distribution of the OLS estimator

under heteroscedasticity and autocorrelation of unknown form. With this,


asymptotically valid tests are constructed in the usual way.

7.5.6. Testing for autocorrelation. Durbin-Watson test


The Durbin-Watson test statistic is

I
G @
I P f 5
G sI t G t G 0 G
I
G @
f
EI t G t G

@
f

@
f




d

The null hypothesis is that the rst order autocorrelation of the errors

8 QI r


8  I r p

is zero:

The alternative is of course

Note

that the alternative is not that the errors are AR(1), since many general patterns of autocorrelation will have the rst order autocorrelation different than zero. For this reason the test is useful for detecting
autocorrelation in general. For the same reason, one shouldnt just assume that an AR(1) model is appropriate when the DW test rejects the
null.
Under the null, the middle term tends to zero, and the other two tend

so

5
7p

the middle term tends to

Supposing that we had an AR(1) error process with


75

These are the extremes:

so

case the middle term tends to

In this case

8)
4 

Supposing that we had an AR(1) error process with

8
4) 

8
75

to one, so

always lies between 0 and 4.

In this





7.5. AUTOCORRELATION

144

F IGURE 7.5.2. Durbin-Watson critical values

The distribution of the test statistic depends on the matrix of regres-


%

sors,

so tables cant give exact critical values. The give upper and

lower bounds, which correspond to the extremes that are possible. See
Figure 7.5.2. There are means of determining exact critical values con-

8
i

ditional on

Note that DW can be used to test for nonlinearity (add discussion).

The DW test is based upon the assumption that the matrix

is xed

in repeated samples. This is often unreasonable in the context of economic time series, which is precisely the context where the test would
have application. It is possible to relate the DW test to other test statistics which are valid without strict exogeneity.

7.5. AUTOCORRELATION

145

Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity. The regression is

Pbbb
txxP t G sI t G CiP R  t G
 P
I



ui

statistic, just as in the White test. There are

8
g


sH1

restrictions, so the test statistic is asymptotically distributed as a

P
i G

and the test statistic is the

The intuition is that the lagged errors shouldnt contribute to explain-

ing the current error if there is no autocorrelation.

into here.

%


independent even if the

g G

is included as a regressor to account for the fact that the

are not

are. This is a technicality that we wont go

This test is valid even if the regressors are stochastic and contain lagged

dependent variables, so it is considerably more useful than the DW


test for typical time series data.
The alternative is not that the model is an AR(P), following the argument above. The alternative is simply that some or all of the rst
autocorrelations are different from zero. This is compatible with

many specic forms of autocorrelation.

7.5.7. Lagged dependent variables and autocorrelation. Weve seen that

following a LLN. An important excep-

contains lagged

R f

tion is the case where

  4 RG U

This will be the case when

 y f 3 

the OLS estimator is consistent under autocorrelation, as long as

and the errors are autocorrelated.

A simple example is the case of a single lag of the dependent variable with

7.5. AUTOCORRELATION

146

AR(1) errors. The model is

P G
CI 4
G  f P
0P sI u R 

 f

 tG
Now we can write


P G G P f P
G f
CI 4I tV gV I R   tI gU
8  y f 3 

I 4U
G

and therefore

which is clearly nonzero. In this case

Since


pG R U

since one of the terms is

1
 P
G R 3 "  3 

the OLS estimator is inconsistent in this case. One needs to estimate by instrumental variables (IV), which well get to later.

7.5.8. Examples.
Nerlove model, yet again. The Nerlove model uses cross-sectional data, so
one may not think of performing tests for autocorrelation. However, specication error can induce autocorrelated errors. Consider the simple Nerlove
model

e
uP )VP 5 4uP 1 | VP uH A
6
3

PI 
2
0

and the extended Nerlove model

8e
uP )uP 5 4uP 5 | uP VP I u A
6
3

2
0


7.5. AUTOCORRELATION

147

F IGURE 7.6.1. Residuals of simple Nerlove model


2

Residuals
Quadratic fit to Residuals

1.5

0.5

-0.5

-1

10

We have seen evidence that the extended model is preferred. So if it is in


fact the proper model, the simple model is misspecied. Lets check if this
misspecication might induce autocorrelated errors.
The Octave program GLS/NerloveAR.m estimates the simple Nerlove model,

and plots the residuals as a function of

, and it calculates a Breusch-Godfrey

test statistic. The residual plot is in Figure 7.6.1 , and the test results are:
Value
Breusch-Godfrey test

p-value

34.930

0.000

Clearly, there is a problem of autocorrelated residuals.


E XERCISE 7.6. Repeat the autocorrelation tests using the extended Nerlove
model (Equation ??) to see the problem is solved.

7.5. AUTOCORRELATION

148

Klein model. Kleins Model I is a simple macroeconometric model. One of


the equations in the model explains consumption ( ) as a function of prots

( ), both current and lagged, as well as the sum of wages in the private sector

) and wages in the government sector (

). Have a look at the README

le for this data set. This gives the variable names and other information.
Consider the model

I e P
ws P T |


&

P
I
&

P
S I
&

P p

&
'

The Octave program GLS/Klein.m estimates this model by OLS, plots the
residuals, and performs the Breusch-Godfrey test, using 1 lag of the residuals. The estimation and test results are:

*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732

Results (Ordinary var-cov estimator)

estimate

st.err.

t-stat.

p-value

16.237

1.303

12.464

0.000

Profits

0.193

0.091

2.115

0.049

Lagged Profits

0.090

0.091

0.992

0.335

Wages

0.796

0.040

19.933

0.000

Constant

7.5. AUTOCORRELATION

149

F IGURE 7.6.2. OLS residuals, Klein consumption equation


Regression residuals
2

Residuals

1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5

10

15

20

25

*********************************************************
Value
Breusch-Godfrey test

p-value

1.539

0.215

and the residual plot is in Figure 7.6.2. The test does not reject the null of
nonautocorrelatetd errors, but we should remember that we have only 21 observations, so power is likely to be fairly low. The residual plot leads me to
suspect that there may be autocorrelation - there are some signicant runs below and above the x-axis. Your opinion may differ.
Since it seems that there may be autocorrelation, letss try an AR(1) correction. The Octave program GLS/KleinAR1.m estimates the Klein consumption
equation assuming that the errors follow the AR(1) pattern. The results, with
the Breusch-Godfrey test for remaining autocorrelation are:

7.5. AUTOCORRELATION

150

*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171

Results (Ordinary var-cov estimator)

estimate

st.err.

t-stat.

p-value

16.992

1.492

11.388

0.000

Profits

0.215

0.096

2.232

0.039

Lagged Profits

0.076

0.094

0.806

0.431

Wages

0.774

0.048

16.234

0.000

Constant

*********************************************************
Value
Breusch-Godfrey test

p-value

2.129

0.345

The test is farther away from the rejection region than before, and the

residual plot is a bit more favorable for the hypothesis of nonautocorrelated residuals, IMHO. For this reason, it seems that the AR(1)
correction might have improved the estimation.
Nevertheless, there has not been much of an effect on the estimated
coefcients nor on their estimated standard errors. This is probably
because the estimated AR(1) coefcient is not very large (around 0.2)

EXERCISES

151

The existence or not of autocorrelation in this model will be important


later, in the section on simultaneous equations.

Exercises

EXERCISES

152

(1) Comparing the variances of the OLS and GLS estimators, I claimed that the
following holds:
(2)

y w S w

@0

Verify that this is true.

(3) Show that the GLS estimator can be dened as

f R f
qij I yBSc"j E

C0

(4) The limiting distribution of the OLS estimator with heteroscedasticity of


h

 I I 

unknown form is

h

j% 1

where

'

1

tf
R t RG u
G

Explain why

I
@ 1


G  R  f )

is a consistent estimator of this matrix.

8 R


 R

(6) For the Nerlove model

8
g RU 

Show that

as



2
%

where

autocovariance of a covariance stationary process

(5) Dene the

e
wP )uP 1 A 4uP 5 | uP A uP I
6
3

2
0


EXERCISES

1&

e
  E

&
'

assume that

153

Exercises
(a) Calculate the FGLS estimator and interpret the estimation results.
(b) Test the transformed model to check whether it appears to satisfy homoscedasticity.

CHAPTER 8

Stochastic regressors
Up to now we have treated the regressors as xed, which is clearly unrealistic. Now we will assume they are random. There are several ways to
think of the problem. First, if we are interested in an analysis conditional on the
explanatory variables, then it is irrelevant if they are stochastic or not, since
conditional on the values of they regressors take on, they are nonstochastic,
which is the case already considered.
In cross-sectional analysis it is usually reasonable to make the analysis

conditional on the regressors.


may depend on


gI gf

gf

In dynamic models, where

a conditional anal-

ysis is not sufciently general, since we may want to predict into the

8
i

the relevant test statistics unconditional on

future many periods out, so we need to consider the behavior of

and

The model well deal will involve a combination of the following assumptions

r p

Linearity: the model is a linear function of the parameter vector

 G
tVP p R  gf

or in matrix form,

154

)
i'

conformable.

is

and

where

and are

bb
 R h f  xxb  I   94i1
) '
G
70P p i  f

where is

8.1. CASE 1

155

Stochastic, linearly independent regressors


with probability 1

has rank

is stochastic


)   R I f

Central limit theorem

is a nite positive denite matrix.

: is normally distributed

e $f ! je G

p  j

G R eI 1
p

m
V

Normality (Optional):

where

Strongly exogenous regressors:

2


(8.0.1)

G
 tU

Weakly exogenous regressors:

is the conditional mean of

given

R  v v
v f

G
 cv t

R v

In both cases,

2


(8.0.2)

8.1. Case 1

In this case,


7G

Normality of

strongly exogenous regressors

Gi I !XiP p 
R R

G R R
X i I !XiP p
p


p I !XRite

 i

, unconditional on

8
%


 X U
and since this holds for all

Likewise,

8.2. CASE 2

the usual test statistics have the


i

distributions. Importantly, these distributions dont depend on

and


i

However, conditional on

in small samples.


2

Doing this leads to a nonnormal density for

and integrating over

Q
X"t

 Q
XDt

multiplying the conditional density by

is obtained by

8
i

the marginal density of

is

If the density of

156

so

when marginalizing to obtain the unconditional distribution, nothing


changes. The tests are valid in small samples.
is stochastic but strongly exogenous and is nor-

Summary: When

mally distributed:

(2)

is unbiased

(1)

is nonnormally distributed

(3) The usual test statistics have the same distribution as with non-

8
i

stochastic

(4) The Gauss-Markov theorem still holds, since it holds conditionand this is true for all

8
%


i

ally on

(5) Asymptotic properties are treated in the next section.

8.2. Case 2
nonnormally distributed, strongly exogenous regressors
carries through as before. However, the argument

regarding test statistics doesnt hold, due to nonnormality of

8
G

The unbiasedness of

Still, we have

1
1
RG I R P
Gi I 6XiP
R R

p
p

8.2. CASE 2

Now

157

1
I T R
I

by assumption, and

gD

 j
1
1
T GR I 1  GR
p
e

since the numerator converges to a

r.v. and the denominator still

goes to innity. We have unbiasedness and the variance disappearing, so, the
estimator is consistent:

8p

Considering the asymptotic distribution


h

RG I

p
 h % 1

RG 1 1
p
eI
I R
1
1
R 1
p I  j

so

hp

0 1

directly following the assumptions. Asymptotic normality of the estimator still


holds. Since the asymptotic results on all test statistics only require this, all the
previous asymptotic results on test statistics are also valid in this case.

nonnormal,

has the properties:

Summary: Under strongly exogenous regressors, with

normal or

(1) Unbiasedness
(2) Consistency
(3) Gauss-Markov theorem holds, since it holds in the previous case
and doesnt depend on normality.
(4) Asymptotic normality

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE?

158

(5) Tests are asymptotically valid, but are not valid in small samples.
8.3. Case 3
Weakly exogenous regressors
An important class of models are dynamic models, where lagged dependent
variables have an impact on the current value. A simple version of these models that captures the important points is

G P
0 R 
I

G0 g7
P f
P & g#
R
T


contains lagged dependent variables. Clearly, even with

e
 v !

 f


where now

and are not uncorrelated, so one cant show unbiasedness. For example,

9

 G
 I t
I gf

contains

(which is a function of


I G




since

as an element.

This fact implies that all of the small sample properties such as unbiasedness, Gauss-Markov theorem, and small sample validity of test
statistics do not hold in this case. Recall Figure 3.7.2. This is a case of
weakly exogenous regressors, and we see that the OLS estimator is
biased in this case.
Nevertheless, under the above assumptions, all asymptotic properties
continue to hold, using the same arguments as before.

8.4. When are the assumptions reasonable?


The two assumptions weve added are

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE?

(2)

p  m G R eI 1

p

f


4)   R I p f A

(1)

159

nite positive denite matrix.

The most complicated case is that of dynamic models, since the other cases can
be treated as nested in this case. There exist a number of central limit theorems
for dependent processes, many of which are fairly technical. We wont enter
into details (see Hamilton, Chapter 7 if youre interested). A main requirement
for use of standard asymptotics for a dependent sequence

I
@ 1
 9g

#
f )

x#

to converge in probability to a nite limit is that

be stationary, in some sense.

Strong stationarity requires that the joint distribution of the set

@8A89( 6g !c
8 # ##
8
2

not depend on

Covariance (weak) stationarity requires that the rst and second mo-

8
2

ments of this set not depend on

An example of a sequence that doesnt satisfy this is an AR(1) process


with a unit root (a random walk):



G P
t0sI 
depends upon in this case.




 

G

One can show that the variance of

Stationarity prevents the process from trending off to plus or minus innity,
and prevents cyclical behavior which would allow correlations between far
znd

x#

removed

to be high. Draw a picture here.

8.4. WHEN ARE THE ASSUMPTIONS REASONABLE?

160

In summary, the assumptions are reasonable when the stochastic con-

ditioning variables have variances that are nite, and are not too strongly
dependent. The AR(1) model with unit root is an example of a case
where the dependence is too strong for standard asymptotics to apply.
The econometrics of nonstationary processes has been an active area
of research in the last two decades. The standard asymptotics dont
apply in this case. This isnt in the scope of this course.

EXERCISES

161

Exercises
if

  w 

. How is this used in the Gauss-Markov theorem?

(2) If it possible for an AR(1) model for time series data, e.g.,
satisfy weak exogeneity? Strong exogeneity? Discuss.

then

P
tG gI gf 8 P  $f

and

 w

(1) Show that for two random variables

CHAPTER 9

Data problems
In this section well consider problems associated with the regressor matrix:
collinearity, missing observation and measurement error.

9.1. Collinearity
Collinearity is the existence of linear relationships amongst the regressors.
We can always write

and


%

P
I v I
e

column of the regressor matrix

is an

vector.

) '
wj1

P se xxxP v
v Pbbb

 3

is the

B
Dv

where

In the case that there exists collinearity, the variation in is relatively small, so

that there is an approximately exact linear relation between the regressors.


relative and approximate are imprecise, so its difcult to dene
when collinearilty exists.

In the extreme, if there are exact linear relationships (every element of equal)

 Q R

so

so

 X

then

is not invertible and the OLS estimator

is not uniquely dened. For example, if the model is

|  & I &
P
G P
t0i |  | Vi  VsH
P
P I
162

 gf

 

9.1. COLLINEARITY

163

then we can write

G P
t0C |  C
 P I
|  | uP & EI & uDH

P P I
P P PI
 | uS |  & VsI & uDH
P P PI
 | V |  & sI & uDH

 gf

G P
tVS |
G P
t0C |


$ R

are multiple values of

s dene two

cant be consistently estimated (there

that solve the fonc). The

the case of perfect collinearity.

the

equations in three

can be consistently estimated, but since the


The

are unidentied in

Perfect collinearity is unusual, except in the case of an error in construction of the regressor matrix, such as including the same regressor

twice.

Another case where perfect collinearity may be encountered is with models


with dummy variables, if one is not careful. Consider a model of rental price
of an apartment. This could depend factors such as size, quality etc., col-

B
D

Girona, Tarragona and Lleida. One could use a model such as

for

G P
B 0a BR  P B
f

3  
Eg4) B f P B P B P
& B

uP B 4uP B & | uP Y VsH DB f


3
B P I 

6
)

In this model,

and
f

otherwise. Similarly, dene

if the

 C
B

apartment is in Barcelona,

B B &
) p
 B

as well as on the location of the apartment. Let

 3

B
f

lected in

so there is an exact relationship between

these variables and the column of ones corresponding to the constant. One
must either drop the constant, or one of the qualitative variables.

9.1. COLLINEARITY

F IGURE 9.1.1.

164

when there is no collinearity

6
4

60
55
50
45
40
35
30
25
20
15

2
0
-2
-4

-6

-4

-2

-6

9.1.1. A brief aside on dummy variables. Introduce a brief discussion of


dummy variables here.

9.1.2. Back to collinearity. The more common case, if one doesnt make
mistakes such as these, is the existence of inexact linear relationships, i.e., correlations between the regressors that are less than one in absolute value, but
not zero. The basic problem is that when two (or more) variables move together, it is difcult to determine their separate inuences. This is reected
in imprecise estimates, i.e., estimates with high variances. With economic data,
collinearity is commonly encountered, and is often a severe problem.
When there is collinearity, the minimizing point of the objective function

that denes the OLS estimator (

, the sum of squared errors) is relatively

poorly dened. This is seen in Figures 9.1.1 and 9.1.2.

9.1. COLLINEARITY

when there is collinearity

F IGURE 9.1.2.

165

6
4

100
90
80
70
60
50
40
30
20

2
0
-2
-4

-6

-4

-2

-6

To see the effect of collinearity on variances, partition the regressor matrix


as

(note: we can interchange the columns of

is the rst column of

r v o
n

where

isf

we like, so theres no loss of generality in considering the rst column). Now,

the variance of

under the classical assumptions, is

R
I X%  

Using the partition,

R v

vR v

R v R

 R

9.1. COLLINEARITY

166

and following a rule for partitioned inversion,



b
I !
h h I y Vsf R
R R
I v
R R
I v I 6dR v v R

r
R
 II sI Xi

v


v


where by

we mean the error sum of squares obtained from the regres-

8
iP

sion

%v

Since

 ) 


V 
)

so the variance of the coefcient corresponding to

is

we have

jyc
)


We see three factors inuence the variance of this coefcient. It will be high if

is large
Draw a picture here.

4)


The last of these cases is collinearity.

well. In this case,

will be close to 1. As

can explain the movement in

and the other regres-

sors, so that

(3) There is a strong linear relationship between

(2) There is little variation in

8 v

(1)

9.1. COLLINEARITY

167

Intuitively, when there are strong linear relations between the regressors, it
is difcult to determine the separate inuence of the regressors on the dependent variable. This can be seen by comparing the OLS objective function in
the case of no correlation between regressors with the objective function with
correlation between the regressors. See the gures nocollin.ps (no correlation)
and collin.ps (correlation), available on the web site.
9.1.3. Detection of collinearity. The best way is simply to regress each explanatory variable in turn on the remaining regressors. If any of these auxiliary

regressions has a high

there is a problem of collinearity. Furthermore, this

procedure identies which parameters are affected.


Sometimes, were only interested in certain parameters. Collinearity
isnt a problem if it doesnt affect what were interested in estimating.

An alternative is to examine the matrix of correlations between the regressors.


High correlations are sufcient but not necessary for severe collinearity.


g

Also indicative of collinearity is that the model ts well (high

but none

of the variables is signicantly different from zero (e.g., their separate inuences arent well determined).
In summary, the articial regressions are the best approach if one wants to
be careful.
9.1.4. Dealing with collinearity. More information
Collinearity is a problem of an uninformative sample. The rst question
is: is all the available information being used? Is more data available? Are
there coefcient restrictions that have been neglected? Picture illustrating how
a restriction can solve problem of perfect collinearity.
Stochastic restrictions and ridge regression

9.1. COLLINEARITY

168

Supposing that there is no more data or neglected restrictions, one possibility is to change perspectives, to Bayesian econometrics. One can express prior
beliefs regarding the coefcients using stochastic restrictions. A stochastic linear restriction would be something of the form

P  p

and

are as in the case of exact linear restrictions, but

where

is a random

vector. For example, the model could be

(
f
gT(
f f
(
@D

G P
Vi

 f

 p

This sort of model isnt in line with the classical interpretation of parameters

P  p

as constants: according to this interpretation the left hand side of

is

constant but the right is random. This model does t the Bayesian perspective:
we combine information coming from the model and the data, summarized in

4f ! j

G P
V"

 f
G

with prior beliefs regarding the distribution of the parameter, summarized in


!( ! jQp
  R `U
G

Since the sample is random it is reasonable to suppose that

which is

the last piece of information in the specication. How can you estimate using

9.1. COLLINEARITY

169

this model? The solution is to treat the restrictions as articial data. Write

Dene the prior precision




f


This model is heteroscedastic, since

8 

This expresses the degree of belief in the restriction relative to the varithen the model

ability of the data. Supposing that we specify

f


is homoscedastic and can be estimated by OLS. Note that this estimator is bi-

ased. It is consistent, however, given that

is a xed constant, even if the

restriction is false (this is in contrast to the case of false exact restrictions). To


is the number of rows of

articial observations have no weight in the objective

these

As

restrictions, where

see this, note that there are

function, so the estimator has the same limiting objective function as the OLS
estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of

the squared length of :

P R
 y 4e 

I
B
P R
B
Se u

P
I X R y R
RG X R X R u R G V R
P
I I
P
P
h RG I X R R h RG I X R


  R U



(the trace is the sum of eigenvalues)

Q
R

(the eigenvalues are all positive, since

is p.d.

9.1. COLLINEARITY

170

so

R
y
D

P
R  R U


e

is the minimum eigenvalue of

As collinearity becomes worse and worse,


tends to zero (recall that the

is nite.

r R n

f


and the estimator is

becomes

With this restriction the model

8
iP  x

Now considering the restriction

the other hand,

tends to innite. On

determinant is the product of the eigenvalues) and

becomes more nearly singular, so

8
I X R

maximum eigenvalue of

(which is the inverse of the

 R

where

r R n

f% I 9 wi
R
P R

m
B

This is the ordinary ridge regression estimator. The ridge regression estimator
which is more and

 e

restrictions tend to

more nearly singular as collinearity becomes worse and worse. As

which is nonsingular, to


% R

can be seen to add

the

that is, the coefcients are shrunken toward zero.

Also, the estimator tends to

f R  f% I 9

8
fi I wi 
R
P R

m
B

B  B R
m

so

This is clearly a false restriction in the limit, if our original

model is at al sensible.

9.2. MEASUREMENT ERROR

171

There should be some amount of shrinkage that is in fact a true restriction.


The problem is to determine the such that the restriction is correct. The inter-

est in ridge regression centers on the fact that it can be shown that there exists

 m

and

and chooses

that artistically seems appropriate (e.g., where the effect of

dies off). Draw picture here. This means of choosing

increasing

as a function of

@0

the value of

 B  B R

The ridge trace method plots

which are unknown.

The problem is that this depends on

a such that

is obviously

subjective. This is not a problem from the Bayesian perspective: the choice of

reects prior beliefs about the length of

In summary, the ridge estimator offers some hope, but it is impossible to

guarantee that it will outperform the OLS estimator. Collinearity is a fact of


life in econometrics, and there is no clear solution to the problem.

9.2. Measurement error


Measurement error is exactly what it says, either the dependent variable or
the regressors are measured with error. Thinking about the way economic data
are reported, measurement error is probably quite prevalent. For example,
estimates of growth of GDP, ination, etc. are commonly revised several times.
Why should the last revision necessarily be correct?

9.2.1. Error of measurement of the dependent variable. Measurement errors in the dependent variable and the regressors have important differences.

9.2. MEASUREMENT ERROR

172

First consider error in measurement of the dependent variable. The data generating process is presumed to be

G P
V"

 f

P f

! 3
t3

is the unobservable true dependent variable, and

the classical assumptions. Given this, we have

is what is ob-

G
iP "  f

served. We assume that and are independent and that

f
where

satises

G P
V" 

P
f

so

 f

G P
pV"

VP ! 3
t3
P
"

is uncorrelated with


%

As long as

this model satises the classical

assumptions and can be estimated by OLS. This type of measurement


error isnt a problem, then.

9.2. MEASUREMENT ERROR

173

9.2.2. Error of measurement of the regressors. The situation isnt so good


in this case. The DGP is

G P
t0 R 

 gf

T 3
S
t3

iP 

 

t

contains the true, unobserved regressors,


is independent of

and that

satises the classical assumptions. Now we have

G P
0 R 

 f

G P R
0 $s% R 

P
R 

G P
V  f

The problem is that now there is a correlation between

and




the model

'

is what is observed. Again assume that


G

matrix. Now

and

is a

where

since

G P
tV R tHiP  


  


where

8 R
H $ 

Because of this correlation, the OLS estimator is biased and inconsistent, just as
in the case of autocorrelated errors with lagged dependent variables. In matrix
notation, write the estimated model as

P
"  f

9.2. MEASUREMENT ERROR

We have that

1
1
Rf I R 

S
I P

1

D R wP R 3 

I



1
R 3 

are independent, and

I
@ 1

R$t

f )

and

and

since

174

1
R 3 




Likewise,


1
G P
0  R wP R 3 

1
f R 3 
so

P  3 

S
h

So we see that the least squares estimator is inconsistent when the regressors
are measured with error.

A potential solution to this problem is the instrumental variables (IV)


estimator, which well discuss shortly.

9.3. MISSING OBSERVATIONS

175

9.3. Missing observations


Missing observations occur quite frequently: time series data may not be
gathered in a certain year, or respondents to a survey may not answer all questions. Well consider two cases: missing observations on the dependent variable and missing observations on the regressors.
9.3.1. Missing observations on the dependent variable. In this case, we
have

G P
V"  f
or




I
f
f

hold.

I
$G

where

is not observed. Otherwise, we assume the classical assumptions

A clear alternative is to simply estimate using the compete observa-

tions

I G P I
0c  f
I
Since these observations satisfy the classical assumptions, one could
estimate by OLS.
The question remains whether or not one could somehow replace the
by a predictor, and improve over OLS in some sense.

Now

R P I RI
f iVDf i
I

f 


If
I
R  I

be the predictor of

8f

R P I RI
i0 i




I
R
I


Let

unobserved

8
 h

h


P w
w

and this will be unbiased only if


Now,

8 w 9

I
RI I R 0D RI 9
PI

I RI R P I RI
i iV %@ I iV %
R P I RI



% iVD %
R I
R P I RI


I RI
i I i0D i
R P I RI

and we use
$

w
where

i I RiVD i gsH  i
R
P I RI P I I RI

r iVDH i n
R P I I RI

9 H w
P I
%V i
R P I RI
%V i
R P I RI




Substituting these into the equation for the overall combined estimator gives

8 f R  R

would give
Likewise, an OLS regression using only the second (lled in) observations

I RI
f i  H %
I I RI

so if we regressed using only the rst (complete) observations, we would have

Rf  R

Recall that the OLS fonc are


9.3. MISSING OBSERVATIONS

176

9.3. MISSING OBSERVATIONS

177

The conclusion is the this lled in observations alone would need to

dene an unbiased estimator. This will be the case only if

G  f
has mean zero. Clearly, it is difcult to satisfy this condition

where

without knowledge of

I f  f
8

Note that putting

does not satisfy the condition and therefore

leads to a biased estimator.

E XERCISE 13. Formally prove this last statement.

One possibility that has been suggested (see Greene, page 275) is to

estimate

using a rst round estimation using only the complete ob-

servations

If RI I E RI 
I
I
H  f

f
I
H

I RI I RI I
f i I 6 i  H

Now, the overall estimate is a weighted average of


above, but we have

I
H

to predict

and

then use this estimate,

just as

IH i I
R

R
f i I

I
H

! %
R
R
! %




9.3. MISSING OBSERVATIONS

178

This shows that this suggestion is completely empty of content: the nal estimator is the same as the OLS estimator using only the complete
observations.
9.3.2. The sample selection problem. In the above discussion we assumed
that the missing observations are random. The sample selection problem is a
case where the missing observations are not random. Consider the model

G P 
t0 R  f

dened as

f  f

if

always observed. What is observed is

which is assumed to satisfy the classical assumptions. However,

is not

Or, in other words,

is missing when it is less than zero.

The difference in this case is that the missing values are not random: they

8
c 

are correlated with the

Consider the case

G 
0P  f
, but using only the observations for which


u 4f


5  4G

with

to estimate.

Figure 9.3.1 illustrates the bias. The Octave program is sampsel.m


9.3.3. Missing observations on the regressors. Again the model is

I
$G

I
f
f

but we assume now that each row of

has an unobserved component(s).

Again, one could just estimate using the complete observations, but it may

9.3. MISSING OBSERVATIONS

179

F IGURE 9.3.1. Sample selection bias


25

Data
True Line
Fitted Line

20

15

10

-5

-10

10

seem frustrating to have to drop observations simply because of a single miss-

ing variable. In general, if the unobserved

is replaced by some prediction,

then we are in the case of errors of observation. As before, this means


is used instead of

that the OLS estimator is biased when

Consistency

is salvaged, however, as long as the number of missing observations doesnt

8
q1

increase with

Including observations that have missing values replaced by ad hoc


values can be interpreted as introducing false stochastic restrictions.
In general, this introduces bias. It is difcult to determine whether
MSE increases or decreases. Monte Carlo studies suggest that it is
dangerous to simply substitute the mean, for example.

9.3. MISSING OBSERVATIONS

180

In the case that there is only one regressor other than the constant,
for the missing

case that doesnt hold for

8
75


subtitution of

does not lead to bias. This is a special

E XERCISE 14. Prove this last statement.


In summary, if one is strongly concerned with bias, it is best to drop
observations that have missing components. There is potential for reduction of MSE through lling in missing elements with intelligent
guesses, but this could also increase MSE.

EXERCISES

181

Exercises
(1) Consider the Nerlove model

e
wP )uP 1 A 4uP 5 | uP A uP I
6
3

2
0

When this model is estimated by OLS, some coefcients are not signicant.
This may be due to collinearity.
Exercises
(a) Calculate the correlation matrix of the regressors.
(b) Perform articial regressions to see if collinearity is a problem.
(c) Apply the ridge regression estimator.
Exercises
(i) Plot the ridge trace diagram

large.

goes to zero, and as

(ii) Check what happens as

becomes very

CHAPTER 10

Functional form and nonnested tests


Though theory often suggests which conditioning variables should be included, and suggests the signs of certain derivatives, it is usually silent regarding the functional form of the relationship between the dependent variable
and the regressors. For example, considering a cost function, one could have a
Cobb-Douglas model

x gYF IyF U
z x  x w 

This model, after taking logarithms, gives

8 (
)  n

4)  
P I
8 


8 | ! ! D! w

I
8 w A  p
G P (
0 )VP F A uF HuP p A
PI I


where

Theory suggests that

This

model isnt compatible with a xed cost of production since

when

Homogeneity of degree one in input prices suggests that

while

constant returns to scale implies

While this model may be reasonable in some cases, an alternative


h

 A


uP F Vh F HuP p
PI I


(
)

and

G P
0

may be just as plausible. Note that

look quite alike, for certain

values of the regressors, and up to a linear transform, so it may be difcult to


choose between these models.

182

10.1. FLEXIBLE FUNCTIONAL FORMS

183

The basic point is that many functional forms are compatible with the linearin-parameters model, since this model can incorporate a wide variety of nonlinear transformations of the dependent variable and the regressors. For ex-


b

is a real valued function and that

is a

H
b

ample, suppose that

vector-

valued function. The following model is linear in the parameters but nonlinear
in the variables:

x#

may be smaller than, equal to or larger than

re-

For example,

could include squares and cross products of the conditioning variables in

8
ct#

 f

G P
tV R 

fundamental conditioning variables , but there may be

 
gressors, where

#


There may be

10.1. Flexible functional forms


Given that the functional form of the relationship between the dependent
variable and the regressors is in general unknown, one might wonder if there
exist parametric models that can closely approximate a wide variety of functional relationships. A Diewert-Flexible functional form is dened as one
such that the function, the vector of rst derivatives and the matrix of second
derivatives can take on an arbitrary value at a single data point. Flexibility in
this sense clearly requires that there be at least

P
P 
5 P ) d
free parameters: one for each independent effect that we wish to model.

10.1. FLEXIBLE FUNCTIONAL FORMS

184

Suppose that the model is

G P
V  D  f
A second-order Taylors series expansion (with remainder term) of the funcabout the point




is

wP 5
P
D H R H H D
P

 D R    

 H

tion

Use the approximation, which simply drops the remainder term, as an approximation to

r
U  H

As

the approximation becomes more and more exact, in the sense that

5
P
D H R D D
P

 H R    

For

 (


8
 H

 

and

the ap-

 H

 H


'

   H



proximation is exact, up to the second order. The idea behind many exible
and

H

H g D

functional forms is to note that

are all constants. If we

treat them as parameters, the approximation will have exactly enough free pa-

8 


which is of unknown form, exactly,

The model is

) P
 R  5 R  P

up to second order, at the point


g  H

rameters to approximate the function

 

&
g

so the regression model to t is

G
VP  R  5 dR  P
) P

 f

&

10.1. FLEXIBLE FUNCTIONAL FORMS

185

While the regression model has enough free parameters to be Diewert-

H  3 

Is

H w & 3 


Is

H  3 

The answer is no, in general. The reason is that if we treat the true
is forced to play

 

the part of the remainder term, which is a function of

so that

values of the parameters as these derivatives, then

exible, the question remains: is

and

are correlated in this case. As before, the estimator is biased in this

case.
A simpler example would be to consider a rst-order T.S. approxima-

tion to a quadratic function. Draw picture.


The conclusion is that exible functional forms arent really exible in a useful statistical sense, in that neither the function itself nor

its derivatives are consistently estimated, unless the function belongs


to the parametric family of the specied functional form. In order to
lead to consistent inferences, the regression model must be correctly
specied.

10.1.1. The translog form. In spite of the fact that FFFs arent really as
exible as they were originally claimed to be, they are useful, and they are
certainly subject to less bias due to misspecication of the functional form than
are many popular forms, such as the Cobb-Douglas of the simple linear in the
variables model. The translog model is probably the most widely used FFF.
This model is as above, except that the variables are subjected to a logarithmic
tranformation. Also, the expansion point is usually taken to be the sample
mean of the data, after the logarithmic transformation. The model is dened

10.1. FLEXIBLE FUNCTIONAL FORMS

186

by

GVP  R  5 )PR  P &


g A 4 A
#
#
## A
A

 f
 

 f
2

In this presentation, the

subscript that distinguishes observations is sup-

pressed for simplicity. Note that

 f



which is the elasticity of with respect to

8#

#
#
$
#

P

(the other part of is constant)

This is a convenient feature of the

translog model. Note that at the means of the conditioning variables, ,



f

are the rst-order elasticities, at the means of the data.

To illustrate, consider that

so the

  #


so

is cost of production:

 F
  f
is a vector of input prices and

where

is output. We could add other vari-

ables by extending in the obvious manner, but this is supressed for simplicity.

10.1. FLEXIBLE FUNCTIONAL FORMS

187

By Shephards lemma, the conditional factor demands are

F
4




and the cost shares of the factors are therefore

F
F 4

  F 

which is simply the vector of elasticities of cost with respect to input prices. If
the cost function is modeled using a translog function, we have

I
I


g R A  #

 #5 # I R P I R 5 P gwR P
) P   I  ) R # P 

IR

r # R  n 5 P R w R  P
) # P

and

8 | 
|

| 

I
| C
 C
I
iC 
I  II

&
&

I
 I


F 
Y F 

I


Note that symmetry of the second derivatives has been imposed.


Then the share equations are just

and

where


#

r I

I
I n 
P

10.1. FLEXIBLE FUNCTIONAL FORMS

188

Therefore, the share equations and the cost equation have parameters in common. By pooling the equations together and imposing the (true) restriction
that the parameters of the equations be the same, we can gain efciency.
To illustrate in more detail, consider the case of two inputs, so

I




In this case the translog model of the logarithmic cost function is

I 

5
#  | EI  | CP  I  CP # 5  P  5  P I  i # P  jHI  HP
 P #
I
I
P I

II P
|
|

&

The two cost shares of the inputs are the derivatives of

I 
 CiP

I
 g

II  P I
 CiH

 P
# | CiP  iI
I
# | CiP  CiI
I
I P

and

with respect to

Note that the share equations and the cost equation have parameters in
common. One can do a pooled estimation of the three equations at once, imposing that the parameters are the same. In this way were using more observations and therefore more information, which will lead to imporved efciency. Note that this does assume that the cost equation is correctly specied
(i.e., not an approximation), since otherwise the derivatives would not be the
true derivatives of the log cost function, and would then be misspecied for
the shares. To pool the equations, write the model in matrix form (adding in

f
tG

P
d

.
.
.

f
a

f
f

I
f

.
.
.

I
$G

.
.
.

observations:

1
7

of

The overall model would stack observations on the three equations for a total

G P d
t0R  f

tion, a single observation can be written as


This is one observation on the three equations. With the appropriate nota

|G
G

I
| C
C
I
|| 

I
$G

II
C

#  EI 
#

I  )

 I )

 I  z zz z #  I  )

I
g

I
H

&

error terms)
10.1. FLEXIBLE FUNCTIONAL FORMS

189

10.1. FLEXIBLE FUNCTIONAL FORMS

190

Next we need to consider the errors. For observation the errors can be placed

in a vector

I
$G

| G
G

 tG

First consider the covariance matrix of this vector: the shares are certainly
correlated since they must sum to one. (In fact, with 2 shares the variances are
equal and the covariance is -1 times the variance. General notation is used to
allow easy extension to the case of more than 2 inputs). Also, its likely that
the shares and the cost equation have different variances. Supposing that the
won t depend upon :

gG

model is covariance stationary, the variance of

I
| H

|
| b b

| b
H H
I II


p


 tG

Note that this matrix is singular, since the shares sum to 1. Assuming that there
is no autocorrelation, the overall covariance matrix has the seemingly unrelated

10.1. FLEXIBLE FUNCTIONAL FORMS

191

regressions (SUR) structure.

I
G


..

.
. .
.

bb
xxb

f
G

.
.
.

..

indicates the Kronecker product. The Kronecker product of

is

I bb
Sxxb
.
.
.

I
I

T
t

bb
xxb

(
T

and

T
x

Personally, I can never keep straight the roles of

.
.
.

(
R

..

and

two matrices

where the symbol

bb
xxb

.
.
.

10.1.2. FGLS estimation of a translog model. So, this model has heteroscedasticity and autocorrelation, so OLS wont be efcient. The next question is: how
do we estimate efciently using FGLS? FGLS is based upon inverting the estiSo we need to estimate

8
S

8
S

mated error covariance

An asymptotically efcient procedure is (supposing normality of the errors)

10.1. FLEXIBLE FUNCTIONAL FORMS

192

(1) Estimate each equation by OLS

(3) Next we need to account for the singularity of

It can be shown that

8p

using

I
@ 1 p
R
G t G f )  S

(2) Estimate

will be singular when the shares sum to one, so FGLS wont work.

The solution is to drop one of the share equations, for example the
second. The model becomes
&


I
H

I
$G
G

II
C

I
| C
C
I
|| 

#  I )

#  I   I  z zz z #  I  )
#


or in matrix notation for the observation:

0yd  f
G P

I
g

10.1. FLEXIBLE FUNCTIONAL FORMS

tions:

.
.
.

fG

P
d

If

.
.
.

observa-

I G

.
.
.

1
g5

and in stacked notation for all observations we have the

193

ff

or, nally in matrix notation for all observations:

G P
0yd  f

Considering the error covariance, we can dene

, and form

S
ht

block of

8 p

5 '
Q5

f 

p
S

as the leading

I
G

Dene

This is a consistent estimator, following the consistency of OLS and


applying a LLN.
(4) Next compute the Cholesky factorization

h p S
I


`)

and the Cholesky factorization of the overall covariance matrix of the


2 equation model, which can be calculated as

f 


`)

10.1. FLEXIBLE FUNCTIONAL FORMS

194

(5) Finally the FGLS estimator can be calculated by applying OLS to the
transformed model

G d  f
P

or by directly using the GLS formula

f h p S R h p S R C 

I
I
I

C0

It is equivalent to transform each observation individually:

G Yd p  u f p
P

and then apply OLS. This is probably the simplest approach.

A few last comments.

(1) We have assumed no autocorrelation across time. This is clearly restrictive. It is relatively simple to relax this, but we wont go into it
here.
(2) Also, we have only imposed symmetry of the second derivatives. Another restriction that the model should satisfy is that the estimated
shares should sum to 1. This can be accomplished by imposing

)
8 7 5
774)  
%

I
B

q B 
|
VH
PI

These are linear parameter restrictions, so they are easy to impose and
will improve efciency if they are true.

10.2. TESTING NONNESTED HYPOTHESES

195

(3) The estimation procedure outlined above can be iterated. That is, esti-

as above, then re-estimate

using errors calculated as

@0

C0

mate


jf  G

These might be expected to lead to a better estimate than the es-

@0

since FGLS is asymptotically more efcient.

using the new estimated error covariance. It can


d

Then re-estimate

timator based on

be shown that if this is repeated until the estimates dont change (i.e.,
iterated to convergence) then the resulting estimator is the MLE. At
any rate, the asymptotic properties of the iterated and uniterated estimators are the same, since both are based upon a consistent estimator
of the error covariance.

10.2. Testing nonnested hypotheses


Given that the choice of functional form isnt perfectly clear, in that many
possibilities exist, how can one choose between forms? When one form is a
parametric restriction of another, the previously studied tests such as Wald,
P

LR, score or

are all possibilities. For example, the Cobb-Douglas model is a

parametric restriction of the translog: The translog is

G P
VS  R  5  R  P
) P

 gf

&
g

where the variables are in logarithms, while the Cobb-Douglas is

G P
0 R  P
8 

 f

&

so a test of the Cobb-Douglas versus the translog is simply a test that

10.2. TESTING NONNESTED HYPOTHESES

196

The situation is more complicated when we want to test non-nested hypotheses. If the two functional forms are linear in the parameters, and use the same
transformation of the dependent variable, then they may be written as

! 3
t3
G P
Vi
P


W

! d 3
t3


ktG


f
 $r

wd

p
B r

85
74)  3

is misspecied, for

fr
 $UI

We wish to test hypotheses of the form:

is correctly specied versus

One could account for non-iid errors, but well suppress this for simplicity.
test, pro

There are a number of ways to proceed. Well consider the

posed by Davidson and MacKinnon, Econometrica (1981). The idea is


to articially nest the two models, e.g.,

P W
Ht

is zero.

P
"c

&

)
 f

On the other hand, if the second model is correctly specied then


X&

&

&

If the rst model is correctly specied, then the true value of

The problem is that this model is not identied in general. For


example, if the models share some regressors, as in

C3  | iS  iDC
P  P
 P I
G P
0C |  | VC  VDH
P
PI

fr
 $I
f
 $r

8
4)


B r


10.2. TESTING NONNESTED HYPOTHESES

197

then the composite model is

P
C3  |


P
C 
&

&


PI
DC

P
S |  | c

&

&

) P
C  c

&

) P I
DH

)
c  gf
&

Combining terms we get

P
i |  | c

) P
i 
&

&

8


in place of

8 Q&


&


 gf

P
si3  3 C |  | i  I
P
P
P
P & EC & Hc & c
) P I P I )

test is to substitute

&

is not, since we have four equa-

tions in 7 unknowns, so one cant test the hypothesis that

The idea of the

are consistently estimable, but

P
C3  |

The four

This is a consistent

estimator supposing that the second model is correctly specied. It will tend
to a nite probability limit even if the second model is misspecied. Then
estimate the model

P W
H  e

&

P
f

is consistently estimable, and

P d
p 
8
f a  f R W I pW R e
W

 f

is asymptotically normal:

T 2


)
x9
&  2

If the second model is correctly specied, then

since

&


Q&

and that the ordinary -statistic for

&

)
 f

one can show that, under the hypothesis that the rst model is correct,

&

In this model,
&

&

P
"

where

tends in

probability to 1, while its estimated standard error tends to zero. Thus


the test will always reject the false null model, asymptotically, since the
statistic will eventually exceed any critical value with probability one.

10.2. TESTING NONNESTED HYPOTHESES

198

We can reverse the roles of the models, testing the second against the

It may be the case that neither model is correctly specied. In this case,

rst.

the test will still reject the null hypothesis, asymptotically, if we use
distribution, since as long as


T 2

)
t9 j

something different from zero,

&

critical values from the

tends to

Of course, when we switch

the roles of the models the other will also be rejected asymptotically.

In summary, there are 4 possible outcomes when we test two models,


each against the other. Both may be rejected, neither may be rejected,
or one of the two may be rejected.

There are other tests available for non-nested models. The

test is

simple to apply when both models are linear in the parameters. The

-test is similar, but easier to apply when

is nonlinear.

The above presentation assumes that the same transformation of the

dependent variable is used by both models. MacKinnon, White and


Davidson, Journal of Econometrics, (1983) shows how to deal with the
case of different transformations.
Monte-Carlo evidence shows that these tests often over-reject a correctly specied model. Can use bootstrap critical values to get betterperforming tests.

CHAPTER 11

Exogeneity and simultaneity


Several times weve encountered cases where correlation between regressors and the error term lead to biasedness and inconsistency of the OLS estimator. Cases include autocorrelation with lagged dependent variables and
measurement error in the regressors. Another important case is that of simultaneous equations. The cause is different, but the effect is the same.

11.1. Simultaneous equations


Up until now our model is

G P
V"  f

not interested in conditioning on

as xed. This means that

When analyzing dynamic models, were


as we saw in the section on stochastic

regressors. Nevertheless, the OLS estimator obtained by treating

we condition on

%
8
i

when estimating

where, for purposes of estimation we can treat

continues to have desirable asymptotic properties even in that case.


199

as xed

11.1. SIMULTANEOUS EQUATIONS

200

Simultaneous equations is a different prospect. An example of a simultaneous equation system is a simple supply-demand system:

 v

 v

r G I
$G n

are jointly determined at the same time by

S

the intersection of these equations. Well assume that

4f

I
G

and

2 
S

H H 
I II
G0P S uDH
PI
P
S  & I &
P

The presumption is that

&

Supply:

I G
$VP f |

Demand:

is determined by some

unrelated process. Its easy to see that we have correlation between regressors

C


&
H0H
I II


&
&
I G& I
$G j$G P gf | & P HjqI &

I
r I
qG

S

&
&
&

G$G P gf | & P H0qI &  S


I

I
GI$VP gf | & DH0qI &  S & S
G
PI

VP S VsH  $VP f | &


G P I
I G

and errors. Solving for

P
CS

&

P
sI

&

Now consider whether

is uncorrelated with

I
 $G U


Because of this correlation, OLS estimation of the demand equation will be


biased and inconsistent. The same applies to the supply equation, for the same
reason.

11.1. SIMULTANEOUS EQUATIONS

and

are the endogenous varibles (endogs), that are deter-

S

In this model,

201

mined within the system.

is an exogenous variable (exogs). These concepts

gf

are a bit tricky, and well return to it in a minute. First, some notation. Suppose
endogs,

is

8
c

Group current and lagged exogs, as well as lagged endogs in the vector
equations into the error vector

&

Stack the errors of the

8)
4'

The model, with additional assumtions, can be written as


 R U


 2

observations and write the model as

sT j
i

Rf

 

RI


Rf

8 & '
aiX1

 

is

.
.
.

RI
.
.
.


 R

Rf

and

.
.
.

R
RI

where

 Q1
'

is

2T
 S
R P

We can stack all

8
c

a
8) '
4"&

, which is



 & '
aiX1

is

&

If there are

we group together current endogs in the vector

This system is complete, in that there are as many equations as endogs.


There is a normality assumption. This isnt necessary, but allows us to

consider the relationship between least squares and ML estimators.

11.2. EXOGENEITY

202

Since there is no autocorrelation of the

s, and since the columns of

are individually homoscedastic, then

f
f Huxxb f H f H
I bb
I
II

.
.
.
.. .
. .
.

may contain lagged endogenous and exogenous variables. These

variables are predetermined.

We need to dene what is meant by endogenous and exogenous

when classifying the current period variables.

11.2. Exogeneity
The model denes a data generating process. The model involves two sets of
as well as a parameter vector

S
R r R x

R x

R x n 

In general, without additional restrictions, is a

P &
45 V aP P &
&
&

and

variables,

dimensional vector. This is the parameter vector that were inter-

&

ested in estimating.

8
t

depends on a parameter vector

Write this density as

and


ca

In principle, there exists a joint density function for

which

 
o t EiE

11.2. EXOGENEITY

s of course. This can be factored into the density of


times the marginal density of

conditional on

This includes lagged

and lagged

8
2

is the information set in period

where

203

  t 
o st aE o CE  o st acC
 

This is a general factorization, but is may very well be the case that not
to indicate elements

may share elements,

that enter into the conditional density and write

of course. We have

I
st

that enter into the marginal. In general,

and

of

affect both factors. So use

I
st

all parameters in

for parameters

o  t aE o 6c C  o st aCE
 I t
 

Recall that the model is


 2
2T j
 S
R P i
R

R
s


 R U

Normality and lack of correlation over time imply that the observations are
independent of one another, so we can write the log-likelihood function as the

11.2. EXOGENEITY

204

sum of likelihood contributions of each observation:

I
@
P I t
 6Dca CE
o
f

I
@
!ca iE
I t
o
f

I
@
st aCE

f

is weakly exogeneous for

cannot share elements if

Supposing that

g
tI t

I
t

arbitrary combinations of

(the

that is invariant

is weakly exoge-

changes, which prevents consideration of

is weakly exogenous, then the MLE of

I
st

would change as

I
t

nous, since

and

8t

This implies that

8EISd  SIg 
t t d  tI t
t

More formally, for an arbitrary


to

to
d

original parameter vector) if there is a mapping from

I
@

 o  t a
f

o  t aE


 o 7d f

D EFINITION 15 (Weak Exogeneity).

using the joint

density is the same as the MLE using only the conditional density

8 t
I
@
 
6Ita iE

d
A f  o i f

In other words, the joint

and conditional log-likelihoods maximize at the same value of

Since the DGP of

as xed in inference.

is irrelevant, we can treat

By the invariance property of MLE, the MLE of is

It Sd

8
d

I
6st

rameter of interest,

is irrelevant for

is sufcient to recover the pa-

and knowledge of

I
t

inference on

With weak exogeneity, knowledge of the DGP of

8I
st

since the conditional likelihood doesnt depend on

and this map-

ping is assumed to exist in the denition of weak exogeneity.

11.3. REDUCED FORM

205

Of course, well need to gure out just what this mapping is to recover

With lack of weak exogeneity, the joint and conditional likelihood func-

This is the famous identication problem.

tions maximize in different places. For this reason, we cant treat

8I
Dt

from

as

xed in inference. The joint MLE is valid, but the conditional MLE is
not.
to be weakly exogenous if

we are to be able to treat them as xed in estimation. Lagged

In resume, we require the variables in

sat-

isfy the denition, since they are in the conditioning information set,
Lagged

arent exogenous in the normal usage of the

8
c

I i

e.g.,

word, since their values are determined within the model, just earlier
on. Weakly exogenous variables include exogenous (in the normal sense)
variables as well as all predetermined variables.

11.3. Reduced form


Recall that the model is

R P i
R
S

R
s

This is the model in structural form.

D EFINITION 16 (Structural form). An equation is in structural form when


more than one current period endogenous variable is included.

Sgf P I
P

&
&
P I &
GjG P gf |
I

HjqI &
&
j$VSgf | & HjqI &
G I G P
PI
I G P
0Cf |
&

P
SS

&

P
I
&

 S

 S & S
G PI
 VP S uDH
Similarly, the rf for price is

&
G &
I
G


$G
I

uP gf |

I G
$VP gf |
&

I f
HP I I
P I


P gf& P H & qI
| &
I & &
us 0sH qI
P G P I & &
&

P qH jv & I &
P
G I

 g

 g

 g

&

demand:
tity is obtained by solving the supply equation for price and substituting into
An example is our supply/demand system. The reduced form for quancurrent period endog is included.
D EFINITION 17 (Reduced form). An equation is in reduced form if only one
reduced form.
Now only one current period endog appears in each equation. This is the

R R
 P ei

R
I R P I i


R


The solution for the current period endogs is easy to nd. It is


11.3. REDUCED FORM

206

11.3. REDUCED FORM

207

The interesting thing about the rf is that the equations individually satisfy the
by assumption,

&
&

I
H

  B
f

is

82

P H & g0H
I
5 II
&
&



&
I
H$G G & $G
I

I
D

The variance of

and

The errors of the rf are

zz gz x
z gx
z
z
z z  gx

i=1,2,

I
G

and therefore

is uncorrelated with

classical assumptions, since

I
 H


This is constant over time, so the rst rf equation is homoscedastic.


are independent over time, so are the

8
ci

Likewise, since the




The variance of the second rf error is


I 5 II
VP & HVH
&
&
G I
jH$G jH$G
G I





and the contemporaneous covariance of the errors across equations is


I
VP H & P qH
II
&

&
&
jG G & $G
G I
I


I
 HU

In summary the rf equations individually satisfy the classical assumptions, under the assumtions weve made, but they are contemporaneously correlated.

11.4. IV ESTIMATION

208

The general form of the rf is

R R
P ei

R
I R P I i

R



so we have that

R I $ x u R I  C

if the

2
 h I

and that the

are timewise independent (note that this wouldnt be the case

were autocorrelated).
11.4. IV estimation

The IV estimator may appear a bit unusual at rst, but it will grow on you
over time.
The simultaneous equations model is

Considering the rst equation (this is without loss of generality, since we can
matrix as

r D f n 
I

is the rst column

always reorder the equations) we can partition the

are the other endogenous variables that enter the rst equation
are endogs that are excluded from this equation
as

r e n 
I


I 
f 

Similarly, partition

11.4. IV ESTIMATION

are the excluded exogs.

are the included exogs, and

209

I
 

Finally, partition the error matrix as

r I G n 

Assume that

has ones on the main diagonal. These are normalization

restrictions that simply scale the remaining coefcients on each equation, and
which scale the variances of the error terms.
Given this scaling and our partitioning, the coefcient matrices can be written as

I
C

I
H




With this, the rst equation can be written as

 f

G P I I P I I
VDHVDCsD

7G

is correlated with

since


endogs.

G
VP

The problem, as weve seen is that

is formed of

11.4. IV ESTIMATION

210

Now, lets consider the general problem of a linear regression model with
correlation between regressors and the error term:

8  RG U

gf  3 G
t3
G P
0"  f
The present case of a structural equation from a system of equations ts into
this notation, but so do other problems, such as measurement error or lagged

dependent variables with autocorrelated errors. Consider some matrix

which is formed of variables uncorrelated with . This matrix denes a projec-

tion matrix

R I R 

by the denition of


G

projection matrix we get

correlated with

so that anything that is projected onto the space spanned by

Transforming the model with this

G
P
" !  f !
or

G P
V  f

are uncorrelated, since this is simply

G R U

G R R U

!
!

 RG U

and

Now we have that

will be un-

11.4. IV ESTIMATION

211

and

R I R 
!
This is a linear combination of

so it must be uncorrelated with


X

OLS to the model

8
G

the columns of

on

is the tted value from a regression of

This implies that applying

G P
V  f

will lead to a consistent estimator, given a few more assumptions. This is the

ments. The estimator is

generalized instrumental variables estimator.

is known as the matrix of instru-

f R
R
i I !X ! i 

from which we obtain

P
R I X ! R
!
G P R
V" i I !X i
R
!
!


so

G I 6RRi I I !ddi
R

R R R
G R
R
% I 6X ! i
1


 0)


Now we can introduce factors of

to get

1
1
1

p 1

1
1

RG I R
R
R I R
R Y  j)
I

so that

Assuming that each of the terms with a

in the denominator satises a LLN,

11.4. IV ESTIMATION

212

!
!

, a nite pd matrix

(= cols

a nite matrix with rank

T y 
f!
T y 
T !yf 

then the plim of the rhs is zero. This last term has plim 0 since we assume that
and are uncorrelated, e.g.,

  tG R

Given these assumtions the IV estimator is consistent

8
T


)
h

we have
h

1
1
1
GR I R R
I


q1

Furthermore, scaling by
h

1
1
1
R I R R


 h 0) 1

Assuming that the far right term saties a CLT, so that


!
!


m

are the obvious ones. An estimator for

is

8 h

!
!


!
h 0) 1

and

The estimators for


I ! ! R ! ! t
I
!

then we get

jf R h ) f ) 1 

This estimator is consistent following the proof of consistency of the OLS esti-

mator of

when the classical assumptions hold.

11.4. IV ESTIMATION

213

The formula used to estimate the variance of

is

h Q I dD%  )
R R R


I

The IV estimator is

(1) Consistent
(2) Asymptotically normally distributed
!

I X ! R
R
G

!
R I X ! R U6  G ! R

(3) Biased in general, since even though

are not independent.

then the IV estimator using

8I
gj

as the estimator that used

I
0

When we have two sets of instruments,

and

!
!

ments inuences the efciency of the estimator.

The choice of instru-

such that

 j
I

and these depend upon the choice of

depends upon

and

An important point is that the asymptotic distribution of

and

may not be zero, since

is at least as efciently asymptotically

More instruments leads to more asymp-

totically efcient estimation, in general.


There are special cases where there is no gain (simultaneous equations

is an example of this, as well see).

The penalty for indiscriminant use of instruments is that the small


sample bias of the IV estimator rises as the number of instruments

to

increases. The reason for this is that

becomes closer and closer

itself as the number of instruments increases.

IV estimation can clearly be used in the case of simultaneous equations. The only issue is which instruments to use.

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

214

11.5. Identication by exclusion restrictions


The identication problem in simultaneous equations is in fact of the same
nature as the identication problem in any estimation setting: does the limiting objective function have the proper curvature so that there is a unique
global minimum or maximum at the true parameter value? In the context of
IV estimation, this is the case if the limiting covariance of the IV estimator is

I 6 ! R !! !  ) an
I

 RG I f 3 

positive denite and

. This matrix is

The necessary and sufcient condition for identication is simply that

this matrix be positive denite, and that the instruments be (asymptotically) uncorrelated with .

For this matrix to be positive denite, we need that the conditions

).

!
!

of full rank (

must be positive denite and

noted above hold:

must be

These identication conditions are not that intuitive nor is it very ob-

vious how to check them.

11.5.1. Necessary conditions. If we use IV estimation for a single equation


of the system, the equation can be written as

G
VP

 f
where
W

Let

r dD n 
I I

Notation:

be the total numer of weakly exogenous variables.

I
 ` 

be the number of excluded exogs (in this equation).

&V&

 s&
)PEID I  &



Let

be the number of included exogs, and let

Let

215

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

be the total number of included endogs, and let

be the number of excluded endogs.

Using this notation, consider the selection of instruments.

exhausts the set of possible instruments, in that

I


if the variables in

It turns out that

ments.

are weakly exogenous and can serve as their own instru-

Now the

dont lead to an identied model then no other

instruments will identify the model either. Assuming this is true (well
prove it in a moment), then a necessary condition for identication is
since if not then at least one instrument must

I
ED I

will not have full column rank:

) wP
&

!
a

&
) wP

be used twice, so

that

This is the order condition for identication in a set of simultaneous


equations. When the only identifying information is exclusion restrictions on the variables that enter an equation, then the number of excluded exogs must be greater than or equal to the number of included
endogs, minus 1 (the normalized lhs endog), e.g.,

) &

To show that this is in fact a necessary condition consider some arbi-

trary set of instruments

A necessary condition for identication is

r I
VP I  n R 1 3   W R 1 3 

I
)
)
I
D
I
s
r I I
eDwP VP I n R 1  W R 1

I
)
)

and

converges in probability to zero, so

s are uncorrelated with the

s, by assumption, the cross

I
HP VP I   D

I
I
r D
I

n P

|I

I
I
I

between

Because the
so
so we have

r n  r D f n
I
I

we can write the reduced form using the same partition

r e n 
I
r D f n 
I
P

Given the reduced form


as

Recall that weve partitioned the model

r dD n 
I I

) wP
&


R ) 1 3 $

where

that
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

216

Since the far rhs term is formed only of linear combinations of columns of

columns, then it is not of full column rank.

columns we have

P
) &

 

or noting that

regardless of the choice of

has more than

When

has more than

instruments. If

the rank of this matrix can never be greater than

217


%

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

) &

In this case, the limiting matrix is not of full column rank, and the identication
condition fails.

11.5.2. Sufcient conditions. Identication essentially requires that the structural parameters be recoverable from the data. This wont be the case, in general, unless the structural model is subject to some restrictions. Weve already
identied necessary conditions. Turning to sufcient conditions (again, were
only considering identication through zero restricitions on the parameters,
for the moment).
The model is

P i
R
S

R
s

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

218

This leads to the reduced form

I SR I
R
SP ei

R
I P I i

R




 S


The reduced form parameters are consistently estimable, but none of them are
known a priori, and there are no restrictions on their values. The problem is
that more than one structural form has the same reduced form, so knowledge
of the reduced form parameters alone isnt enough to determine the structural
parameters. To see this, consider the model

matrix. The rf of this new model

R
R

&
i' &

 P

CwP

I P I

I I P P I I P
P
P
P P I P
P

is

is some arbirary nonsingular

where

R
i

R
i

 R

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

219

Likewise, the covariance of the rf of the transformed model is

 I

Since the two structural forms lead to the same rf, and the rf is all that is directly estimable, the models are said to be observationally equivalent. What we

ble

and

such that the only admissi

need for identication are restrictions on

is an identity matrix (if all of the equations are to be identied). Take the

coefcient matrices as partitioned before:

I
C

I
H

The coefcients of the rst equation of the transformed model are simply these
. This gives

I
I

I
C

coefcients multiplied by the rst column of

I
H

I
I

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

220

For identication of the rst equation we need that there be enough restrictions
so that the only admissible

I
I


P

be the leading column of an identity matrix, so that

)


I
Ct

I
H

I
I


I
i

I
H

Note that the third and fth rows are




Supposing that the leading matrix is of full column rank, e.g.,

) & 

` 



then the only way this can hold, without additional restrictions on the models
is a vector of zeros, then

)  I
I

) 

I
I

) & 

r I


) n
Therefore, as long as

the rst equation

is a vector of zeros. Given that


P

parameters, is if

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

221

then

I
I



P

The rst equation is identied in this case, so the condition is sufcient for
identication. It is also necessary, since the condition implies that this submarows. Since this matrix has

&
P V& 

) &

trix must have at least

&
rows, we obtain

) &

&
P 0&
or

) &

which is the previously derived necessary condition.


The above result is fairly intuitive (draw picture here). The necessary condition ensures that there are enough variables not in the equation of interest to
potentially move the other equations, so as to trace out the equation of interest. The sufcient condition ensures that those other equations in fact do move
around as the variables change their values. Some points:

)
4(s& 

When an equation has

is is exactly identied, in that

omission of an identiying restriction is not possible without loosing


consistency.

)
4

&

When

the equation is overidentied, since one could

drop a restriction and still retain consistency. Overidentifying restrictions are therefore testable. When an equation is overidentied we

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

222

have more instruments than are strictly necessary for consistent estimation. Since estimation by IV with more instruments is more efcient
asymptotically, one should employ overidentifying restrictions if one
is condent that theyre true.
We can repeat this partition for each equation in the system, to see

which equations are identied and which arent.


These results are valid assuming that the only identifying informa-

tion comes from knowing which variables appear in which equations,


e.g., by exclusion restrictions, and through the use of a normalization. There are other sorts of identifying information that can be used.
These include
(1) Cross equation restrictions
(2) Additional restrictions on parameters within equations (as in the
Klein model discussed below)
(3) Restrictions on the covariance matrix of the errors
(4) Nonlinearities in variables
When these sorts of information are available, the above conditions

arent necessary for identication, though they are of course still sufcient.

To give an example of how other information can be used, consider the model

where

is an upper triangular matrix with 1s on the main diagonal. This is a

triangular system of equations. In this case, the rst equation is

I I  f
P
I

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

223

Since only exogs appear on the rhs, this equation is identied.


The second equation is

5  &


P 0DI t f

P I f  

This equation has

excluded exogs, and

included endogs, so it

fails the order (necessary) condition for identication.

  I

However, suppose that we have the restriction

so that the rst

and second structural errors are uncorrelated. In this case

I f
G I G P R
 0DI i7  GU
S

so theres no problem of simultaneity. If the entire

matrix is diago-

nal, then following the same logic, all of the equations are identied.
This is known as a fully recursive model.

11.5.3. Example: Kleins Model 1. To give an example of determining identication status, consider the following macro model (this is the widely known

8 I a I I w s&  ) n  R
and the predetermined variables are all others:

r
a T o

n R
The endogenous variables are the
government


c





} }




8
c w

and a time trend,

taxes,

|
|

|
H H
I II
I
| H


s&

lhs variables,
nonwage spending,

The other variables are the government wage bill,


I
P

|e
e
I
ge


T V Ha
& P P
wS S
| VS w | sI a CaCiP p 
G P  P  P I 
0sI | uDI VC HuP p
G P
P
P I
I G P
$V P T | & I & C I & P p &

P
P


 a

 T



Capital Stock:
Prots:
Output:
Private Wages:

Investment:
Consumption:
Kleins Model 1)

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

224

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

225

The model assumes that the errors of the equations are contemporaneously

correlated, by nonautocorrelated. The model written as

gives

)
| &



|

&
|


| &
p p
&



)


)

HY I &
) Ct
I

and

)

)

To check this identication of the consumption equation, we need to extract

the submatrices of coefcients of endogs and exogs that dont appear

in this equation. These are the rows that have zeros in the rst column, and

11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS

226

we need to drop the rst column. We get

example, selecting rows 3,4,5,6, and 7 we obtain the matrix

'


|
|
)


)



) qC
) I
)

)

We need to nd a set of 5 rows of this matrix gives a full-rank 5

matrix. For

) |
|
)


w

This matrix is of full rank, so the sufcient condition for identication is met.

and counting excluded exogs,

 

) 7 

) & 


77  &

Counting included endogs,

so

11.6. 2SLS

227

The equation is over-identied by three restrictions, according to the

counting rules, which are correct when the only identifying information are the exclusion restrictions. However, there is additional infor-

and

mation in this case. Both

enter the consumption equation,

and their coefcients are restricted to be the same. For this reason the
consumption equation is in fact overidentied by four restrictions.

11.6. 2SLS
When we have no information regarding cross-equation restrictions or the
structure of the error covariance matrix, one can estimate the parameters of a
single equation of the system without regard to the other equations.
This isnt always efcient, as well see, but it has the advantage that

misspecications in other equations will not affect the consistency of


the estimator of the parameters of the equation of interest.
Also, estimation of the equation wont be affected by identication
problems in other equations.
is re-

gressed on all the weakly exogenous variables in the system, e.g., the entire
matrix. The tted values are

I
D
I R R
i I !Xiu
on the space spanned by

I
s




and since any vector in this space is uncorrelated with

by assumption,

ID

I
 

Since these tted values are the projection of


i

The 2SLS estimator is very simple: in the rst stage, each column of

is

11.6. 2SLS

8
G

I


related with

Since

I


uncorrelated with

228

is simply the reduced-form prediction, it is cor-

The only other requirement is that the instruments be linearly

independent. This should be the case when the order condition is satised,
in this case.

in place of

I


original model is

I

The second stage substitutes

than in

I
D

since there are more columns in

and estimates by OLS. This

 f

G P I I P I I
VDHVDCsD
G
VP


W

and the second stage model is

8 G P I I P I I
VDHVDCsD  f

is in the space spanned by

I


stage model as

I
  %
I 

Since

so we can write the second

G
VP W s

G P I I P I I
VDH DCsD s

 f
$

The OLS estimator applied to this model is

f s R W I pW s R e 
W

which is exactly what we get if we estimate using IV, with the reduced form
predictions of the endogs used as instruments. Note that if we dene

 W

r I I
dD n

11.6. 2SLS

are the instruments for


YW

so that

229

then we can write

fyW I !pWyW 
R R

Important note: OLS on the transformed model can be used to calcusince we see that its equivalent to IV using

late the 2SLS estimate of

a particular set of instruments. However the OLS covariance formula is


not valid. We need to apply the IV covariance formula already seen
above.

Actually, there is also a simplication of the general IV variance formula. Dene

 W

s


The IV covariance estimator would ordinarily be

R
R
R

h WW h W yW h W yW d
I
I

However, looking at the last term in brackets

I
RI
I
 RI
I
c R I DE R I
I


I I
I I
 r eD n R r d n  W R W

11.6. 2SLS

we can write

I
RI
I
R I

r eD
I I

is idempotent and since

I I
n R r dD
I
D s RI
ID sUs R I


i 

but since

230

r I I
r I I
 eD n R  D n


R
yW

Therefore, the second and last term in the variance formula cancel, so the 2SLS
varcov estimator simplies to

h W yW 

I

which, following some algebra similar to the above, can also be written as

h W yW d
I

Finally, recall that though this is presented in terms of the rst equation, it is
general since any equation can be placed rst.
Properties of 2SLS:

(1) Consistent
(2) Asymptotically normal
(3) Biased when the mean esists (the existence of moments is a technical
issue we wont go into here).
(4) Asymptotically inefcient, except in special circumstances (more on
this later).

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS

231

11.7. Testing the overidentifying restrictions


The selection of which variables are endogs and which are exogs is part of
the specication of the model. As such, there is room for error here: one might
erroneously classify a variable as exog when it is in fact correlated with the
error term. A general test for the specication on the model can be formulated
as follows:
The IV estimator can be calculated by applying OLS to the transformed
model, so the IV objective function at the minimized value is

 h ) jf

R
h

jf l )

but


)
!

GVP"

R I X !
!
f R I X

!
!
f i I !X
R

jf

R
iujf





R uj

R uj

G P
0" w


where

R
R
% I 6X ! iuj
$

w
so

GVPi

R R PR

w R w diugtG  s)
!

. Substituting a consistent


w R w ' )

!


w R w ' )

!

)
t9 j


G
w R w R G  )
!

random variable with an idempotent matrix in

are normally distributed, with variance


xs

estimator,

This isnt available, since we need to estimate


the middle, so
is a quadratic form of a

variable
Supposing the

then the random

G
R
w R w tG  s)
!
so


l
R I Q R u


!
!



w


is orthogonal to

8 i I 6X
R

! R

R
Ri I !X iu i I !X
!
!
!
!
R !
R !
R
% I 6X ! iuj ! i I 6X
!

R
iu !

R%u ! !
R
iu !

Furthermore,

w R w
!


w R w
!

Moreover,

is idempotent, as can be veried by multiplication:


11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS

232

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS

Even if the

233

arent normally distributed, the asymptotic result still

holds. The last thing we need to determine is the rank of the idempotent matrix. We have
!

R
R
i I !X ! iu !


w ! R w
so

R
I 6X iu
R !
i I 6X !

R I R y


R
! i y ! y

!
R
iu ! ! y

I R R y

and






is the number of columns of

 w R w
!
8
%

columns of

where

is the number of

The degrees of freedom of the test is simply the number

of overidentifying restrictions: the number of instruments we have


beyond the number that is strictly necessary for consistent estimation.

This test is an overall specication test: the joint null hypothesis is that

the model is correctly specied and that the

form valid instruments

(e.g., that the variables classied as exogs really are uncorrelated with
W

8
G
 f

and

8
7G

or that there is correlation between

G
XP

Rejection can mean that either the model

is misspecied,

This is a particular case of the GMM criterion test, which is covered in

the second half of the course. See Section 15.8.


Note that since

G 
w

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS

234

and

G w R w R G  s)

!
we can write



)

D1

! e 1
1
GR G

G R I R  R I R R G

test statistic.

on all of the instruments

is the uncentered

from a regression of the

where

residuals

. This is a convenient way to calculate the

On an aside, consider IV estimation of a just-identied model, using the standard notation

G P
V"  f
is the matrix of instruments. If we have exact identication then


y

, so

is a square matrix. The transformed model is

X I

G
P
" !  f !
and the fonc are

f R
 s) j %
!
The IV estimator is


 `

and

f R

R
i I X ! i 

h ) R Vf R

P
!
!
)

R R P
i s f i
R
!

!
h ) jf

R
i
!

h ) jf

!

) jf

!

) jf

!
) jf

R h

R
 h

R
 h

jf

R f


R f


R
f


R
f



 )

The objective function for the generalized IV estimator is

f I 6Q
R R



Rf I R R I R  R I Q R
f R I R  R I Q R


!
we obtain


f R
!

R R R
I @i I !X
R R
R
I I !ddi I !X
R R R
I I 6di





Now multiplying this by




R
 I X i
!
Considering the inverse here

11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS

235

11.8. SYSTEM METHODS OF ESTIMATION

236

by the fonc for generalized IV. However, when were in the just indentied
case, this is
!


 )

f R I X R u R I R 0 R I R R f

f I 6Qu f
R R
R
R R ! R
f I 6QVjf f




The value of the objective function of the IV estimator is zero in the just identied
case. This makes sense, since weve already shown that the objective function
with degrees of freedom equal to the

rv, which has mean 0 and variance 0,

is asymptotically

after dividing by

number of overidentifying restrictions. In the present case, there are no overi

dentifying restrictions, so we have a

e.g., its simply 0. This means were not able to test the identifying restrictions
in the case of exact identication.

11.8. System methods of estimation


2SLS is a single equation method of estimation, as noted above. The advantage of a single equation method is that its unaffected by the other equations
of the system, so they dont need to be specied (except for dening what are
the exogs, so 2SLS can use the complete set of instruments). The disadvantage
of 2SLS is that its inefcient, in general.
Recall that overidentication improves efciency of estimation, since

an overidentied equation can use more instruments than are necessary for consistent estimation.
Secondly, the assumption is that

11.8. SYSTEM METHODS OF ESTIMATION

237

sT j
i

R
 i

Since there is no autocorrelation of the

s, and since the columns of

are individually homoscedastic, then

f
f Huxxb f H f H
I bb
I
II

.
.
.
.. .
. .
.

This means that the structural equations are heteroscedastic and correlated with one another
In general, ignoring this will lead to inefcient estimation, following

the section on GLS. When equations are correlated with one another
estimation should account for the correlation in order to obtain efciency.
Also, since the equations are correlated, information about one equa-

tion is implicitly information about all equations. Therefore, overidentication restrictions in any equation improve efciency for all equations, even the just identied equations.
Single equation methods cant use these types of information, and are
therefore inefcient (in general).

11.8. SYSTEM METHODS OF ESTIMATION

238

11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments estimator (see Chapter 15). I no
longer teach the following section, but it is retained for its possible historical
interest. Another alternative is to use FIML (Subsection 11.8.2), if you are willing to make distributional assumptions on the errors. This is computationally
feasible with modern computers.
Following our above notation, each structural equation can be written as

G
B VP B B W
G PI PI
B 0H B 0Di B


B f


.
.
.

I
UW

bb
xxb

G
VP

I
f


 f
W

where we already have that

.
.
.

..

bb
xxb

.
.
.

.
.
.

.
.
.

I
G

or

equations together we get

&

Grouping the

RG
 tG
i

11.8. SYSTEM METHODS OF ESTIMATION

239

The 3SLS estimator is just 2SLS combined with a GLS correction that takes
as

R I X R u

bb
xxb

..

 W

bb
xxb
.
.
.

bb
xxb

I I
d

.
.
.

..

.
.
.

8
i

W R X R u

I
I

UW R I X R u

bb
xxb

.
.
.

Dene

advantage of the structure of

These instruments are simply the unrestricted rf predicitions of the endogs,


combined with the exogs. The distinction is that if the model is overidentied,
then

'

does not impose these restrictions. Also, note that

and

and

may be subject to some zero restrictions, depending on the restrictions on

is calculated

using OLS equation by equation. More on this later.


The 2SLS estimator would be

fyW I !pWyW 
R R

as can be veried by simple multiplication, and noting that the inverse of a


block-diagonal matrix is just the matrix with the inverses of the blocks on the
main diagonal. This IV estimator still ignores the covariance information. The
natural extension is to add the GLS transformation, putting the inverse of the

11.8. SYSTEM METHODS OF ESTIMATION

240

error covariance into the formula, which gives the 3SLS estimator

f f I yS yW h W f I yS yW
R
R
I

f I $f egyW h W I 4f yW
S R
S R
I

The solution is to dene a feasible


The obvious solution is to use an

8
S

B f HB G


Then the element

1

G BR G B


3

Substitute

is estimated by

B

B B

of

not
W

(IMPORTANT NOTE: this is calculated using

estimator based on the 2SLS residuals:

0
@`

0
CG

estimator using a consistent estimator of

8
g B

8
S

This estimator requires knowledge of

into the formula above to get the feasible 3SLS estimator.

Analogously to what we did in the case of 2SLS, the asymptotic distribution


of the 3SLS estimator can be shown to be
h

$f
I I

S
R

 f 

0
@`

| 1

A formula for estimating the variance of the 3SLS estimator in nite samples
is

h W hf
I

1

R
I y S W  h

(cancelling out the powers of


0
@`

This is analogous to the 2SLS formula in equation (??), combined with


the GLS correction.

11.8. SYSTEM METHODS OF ESTIMATION

241

In the case that all equations are just identied, 3SLS is numerically
equivalent to 2SLS. Proving this is easiest if we use a GMM interpre-

tation of 2SLS and 3SLS. GMM is presented in the next econometrics


course. For now, take it on faith.

equation by equation using OLS:

The 3SLS estimator is based upon the rf parameter estimator

calculated

d% I 6Xi 
R R

which is simply

r xxb f f n R I X R 
f bb
I

that is, OLS equation by equation using all the exogs in the estimation of each

column of

It may seem odd that we use OLS on the reduced form, since the rf equations are correlated:

R P R

R
I R P I i

R



and

2
 h I

R I  u R I  C

Let this var-cov matrix be indicated by

R I 

11.8. SYSTEM METHODS OF ESTIMATION

242

OLS equation by equation to get the rf is equivalent to

is the

 3

 3

endog,

is the entire
column of

and

bb
xxb

column of

I
f


 3

) '
w1

Bf

'
j1

Use the notation

is the

.
.
.

vector of observations of the

matrix of exogs,

.
.
.

bb
xxb

is the

..

where

.
.
.

I
$

.
.
.

.
.
.

 f

to indicate the pooled model. Following this notation, the error covariance
matrix is


 7R

This is a special case of a type of model known as a set of seemingly


unrelated equations (SUR) since the parameter vector of each equation

is different. The equations are contemporanously correlated, however.

The general case would have a different

for each equation.

Note that each equation of the system individually satises the classi-

cal assumptions.
However, pooled estimation using the GLS correction is more efcient,

since equation-by-equation estimation is equivalent to pooled estimais block diagonal, but ignoring the covariance informa-

The model is estimated by GLS, where

tion.

tion, since

is estimated using the OLS

residuals from equation-by-equation estimation, which are consistent.

11.8. SYSTEM METHODS OF ESTIMATION

are the same, which is true in the

v
 R
I

Using the rules

and

we get


I 
f R I X R u

f iw I I 6Xiu
R
R
f iw I I Xwtf iw I t
R

R

R Xwf Xwtf I 4f R Xwtf



.
.
.

w


(3)

w 
R R w
I w 

(2)

(1)

I
8 
%wtf

this note that in this case

OLS. To show

f I $f

present case of estimation of the rf parameters, SUR


$

In the special case that all the

243






So the unrestricted rf coefcients can be estimated efciently (assum-

We have ignored any potential zeros in the matrix

ing normality) by OLS, even if the equations are correlated.


which if they

exist could potentially increase the efciency of estimation of the rf.


Another example where SUR OLS is in estimation of vector autore$

gressions. See two sections ahead.

11.8.2. FIML. Full information maximum likelihood is an alternative estimation method. FIML will be asymptotically efcient, since ML estimators based on a given information set are asymptotically efcient w.r.t. all
other estimators that use the same information set, and in the case of the

11.8. SYSTEM METHODS OF ESTIMATION

244

full-information ML estimator we use the entire information set. The 2SLS


and 3SLS estimators dont require distributional assumptions, while FIML of
course does. Our model is, recall

R
s


 R U

means that the density for

requires the Jacobian

R t

}  t }

5 ~ p
R R j R I ) R R )  7g} eI I

is

} p 

so the density for

to

is the multivariate nor-

I S R )5  } eI I
~ p

The transformation from

mal, which is


 2
2T j
 S
R P i
R

The joint normality of

5
} p

Given the assumption of independence over time, the joint log-likelihood function is

R R j R I 4 R j R

I@ 5
I
f )

5
} A 1


5
S
1 P 5
} p A D1  T  f
&

do this in the next section.

of this can be done using iterative numeric methods. Well see how to

This is a nonlinear in the parameters objective function. Maximixation

11.9. EXAMPLE: 2SLS AND KLEINS MODEL 1

245

It turns out that the asymptotic distribution of 3SLS and FIML are the

same, assuming normality of the errors.


One can calculate the FIML estimator by iterating the 3SLS estimator,

thus avoiding the use of a nonlinear optimizer. The steps are

8 CGI0 | CG | 

|
|
0
@`

as normal.
This is new, we didnt estimate

in

(2) Calculate

and

0
CG

(1) Calculate

this way before. This estimator may have some zeros in it. When
Greene says iterated 3SLS doesnt lead to FIML, he means this for

and

and

you do converge to FIML.

and calculate

using

(3) Calculate the instruments

and

If you update

but only updates

a procedure that doesnt update

to get the estimated errors, applying the usual estimator.

(5) Repeat steps 2-4 until there is no change in the parameters.

8
S

(4) Apply 3SLS using these new instruments and the estimate of

FIML is fully efcient, since its an ML estimator that uses all informa-

tion. This implies that 3SLS is fully efcient when the errors are normally
distributed. Also, if each equation is just identied and the errors are
normal, then 2SLS will be fully efcient, since in this case 2SLS 3SLS.
$

When the errors arent normally distributed, the likelihood function is


of course different than whats written above.

11.9. Example: 2SLS and Kleins Model 1


The Octave program Simeq/Klein.m performs 2SLS estimation for the 3
equations of Kleins model 1, assuming nonautocorrelated errors, so that lagged
endogenous variables can be used as instruments. The results are:
CONSUMPTION EQUATION

11.9. EXAMPLE: 2SLS AND KLEINS MODEL 1

246

*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059

estimate

st.err.

t-stat.

p-value

16.555

1.321

12.534

0.000

Profits

0.017

0.118

0.147

0.885

Lagged Profits

0.216

0.107

2.016

0.060

Wages

0.810

0.040

20.129

0.000

Constant

*******************************************************
INVESTMENT EQUATION

*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184

estimate

st.err.

t-stat.

p-value

20.278

7.543

2.688

0.016

Profits

0.150

0.173

0.867

0.398

Lagged Profits

0.616

0.163

3.784

0.001

Constant

11.9. EXAMPLE: 2SLS AND KLEINS MODEL 1

Lagged Capital

-0.158

0.036

247

-4.368

0.000

*******************************************************
WAGES EQUATION

*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427

estimate

st.err.

t-stat.

p-value

Constant

1.500

1.148

1.307

0.209

Output

0.439

0.036

12.316

0.000

Lagged Output

0.147

0.039

3.777

0.002

Trend

0.130

0.029

4.475

0.000

*******************************************************

The above results are not valid (specically, they are inconsistent) if the errors are autocorrelated, since lagged endogenous variables will not be valid
instruments in that case. You might consider eliminating the lagged endogenous variables as instruments, and re-estimating by 2SLS, to obtain consistent
parameter estimates in this more complex case. Standard errors will still be
estimated inconsistently, unless use a Newey-West type covariance estimator.
Food for thought...

CHAPTER 12

Introduction to the second half

over a set
x

d f f
47C9

optimizing element of an objective function

D EFINITION 12.0.1. [Extremum estimator] An extremum estimator

available data, based on a sample of size .

be the

Well begin with study of extremum estimators in general. Let

is the

Well usually write the objective function suppressing the dependence on

Example: Least squares, linear model


Stacking observations

8 R h f  bxbxb  I  d f
 t"HCf  
fG Ppd f

d  1
8 5 2 G
p r!A@8@8974)  SctjP p d R  gf
v
x
pg

vertically,

where

8f
7

Let the d.g.p. be

The least squares

estimator is dened as

8
R I R  d

f D R `@f sD 1  eS9 E
f d
f ) d f

d
`@

We readily nd that

Example: Maximum likelihood

5
7~g} I
p
e

fq
I
@

d
 4eSf

~

$

Because the logarithmic function is strictly increasing on

The maxi-

, maximization

k
5

d f
4ig

mum likelihood estimator is dened as

8 )  p d
t99ej Yf

Suppose that the continuous random variable

of the average logarithm of the likelihood function is achieved at the same as


d

248

12. INTRODUCTION TO THE SECOND HALF

249

for the likelihood function:

8
 d

I
@
5
)
D1 5 5  4R f D1  4eS ~ g
)
d
) d f
d f
i f

Solution of the f.o.c. leads to the familiar result that

MLE estimators are asymptotically efcient (Cramr-Rao lower bound,

Theorem3), supposing the strong distributional assumptions upon which


they are based are true.
One can investigate the properties of an ML estimator supposing

that the distributional assumptions are incorrect. This gives a quasiML estimator, which well study later.
The strong distributional assumptions of MLE may be questionable

in many cases. It is possible to estimate using weaker distributional


assumptions based only on some of the moments of a random variable(s).

Example: Method of moments

from the

pd
ge

Suppose we draw a random sample of

distribution. Here,

is the parameter of interest. The rst moment (expectation),

I
Q

of a random

In this example, the relationship is the identity function

p
xgd  eQ
p d I

variable will in general be a function of the parameters of the distribution, i.e.,


.
is a moment-parameter equation.

sample rst moment is

8
q1 f

I@
I
 Q
f

p RvQ
d I

p d I
ggRvQ  Q 
I

though in general the relationship may be more complicated. The

12. INTRODUCTION TO THE SECOND HALF

250

Dene

I dI
Q q4egQ  4R%
d I
The method of moments principle is to choose the estimator of the

parameter to set the estimate of the population moment equal to the

I
d i

sample moment, i.e.,

. Then the moment-parameter equation

is inverted to solve for the parameter estimate.

I@

pd  4d %
I
f

p
gd

1 f @
I
T f

Since

8  1 gf

In this case,

by the LLN, the estimator is consistent.

More on the method of moments

r.v. is

8 p @5  p ig  g
d
d f
f
1
d d
p@5  4e
f f I
7ug @f

Dene

pd
ge

Continuing with the above example, the variance of a

The MM estimator would set

8
$

pd 5 4d

f f I
7g @f 

Again, by the LLN, the sample variance is consistent for the true vari-

 f f1 5 I
d
7g @f 

So,

1
8 p d@5
f f I
T 7ug @f

ance, that is,

12. INTRODUCTION TO THE SECOND HALF

251

which is obtained by inverting the moment-parameter equation, is


consistent.
Example: Generalized method of moments (GMM)

The previous two examples give two estimators of

which are both con-

sistent. With a given sample, the estimators will be different in general.


With two moment-parameter equations and only one parameter, we

have overidentication, which means that we have more information


than is strictly necessary for consistent estimation of the parameter.
The GMM combines information from the two moment-parameter equations to form a new estimator which will be more efcient, in general

(proof of this below).

i.e.,

I
@
81

gf pd
f
I
@
eEI%
d
1)
f


.

d I
 e%

p d I
 ge%

and

p

 dI
4REi

Clearly, when evaluated at the true parameter value

both

From the second example we dene additional moment conditions

f f d d
7gp@5  eE
8 f f1 I
d d
p@5  4R
7ug @f

and

is

dI
 s p eE%

8 f
gwd  4REi
dI

the sample average of

We already have that

d I
evi

From the rst example, dene

12. INTRODUCTION TO THE SECOND HALF

to set either

I
k d i

chose

or

The MM estimator would

In general, no single value of


d

8  4d

8 p e
d


Again, it is clear from the LLN that

252

will solve the two equations simultaneously.

and choosing

8HE4eXpt  4deS  d
d
f

 R E4e g4R%  eX
d  d I
d

" wR


 t

An example would be to choose

where

where

 d
gE4eXt

The GMM estimator is based on dening a measure of distance

is a positive denite

matrix. While its clear that the MM gives consistent estimates if there is a oneto-one relationship between parameters and moments, its not immediately
obvious that the GMM estimator is consistent. (Well see later that it is.)
These examples show that these widely used estimators may all be interpreted as the solution of an optimization problem. For this reason, the study
of extremum estimators is useful for its generality. We will see that the general
results extend smoothly to the more specialized results available for specic
estimators. After studying extremum estimators in general, we will study the
GMM estimator, then QML and NLS. The reason we study GMM rst is that
LS, IV, NLS, MLE, QML and other well-known parametric estimators may all
be interpreted as special cases of the GMM estimator, so the general results on
GMM can simplify and unify the treatment of these other estimators. Nevertheless, there are some special results on QML and NLS, and both are important in empirical research, which makes focus on them useful.

One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models arent useful. Linear models are more general than

12. INTRODUCTION TO THE SECOND HALF

253

they might rst appear, since one can employ nonlinear transformations of the
variables:

r bb

G
t0P p d  $T m xxb  m  vI m n  g p m
f

 f

P
SI  uP

&


i

ts this form.

G P
tVC  I  P I 

For example,

The important point is that the model is linear in the parameters but not
necessarily linear in the variables.

In spite of this generality, situations often arise which simply can not be convincingly represented by linear in the parameters models. Also, theory that
applies to nonlinear models also applies to linear models, so one may as well
start off with the general case.
Example: Expenditure shares

f
!d  B
f
!d DC

goods is

 B
f CAB 

) B B
 I

)
} 'g B

or

B
C

for

and

8 f
B

so necessarily

 3

An expenditure share is

of

&

Roys Identity states that the quantity demanded of the

. No linear in the parameters model

with a parameter space that is dened independent of the data can

guarantee that either of these conditions holds. These constraints will often be
violated by estimated linear models, which calls into question their appropriateness in cases of this sort.
Example: Binary limited dependent variable

12. INTRODUCTION TO THE SECOND HALF

254

The referendum contingent valuation (CV) method of infering the social


value of a project provides a simple example. This example is a special case
of more general discrete choice (or binary response) models. Individuals are

where

is income and

pG P  p
9q"7

ity in the base case (no project) is

for provision of a project. Indirect util-

asked if they would pay an amount

is a

vector of other variables such as prices, personal characteristics, etc. After proreect variations

of preferences in the population. With this, an individual agrees1 to pay


C

if the consumer agrees to pay

)  f
8  p
g"t
 I G  G
G p

otherwise. The probability of agreement is

To make the example

s0

sy

&

w %
 


 " I


 " p

IG

and

p
G

and

8  
g w iHw

f
 t) 

specic, suppose that

To simplify notation, dene

for the change,

8  
4s w %w

(12.0.1)

w
 
 w %w

and let

and

 f
 w d I

G
I j p G
Dene

collect

let

if


" p  w I

Dene

8 I q" I
G P 

The random terms

5
7)  ! G
3
B

vision, utility is

are i.i.d. extreme value random variables. That is, utility de-

pends only on income, preferences in both states are homothetic, and a specic distributional assumption is made on the distribution of preferences in
the population. With these assumptions (the details are unimportant here, see
1

We assume here that responses are truthful, that is there is no strategic behavior and that
individuals are able to order their preferences in this hypothetical situation.

12. INTRODUCTION TO THE SECOND HALF

255

articles by D. McFadden if youre interested) it can be shown that


w VP &

d
  w 


is the logistic distribution function

8 I $Y 7g}  $
#  ~ P ) #

#
$

where

This is the simple logit model: the choice probability is the logit function of a
linear in parameters function.
is

. Thus, we

P
 w VP &

can write

or 1, and the expected value of

is either

w VP &

Now,

 f

 U

One could estimate this by (nonlinear) least squares


E w uP &
can be written as a linear

w uP &

1
h 
) E &

in the parameters model, in the sense that, for arbitrary , there are no

 R m
7Bdc w  w VP &




such that

R  m


we can always nd a


7d

w m


4)

negative or greater than

eter. This is because for any

and is a dimensional paramd

is a -vector valued function of

where

such that

w m d


The main point is that it is impossible that

will be

which is illogical, since it is the expectation of a 0/1

binary random variable. Since this sort of problem occurs often in empirical
work, it is useful to study NLS and other nonlinear models.

12. INTRODUCTION TO THE SECOND HALF

256

After discussing these estimation methods for parametric models well briey
introduce nonparametric estimation methods. These methods allow one, for exconsistently when we are not willing to assume that a

G P
t0 

model of the form

ample, to estimate

 f

can be restricted to a parametric form

t



g

 gf
G
 $# t
P
g

 #
 st

and perhaps

 #
 t
G P d
04 

dc
b

where

are of known functional form. This is im-

portant since economic theory gives us general information about functions


and the signs of their derivatives, but not about their specic form.
Then well look at simulation-based methods in econometrics. These methods allow us to substitute computer power for mental power. Since computer
power is becoming relatively cheap compared to mental effort, any econometrician who lives by the principles of economic theory should be interested in
these techniques.
Finally, well look at how econometric computations can be done in parallel on a cluster of computers. This allows us to harness more computational
power to work with more complex models that can be dealt with using a desktop computer.

CHAPTER 13

Numeric optimization methods


Readings: Hamilton, ch. 5, section 7 (pp. 133-139) Gourieroux and Mon!

fort, Vol. 1, ch. 13, pp. 443-60 ; Goffe, et. al. (1994).

If were going to be applying extremum estimators, well need to know


how to nd an extremum. This section gives a very brief introduction to what
is a large literature on numeric optimization methods. Well consider a few
well-known techniques, and one fairly new technique that may allow one to
solve difcult problems. The main objective is to become familiar with the
issues, and to learn how to use the BFGS algorithm at the practical level.
The general problem we consider is how to nd the maximizing element
-vector) of a function

8 d
4R

(a

This function may not be continuous, and

it may not be differentiable. Even if it is twice continuously differentiable, it


may not be globally concave, so local maxima, minima and saddlepoints may
were a quadratic function of


7d

d
4R

all exist. Supposing

e.g.,

 R P RU P
d d )5 yd!  4e
d

the rst order conditions would be linear:

8
xU I  d

P
d
U  4e

so the maximizing (minimizing) element would be

This is the sort

of problem we have with linear models estimated by OLS. Its also the case for
257

13.2. DERIVATIVE-BASED METHODS

258

feasible GLS, since conditional on the estimate of the varcov matrix, we have
a quadratic objective function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able
to solve for the maximizer analytically. This is when we need a numeric optimization method.
13.1. Search
The idea is to create a grid over the parameter space and evaluate the function at each point on the grid. Select the best point. Then rene the grid in
the neighborhood of the best point, and continue until the accuracy is good
enough. See Figure 13.1.1. One has to be careful that the grid is ne enough
in relationship to the irregularity of the function to ensure that sharp peaks are
not missed entirely.

points. For example, if

and

4 ) 

)'p)7G8 7
A )
"#
p I 4 )

there would

points to check. If 1000 points can be checked in a second, it would

take

years to perform the calculations, which is approximately the

age of the earth. The search method is a very reasonable choice if

but it quickly becomes infeasible if

is moderate or large.

be

we need to check

dimensional parameter space,

 ) o


values in each dimension of a

To check

is small,

13.2. Derivative-based methods


13.2.1. Introduction. Derivative-based methods are dened by

I h

(3) the stopping criterion.

tives)

given

(2) the iteration method for choosing

(1) the method for choosing the initial value,

(based upon deriva-

13.2. DERIVATIVE-BASED METHODS

259

F IGURE 13.1.1. The search method

The iteration method can be broken into two problems: choosing the stepsize
which is of the same

h h
t

wP
t

 I h


d

A locally increasing direction of search

so that

dimension of

(a scalar) and choosing the direction of movement,

is a direction such that

"wy e S
t P d r

for positive but small. That is, if we go in direction , we will improve on the
t

objective function, at least if we dont go too far in that direction.

13.2. DERIVATIVE-BASED METHODS

260

As long as the gradient at is not zero there exist increasing directions,

RH
d
h h

is a symmetric pd

is the gradient at . To see this, take a T.S.


d

 p
d
e  ep
d

matrix and

where

and they can all be represented as

expansion around

P P d  P P d
pt R t eHu t e

)
t
t

8 t R 4RD
d

)
x

positive denite, we guarantee that

 d
geH 

Dening

is to be an inwhere

P R d  P d
t4eHCwe

term can be ignored. If


t

t P d
 "we

creasing direction, we need

the

)
tc

For small enough

is

d R d
R d
eH ceH  Bt4eH
Every increasing direction can be represented in this

is less that 90 degrees). See Figure 13.2.1.

way (p.d. matrices are those such that the angle between

and

d
4RD

8 ( eH
d

unless

With this, the iteration rule becomes

eH uP
d

h h h
h
d

 I h

and we keep going until the gradient becomes zero, so that there is no increas-

is fairly straightforward. A simple line


is a scalar.

The remaining problem is how to choose

search is an attractive possibility, since

, choosing

Conditional on

and

ing direction. The problem is how to choose

Note also that this gives no guarantees to nd a global maximum.




13.2. DERIVATIVE-BASED METHODS

261

F IGURE 13.2.1. Increasing directions of search

13.2.2. Steepest descent. Steepest descent (ascent if were maximizing) just

sets

to and identity matrix, since the gradient provides the direction of max-

imum rate of change of the objective function.


Advantages: fast - doesnt require anything more than rst deriva-

tives.
Disadvantages: This doesnt always work too well however (draw picture of banana function).

13.2. DERIVATIVE-BASED METHODS

262

13.2.3. Newton-Raphson. The Newton-Raphson method uses information


about the slope and curvature of the objective function to determine which direction and how far to move from an initial point. Supposing were trying

(an initial guess).


d

sd R R pd 5 P d c eH RCu 4RC


d
d
)
d
R d  P d f d f
h
h
h
h
h

d
s

we can maximize the portion of the right-hand


d

pd R R ipd 5 Bdc RD  4e


d
d
) P R d
d
h
h
h

d
i

This is a much easier problem, since it is a quadratic function

so it has linear rst order conditions. These are


d

pd e
d
h

P d d
 eH  e
h

d
i

So the solution for the next round estimate is


7d

in

8
7d

with respect to

i.e., we can maximize


d

 d f
geSx

side that depends on

To attempt to maximize

8 d f
geS9

about

Take a second order Taylors series approximation of

d f
eSt

to maximize

eH I R
d
d
h
h

 I h

This is illustrated in Figure 13.2.2.


7d

may be bad far away from the maximizer

d f
eSt

However, its good to include a stepsize, since the approximation to

so the actual iteration formula is

eH I ! e j
d
d
h
h h h
d

 I h

A potential problem is that the Hessian may not be negative denite


d
I h e

when were far from the maximizing point. So

may not be

13.2. DERIVATIVE-BASED METHODS

263

F IGURE 13.2.2. Newton-Raphson method

eH I e
d
d
h
h

positive denite, and

may not dene an increasing di-

rection of search. This can happen when the objective function has at
regions, in which case the Hessian matrix is very ill-conditioned (e.g.,
d

is nearly singular), or when were in the vicinity of a local minimum,


is positive denite, and our direction is a decreasing direction

e
d
h

of search. Matrix inverses by computers are subject to large errors


when the matrix is ill-conditioned. Also, we certainly dont want to
go in the direction of a minimum when were maximizing. To solve
d

this problem, Quasi-Newton methods simply add a positive denite


to ensure that the resulting matrix is positive defwhere
U

$ P d
%U Ue

inite, e.g.,

d
4e

component to

is chosen large enough so that

13.2. DERIVATIVE-BASED METHODS

264

is well-conditioned and positive denite. This has the benet that

improvement in the objective function is guaranteed.


Another variation of quasi-Newton methods is to approximate the

Hessian by using successive gradient evaluations. This avoids actual


calculation of the Hessian, which is an order of magnitude (in the dimension of the parameter vector) more costly than calculation of the
gradient. They can be done to ensure that the approximation is p.d.
DFP and BFGS are two well-known examples.

Stopping criteria
The last thing we need is to decide when to stop. A digital computer is
subject to limited machine precision and round-off errors. For these reasons,
it is unreasonable to hope that a program can exactly nd the point that maximizes a function. We need to dene acceptable tolerances. Some stopping
criteria are:

Negligable change in parameters:

Negligable relative change:

I
$G

d
I h h d

% G
Id h
I h h d

Negligable change of function:

d
d
| G I h Ruq h e

Gradient negligibly different from zero:

3
cxGG R 
d
h

13.2. DERIVATIVE-BASED METHODS

265

Or, even better, check all of these.

Also, if were maximizing, its good to check that the last round (real,

not approximate) Hessian is negative denite.

Starting values
The Newton-Raphson and related algorithms work well if the objective
function is concave (when maximizing), but not so well if there are convex
regions and local minima or multiple local maxima. The algorithm may converge to a local minimum or to a local maximum that is not optimal. The
algorithm may also have difculties converging at all.
The usual way to ensure that a global maximum has been found

is to use many different starting values, and choose the solution that
returns the highest objective function value. THIS IS IMPORTANT
in practice. More on this later.

Calculating derivatives
The Newton-Raphson algorithm requires rst and second derivatives. It
is often difcult to calculate derivatives (especially the Hessian) analytically if

dcC
b f

the function

is complicated. Possible solutions are to calculate derivatives

numerically, or to use programs such as MuPAD or Mathematica to calculate


analytic derivatives. For example, Figure 13.2.3 shows MuPAD1 calculating a
derivative that I didnt know off the top of my head, and one that I did know.
Numeric derivatives are less accurate than analytic derivatives, and
are usually more costly to evaluate. Both factors usually cause opti-

mization programs to be less successful when numeric derivatives are


used.
1

MuPAD is not a freely distributable program, so its not on the CD. You can download it from
http://www.mupad.de/download.shtml

13.2. DERIVATIVE-BASED METHODS

266

F IGURE 13.2.3. Using MuPAD to get analytic derivatives

One advantage of numeric derivatives is that you dont have to worry

about having made an error in calculating the analytic derivative. When


programming analytic derivatives its a good idea to check that they
are correct by using numeric derivatives. This is a lesson I learned the
hard way when writing my thesis.
Numeric second derivatives are much more accurate if the data are
scaled so that the elements of the gradient are of the same order of

8
4) 4 8  S x
b f
b f

4 )  S
 G P # P 
cw &  f

magnitude. Example: if the model is


mation is by NLS, suppose that

and esti-

and

13.3. SIMULATED ANNEALING

dcC x
b f
S
b f
8 44 ) #  # ! 44 )  g4 )   ! 44 ) q& 

&

One could dene

267

In this case, the gradients

and

will both be 1.

In general, estimation programs always work better if data is scaled


in this way, since roundoff errors are less likely to become important.
This is important in practice.

There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian.
The iterations are faster for this reason since the actual Hessian isnt
calculated, but more iterations usually are required for convergence.
Switching between algorithms during iterations is sometimes useful.

13.3. Simulated Annealing


Simulated annealing is an algorithm which can nd an optimum in the
presence of nonconcavities, discontinuities and multiple local minima/maxima.
Basically, the algorithm randomly selects evaluation points, accepts all points
that yield an increase in the objective function, but also accepts some points
that decrease the objective function. This allows the algorithm to escape from
local minima. As more and more points are tried, periodically the algorithm
focuses on the best point so far, and reduces the range over which random
points are generated. Also, the probability that a negative move is accepted
reduces. The algorithm relies on many evaluations, as in the search method,
but focuses in on promising areas, which reduces function evaluations with
respect to the search method. It does not require derivatives to be evaluated. I
have a program to do this if youre interested.

13.4. EXAMPLES

268

13.4. Examples
This section gives a few examples of how some nonlinear models may be
estimated using maximum likelihood.
13.4.1. Discrete Choice: The logit model. In this section we will consider
maximum likelihood estimation of the logit model for binary 0/1 dependent
variables. We will use the BFGS algotithm to nd the MLE.
We saw an example of a binary choice model in equation 12.0.1. A more
general representation is

H
 P
)
f
G
jq  H

d
4  

 f

f
 t) 
$

The log-likelihood function is

s4d

f ) P d
f B 1 d f
B 
C id) B 04 C  A B )  4eS
B
f

For the logit model (see the contingent valuation example above), the probability has the specic form

4d ~ P
 c  } )    
d
)
&
'

You should download and examine LogitDGP.m , which generates data


according to the logit model, logit.m , which calculates the loglikelihood, and
EstimateLogit.m , which sets things up and calls the estimation routine, which
uses the BFGS algorithm.

13.4. EXAMPLES
d

and the true

8 R x9 
)

 4 )  1

Here are some estimation results with

269

***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: 0.607063
Observations: 100
constant
slope

estimate
0.5400
0.7566

st. err
0.2229
0.2374

t-stat
2.4224
3.1863

p-value
0.0154
0.0014

Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************

The estimation program is calling mle_results(), which in turn calls


a number of other routines. These functions are part of the octave-forge
repository.
13.4.2. Count Data: The Poisson model. Demand for health care is usually thought of a a derived demand: health care is an input to a home production function that produces health, and health is an argument of the utility
function. Grossman (1972), for example, models health as a capital stock that
is subject to depreciation (e.g., the effects of ageing). Health care visits restore
the stock. Under the home production framework, individuals decide when to
make health care visits to maintain their health stock, or to deal with negative
shocks to the stock in the form of accidents or illnesses. As such, individual

13.4. EXAMPLES

270

demand will be a function of the parameters of the individuals utility functions.


The MEPS health data le , meps1996.data, contains 4564 observations
on six measures of health care usage. The data is from the 1996 Medical Expenditure Panel Survey (MEPS). You can get more information at http://www.meps.ahrq.gov/.
The six measures of use are are ofce-based visits (OBDV), outpatient visits (OPV), inpatient visits (IPV), emergency room visits (ERV), dental visits
(VDV), and number of prescription drugs taken (PRESCR). These form columns
1 - 6 of meps1996.data. The conditioning variables are public insurance
(PUBLIC), private insurance (PRIV), sex (SEX), age (AGE), years of education
(EDUC), and income (INCOME). These form columns 7 - 12 of the le, in the
order given here. PRIV and PUBLIC are 0/1 binary variables, where a 1 indicates that the person has access to public or private insurance coverage. SEX
is also 0/1, where 1 indicates that the person is female. This data will be used
in examples fairly extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and
gives some descriptive information about variables, which follows:
All of the measures of use are count data, which means that they take on

@8@8974
85)

the values

. It might be reasonable to try to use this information by

specifying the density as a count data density. One of the simplest count data
densities is the Poisson density, which is
( f
8 u )
7g}
~
e e

f
 S

Y
G

The Poisson average log-likelihood function is

I
B 1
(
f
d f
B f Se B uP Se )  4RC
B
B
f

13.4. EXAMPLES

271

We will parameterize the model as

8
@R ( 02 & w f 1 d)

 BR v 7}
~

 B
Ce
 B
Dv

This ensures that the mean is positive, as is required for the Poisson model.
Note that for this parameterization
e

e U

so

variable.

with respect to the

 6

 

the elasticity of the conditional mean of

conditioning

The program EstimatePoisson.m estimates a Poisson model using the full


data set. The results of the estimation, using OBDV as the dependent variable
are here:
MPITB extensions found

OBDV

******************************************************
Poisson model, MEPS 1996 full data set

MLE Estimation Results

13.5. DURATION DATA AND THE WEIBULL MODEL

272

BFGS convergence: Normal convergence

Average Log-L: -3.671090


Observations: 4564

estimate

st. err

t-stat

p-value

-0.791

0.149

-5.290

0.000

pub. ins.

0.848

0.076

11.093

0.000

priv. ins.

0.294

0.071

4.137

0.000

sex

0.487

0.055

8.797

0.000

age

0.024

0.002

11.471

0.000

edu

0.029

0.010

3.061

0.002

inc

-0.000

0.000

-0.978

0.328

constant

Information Criteria
CAIC : 33575.6881

Avg. CAIC:

7.3566

BIC : 33568.6881

Avg. BIC:

7.3551

AIC : 33523.7064

Avg. AIC:

7.3452

******************************************************

13.5. Duration data and the Weibull model


In some cases the dependent variable may be the time that passes between
the occurence of two events. For example, it may be the duration of a strike,
or the time needed to nd a job once one is unemployed. Such variables take
on values on the positive real line, and are referred to as duration data.

13.5. DURATION DATA AND THE WEIBULL MODEL

273

A spell is the period of time between the occurence of initial event and the
concluding event. For example, the initial event could be the loss of a job, and
the nal event is the nding of a new job. The spell is the period of unemployment.
be the time the initial event occurs, and

be the time the conclud-

I
2

p2

Let

ing event occurs. For simplicity, assume that time is measured in years. The


gE2

3
4

function of

is the duration of the spell,


with distribution function

8E2  E

2
p2 I 3 P
j62 

random variable

. Dene the density

Several questions may be of interest. For example, one might wish to know
the expected time one has to wait to nd a job given that one has already
waited years. The probability that a spell lasts years is

8
gt

3
P

)  t

)  t

conditional on the spell already having lasted years is

8 t

E32

3
4
P

)
2
 t

The density of

3
4

The expectanced additional time required for the spell to end given that is has

with respect to this density, minus


3
5

3
#

u 7t t 43# P y) #  Vt U 

8
4

To estimate this function, one needs to specify the density

2
E

already lasted years is the expectation of

as a para-

metric density, then estimate by maximum likelihood. There are a number of


possibilities including the exponential density, the lognormal, etc. A reasonably exible model that is a generalization of the exponential density is the
Weibull density

13.5. DURATION DATA AND THE WEIBULL MODEL

the log densities.

8 e 

8 I E2 e q e 76@  d 2 4

3

According to this model,

274

The log-likelihood is just the product of

To illustrate application of this model, 402 observations on the lifespan of


mongooses in Serengeti National Park (Tanzania) were used to t a Weibull
model. The spell in this case is the lifetime of an individual mongoose.
and

 

7 8 4 8 e


The parameter estimates and standard errors are

and the log-likelihood value is -659.3. Figure 13.5.1 presents tted

7 A
$@7 8 D9 E 8

life expectancy (expected additional years of life) as a function of age, with 95%
condence bands. The plot is accompanied by a nonparametric Kaplan-Meier
estimate of life-expectancy. This nonparametric estimator simply averages all
spell lengths greater than age, and then subtracts age. This is consistent by the
LLN.
In the gure one can see that the model doesnt t the data well, in that it
predicts life expectancy quite differently than does the nonparametric model.
For ages 4-6, the nonparametric estimate is outside the condence interval that
results from the parametric model, which casts doubt upon the parametric
model. Mongooses that are between 2-6 years old seem to have a lower life
expectancy than is predicted by the Weibull model, whereas young mongooses
that survive beyond infancy have a higher life expectancy, up to a bit beyond
2 years. Due to the dramatic change in the death rate as a function of , one
as a mixture of two Weibull densities,

8 h I z E2 e  e z 6 @ z t P h I  EI e vCI

)
2 I 

e 

2
E

@  t

3
4

might specify

 4d 2

3
4

13.5. DURATION DATA AND THE WEIBULL MODEL

275

F IGURE 13.5.1. Life expectancy of mongooses, Weibull model

and

5 3 B
)  ! Se

The parameters

are the parameters of the two Weibull densi-

ties, and is the parameter that mixes the two.

With the same data, can be estimated using the mixed model. The results
d

are a log-likelihood = -623.17. Note that a standard likelihood ratio test can-

and

(single density), the two parameters

) 

not be used to chose between the two models, since under the null that

are not identied. It is possi-

ble to take this into account, but this topic is out of the scope of this course.
Nevertheless, the improvement in the likelihood function is considerable. The
parameter estimates are

13.6. NUMERIC OPTIMIZATION: PITFALLS

276

Parameter Estimate St. Error


0.233

0.016

1.722

0.166

1.731

0.101

1.522

0.096

0.428

0.035

I
i

Note that the mixture parameter is highly signicant. This model leads to
the t in Figure 13.5.2. Note that the parametric and nonparametric ts are
9

quite close to one another, up to around

years. The disagreement after this

point is not too important, since less than 5% of mongooses live more than 6
years, which implies that the Kaplan-Meier nonparametric estimate has a high
variance (since its an average of a small number of observations).
Mixture models are often an effective way to model complex responses,
though they can suffer from overparameterization. Alternatives will be discussed later.
13.6. Numeric optimization: pitfalls
In this section well examine two common problems that can be encountered when doing numeric optimization of nonlinear models, and some solutions.
13.6.1. Poor scaling of the data. When the data is scaled so that the magnitudes of the rst and second derivatives are of different orders, problems can
easily result. If we uncomment the appropriate line in EstimatePoisson.m, the
data will not be scaled, and the estimation program will have difculty converging (it seems to take an innite amount of time). With unscaled data, the
elements of the score vector have very different magnitudes at the initial value

13.6. NUMERIC OPTIMIZATION: PITFALLS

277

F IGURE 13.5.2. Life expectancy of mongooses, mixed Weibull model

of (all zeros). To see this run CheckScore.m. With unscaled data, one element
d

of the gradient is very large, and the maximum and minimum elements are 5
orders of magnitude apart. This causes convergence problems due to serious
numerical inaccuracy when doing inversions to calculate the BFGS direction
of search. With scaled data, none of the elements of the gradient are very
large, and the maximum difference in orders of magnitude is 3. Convergence
is quick.
13.6.2. Multiple optima. Multiple optima (one global, others local) can
complicate life, since we have limited means of determining if there is a higher

13.6. NUMERIC OPTIMIZATION: PITFALLS

278

F IGURE 13.6.1. A foggy mountain

maximum the the one were at. Think of climbing a mountain in an unknown
range, in a very foggy place (Figure 13.6.1). You can go up until theres nowhere
else to go up, but since youre in the fog you dont know if the true summit
is across the gap thats at your feet. Do you claim victory and go home, or do
you trudge down the gap and explore the other side?
The best way to avoid stopping at a local maximum is to use many starting
values, for example on a grid, or randomly generated. Or perhaps one might
have priors about possible values for the parameters (e.g., from previous studies of similar data).

13.6. NUMERIC OPTIMIZATION: PITFALLS

279

Lets try to nd the true minimizer of minus 1 times the foggy mountain
function (since the algoritms are set up to minimize). From the picture, you

can see its close to

, but lets pretend there is fog, and that we dont know

that. The program FoggyMountain.m shows that poor start values can lead to
problems. It uses SA, which nds the true global minimum, and it shows that
BFGS using a battery of random start values can also nd the global minimum
help. The output of one run is here:
MPITB extensions found

======================================================
BFGSMIN final results

Used numeric gradient

-----------------------------------------------------STRONG CONVERGENCE
Function conv 1

Param conv 1

Gradient conv 1

-----------------------------------------------------Objective function value -0.0130329


Stepsize 0.102833
43 iterations
------------------------------------------------------

param

gradient

change

15.9999

-0.0000

0.0000

-28.8119

0.0000

0.0000

13.6. NUMERIC OPTIMIZATION: PITFALLS

The result with poor start values


ans =

16.000

-28.812

================================================
SAMIN final results
NORMAL CONVERGENCE

Func. tol. 1.000000e-10 Param. tol. 1.000000e-03


Obj. fn. value -0.100023

parameter

search width

0.037419

0.000018

-0.000000

0.000051

================================================
Now try a battery of random start values and
a short BFGS on each, then iterate to convergence
The result using 20 randoms start values
ans =

3.7417e-02

2.7628e-07

The true maximizer is near (0.037,0)

280

13.6. NUMERIC OPTIMIZATION: PITFALLS

281

In that run, the single BFGS run with bad start values converged to a point far
from the true minimizer, which simulated annealing and BFGS using a battery
of random start values both found the true maximizaer. battery of random
start values managed to nd the global max. The moral of the story is be
cautious and dont publish your results too quickly.

EXERCISES

282

Exercises
(1) In octave, type help bfgsmin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call bfgsmin. Run it, and
examine the output.
(2) In octave, type help samin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call samin. Run it, and
examine the output.
(3) Using logit.m and EstimateLogit.m as templates, write a function to calculate the probit loglikelihood, and a script to estimate a probit model. Run
it using data that actually follows a logit model (you can generate it in the
same way that is done in the logit example).
(4) Study mle_results.m to see what it does. Examine the functions that
mle_results.m calls, and in turn the functions that those functions call.
Write a complete description of how the whole chain works.
(5) Look at the Poisson estimation results for the OBDV measure of health care
use and give an economic interpretation. Estimate Poisson models for the
other 5 measures of health care usage.

CHAPTER 14

Asymptotic properties of extremum estimators


Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24 Amemiya, Ch.
!
8

4 section 4.1 ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey

and McFadden (1994), Large Sample Estimation and Hypothesis Testing, in


Handbook of Econometrics, Vol. 4, Ch. 36.

14.1. Extremum estimators

 9#

d f f
47C

with

The OLS estimator minimizes

1)
1)


are dened similarly to

8
YW

d f W f
 eC

and

d
pj
I

f B
d
BR  B
f

8 R BR   B HB #
f 

where

 B G d BR kB f
P  

E XAMPLE 18. Given the model

. Let the objective function

283

 '
X1

are -vectors and is nite.

as the optimizing

f# bb I
f
R r uxxb # g# n  W

d f
4eS9

the

random matrix

depend upon a

over a set

element of an objective function

In Denition 12.0.1 we dened an extremum estimator

where

observations, dene

14.2. CONSISTENCY

284

14.2. Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article,
ref. later), which well see in its original form later in the course. It is interesting to compare the following proof with Amemiyas Theorem 4.1.1, which is
done in terms of convergence in probability.

(1) Compactness: The parameter space

is an open subset of Euclidean

is compact.

such that

A

  4Ra uq4e  @  f
d d f 9

d f
eSx

is such that

d f
64yCxg

and hold it xed. Then

of functions. Suppose that

i.e.,

8p
9gd d
f


g d  p
wxd cx4RaG
 d  d
G
b

Proof: Select a

Then

has a unique global maximum at

x g
pt d

(3) Identication:

a.s.

is a xed sequence

converges uniformly to

This happens with probability one by assumption (b). The sequence


x

in the compact set

lies

by assumption (1) and the fact that maximixation is over

. Since every sequence from a compact set has at least one limit point (Davidd

fd

There is a subsequence

is simply a sequence of increasing integers) with

fd

87fd

son, Thm. 2.12), say that is a limit point of

W1
Yt

f
7d

8 d
g4RaG

continuous in on

that is

(2) Uniform Convergence: There is a nonstochastic function

p d
ge

The closure of

space

is obtained by maximiz-

d
4RI

d f
eS9

Assume

over

ing

f
d

T HEOREM 19. [Consistency of e.e.] Suppose that

. By

8 d
p Ra

p ea  p e t W
d
d
f


gd a  d W
f f

8 d f
g p e t W
p e
d
f

d W
f f

d 9
f f

C
d


d  d A

is . So the above claim is true.


B

of
k

dcuG
b

implies that

8
gd  d W
f

8
gd a  d W
f f

from the sequence

8 CD fd

by uniform convergence, so
as seen above, and
However,
which holds in the limit, so
Next, by maximization
since the limit as
Continuity of

uniform convergence implies

To see this, rst of all, select an element

Then

uniform convergence and continuity


14.2. CONSISTENCY

285

14.2. CONSISTENCY

286

so

Finally, all of the above limits hold


xed, but now we need to consider

except on a set

p
d

 d
g p ea  4d a

f
d

has only one limit point,

 d

8 

with

. Therefore

almost surely, since so far we have held


all

8p

and

d
4RaI

we must have

at

p
9d

But by assumption (3), there is a unique global maximum of

Discussion of the proof:


This proof relies on the identication assumption of a unique global
An equivalent way to state this is

d
eG

with

must have

p Ra
d

(c) Identication: Any point in

 v p d
 d

8p
xgd

maximum at

which matches the way we will write the assumption in the section on nonparametric inference.
is in fact a global maximum of

f
d

quired to be unique for

8 d f
Hex

We assume that

It is not re-

nite, though the identication assumption

requires that the limiting objective function have a unique maximizing argument. The next section on numeric optimization methods will

trivial problem.

d f
ey9

show that actually nding the global maximum of

may be a non-

See Amemiyas Example 4.1.4 for a case where discontinuity leads to

The assumption that

breakdown of consistency.
x

is in the interior of

(part of the identica-

tion assumption) has not been used to prove consistency, so we could


is simply an element of a compact set

8
x

directly assume that

The

reason that we assume its in the interior here is that this is necessary
for subsequent proof of asymptotic normality, and Id like to maintain
a minimal set of simple assumptions, for clarity. Parameters on the
boundary of the parameter set cause theoretical difculties that we

14.2. CONSISTENCY

287

will not deal with in this course. Just note that conventional hypothesis testing methods do not apply in this case.
is not required to be continuous, though

d
4R`

d f
e9

Note that

is.

The following gures illustrate why uniform convergence is important.

With uniform convergence, the maximum of the sample


objective function eventually must be in the neighborhood
of the maximum of the limiting objective function

14.2. CONSISTENCY

288

With pointwise convergence, the sample objective function


may have its maximum far away from that of the limiting
objective function

We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 19. The following theorem is from Davidson, pg. 337.
be a sequence of stochastic

real-valued functions on a totally-bounded metric space

Then

8 x
!

d f &
e

T HEOREM 20. [Uniform Strong LLN] Let

A
d f
 eSs& 
9

if and only if

where
x

p

x g
pwd

d f &
eSs
d f
T 4R s&

is a dense subset of

and

is strongly stochastically equicontinuous..

the Euclidean norm.

The metric space we are interested in now is simply

(b)

for each

(a)

using

The pointwise almost sure convergence needed for assuption (a) comes
from one of the usual SLLNs.

14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES

289

Stronger assumptions that imply those of the theorem are:

the parameter space is compact (this has already been assumed)


the objective function is continuous and bounded with probability one on the entire parameter space
a standard SLLN can be shown to apply to some point in the parameter space
These are reasonable conditions in many cases, and henceforth when

dealing with specic estimators well simply assume that pointwise


almost sure convergence can be extended to uniform almost sure convergence in this way.
The more general theorem is useful in the case that the limiting obeven if

d f
4eSx

jective function can be continuous in

is discontinuous.

This can happen because discontinuities may be smoothed out as we


take expectations over the data. In the section on simlation-based estimation we will se a case of a discontinuous objective function.

14.3. Example: Consistency of Least Squares


We suppose that data is generated by random sampling of

has the common distribution function

Suppose that the variances

for which
x

 gf

&

x
lg

8ctud R  gf
G P p 
R 6s!p &  gd
p 
p

we can write

8
'

G F G P Fp P
tU tU!swqp

are independent) with support

are nite. Let

1
 R U4  
F )

`
F nDQ
Q `
F f
p!

is compact. Let

, where
( and
and

so

The sample objective function for a sample size

14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES

290

is

I@
I
@
I
@

1
d d
1 5 t d d
G
)P tG p s R 
P i p R 
f
f
f
I
B
I
@

f
d R G
 tVP p d R  1 )  4d R  g
f
f

d f
 eS9

1)


1)

Considering the last term, by the SLLN,

8 

Q
"t

Q
Dt

G


H F
IhG

I
@

G f 1)

dent, the SLLN implies that it converges to zero.

and

and are indepen-

Considering the second term, since

Finally, for the rst term, for a given , we assume that a SLLN applies

so that
F
Q
Dt

F
Q
Dt

t
d
i p R
d


F
I

F 0 p Dp 0 p

P F

Q t
0 p P "8F F 0 p
F

p & wP &
5
&
p & wP &
5
&
I
x i p R @
d
d

f

1)

(14.3.1)

p &


p &

Finally, the objective function is clearly continuous, and the parameter space
is assumed to be compact, so the convergence is also uniform. Thus,

8p
9s  !9p 'Q&
 & 

F
P F

5
uP j p  0 p & p & sP

&

p &  4RaG
d

A minimizer of this is clearly

E XERCISE 21. Show that in order for the above solution to be unique it is

8 p
 F

necessary that

Discuss the relationship between this condition and

the problem of colinearity of regressors.

14.4. ASYMPTOTIC NORMALITY

291

This example shows that Theorem 19 can be used to prove strong consistency of the OLS estimator. There are easier ways to show this, of course - this
is only an example of application of the theorem.

14.4. Asymptotic Normality


A consistent estimator is oftentimes not very useful unless we know how
fast it is likely to be converging to the true value, and the probability that it
is far away from the true value. Establishment of asymptotic normality with
a known scaling factor solves these two problems. The following theorem is
similar to Amemiyas Theorem 4.1.3 (pg. 111).
T HEOREM 22. [Asymptotic normality of e.e.] In addition to the assumptions
of Theorem 19, assume
exists and is continuous in an open, convex neighbor-

a nite negative denite matrix, for any sequence

8p

 d
p e

d f P
6$fRC4

8pd
d f
4RCP

that converges almost surely to

hood of
(b)

d f
eS9

(a)

p R@ p ea
d
P
d
I d P d o I p e % m h p d 1
p d f
gRC 1 Qtf  ggRa o

p d
 p d
4ge o  q ggRC 1

m p d f

(c)

where

fd
`

Then

Proof: By Taylor expansion:

h p ipd RC  p eS96  4d S
d
d f
P d f f f

pd
xg

will be in the neighborhood where

) P
cd

probability one as

d f
4RC

Note that

8
4)

where

becomes large, by consistency.

exists with

14.4. ASYMPTOTIC NORMALITY

292

Now the l.h.s. of this equation is zero, at least asymptotically, since


is a maximizer and the f.o.c. must hold exactly since the limiting

f
d

f
d

Cd

gives

and

p

is between

and since

f
p d d


Also, since

8p
9d

objective function is strictly concave in a neighborhood of

, assumption (b)

p e P eS
d
d f
T

So

h p ipd tT  p e P  p RC
d
) P d
P d f 

h p ssd 1 t$T  p e
d
) P d

P d f
 p eS96 1

is a nite negative denite matrix, so the


, so we can write

term is asymptoti-

p d h
gRaP

pd
6ge

cally irrelevant next to

)
tT

Now

And

p eS 1 I 6 p eaR  h p ipd 1
d f
d P
d

h p ipd 1 h p e Q p eS96 1  h
d
d
P P
d f

Because of assumption (c), and the formula for the variance of a linear combih

I p e
d

p ea o I p R
d
d

 

P
%

nation of r.v.s,

h p ipd 1
d

Assumption (b) is not implied by the Slutsky theorem. The Slutsky

f

81
H
b
 H f  H

cant depend on

is a function of

fd f
e P

Ch. 4) is

our case

and

dH
b

However, the function

if

is continuous at

8 

theorem says that

to use this theorem. In

A theorem which applies (Amemiya,

14.4. ASYMPTOTIC NORMALITY

8xd d
p

and

p
gd

d
4RaC

is continuous at

p

d f
4RCg

then

To apply this to the second derivatives, sufcient conditions would


be that the second derivatives be strongly stochastically equicontinuand that an ordinary LLN applies to the

derivatives when evaluated at

8p d g
ge"wd

ous on a neighborhood of

p
xgd

p d
gRaC


Stronger conditions that imply this are as above: continuous and bounded

Skip this in lecture. A note on the order of these matrices: Supposing


is representable as an average of

d f
4RC

for all estimators we consider,

terms, which is the case

is also an average of

d f
4eS

that

second derivatives in a neighborhood of

8p
9d
matrices,

the elements of which are not centered (they do not have zero expectation). Supposing a SLLN applies, the almost sure limit of

as we saw in Example 51. On the other hand, assumpmeans that

 q

hm

d f
 p eS96 1

S
T

wed have


q1

U
tVT

where we use the result of Example 49. If we were to omit the

p e
d

p eS96 1
d f
 ) S p d
t T ggRP

tion (c):

pd f
gggR 9

h z 1 T S

)
t$T S z 1


d f
 p eS96

if

uniformly on an open neighborhood of

p ea@ d g
d

function

converges uniformly almost surely to a nonstochastic

T HEOREM 23. If

293

14.5. EXAMPLES

8
g

1
T
1

1
1
 ( T S T

p eS96
d f

to zero.

is centered, so we need to scale by

where we use the fact that

294

The sequence

to avoid convergence

14.5. Examples
14.5.1. Binary response models. Binary response models arise in a variety
of contexts. Weve already seen a logit model. Another simple example is a
probit threshold-crossing model. Assume that

)
x9
)
f
G
j R 

 f
G

is an unobserved (latent) continuous variable, and

Gt ~ p
4 5 G 7} eI x
5

, where

is negative or positive. Then

is a binary vari-

f
s  t)  s

able that indicates whether

Here,


     

  

is the standard normal distribution function.

In general, a binary response model will require that the choice probability
be parameterized in some form. For a vector of explanatory variables , the

response probability will be parameterized in some manner

d
4     ) 
f

4d R      
d

we have a logit model. If

where

dc
b


gd R 

d
   s


W

If

standard normal distribution function, then we have a probit model.

is the

14.5. EXAMPLES

295

Regardless of the parameterization, we are dealing with a Bernoulli density,


X
u

d  ) d B
B f XY
I 4  ic X u 4 C   C B %G

so as long as the observations are independent, the maximum likelihood (ML)


is the maximizer of

I
B 1
8d
4
B f
C  B )
f
I

d

f ) P d
f B 1
B 
C s%) A B 0yc C  B )
B
f

Following the above theoretical results,

d f
 4RC
(14.5.1)

tends in probability to the

maximizes the uniform almost sure limit of

8 d  f
g 
d f
eSx
8 d f
4RCx

and following a SLLN for i.i.d. processes,


expectation of a representative term
to get

conditional on

Noting that

that

gpd BC s B 

 f
p
gd


d

estimator,

converges almost surely to the

First one can take the expectation


8 d 
$s4  i) A p   i) E4   p   s    yd) A Sjcs4   t u
d
 P d
d

d 
f ) P d f

Next taking expectation over

is the (joint - the integral is understood to be multiple, and

d
 4ea

Y
'g

is the

 Q

where

   Q   id) c p   i) s   p   
t
d 
d  P d
d

(14.5.2)

we get the limiting objective function

support of ) density function of the explanatory variables . This is clearly


is continuous, and if the parameter space is


d

compact we therefore have uniform almost sure convergence. Note that

d
4  

as long as

d
  

continuous in

is continous for the logit and probit models, for example. The maximizing

R
8 p   d p   d  p ea
d
 f
d
 f
d

p R
d

p e
d

p de

1
)

p R 1 )
d

1
h

p e
d



f

So we get


f

d f
f
 p eS 1

The terms in

also drop out by the same argument:

i.i.d.
f.o.c. in the consistency proof above and the fact that observations are

Theres no need to subtract the mean, since its zero, following the

the expectation of a typical element of the outer product of the gradient.

d f
pgRC 1 f  gRa o

p d
8 d P d
q I 6 p he p Ra o I ! p R% @
d P

In the case of i.i.d. observations

is simply

h p d 1
d

The asymptotic normality theorem tells us that


tent. Question: whats needed to ensure that the solution is unique?

8p
xgd  d
d

d
 siy)
d

d
 
d

d
d

  Q      siy) q    g   Y
pd
pd
 Ig4RaG
d  d

This is clearly solved by

Provided the solution is unique,

is consis-

element of

solves the rst order conditions


14.5. EXAMPLES

296

8R Dv p   iq p   
d
 d
v
d 
v c p   %f

Rd d
d
 p e

d
 f d
 p  

(14.5.3)
So

8 v 4  %4  
d  d


d  ) d
v   i4  
d R 7g} )
~ P
~ P )
v 4dv R v 7} I d R v 7g}
~
 ~  ~ P )
v 4dR v 7} dR v 7g}



d d
 4  

We can simplify the above results in this case. We have that

8 I dR v 7g} P c  4  
 ~ ) d
Now suppose that we are dealing with a correctly specied logit model:

8 p   i) 0 p    f  p  
d

f ) P d
d
 f
8 

From above, a typical element of the objective function is


and

or equivalently, rst over

then over

 
f
R
8g p    d d
d
f
d
 p e

 

on

Expectations are jointly over


P

Likewise,

conditional

14.5. EXAMPLES

297

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL

gives

t
v
d
 P d  d 5
  BQR Dv p   Q p   s p   u f

8   QR Hv p   iq p   
t
v
d
 d

where we use the fact that

d
   p RP
f
 S Y

(14.5.6)

. Likewise,

8   BQR vHv p   iq p 
t

d
 d
pd
g v   Y
f

d
 p Ra

(14.5.5)

(14.5.4)

then

Taking expectations over

298

Note that we arrive at the expected result: the information matrix equality
. With this,

I 6 p e
d

p ea o I 6 p R
d
d

pd
e

 

P
%

I 6 p RR9 d
d P

h
d
p ipd 1

 gRaP
p d h
o

holds (that is,

simplies to

h p ipd 1
d

which can also be expressed as


h

8 I p e
d


m

h p ipd 1
d

On a nal note, the logit and standard normal CDFs are very similar - the
logit distribution is a bit more fat-tailed. While coefcients will vary slightly
between the two models, functions of interest such as estimated probabilities
will be virtually identical for the two models.


4d   

14.6. Example: Linearization of a nonlinear model


Ref. Gourieroux and Monfort, section 8.3.4. White, Intnl Econ. Rev. 1980 is
an earlier reference.

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL

299

Suppose we have a nonlinear model

G P d B
B V p  C B f

where

! B G
t33
The nonlinear least squares estimator solves

f B 1
d
B
C B ) g f d
f

Well study this more later, but for now it is clear that the foc for minimization
will require solving a set of nonlinear equations. A common approach to the
problem seeks to avoid this difculty by linearizing the model. A rst order
Taylors series expansion about the point

with remainder gives

R p  C  p  p  B f
B aQP p  p
B P d

d


p


and the Taylors series remainder. Note that


a

encompasses both

BG

where

is no longer a classical error - its mean is not zero. We should expect problems.
Dene


p  p
d



Rp q p  p 
d
d
pgp
 

&

by applying OLS to

a
B QP C uP
B

and

&


B f

&
g

be consistent for

and

&

&

Question, will

and

Given this, one might try to estimate

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL

300

as extremum

&

8 R R ! & 

C 0
B

B
f
&

estimators. Let

and

The answer is no, as one can see by interpreting

I
B 1  f
 H  

f )

The objective function converges to its expectation

f
Y  HaG HS9

 f
T

 0

that minimizes


HG

p


to the

&

f
Y  p

&

8 8
4C

converges

 j

and

Noting that

 0 & q p  
VP

G P d
 0 & p0 p   Y
&

&

f
Y


G

perplane that is closest to the true regression function

correspond to the hyaccording to the

mean squared error criterion. This depends on both the shape of


density function of the conditioning variables.

and

pd
g4 

drop out.

 dR 

since cross products involving

and the

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL

301

Inconsistency of the linear approximation, even at


the approximation point
x

h(x,)
x

Tangent line

Fitted line

x
x

x_0

It is clear that the tangent line does not minimize MSE, since, for ex-

pd
gg 

ample, if

is concave, all errors between the tangent line and

the true function are negative.

p
d

Note that the true underlying parameter

is not estimated consis-

tently, either (it may be of a different dimension than the dimension


of the parameter of the approximating model, which is 2 in this example).
Second order and higher-order approximations suffer from exactly
the same problem, though to a less severe degree, of course. For
this reason, translog, Generalized Leontiev and other exible functional forms based upon second-order approximations in general suffer from bias and inconsistency. The bias may not be too important for
analysis of conditional means, but it can be very important for analyzing rst and second derivatives. In production and consumer analysis,
rst and second derivatives (e.g., elasticities of substitution) are often

14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL

302

of interest, so in this case, one should be cautious of unthinking application of models that impose stong restrictions on second derivatives.
This sort of linearization about a long run equilibrium is a common
practice in dynamic macroeconomic models. It is justied for the purposes of theoretical analysis of a model given the models parameters,
but it is not justiable for the estimation of the parameters of the model
using data. The section on simulation-based methods offers a means
of obtaining consistent estimators of the parameters of dynamic macro
models that are too complex for standard methods of analysis.

EXERCISES

303

Exercises

and

that are the probability limits of

&

C
B

&

the numeric values of

by OLS. Find
and

B B
P C P 'B f
&

 B P B  ) qB f
G


Suppose we estimate the misspecied model

is iid(0,

where

BG

uniform(0,1), and

8
gD

(1) Suppose that

(2) Verify your results using Octave by generating data that follows the above
model, and calculating the OLS estimator. When the sample size is very
large the estimator should be very close to the analytical results you obtained in question 1.
(3) Use the asymptotic normality theorem to nd the asymptotic distribution

and

This means nding

where

x
yx x
x cb gsQP S z
 p f

QG
G P p 
U l f

8 

and is independent of

for the model


)
t

of the ML estimator of

8
g p

CHAPTER 15

Generalized method of moments (GMM)


Readings: Hamilton Ch. 14 ; Davidson and MacKinnon, Ch. 17 (see pg.

587 for refs. to applications); Newey and McFadden (1994), Large Sample
Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch.
36.

15.1. Denition
Weve already seen one example of GMM in the introduction, based upon
distribution. Consider the following example based upon the t-distribution.
is

p
4d

q1
d
45 p R p d

f P
d  f
p d ) `5 tqegIpp RA

) P d  p g! `
Y
one could estimate

Given an iid sample of size

The density function of a t-distributed r.v.

the

by maximizing the log-

likelihood function

d f
4g

Y
`

I
@
d

A f  4RSf ~ g

This approach is attractive since ML estimators are asymptotically efcient. This is because the ML estimator uses all of the available information (e.g., the distribution is fully specied up to a parameter). Recalling that a distribution is completely characterized by its moments,
the ML estimator is interpretable as a GMM estimator that uses all of
304

15.1. DEFINITION

305

ments to estimate a

the moments. The method of moments estimator uses only

mo-

dimensional parameter. Since information is

discarded, in general, by the MM estimator, efciency is lost relative


to the ML estimator.

(for

8 p
g45 gd

5 p d p f
4Vg gd 

Using the notation introduced previously, dene a moment condition

y)
X Xf
 d
5
I
d
$ d %


8  deI%
p
p dI
 6geE%
8 7f @f 1 )
I
f 4upR d  4eE%
5 d
dI
both

and

Choosing to set

dI I
d I
 e% @f 1 )  4Rv%

As before, when evaluated at the true parameter value

p

and

5 d
VR

has mean zero and variance

p d f Y
ggg @

Continuing with the example, a t-distributed r.v. with density

(15.1.1)

yields a MM estimator:

This estimator is based on only one moment of the distribution - it uses less
information than the ML estimator, so it is intuitively clear that the MM estimator will be inefcient relative to the ML estimator.

An alternative MM estimator could be based upon the fourth moment


of the t-distribution. The fourth moment of a t-distributed r.v. is

p RD 5 p d
d
 gR4u e  3
f

p d y7

We can dene a second moment condition


3

3
cQ

8 gd
p

I
@ 1
d 45 d d

f
s4YVR  4R
d
7
f )

provided

15.1. DEFINITION

306

4d

to set

A second, different MM estimator chooses

If you

solve this youll see that the estimate is different from that in equation
15.1.1.

This estimator isnt efcient either, since it uses only one moment. A GMM estimator would use the two moment conditions together to estimate the single
parameter. The GMM estimator is overidentied, which leads to an estimator which is efcient relative to the just identied MM estimators (more on
efciency later).

dicate the sample size. Note that

subscript is used to in-

xgd QdDt)$T d eX
p   S d
 p 1
g eI T

d
 gpeX
8 R E4e !ge%  eC
d  d I
d f

The

As before, set

since it is an

average of centered random variables, whereas

where expectations are taken using the true distribution with param-

8p
xd

eter

This is the fundamental reason that GMM is consistent.

A GMM estimator requires dening a measure of distance,

We assume

converges to a

In general, assume we have moment conditions, so

is a -vector

'


is a

and

matrix.

d
4RX

nite positive denite matrix.

8 d f d
g4RXi R 4eX  e 9
d f

and we minimize

f
"
"fi R  E4eXpt
d
d
E4RXpt

popular choice (for reasons noted below) is to set

.A

For the purposes of this course, the following denition of the GMM estimator
is sufciently general:

f
i
  eX!4

d
 d f f d f
g4Rig" R eS $ 4eS g
d f

p

-vector,

with

and

symmetric positive denite matrix

-dimensional parameter vec-

where

d I
4RE @f I f  4RC
d f

tor

D EFINITION 24. The GMM estimator of the

is a

converges almost surely to a nite


.

'


15.2. CONSISTENCY

307

Whats the reason for using GMM if MLE is asymptotically efcient?


Robustness: GMM is based upon a limited set of moment conditions.
For consistency, only these moment conditions need to be correctly

specied, whereas MLE in effect requires correct specication of every


conceivable moment condition. GMM is robust with respect to distributional misspecication. The price for robustness is loss of efciency with
respect to the MLE estimator. Keep in mind that the true distribution
is not known so if we erroneously specify a distribution and estimate
by MLE, the estimator will be inconsistent in general (not always).
Feasibility: in some cases the MLE estimator is not available, because we are not able to deduce the likelihood function. More
on this in the section on simulation-based estimation. The GMM
estimator may still be feasible even though MLE is not possible.

15.2. Consistency
We simply assume that the assumptions of Theorem 19 hold, so the GMM
estimator is strongly consistent. The only assumption that warrants additional comments is that of identication. In Theorem 19, the third assumphas a unique global maximum at

p
d

daG
b

tion reads: (c) Identication:

i.e.,

Taking the case of a quadratic objective function

p 
9gd "d
 d
4ea
  ea pp R gpgdea  gge G
p d

pd

y
8  p dRa

d f
 p RC 4

8 d
g4Ra eSf
d

T

8 d
4eSf
 d f f d f
g4RCgi R eS  4RC
d f
8 p  d  d
xd cgea ea
p d
rst consider

Applying a uniform law of large numbers, we get


Since

by assumption,

Since

cation, we need that

in order for asymptotic identi-

for

for at least some element

15.3. ASYMPTOTIC NORMALITY

Q
f


p
d

matrix guarantee that

'



t

denite

of the vector. This and the assumption that

308

a nite positive

is asymptotically identied.

'

Note that asymptotic identication does not rule out the possibility
of lack of identication for a given data set - there may be multiple
minimizing solutions in nite samples.

15.3. Asymptotic normality


We also simply assume that the conditions of Theorem 22 hold, so we will
have asymptotic normality. However, we do need to nd the structure of the
asymptotic variance-covariance matrix of the estimator. From Theorem 22, we
h

y
8gpgeS 1 X f A  ea o
d f

p d
d f
eS z
I 6 p e p ea o I 6 p R
d P d
d

P
%

h p ipd 1
d

p d
geP

where

have

is the almost sure limit of

and

We need to determine the form of these matrices given the objective function

8 d f f d f
g4RCgi R eS  4RC
d f

Now using the product rule from the introduction,

d f d
5  4RC

 d Rf
HR d

matrix
$

d
eSf

'

so:

8d
HR e 5  4e d
d
d
and

d f
eS

f  d
"74eSf eS9
d f
,

is omitted to unclutter the notation).

all depend on the sample size


q1

(15.3.1)
(Note that

d f f d
eygi e yf

Dene the

but it


d
d
f
Rf f" R 4RXH1 p eXH1 if A f

Rf fi4eXc p RXh i$f 1 A f


R d
d
f
h
p RCf d 1 A f
d

p d
 ggh RXq

by assumption), we have

and

ps
d
5  p ea

rows of

a.s. (we assume a LLN

@848C6 R



d
 p ea

p
gd
p Ra o
d

(since

, following equation 15.3.1, and noting that the scores

A4C7 
8 8
Rd d
P
d
 p RCf
A

have mean zero at


With regard to
holds).

we get

R
 BR p e d
d


where we dene

Stacking these results over the

ggtcT  p eX
)
d


since

d
R p RXg5

to a nite limit. In this case, we have

y
d
BR e

satises a LLN, so that it converges almost surely

d d
BR 4e

p
xgd

d
R 4eX5

R
5 P
R B d dRggR B 5

R
de fie B d
d
5

assume that

be the

th row of

8d
ge

at

When evaluating the term

d R d
d
 4R B

uct rule,
To take second derivatives, let

Using the prod-

15.3. ASYMPTOTIC NORMALITY

309

15.4. CHOOSING THE WEIGHTING MATRIX


h

is an average of centered (mean-zero) quantities, it is

p d
ggRX

Now, given that

310

reasonable to expect a CLT to apply, after multiplication by


gs 

this,

. Assuming

p RXH1
d
where

8 R p RX p RXH1 A f 
d
d

Using this, and the last equation, we get

s  p Ra
d

Using these results, the asymptotic normality theorem gives us


h

 r I R

s R

p I R


s x n

h p ipd 1
d

the asymptotic distribution of the GMM estimator for arbitrary weighting mato be positive denite,

"

Note that for

trix

must have full row rank,

 s

8 i
f

15.4. Choosing the weighting matrix


is a weighting matrix, which determines the relative importance of viola-

tions of the individual moment conditions. For example, if we are much more
sure of the rst moment condition, which is based upon the variance, than of
the second, which is based upon the fourth moment, we could set

15.4. CHOOSING THE WEIGHTING MATRIX

much larger than

8
xU

with

311

In this case, errors in the second moment condition

have less weight in the objective function.


Since moments are not independent, in general, we should expect that

there be a correlation between the moment conditions, so it may not

data dependent matrix.

We have already seen that the choice of

be desirable to set the off-diagonal elements to 0.

may be a random,

will inuence the asymp-

totic distribution of the GMM estimator. Since the GMM estimator is

already inefcient w.r.t. MLE, we might like to choose the

matrix

to make the GMM estimator efcient within the class of GMM estimators
.

To provide a little intuition, consider the linear model

I

satises the classical assumptions of




homoscedasticity and nonautocorrelation, since

 w I wI

8 f  R I R I  R I R  R

for

 I 8w

G

 R  4G

G  f
P

Then the model

e.g,

 R

be the Cholesky factorization of

'

Let

That is, he have heteroscedasticity and autocorre-

8I

lation.

8
g  j G

where

G P v
7 R  f

d f
eS

dened by

(Note: we use

both nonsingular). This means that the transformed

model is efcient.

jective function

minimizes the ob-

Interpreting


 G  q

8
 I R 
f

f
G i  f
P

The OLS estimator of the model

as moment conditions (note that they do have zero expectation when

evaluated at

), the optimal weighting matrix is seen to be the in-

verse of the covariance matrix of the moment conditions. This result


carries over to GMM estimation. (Note: this presentation of GLS is not

15.4. CHOOSING THE WEIGHTING MATRIX

312

a GMM estimator, because the number of moment conditions here is

8
q1

equal to the sample size,

Later well see that GLS can be put into the

GMM framework dened above).


is a GMM estimator that minimizes

the asymptotic variance

con-

versus

R eI r eI
p
p
R


ps

I R ps

s I R p

s I R

 I

'

8 I R I

is some arbitrary positive denite matrix:

 I

Now, for any choice such that

sider the difference of the inverses of the variances when

s
R eI


p

s R ps

n eI
p

R I

when

so that

I R

 I

simplies to

will be minimized by choosing

84 R pgRXgpgRXH1 % f A 
d d

Proof: For

where

asymptotic variance of

the

f
f
"
"


 d f f d f
geigi R eS

T HEOREM 25. If

as can be veried by multiplication. The term in brackets is idempotent, which


is also easy to check by multiplication, and is therefore positive semidenite.
A quadratic form in a positive semidenite matrix is also positive semidenite. The difference of the inverses of the variances is positive semidenite,
which implies that the difference of the variances is negative semidenite,
which proves the theorem.
The result
h

r R I

 n
m

h p ipd 1
d

(15.4.1)

15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX

p
d

means approximately distributed as. To operationalize this we

is continuous in

not continuous. We now turn to estimation of


7d

chastic equicontinuity results can give us this result even if

assuming that

which is consistent
Sto-

Rf

is simply

8
7d

by the consistency of

The obvious estimator of

Rf
 h d Rf

and

need estimators of

where the


 1
I R I

allows us to treat

313

is

15.5. Estimation of the variance-covariance matrix


(See Hamilton Ch. 10, pp. 261-2 and 280-84) .

In the case that we wish to use the optimal weighting matrix, we need an
the limiting variance-covariance matrix of

one could estimate

p d f
ggRCD1

estimate of

. While

parametrically, we in general have little information

upon which to base a parametric specication. In general, we expect that:

2


s R U  @

will be autocorrelated (

). Note that this autoco-

variance will not depend on if the moment conditions are covariance

stationary.

contemporaneously correlated, since the individual moment condi-

).

B  B

and have different variances (

).


B U

tions will not in general be independent of one another (

15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX

314

Since we need to estimate so many components if we are to take the parametric approach, it is unlikely that we would arrive at a correct parametric specication. For this reason, research has focused on consistent nonparametric

8
4d E 


7d
8
g R U 

are functions of

Now

P
I f ) 1 xxxH R
Pbbb

I @
8 R
f 1 ) 

would be

P
DI f 9xxP h R
Pbbb

1
P 01 P h IR
5

8 h R




in the denominator instead). So, a natural, but inconsis-

h I Rf

 f

tent, estimator of

I Rf

1
P 5011  IR I ) 1 P p
P

P
I
@
I
@

1)v

w R

f
f
1 v  R p eXc p eX1
d
d

(you might use

is

I
@
1)

f

A natural, consistent estimator of

so that

Recall that

so for now assume that we have some consistent

I@
1)
R
f

p

estimator of


 R

and

Note that

autocovariance of

the moment conditions

Dene the

8 R

does not depend on

and

is covariance stationary (the covariance be-

8
gE2

tween

Henceforth we assume that

2
X

estimators of

P 11 P p

I f
1
P
DI ) 1 P p

15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX

315

This estimator is inconsistent in general, since the number of parameters to


estimate is more than the number of observations, and increases more rapidly

tends to zero sufciently rapidly as

a modied estimator

 hR

I
p
P

xf (

must be

1


can be dropped because

grows sufciently

8 1
gDT

will be consistent, provided

1
S

f f

1


slowly. The term

as

where

tends to

On the other hand, supposing that

than , so information does not build up as

This allows

information to accumulate at a rate that satises a LLN. A disadvantage of


this estimator is that it may not be positive denite. This could cause one to
statistic, for example!
which in turn

which is based upon an estimate of


d

solution to this circularity is to set the weighting matrix

requires an estimate of

requires an estimate of

Note: the formula for

p d
geX

calculate a negative

The

arbitrarily

(for example to an identity matrix), obtain a rst consistent but inef-

then re-estimate

nor change appreciably


d

p

The process can be iterated until neither

then use this estimate to form

cient estimate of

8p
xgd

between iterations.

15.5.1. Newey-West covariance estimator. The Newey-West estimator (Econometrica, 1987) solves the problem of possible nonpositive deniteness of the
above estimator. Their estimator is

8 hR

xf (


I
)P

y) P p

15.6. ESTIMATION USING CONDITIONAL MOMENTS

316

This estimator is p.d. by construction. The condition for consistency is that


This estimator is

I 1

3 p
e

an example of a kernel estimator.

nonparametric - weve placed no parametric restrictions on the form of

It is

8
7

Note that this is a very slow rate of growth for

In a more recent paper, Newey and West (Review of Economic Studies, 1994)
use pre-whitening before applying the kernel estimator. The idea is to t a VAR
model to the moment conditions. It is expected that the residuals of the VAR
model will be more nearly white noise, so that the Newey-West covariance
estimator might perform better with short lag lengths..
The VAR model is

8
i
P T x Pbbb P I
CUT UhxxxsI x 

Then the Newey-West covariance

estimator is applied to these pre-whitened residuals, and the covariance

This is estimated, giving the residuals

is

estimated combining the tted VAR

T Ux xxxsI x 
T Pbbb P I

tails.

with the kernel estimate of the covariance of the

See Newey-West for de-

I have a program that does this if youre interested.

15.6. Estimation using conditional moments


If the above VAR model does succeed in removing unmodeled heteroscedasticity and autocorrelation, might this imply that this information is not being
used efciently in estimation? In other words, since the performance of GMM
depends on which moment conditions are used, if the set of selected moments

15.6. ESTIMATION USING CONDITIONAL MOMENTS

317

exhibits heteroscedasticity and autocorrelation, cant we use this information,


a la GLS, to guide us in selecting a better set of moment conditions to improve
efciency? The answer to this may not be so clear when moments are dened
unconditionally, but it can be analyzed more carefully when the moments used
in estimation are derived from conditional moments.
So far, the moment conditions have been presented as unconditional expectations. One common way of dening unconditional moment conditions is
based upon conditional moment conditions.
has zero expectation conditional on the

is also zero. The unconditional expectation is

and a function

 Y

Then the unconditional expectation of the product of


XH

t
 iX

random variable

Suppose that a random variable

of

8 t 
i4t X XH

e
I


 QHq

'

This can be factored into a conditional expectation and an expectation w.r.t.

r
o
t 
iX QH

it can be pulled out of the integral

8 t
%4X XH iX
t

e
I


 XD

doesnt depend on


QH

Since

8 t
%$X

the marginal density of


 XD

Y
f'

But the term in parentheses on the rhs is zero by assumption, so


 XD q

15.6. ESTIMATION USING CONDITIONAL MOMENTS

318

as claimed.
This is important econometrically, since models often imply restrictions on
conditional moments. Suppose a model tells us that the function
equal to

d
gc  D
 f
 4


c

expectation, conditional on the information set

has

8d
g4c  D   g 4
 f
so that

R    D
 d

 f
gf  

we can set

 G P 
tw R  4f

For example, in the context of the classical linear model

With this, the function

d  f d
4c  Duq   eE
has conditional expectation equal to zero

8  4RE 4
d

This is a scalar moment condition,which wouldnt be sufcient to identify a


However, the above result allows us to

8
7d

dimensional parameter

t)

form various unconditional expectations

8
c

F
U

for identication holds.

is a set of variables

are instrumental variables. We

F
U

) '
j

moment conditions, so as long as

The

and

UF

d
 eE

-vector valued function of

drawn from the information set


now have

is a

d F
eE U

where

the necessary condition

This ts the previous treatment. An interesting

I
@ 1
d
eE
f )
I
@ 1
4RE
d
W f )
d Rf
eSf W ) 1

eif
d

4e
d
d
4eI

8f
W

d f
 4RC

Rf

bb f F
xxb 4U

W
 V


W

f F
UI
I
F
I F
I

W
 V

f F
$U


F
bb I F
xxb E

 W

 2





Rf
W ) 1


RI

.
.
.

row of

.
.
.

moment conditions
W

f
 W

.
.
.


F
I F
E

is the

F
yW

to achieve maximum efciency.


question that arises is how one should choose the instrumental variables
where

With this we can form the

'
X1

One can form the

matrix

15.6. ESTIMATION USING CONDITIONAL MOMENTS

319

f d
W 4 fR d ) 1
1 d
d
R eSf R f W
y )

'


d
 eSf


8 Df ) 1  4RCf
f W
d

1
%'

is a

which we can dene to be

where

matrix) is

(a

Note that with this choice of moment conditions, we have that

320

d
e R

15.6. ESTIMATION USING CONDITIONAL MOMENTS

matrix that has the derivatives of the individual moment

conditions as its columns. Likewise, dene the var-cov. of the moment conditions

8p d
ggeSf
f f
W 1  R f W
f R d
W c p eSf p eSf ) 1 yW
d
Rf

fW R p eSf p RCf R f W ) 1
d
d
c p eic p Ri1
R d f d f

 f



$

 f

where we have dened

Note that matrix is growing with the

sample size and is not consistently estimable without additional assumptions.


The asymptotic normality theorem above says that the GMM estimator using the optimal weighting matrix is distributed as
h


sr j
m

h p d 1
d

15.6. ESTIMATION USING CONDITIONAL MOMENTS


d
d

f


ing matrix, we can show that putting

Using an argument similar to that used to prove that

I
1
1
1
R f R f W I C R f W Df
f Wf
f W

(15.6.1)

where

321

is the efcient weight-

R f I f

f
 W


causes the above var-cov matrix to simplify to


d

8 f f 1  t  f 
I R I gf

(15.6.2)

and furthermore, this matrix is smaller that the limiting var-cov for any other
choice of instrumental variables. (To prove this, examine the difference of the
inverses of the var-cov matrices with the optimal intruments and with nonoptimal instruments. As above, you can show that the difference is positive

where

must be consistently estimated to apply

is straightforward - one just uses

 h d Rf d


p

Usually, estimation of

this.

and



f

since it depends on

which we should write more properly as

 d
g p R f

Note that both

semi-denite).

is some initial consistent estimator based on non-optimal in-

struments.
may not be possible. It is an


q1

f


more unique elements than

1 '
XV1

Estimation of

matrix, so it has

the sample size, so without restrictions

15.8. A SPECIFICATION TEST

322

on the parameters it cant be estimated consistently. Basically, you


need to provide a parametric specication of the covariances of the
in order to be able to use optimal instruments. A solution is to ap-

d
4RE

proximate this matrix parametrically to dene the instruments. Note


that the simplied var-cov matrix in equation 15.6.2 will not apply if
approximately optimal instruments are used - it will be necessary to
b

h
b b

f y

use an estimator based upon equation 15.6.1, where the term

must be estimated consistently apart, for example by the Newey-West


procedure.

15.7. Estimation using dynamic moment conditions


Note that dynamic moment conditions simplify the var-cov matrix, but are
often harder to formulate. The will be added in future editions. For now, the
Hansen application below is enough.

15.8. A specication test


The rst order conditions for minimization, using the an estimate of the

h d h d yf
f
I
d

d
5  4d

or

optimal weighting matrix, are

f
d i I 4d

p RC eI h eI
d f
p
p

 


I R I

p eS eI 1
d f
p

m
i

p
R eI

 x 1

Now

p
 d X eI 1

p eC eI h eI
d f
p
p


I R I

p
R eI 1

Or


 d X1
h

8 d f
p eS I


I R I

This last can be written as

R 1

d f
q p eSH1
h


 $ d XH1

get
h

With this, and taking into account the original expansion (equation ??), we

p RC I
d f


I R I

 h p pd 1
d

or

h
d
d f
p ipd R I  p eC I

d
pd

xT P h p dipd p R I d  p RC I d  4d X I 4d
)
R d

P d f

I d
8)
gtcT P h p d p e Rf  p RC  4d X
d
d
P d f

The lhs is zero, and since tends to


Multiplying by

and

tends to

, we can write

we obtain

(15.8.1)


4d X

Consider a Taylor expansion of

15.8. A SPECIFICATION TEST

323

15.8. A SPECIFICATION TEST

324

and one can easily verify that


I R I

p
R eI

 xl

 

(recall that the rank of an idempotent matrix is

equal to its trace) so

h eI
p

is idempotent of rank


4 X I c$d XH1  h $d X eI 1 R h $d X eI 1
R
p
p

m
i d

we also have

converges to

Since


$d X I c4d XH1
R

or

fb
4d C1

supposing the model is correctly specied. This is a convenient test since we

with a


q1

just multiply the optimized value of the objective function by

and compare

critical value. The test is a general test of whether or not the

moments used to estimate are correctly specied.

This wont work when the estimator is just identied. The f.o.c. are
$


$d X I  eS
d f

and

But with exact identication, both

are square and invertible

(at least asymptotically, assuming that asymptotic normality hold), so

8
$


d X

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

325

So the moment conditions are zero regardless of the weighting matrix


used. As such, we might as well use an identity matrix and save trou-

f
 4 d 9

ble. Also

, so the test breaks down.

A note: this sort of test often over-rejects in nite samples. If the sam-

ple size is small, it might be better to use bootstrap critical values. That
by sampling from the data with re-

bootstrap samples, optimize and calculate the test

must be a very large number if

exceed the value. Of course,


d

)
4 b &

8 6A@8@8974)  Cg d "1
85
% 
b

percent of the

such that

Dene the bootstrap critical value

i
p

statistic

placement. For

is, draw articial samples of size

is large, in order to determine

the critical value with precision. This sort of test has been found to
have quite good small sample properties.

15.9. Other estimators interpreted as GMM estimators


15.9.1. OLS with heteroscedasticity of unknown form.
E XAMPLE 26. Whites heteroscedastic consistent varcov estimator for OLS.
a diagonal matrix.
where

and


S

If were not condent about parameterizing

jointly (feasible
is correct.

we can still estimate

consistently by OLS. However, the typical covariance estimator

I R

ferences.


 

GLS). This will work well if the parameterization of

dimensional parameter vector, and to estimate

is a nite


gsS 

G P p
V 

The typical approach is to parameterize

 S
gT jsG

where

S
@

Suppose

will be biased and inconsistent, and will lead to invalid in-

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

(a

column vector) we have

)
0'

which suggests the moment condition

G
 t 

By exogeneity of the regressors

326

8
 R v gS  qE
f 
moment con-

8
R v v

1 H v
) f

1 )

1 ) X



will be identically zero at the minimum, due to exact


qX

For any choice of

ditions). We have

parameters and

In this case, we have exact identication (

identication. That is, since the number of moment conditions is identical to


regardless of

There

8
X


 X

the number of parameters, the foc imply that

is no need to use the optimal weighting matrix in this case, an identity matrix
works just as well for the purpose of estimation. Therefore


R I 6 R  g v

8
q1 R  R v v

is

8 hR

8 hdR

1 
)

Recall that a possible estimator of

In this case

is simply

R v v

Recall that

The GMM estimator of the asymptotic varcov matrix is

8 hR
I

which is the usual OLS estimator.

I
p
P
I f

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

327

This is in general inconsistent, but in the present case of nonautocorrelation, it


simplies to

which has a constant number of elements to estimate, so information will accumulate, and consistency obtains. In the present case




1 '
%X1

is an

diagonal matrix with

in the position

2
2

G
1
qR
I
@
w
G R v v f v 1 )
I
@

v
w
h R v f R v v l1 )
f
I
@
p
R

f 1 ) 

where

Therefore, the GMM varcov. estimator, which is consistent, is


h

I R
st
1

R
I

1
qR I
1
1
R R
q

1
R

h
 Dh j% 1


This is the varcov estimator that White (1980) arrived at in an inuential article.
This estimator is consistent under heteroscedasticity of an unknown form. If

the rest is the same.

there is autocorrelation, the Newey-West estimator can be used to estimate

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

328

15.9.2. Weighted Least Squares. Consider the previous example of a linear model with heteroscedasticity of unknown form:

S
T j
G
0P p


G
is a diagonal matrix.

is known, so that

q I S R I I S R 

This estimator can be interpreted as the solution to the

In this case, the GLS

moment conditions
$

estimator is

is a correct para-

metric specication (which may also depend upon

Now, suppose that the form of

8
g
p RS
d

where

p d
p
R eES
1 ) Rdf !S

v 1 )  X
v v

That is, the GLS estimator in this case has an obvious representation as a GMM
estimator. With autocorrelation, the representation exists but it is a little more
complicated. Nevertheless, the idea is the same. There are a few points:

The (feasible) GLS estimator is known to be asymptotically efcient in

This means that it is more efcient than the above example of OLS with

the class of linear asymptotically unbiased estimators (Gauss-Markov).

Whites heteroscedastic consistent covariance, which is an alternative


GMM estimator.
This means that the choice of the moment conditions is important to
achieve efciency.

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

329

15.9.3. 2SLS. Consider the linear model

 G P R
t0 g#  gf
or

is i.i.d. Suppose that

this equation is one of a system of simultaneous equations, so that

as the vector of predictions of

when regressed upon

, e.g.,

must

Dene

8)
gtj'

is

4  v

exogenous and predetermined variables that are uncorrelated with

R I R 

that

(suppose

is the vector of all

both endogenous and exogenous variables. Suppose that

contains

and

x#

is

gG

)'
G P
V 

using the usual construction, where

R I R 

is a linear combination of the exogenous variables

and so

identically equal to zero, regardless of

have

so we

R h u d 4
R f
I

estimator will set

moment conditions, the GMM

1 ) X



parameters and

f
 R vg 4  E

Since we have

-dimensional moment con-

8 R f
 4wg 4

dition

This suggests the

8
G

be uncorrelated with

Since

R
4x4

This is the standard formula for 2SLS. We use the exogenous variables and
the reduced form predictions of the endogenous variables as instruments, and

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

330

apply IV estimation. See Hamilton pp. 420-21 for the varcov formula (which

is the standard formula for 2SLS), and for how to deal with

heterogeneous

and dependent (basically, just use the Newey-West or some other consistent
and apply the usual formula). Note that

gG

estimator of

dependent causes

lagged endogenous variables to loose their status as legitimate instruments.


15.9.4. Nonlinear simultaneous equations. GMM provides a convenient
way to estimate nonlinear systems of simultaneous equations. We have a system of equations of the form

I
 f

I G P d
0 pI c4cvI
0 p c4c
G P d
 G P d
V p 4c

 f

.
.
.

 f

or in compact notation

8
B G
) B
j' Yw

vector of instruments

for each equation, that

Typical instruments would be low order monomials


with their lagged values. Then we can dene

orthogonality conditions

v E 4c
d f
I d I f
I v Eg4cI

d
 eE

.
.
.

v E 4c
d

d
b

in the exogenous variables in


c$

&

are uncorrelated with

B
c Dv
R8 R p Dxx9 R p  R pI e  p
d  b b b
d
d

 gf

-vector valued function, and

We need to nd an

) B I
' h yw B

the

is a

 G P d
t0 p 4

where

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

331

A note on identication: selection of instruments that ensure identi-

cation is a non-trivial problem.


A note on efciency: the selected set of instruments has important effects on the efciency of estimation. Unfortunately there is little theory

offering guidance on what is the optimal set. More on this later.


15.9.5. Maximum likelihood. In the introduction we argued that ML will
in general be more efcient than GMM since ML implicitly uses all of the moments of the distribution while GMM uses a limited number of moments. Ac-

parameters can be uniquely characterized by

moment conditions. However, some sets of

tually, a distribution with

moment conditions may contain

more information than others, since the moment conditions could be highly

correlated. A GMM estimator that chose an optimal set of

moment condi-

tions would be fully efcient. Here well see that the optimal moment conditions are simply the scores of the ML estimator.
-vector of variables, and let

&

be a

8 R R !@A8@89 R ! RI 
f 8 f f

gf

Let

Then at time

has been observed (refer to it as the information set, since we assume

I CC2


the conditioning variables have been selected to take advantage of all useful
information). The likelihood function is the joint density of the sample:

d ff 8 fI f
4gA@8A8 !6

d
 4R

which can be factored as

d f b d f ff
6I i q4!I C g

d
 e

and we can repeat this to get

8I f b 8 b d f f f b d f f f
gc $A8@84q C I q6I i

d
 4e

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

332

The log-likelihood function is therefore

I
@
8 d
4I C gf
d
f  4R

d
4iE

observation. It can be shown that, under the regularity

conditions, that the scores have conditional mean zero when evaluated at
(see notes to Introduction to Econometrics):

p
$d

 2

as the score of the

d f
46I i g

Dene

d
 !I i p cit
so one could interpret these as moment conditions to use to dene a justparameters there are

tions). The GMM estimator sets

identied GMM estimator ( if there are

score equa-

I
@
I
@



 f
 4 d I C g 1 )  d CE 1 )
f
f

which are precisely the rst order conditions of MLE. Therefore, MLE can be

Consistent estimates of variance components are as follows

I R I

interpreted as a GMM estimator. The GMM varcov formula is

I
@
R
4d 6I i g
 f
 d
f 1 )  4 d ciX 

and


g

It is important to note that

are both condi-

tionally and unconditionally uncorrelated. Conditional uncorrelation


is a function of


i

follows from the fact that

which is in the in-

formation set at time . Unconditional uncorrelation follows from the

15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS

333

fact that conditional uncorrelation hold regardless of the realization


so marginalizing with respect to

I i


I C

of

preserves uncorrelation

(see the section on ML estimation, above). The fact that the scores are

serially uncorrelated implies that

can be estimated by the estimator

of the 0 autocovariance of the moment conditions:

I
@
I
@
 f
 f
R  
R r 4 d 6I i g n r d I C A n 1 )  c4 d ci d CE 1 ) 
f
f

Recall from study of ML estimation that the information matrix equality (equation ??) states that
B

8 p I C g  C R p I C A p gI i g
d 
f
d 
f
d 
f

This result implies the well known (and already seeen) result that we can estiin any of three ways:

The sandwich version:

C
I
 f
d I C A @f

B
I
' R r d I C A n r 4d I C g n @f

f
 f

I
I
' C 4d I i g @
 f
f B

1 

term cancel, except for a minus sign):

or the inverse of the negative of the Hessian (since the middle and last

I
@
 w d I C g
 f
)
f 1 v 

I

mate

or the inverse of the outer product of the gradient (since the middle
and last cancel except for a minus sign, and the rst term converges to
minus the inverse of the middle term, which is still inside the overall
inverse)

15.10. EXAMPLE: THE HAUSMAN TEST

334

I
@
8 s  f
 f
R r d I C A n r 4 d 6I i g n 1 )
f
I

r
y

This simplication is a special result for the MLE estimator - it doesnt apply
to GMM estimators in general.
Asymptotically, if the model is correctly specied, all of these forms converge to the same limit. In small samples they will differ. In particular, there
is evidence that the outer product of the gradient formula does not perform
very well in small samples (see Davidson and MacKinnon, pg. 477). Whites
Information matrix test (Econometrica, 1982) is based upon comparing the two
ways to estimate the information matrix: outer product of gradient or negative
of the Hessian. If they differ by too much, this is evidence of misspecication
of the model.

15.10. Example: The Hausman Test


This section discusses the Hausman test, which was originally presented
in Hausman, J.A. (1978), Specication tests in econometrics, Econometrica, 46,
1251-71.

8e P 
cE R  4f

Consider the simple linear regression model

We assume that

the functional form and the choice of regressors is correct, but that the some of
the regressors may be correlated with the error term, which as you know will

produce inconsistency of

For example, this will be a problem if

if some regressors are endogeneous

lagged values of the dependent variable are used as regressors and

some regressors are measured with error

6e

is autocorrelated.

15.10. EXAMPLE: THE HAUSMAN TEST

335

F IGURE 15.10.1. OLS and IV estimators when regressors and errors are correlated
x
x

qo mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4

{
{
zz

x
x

q o mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4

5


8 5


8 5

% 8
% yy yy yy 8yy yy


@5


@5


8
8

To illustrate, the Octave program biased.m performs a Monte Carlo experiment where errors are correlated with regressors, and estimation is by OLS
and IV.
Figure 15.10.1 shows that the OLS estimator is quite biased, while the IV
estimator is on average much closer to the true value. If you play with the program, increasing the sample size, you can see evidence that the OLS estimator
is asymptotically biased, while the IV estimator is consistent.
We have seen that inconsistent and the consistent estimators converge to
different probability limits. This is the idea behind the Hausman test - a pair
of consistent estimators converge to the same probability limit, while if one is
consistent and the other is not they converge to different limits. If we accept
that one is consistent (e.g., the IV estimator), but we are doubting if the other
is consistent (e.g., the OLS estimator), we might try to check if the difference
between the estimators is signicantly different from zero.

15.10. EXAMPLE: THE HAUSMAN TEST

336

If were doubting about the consistency of OLS (or QML, etc.), why
should we be interested in testing - why not just use the IV estima-

tor? Because the OLS estimator is more efcient when the regressors
are exogenous and the other classical assumptions (including normality of the errors) hold. When we have a more efcient estimator that
relies on stronger assumptions (such as exogeneity) than the IV estimator, we might prefer to use it, unless we have evidence that the
assumptions are false.

So, lets consider the covariance between the MLE estimator

(or any other

fully efcient estimator) and some other CAN estimator, say . Now, lets

recall some results from MLE. Equation 4.4.1 is:


h

h p d 1
d
T

8 d 
g p eHb1 I 6 p ea
d

Equation 4.6.2 is

8 d
4Ra o  e
d
Combining these two equations, we get
h

8 d 
p eHn1 I ! p ea h p ipd 1
d
d


Also, equation 4.7.1 tells us that the asymptotic covariance between any
CAN estimator and the MLE score vector is

9
d
ea

9
d
49an


4eHn1
d 
h p d 1

d
i


r

15.10. EXAMPLE: THE HAUSMAN TEST


h

h ipd
d

h ip d
d

h
h

Now, consider

337

d 
4RDb1
h ip d 1
d
h

4 

d
I e

The asymptotic covariance of this is


h

4ea
I d

4 

9
d
4Ra

d
49an


4 

d
d
I 4Ra I 4R
d
d
I 4Ra $x n
4ea
I d

h d 1
d

h d 1
d

h





which, for clarity in what follows, we might write as


h


4d ar I 4R
d

d
d
I 4Ra $x n


h ipd
d

h ip d
d

So, the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE asymptotic variance (the inverse of the information
matrix).
Now, suppose we with to test whether the the two estimators are in fact

pd

both converging to

, versus the alternative hypothesis that the MLE esti-

mator is not in fact consistent (the consistency of is a maintained hypothesis).

Under the null hypothesis that they are, we have


h

 h d d 1

h p pd
d

h p p d
d

1
1

9 n

will be asymptotically normally distributed as

15.10. EXAMPLE: THE HAUSMAN TEST

338

8 h 4d ruq4 d n

h d p d 1

p d h d arVq4 d an R h d p d 1

where


g

So,

is the rank of the difference of the asymptotic variances. A statistic

that has the same asymptotic distribution is

8
g

h d d h d q d R h d p d

This is the Hausman test statistic, in its original form. The reason that this

test has power under the alternative hypothesis is that in that case the MLE

pd
p


h d 1

h
d

Then the mean of the asymptotic distribution of vector

, say, where

estimator will not be consistent, and will converge to

will be

, a non-zero vector, so the test statistic will eventually reject, regardless of


d

how small a signicance level is used.


Note: if the test is based on a sub-vector of the entire parameter vector

of the MLE, it is possible that the inconsistency of the MLE will not
show up in the portion of the vector that has been used. If this is the
case, the test may not have power to detect the inconsistency. This
may occur, for example, when the consistent but inefcient estimator
is not identied for all the parameters of the model.

Some things to note:


The rank, , of the difference of the asymptotic variances is often less

than the dimension of the matrices, and it may be difcult to determine what the true rank is. If the true rank is lower than what is taken

15.10. EXAMPLE: THE HAUSMAN TEST

339

to be true, the test will be biased against rejection of the null hypothesis. The contrary holds if we underestimate the rank.
A solution to this problem is to use a rank 1 test, by comparing only

a single coefcient. For example, if a variable is suspected of possibly


being endogenous, that variables coefcients may be compared.
This simple formula only holds when the estimator that is being tested
for consistency is fully efcient under the null hypothesis. This means

that it must be a ML estimator or a fully efcient estimator that has


the same asymptotic distribution as the ML estimator. This is quite
restrictive since modern estimators such as GMM and QML are not in
general fully efcient.

Following up on this last point, lets think of two not necessarily efcient es, where one is assumed to be consistent, but the other may
and

belong to the

Igd

I
gd

not be. We assume for expositional simplicity that both

and

timators,

same parameter space, and that they can be expressed as generalized method
of moments (GMM) estimators. The estimators are dened (suppressing the
dependence upon data) by

is a

B 'B

positive

Consider the omnibus GMM estimator

R
d
I d I
gR%



B d

vector of moment conditions, and


z   
|

  z 
|

r R R
d

85
7)  3

)
' B

B e B
d

denite weighting matrix,


(15.10.1)

is a

B e B B R B e X
d
d

where

R ge% n  h d ggd

I d I
I

15.10. EXAMPLE: THE HAUSMAN TEST

340

Suppose that the asymptotic covariance of the omnibus moment vector is


h

e
d
I d I
gev%


tf


S

S b
yS yS
I I

The standard Hausman test is equivalent to a Wald test of the equality of

(or subvectors of the two) applied to the omnibus GMM estimator, but

and

I
$d

1


(15.10.2)

with the covariance of the moment conditions estimated as

8
S
I


S

I
yS

  z 

While this is clearly an inconsistent estimator in general, the omitted

term

cancels out of the test statistic when one of the estimators is asymptotically
efcient, as we have seen above, and thus it need not be estimated.
The general solution when neither of the estimators is efcient is clear: the
matrix must be estimated consistently, since the

S
I

entire

term will not can-

cel out. Methods for consistently estimating the asymptotic covariance of a


vector of moment conditions are well-known, e.g., the Newey-West estimator
discussed previously. The Hausman test using a proper estimator of the over-

all covariance matrix will now have an asymptotic

distribution when nei-

ther estimator is efcient. However, the test suffers from a loss of power due to
the fact that the omnibus GMM estimator of equation 15.10.1 is dened using
an inefcient weight matrix. A new test can be dened by using an alternative

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS

341

omnibus GMM estimator

d
h }S r R e
I ~

R ge% n  h d ggd
I d I
I

is a consistent estimator of the overall covariance matrix

R
d
I d I
gRv%

}
S

where

(15.10.3)

of equation

15.10.2. By standard arguments, this is a more efcient estimator than that


dened by equation 15.10.1, so the Wald test using this alternative is more
powerful. See my article in Applied Economics, 2004, for more details, including
simulation results.

15.11. Application: Nonlinear rational expectations


Readings: Hansen and Singleton, 1982 Tauchen, 1986
!
8

Though GMM estimation has many applications, application to rational


expectations models is elegant, since theory directly suggests the moment conditions. Hansen and Singletons 1982 paper is also a classic worth studying in
itself. Though I strongly recommend reading the paper, Ill use a simplied
model with similar notation to Hamiltons.
We assume a representative consumer maximizes expected discounted utility over an innite horizon. Utility is temporally additive, and the expected
utility hypothesis holds. The future consumption stream is the stochastic se-

The parameter

(15.11.1)

8 E 

8 p @

ity

The objective function at time is the discounted expected util-

quence

is between 0 and 1, and reects discounting.

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS


2

is the information set at time

342

and includes the all realizations of

i

random variables indexed and earlier.

- current consumption, which is constained to

be less than or equal to current wealth

8
F

The choice variable is

Suppose the consumer can invest in a risky asset. A dollar invested in

the asset yields a gross return

, where

I 3

I c Pp)  UF
3

8
)

S

Current wealth

$t

is normalized to

is the dividend in period

8
2

is the price and

The price of

S

P )
I 4I S  I
t P

where

is investment in period

. So the problem is to allocate current wealth between current

) Y2

is risky.

are not known in period : the asset

Future net rates of return

P
6  F

consumption and investment to nance future consumption:

A partial set of necessary conditions for utility maximization have the form:

8 I CI 7 s  i
R
P )
R

(15.11.2)

To see that the condition is necessary, suppose that the lhs < rhs. Then by
reducing current consumption marginally would cause equation 15.11.1 to


g R

drop by

since there is no discounting of the current period. At the

same time, the marginal reduction in consumption nances investment, which


which could nance consumption in period

8) P
2


I
P )

has gross return

This increase in consumption would cause the objective function to increase by

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS

343

Therefore, unless the condition holds, the expected

8
I R I 7

P )

discounted utility function is not maximized.


To use this we need to choose the functional form of utility. A constant
relative risk aversion form is



 


is the coefcient of relative risk aversion (

t)


s

where

. With this form,

R
I  C
so the foc are

P )
I I I   I

P )

 I I I I

so that we could use this to dene moment conditions, it is unlikely that

While it is true that

is

stationary, even though it is in real terms, and our theory requires stationarity.


P )
I I
r

Suppose that

8
gE2

based only upon information available in time

is chosen

is a vector of variables drawn from the information set

We can use the necessary conditions to form the expressions

8
c

can be passed though the conditional expectation since

(note that

1-

To solve this, divide though by

d
eE

v h  n I c0)

P )

15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS

and

8


represents

344

Therefore, the above expression may be interpreted as a moment con-

H2


Note that at time

8p
x4d

dition which can be used for GMM estimation of the parameters

has been observed, and is therefore an element of

the information set. By rational expectations, the autocovariances of the mo

ment conditions other than

should be zero. The optimal weighting matrix

is therefore the inverse of the variance of the moment conditions:

c p eXc p eX1 A wp
R d d


which can be consistently estimated by

I
@
R
cd Ec4d E
1) 

As before, this estimate depends on an initial consistent estimate of


d

which

This process can be iterated, e.g., use the new estimate to re-estimate

use


7d

matrix, for example). After obtaining

can be obtained by setting the weighting matrix

arbitrarily (to an identity

we then minimize

8 d
geX I c4RX  4R
R d
d

and repeat until the estimates dont change.

p
9gd

this to estimate

This whole approach relies on the very strong assumption that equation 15.11.2 holds without error. Supposing agents were heterogeneous, this wouldnt be reasonable. If there were an error term here, it
could potentially be autocorrelated, which would no longer allow any
variable in the information set to be used as an instrument..

15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL

345

In principle, we could use a very large number of moment conditions

8
c v

in estimation, since any current or lagged variable could be used in

Since use of more moment conditions will lead to a more (asymptotically) efcient estimator, one might be tempted to use many instrumental variables. We will do a compter lab that will show that this
may not be a good idea with nite samples. This issue has been studied using Monte Carlos (Tauchen, JBES, 1986). The reason for poor

performance when using many instruments is that the estimate of


becomes very imprecise.

Empirical papers that use this approach often have serious problems
in obtaining precise estimates of the parameters. Note that we are bas-

ing everything on a single parial rst order condition. Probably this


f.o.c. is simply not informative enough. Simulation-based estimation
methods (discussed below) are one means of trying to use more informative moment conditions to estimate this sort of model.
15.12. Empirical example: a portfolio model
The Octave program portfolio.m performs GMM estimation of a portfolio

As instruments we use 2 lags of and

in that order. There are 95 observations (source: Tauchen,

The estimation results are

and


9

model, using the data le tauchen.data. The columns of this data le are

1986).

***********************************************
Example of GMM estimation of rational expectations model

15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL

346

GMM Estimation Results


BFGS convergence: Normal convergence

Objective function value: 0.071872


Observations: 93

Value
X^2 test

df

p-value

6.6841

5.0000

0.2452

estimate

st. err

t-stat

p-value

beta

0.8723

0.0220

39.6079

0.0000

gamma

3.1555

0.2854

11.0580

0.0000

***********************************************

experiment with the program using lags of 1, 3 and 4 periods to dene

Iterate the estimation of

instruments

and

to convergence.




Comment on the results. Are the results sensitive to the set of instruas well as

8
7d

ments used? (Look at

Are these good instruments? Are

the instruments highly correlated with one another?

EXERCISES

347

Exercises
(1) Show how to cast the generalized IV estimator presented in section 11.4 as


7f

is the form of the the matrix

d
eE

a GMM estimator. Identify what are the moment conditions,

, what

what is the efcient weight matrix, and

show that the covariance matrix formula given previously corresponds to


the GMM covariance matrix formula.

f
 v 4

(2) Using Octave, generate data from the logit dgp . Recall that

. Consider the moment condtions (exactly iden-


~ P
d
I s4d & v c 7g} d)  4c v

tied):

8 d 
v c v igf  4RE
d
(a) Estimate by GMM, using these moments. Estimate by MLE.
(b) The two estimators should coincide. Prove analytically that the estimators coicide.


$d X I R 4d X"1
b

(1) Verify the missing steps needed to show that

has a

distribution. That is, show that the monster matrix is idem-

8 

potent and has trace equal to

CHAPTER 16

Quasi-ML
Quasi-ML is the estimator one obtains when a misspecied probability
model is used to calculate an ML estimator.

conditional on

is a member of the parametric family

8
h f v 8998qI v 
 v

The true joint density is associated with the vector

suppose the joint density of

and a vector of conditioning

r p
Y
4!  

f 88 I
h e D998 q 

variables

of a random vector

Given a sample of size

8
g p !  

doesnt depend on

p
x

As long as the marginal density of

this conditional

density fully characterizes the random characteristics of samples: e.g., it fully


describes the probabilistically important features of the d.g.p. The likelihood


!g!    ! 

and let

888
h v 9qI v 

  p

888 I
h I H99q  I

Let

function is just this density evaluated at other values

The likelihood function, taking into account possible dependence of

348

16. QUASI-ML

349

observations, can be written as

I
@

E 
fq
I
@


I DE  q
f

 ! 
f

The average log-likelihood function is:

I
@ 1
1 f
ES


f )  !  f )  C




Suppose that we do not have knowledge of the family of densities

2cg$! I HES  g I D
 p   p d 
 g4 I D
x
g d  d 


8
ES

is a member of the family

such that

Mistakenly, we may assume that the conditional density of

where there is no
(this is what we

p
gd

mean by misspecied).

This setup allows for heterogeneous time series data, with dynamic
misspecication.

The QML estimator is the argument that maximizes the misspecied average
log likelihood, which we refer to as the quasi-log likelihood function. This
objective function is

I
@
d
4RE
f

I
@
p  
d 
I DE
f
)

d f
 4RC

d f f
4eS ~ g d

and the QML is

16. QUASI-ML

350

A SLLN for dependent sequences applies (we assume), so that

d
4e

I
@ 1 f

4RE
d
d f
A f ) A  4eS

We assume that this can be strengthened to uniform convergence, a.s., followis the value that

d
e ~ E p

d
4R

maximizes

ing the previous arguments. The pseudo-true value of

Given assumptions so that theorem 19 is applicable, we obtain


a.s.

p
d

f
 f d A

An example of sufcient conditions for consistency are

is compact
is continuous and converges pointwise almost surely to

means

is uniformly continuous).

is a unique global maximizer. A stronger version of this as-

p
gd

sumption that allows for asymptotic normality is that

I 6 p e
d

p ea o I 6 p R
d
d

 

P
%

h p ipd 1
d

where

Applying the asymptotic normality theorem,

8p

ists and is negative denite in a neighborhood of

4R
d

d
4R

compactness of

will be continuous, and this combined with

d
e

d f
eS9

(this means that

d
4e


x

ex-

p eS  f  p RP
d f
d

16. QUASI-ML

351

and

8 d f
p eS 1 A f  p ea

d

Note that asymptotic normality only requires that the additional as-

8
x

p
xgd

local property.

not throughout

for

hold in a neighborhood of

at

and

for
P

sumptions regarding

and

In this sense, asymptotic normality is a

16.0.1. Consistent Estimation of Variance Components. Consistent estiis straightforward. Assumption (b) of Theorem 22 implies

p ea o
d
f
d

I
@ 1 tf
I
@ 1

f
d

 p e ) T 4f d E )  4f d SP

f
f

Consistent estimation of

g

I@
1j


)1
r
f
I@
g

1
f )
1

) 1

1 h

I
@
p R
d
f
A
p eS96
d f
s

I@
5j


f

We need to estimate

is more difcult, and may be impossible.

p d
eE

Notation: Let

in place of

8p
9gd

That is, just calculate the Hessian using the estimate

8 d
p RP

that

p e
d

mation of

d
 p Ra


f


f

16. QUASI-ML

352

This is going to contain a term

I
@
1 f


R 51

f )

which will not tend to zero, in general. This term is not consistently estimable
in general, since it requires calculating an expectation using the true density
under the d.g.p., which is unknown.

p d
6gea

There are important cases where

is consistently estimable. For

example, suppose that the data come from a random sample (i.e., they
are iid). This would be the case with cross sectional data, for example.

does not imply that the conditional density

f

 f


(Note: we have that the joint distribution of

is identical. This
is identical).

With random sampling, the limiting objective function is simply

8 
 f

p   p  p e
d
f
d

where

means expectation of

and

means expectation respect

to the marginal density of

By the requirement that the limiting objective function be maximized

The dominated convergence theorem allows switching the order of

p
d

at

we have

p
d
d
f

 p eg  p  
expectation and differentiation, so

p
p
d
f
d
f

 p   A  p  

16. QUASI-ML

353

The CLT implies that

8 d
gE p Ra o  j

I
@ 1
p 
d
 f A )
f

That is, its not necessary to subtract the individual means, since they
are zero. Given this, and due to independent observations, a consistent estimator is

I
@ 1
$d c y 4d E


A f ) 

This is an important case where consistent estimation of the covariance matrix


is possible. Other cases exist, even for dynamically misspecied time series
models.

CHAPTER 17

Nonlinear least squares (NLS)


Readings: Davidson and MacKinnon, Ch. 2 and 5 ; Gallant, Ch. 1

17.1. Introduction and denition


Nonlinear least squares (NLS) is a means of estimating the parameter of
the model

8G P d
ctV p  v

 gf

In general,

will be heteroscedastic and autocorrelated, and possibly

nonnormally distributed. However, dealing with this is exactly as in

the case of linear models, so well just treat the iid case here,

! DtG
t33
If we stack the observations vertically, dening

R f f 8 fI f
cg!@@8A89 ! 
R d
cI  @A8@89gI  ggI  
8d
d

and

fG 8 GI G
R EA@8A8 $  G

observations as

G Pd
04R

we can write the

354

17.1. INTRODUCTION AND DEFINITION

355

Using this notation, the NLS estimator can be dened as

d
d
d
d f
4R i ) 1  e % R e % ) 1  4eS

The estimator minimizes the weighted sum of squared errors, which

and

8d
4R

is the same as minimizing the Euclidean distance between

The objective function can be written as

d f
 d R d P d R 5 R
44R 4e 4R sus ) 1  4RC

which gives the rst order conditions

8
$d y

Using this, the rst order conditions can

8
g$d s


$d s

'
Q1

be written as

in place of

P R $d

In shorthand, use

(17.1.1)

matrix

$d R 4d

Dene the

$d R  R
P

or

8
$

4d i n R

(17.1.2)

This bears a good deal of similarity to the f.o.c. for the linear model - the

so the f.o.c. (with spherical errors) simplify to

  R R

is simply

then

 d
d  e

derivative of the prediction is orthogonal to the prediction error. If

17.2. IDENTIFICATION

356

the usual 0LS f.o.c.


We can interpret this geometrically: INSERT drawings of geometrical depiction
of OLS and NLS (see Davidson and MacKinnon, pgs. 8,13 and 46).
Note that the nonlinearity of the manifold leads to potential multiple

d f
eSt

local maxima, minima and saddlepoints: the objective function

is not necessarily well-behaved and may be difcult to minimize.

17.2. Identication
As before, identication can be considered conditional on the sample, and

d
4e G

p d
ggRG

case if

such that

is strictly convex at

tend

This will be the

d
gpe
p
xgd

8 p  d  d
xd c9gea eaG
p d

to a limiting function

d f
eSt

asymptotically. The condition for asymptotic identication is that

which requires that

be positive

denite. Consider the objective function:

I
@ 1
ceE q p eE
G d
d

f 5
I
@ 1
P d
d
eE q p eE )
f
I
@ 1
tVP p c
G
d
v )

f
I@ 1
gf

f )
d 
 v

I
@ 1

tG
f )
d
4c v E

d f
 eS9




As in example 14.3, which illustrated the consistency of extremum estimators using OLS, we conclude that the second term will converge

8
d

to a constant which does not depend upon

A LLN can be applied to the third term to conclude that it converges


and are uncorrelated.

d
4R

pointwise to 0, as long as

17.2. IDENTIFICATION

357

Next, pointwise convergence needs to be stregnthened to uniform al-

most sure convergence. There are a number of possible assumptions


one could use. Here, well just assume it holds.
Turning to the rst term, well assume a pointwise law of large num-

bers applies, so

d
 4 

 Q

I
@ 1
d d
ceE q p RE f )

g
ld

form almost sure convergence is immediate. For example if

)
t9

8
7d
r  I 4d  7} )

~ P

y
)

is continuous in

will

so strengthening to uni-

8 

be bounded and continuous, for all

In many cases,

d
4 

q p 
d #

is the distribution function of

where

 # Q d #
g4Dt 47

(17.2.1)

a bounded range, and the function

Given these results, it is clear that a minimizer is

When considering identi-

8p
x4d

cation (asymptotic), the question is whether or not there may be some other
minimizer. A local condition for identication is that

 Dt 4 
Q d

d
q p  

8pd
Rd d
R
d d d
 4Ra

Evaluating this derivative, we obtain (after a little

# Q
$Dt R p 7 y c p 
d  #
R d #

5 

 Dt  
Q d

work)

be positive denite at

d
q p  

the expectation of the outer product of the gradient of the regression function

8p
9gd

evaluated at

(Note: the uniform boundedness we have already assumed

17.4. ASYMPTOTIC NORMALITY

358

allows passing the derivative through the integral, by the dominated convergence theorem.) This matrix will be positive denite (wp1) as long as the gradient vector is of full rank (wp1). The tangent space to the regression manifold
-dimensional space if we are to consistently estimate a

must span a

dimensional parameter vector. This is analogous to the requirement that there


be no perfect colinearity in a linear model. This is a necessary condition for
identication. Note that the LLN implies that the above expectation is equal
to

1
d
R 5  p e

17.3. Consistency
We simply assume that the conditions of Theorem 19 hold, so the estimator
is consistent. Given that the strong stochastic equicontinuity conditions hold,
as discussed above, and given the above identication conditions an a com-

proofs assumptions are satised.


gsx

pact estimation space (the closure of the parameter space

the consistency

17.4. Asymptotic normality


As in the case of GMM, we also simply assume that the conditions for asymptotic normality as in Theorem 22 hold. The only remaining problem is to
determine the form of the asymptotic variance-covariance matrix. Recall that
the result of the asymptotic normality theorem is
h

 I p e p Ra o I p R%
d
P
d
d P
d f
4RC y z

h p d 1
d

is the almost sure limit of

evaluated at

p
xgd

 d
p Ra o R c p e g c p eS96 1
d f d f
T

p d
ge
P

where

and

 R 1 5  p RaP
d

1
d
R A  p ea

Weve already seen that


o

This converges almost surely to its expectation, following a LLN

RG
tGR 1  R c p R 9g p RC 1
d f d f

we can write the above as

RG

G R c p R d
d

d 
w p c
v

I
@

d 
tG f v w p  v

8 d
p  v
8 d
4 v


d 
 p  v

I
@

tG f
Noting that

I
@ 1

d f d f
G f v  R p RC p RC 1

I
@ 1
d f
G f 5  p RC

d 
4 v

d 
 v

With this we obtain

I
@ 1
f
 4RC
d f
f 5

p
9gd

Evaluating at
So

I
@
1 4eS
gf
 d f
f )

The objective function is


17.4. ASYMPTOTIC NORMALITY

359

17.5. EXAMPLE: THE POISSON MODEL FOR COUNT DATA

p RaP
d

normality theorem, we get

 d
g p ea

and

ing these expressions for

and

8
G

where the expectation is with respect to the joint density of

360

Combin-

and the result of the asymptotic

8 o1 
I R

h p d 1
d

We can consistently estimate the variance covariance matrix using

where

1

R
I

(17.4.1)

is dened as in equation 17.1.1 and

1


r d % n R r d % n 

the obvious estimator. Note the close correspondence to the results for the
linear model.
17.5. Example: The Poisson model for count data
conditional on

Suppose that

is independently distributed Poisson. A

Poisson random variable is a count data variable, which means it can take the
values {0,1,2,...}. This sort of model has been used to study visits to doctors per
year, number of patents registered by businesses per year, etc.
The Poisson density is

gf

that the true mean is

as is the variance. Note that

is

(
gf
8A88@975649 g f
8 )
f
5! u e e 7g}  g
~

The mean of

must be positive. Suppose


p R v 7}  p
~
e

17.6. THE GAUSS-NEWTON ALGORITHM

by nonlinear least

 R v  } g
~ f

I@
f
 S 

f )

We can write

8
c

squares:

Suppose we estimate

p


which enforces the positivity of

361

8
G
v
  v t
G

ps R v  }  v 4fU

~
I@
I
@
I
@
q R

5P G
P  R  ~

~
~
v 7~g} p R v  } tG
v 7g} p R v 7g}
f )
f )
f
I
@

~ G
~
 R v 7} t0P p R v 7g}
f

f
 C

)


The last term has expectation zero since the assumption that
implies that
related with

which in turn implies that functions of

are uncor-

Applying a strong LLN, and noting that the objective function

is continuous on a compact parameter space, we get

p R v 7g} 0P qR v 7g} p dR v 7g}  


~
~
~
where the last term comes from the fact that the conditional variance of is the

x
x x
 p f y
sQP qC z
b


This

and

means nding the the specic forms of

8ggps o

 x
8 h sj%
p

E XERCISE 27. Determine the limiting distribution of

so the

8
f

NLS estimator is consistent as long as identication holds.

p 

This function is clearly minimized at

same as the variance of

Again, use a CLT as needed, no need to verify that it can be applied.

17.6. The Gauss-Newton algorithm


Readings: Davidson and MacKinnon, Chapter 6, pgs. 201-207 .

The Gauss-Newton optimization technique is specically designed for nonlinear least squares. The idea is to linearize the nonlinear model, rather than

17.6. THE GAUSS-NEWTON ALGORITHM

362

the objective function. The model is

8G P d
V p e
d

we have

P d
s4R

a
Q

rather than the true value

r I
d

rst order Taylors series approximation around a point

to evaluating the regression function at

and the error due

8p

is a combination of the fundamental error term

where

p

At some in the parameter space, not equal to

Take a

P a
QP I ips I y  I R
d d
d
P d

approximationerror.

This can be written as


a

Id
I R y
d
' 1
P d
I Rs 

I es
d

where, as above,

is the
and

regression function, evaluated at

is

matrix of derivatives of the


plus approximation error from

the truncated Taylors series.




The other new element here is

Given

which is also known.

8
g I

d
e

d
h

$
U

 d
g I R i

Similarly,

is known, given

8I

Note that

Note that one could esti-

mate simply by performing OLS on the above equation.

 U

process. Stop when

U

this, take a new Taylors series expansion around

as

8 I dlPU 

we calculate a new round estimate of

With

and repeat the

(to within a specied tolerance).

17.6. THE GAUSS-NEWTON ALGORITHM

363

To see why this might work, consider the above approximation, but evaluated
at the NLS estimator:

P h d pd $d s$d
P

is

8 r 4d i n R h R d U

pd
d

The OLS estimate of

This must be zero, since

r 4d % n R

by denition of the NLS estimator (these are the normal equations as in equa-

when we evaluate at

updating would stop.

tion 17.1.2, Since

The Gauss-Newton method doesnt require second derivatives, as does

The varcov estimator, as in equation 17.4.1 is simple to calculate, since

the Newton-Raphson method, so its faster.

we have

as a by-product of the estimation process (i.e., its just the

last round regressor matrix). In fact, a normal OLS program will


give the NLS varcov estimator directly, since its just the OLS varcov
estimator from the last iteration.

 d d
4Rs R es

The method can suffer from convergence problems since

may be very nearly singular, even with an asymptotically identied

model, especially if is very far from . Consider the example


d

| 
G P I
t0P  VsH  f
|

tive function, so

When evaluated at

has virtually no effect on the NLS objec-

will have rank that is essentially 2, rather than 3.

17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION

will be nearly singular, so


I R 

In this case,

large roundoff errors.

364

will be subject to

17.7. Application: Limited dependent variables and sample selection


Readings: Davidson and MacKinnon, Ch. 15 (a quick reading is suf

cient), J. Heckman, Sample Selection Bias as a Specication Error, Econometrica, 1979 (This is a classic article, not required for reading, and which is a
bit out-dated. Nevertheless its a good place to start if you encounter sample
selection problems in your research).
Sample selection is a common problem in applied research. The problem
occurs when observations used in estimation are sampled non-randomly, according to some selection scheme.

17.7.1. Example: Labor Supply. Labor supply of a person is a positive


number of hours per unit time supposing the offer wage is higher than the
reservation wage, which is the wage at which the person prefers not to work.
The model (very simple, with subscripts suppressed):

Latent labor supply:

P R T F

a P
Q R
P v
R  G

 RF

Reservation wage:

Offer wage:

Characteristics of individual:

Write the wage differential as

P R SaH4c
a P R
G P
VydR

17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION

365

We have the set of equations

P
dR v

8G P
0YBdR
8

) $

 

Assume that

We assume that the offer wage and the reservation wage, as well as the latent
are unobservable. What is observed is

F )

variable

F


8 pF

In other words, we observe whether or not a person is working. If the person

8  (

Otherwise,

8
I

is working, we observe labor supply, which is equal to latent labor supply,

Note that we are using a simplifying assumption that

individuals can freely choose their weekly hours of work.


Suppose we estimated the model

 F

or equivalently,

R
G

 `d R
G


tions are those for which

The problem is that these observad

P v
R 
using only observations for which

residual

and

17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION

can enter in

since elements of

depend on

are dependent. Furthermore, this expectation will in general

and

since

366

Because of these two facts,

least squares estimation is biased and inconsistent.


Given the joint normality of

8
$Gd R j
G

Consider more carefully

and

we can write (see for example Spanos Statistical Foundations of Econometric


G

Modelling, pg. 122)

P G
H$ 

has mean zero and is independent of . With this we can write

P G P v
$uR 

we get

R
G

If we condition this equation on

where

P
dR j $uR 
G G P v
A useful result is that for

)
t #
dc
b

and

Y
#
 C  # #
#
# qt

dt
b

where

are the standard normal density and distribution

function, respectively. The quantity on the RHS above is known as the


inverse Mills ratio:

Y
#

qt  ca
#

With this we can write

17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION

y s h R v
 r c y
n

R
P 4d d R t  $VR v

P

The error term has conditional mean zero, and is uncorrelated

Rv

NLS.

y y s
8 c s

8
$ 

with the regressors

where

(17.7.2)

(17.7.1)

367

At this point, we can estimate the equation by

Heckman showed how one can estimate this in a two step procedure
is estimated, then equation 17.7.2 is estimated by least

squares using the estimated value of


d

where rst

to form the regressors. This

is inefcient and estimation of the covariance is a tricky issue. It is


probably easier (and more efcient) just to do MLE.
The model presented above depends strongly on joint normality. There
exist many alternative models which weaken the maintained assumptions. It is possible to estimate consistently without distributional assumptions. See Ahn and Powell, Journal of Econometrics, 1994.

CHAPTER 18

Nonparametric inference
18.1. Possible pitfalls of parametric inference: estimation
Readings: H. White (1980) Using Least Squares to Approximate Unknown
Regression Functions, International Economic Review, pp. 149-70.
In this section we consider a simple example, which illustrates both why
nonparametric methods may in some cases be preferred to parametric methods.

and

5
5 P

h   7 )  

Taylors series approximation to

In general, the functional form of

is a classical error.

with respect to

is unknown. One idea is to take a

about some point

8p

 f

throughout the range of .

The problem of interest is to estimate the elasticity of

, where

 

G P
 

Suppose that

 5
g 

is uniformly distributed on


f

We suppose that data is generated by random sampling of

Flexible functional

forms such as the transcendental logarithmic (usually know as the translog)


can be interpreted as second order Taylors series approximations. Well work

p    p   p 

P
368

with a rst order approximation, for simplicity. Approximating about

 

18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION

is the value of the function at

8 


value of the derivative at

  

The coefcient

we can write

 UP  
  p 

If the approximation point is

369

and the slope is the

These are of course not known. One might try

estimation by ordinary least squares. The objective function is

I
@
8

f
U
 g 1 )  gC
f

The limiting objective function, following the argument we used to get equations 14.3.1 and 17.2.1 is

8  t E  q 

U
 gCa

The theorem regarding the consistency of extremum estimators (Theorem 19)


U

and

tells us that

will converge almost surely to the values that minimize

the limiting objective function. Solving the rst order conditions1 reveals that

p
 $

The estimated approximat-

therefore tends almost surely to

P
 9
A

U
gCa

ing function

8 I  6
pU

obtains its minimum at

  a

We may plot the true function and the limit of the approximation to see the
asymptotic bias as a function of :

(The approximating model is the straight line, the true model has curvature.) Note that the approximating model is in general inconsistent, even at
the approximation point. This shows that exible functional forms based
1

All calculations were done using Scientic Workplace.

18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION

370

upon Taylors series approximations do not in general allow consistent estimation. The mathematical properties of the Taylors series do not carry over
when coefcients are estimated.
The approximating model seems to t the true model fairly well, asymptotically. However, we are interested in the elasticity of the function. Recall
that an elasticity is the marginal function divided by the average function:

 t  it    G

R
R

over the range of

will require a good


The approximating

  R  


elasticity is

and

8 

approximation of both

Good approximation of the elasticity over the range of

Plotting the true elasticity and the elasticity obtained from the limiting approximating model

8 

The true elasticity is the line that has negative slope for large

Visually we

see that the elasticity is not approximated so well. Root mean squared error in
the approximation of the elasticity is


)7
g 78  eI  t 
p

G
q 
p

Now suppose we use the leading terms of a trigonometric series as the


approximating model. The reason for using a trigonometric series as an approximating model is motivated by the asymptotic properties of the Fourier
exible functional form (Gallant, 1981, 1982), which we will study in more detail below. Normally with this type of model the number of basis functions is

18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION

371

an increasing function of the sample size. Here we hold the set of basis function xed. We will consider the asymptotic behavior of a xed model, which
we interpret as an approximation to the estimators behavior in nite samples.
Consider the set of basis functions:

5
5
8 r  E  4v    ) n  

The approximating model is

8 & 
W

 


Maintaining these basis functions as the sample size increases, we nd that the
limiting objective function is minimized at

 

8  6 )  D6  C6 )  | 6 )  6
6

A
9

 I

Substituting these values into


proximation

we obtain the almost sure limit of the ap-

(18.1.1)



 5 P ) Dy  5 4 P  E P ) Dy S4 P  y9




P


  a@

Plotting the approximation and the true function:

Clearly the truncated trigonometric series model offers a better approximation, asymptotically, than does the linear model. Plotting elasticities: On
average, the t is better, though there is some implausible wavyness in the
estimate.

18.2. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING

372

Root mean squared error in the approximation of the elasticity is

p

 7 )75$)98  t  a@ q  SG
9

eI    R 
p

about half that of the RMSE when the rst order approximation is used. If
the trigonometric series contained innite terms, this error measure would be
driven to zero, as we shall see.

18.2. Possible pitfalls of parametric inference: hypothesis testing


What do we mean by the term nonparametric inference? Simply, this
means inferences that are possible without restricting the functions of interest
to belong to a parametric family.

Consider means of testing for the hypothesis that consumers maximize utility. A consequence of utility maximization is that the Slutsky
, where

0 d


0  T


matrix

are the a set of compensated demand

functions, must be negative semi-denite. One approach to testing for


utility maximization would estimate a set of normal demand functions
.


! 

Estimation of these functions by normal parametric methods requires

specication of the functional form of demand, for example

p

d  G P
d 
  
p V p "!d  !d 

is a function of known form and

   
d "!d j 

After estimation, we could use

is a nite dimen-

to calculate (by solving

the integrability problem, which is non-trivial)

8 
0  T

pd 
g"!d 

sional parameter.

p
x

x
pg

where

If we can

18.3. THE FOURIER FUNCTIONAL FORM

373

statistically reject that the matrix is negative semi-denite, we might


conclude that consumers dont maximize utility.
The problem with this is that the reason for rejection of the theoretical

proposition may be that our choice of functional form is incorrect. In


the introductory section we saw that functional form misspecication
leads to inconsistent estimation of the function and its derivatives.
Testing using parametric models always means we are testing a com-

pound hypothesis. The hypothesis that is tested is 1) the economic


proposition we wish to test, and 2) the model is correctly specied.
Failure of either 1) or 2) can lead to rejection. This is known as the
model-induced augmenting hypothesis.

Varians WARP allows one to test for utility maximization without


specifying the form of the demand functions. The only assumptions
used in the test are those directly implied by theory, so rejection of the
hypothesis calls into question the theory.
Nonparametric inference allows direct testing of economic proposi-

tions, without the model-induced augmenting hypothesis.

18.3. The Fourier functional form


Readings: Gallant, 1987, Identication and consistency in semi-nonparametric
regression, in Advances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
Suppose we have a multivariate model

G P
70 v

 f

18.3. THE FOURIER FUNCTIONAL FORM

is of unknown form and

is a

where

374

dimensional vector. For

simplicity, assume that is a classical error. Let us take the estimation

of the vector of elasticities with typical element

B
  C v
v Dv
B

8 v
B

at an arbitrary point

The Fourier form, following Gallant (1982), but with a somewhat different parameterization, may be written as

(18.3.1)

8
E v R ~% q v R % 4v R

have each been trans-

formed to lie in an interval that is shorter than

This is required

8 5

888II II R
g999g$Dc x As & u
R


 d v

&
'

We assume that the conditioning variables

IWI

P 7R 5  R P
v v ) P v

to avoid periodic behavior of the approximation, which is desirable


since economic functions arent periodic. For example, subtract sample means, divide by the maxima of the conditioning variables, and
is some positive number less than

8 5 
w A@88A74) Q&


$i 5

are elementary multi-indices which are simply

formed of integers (negative, positive and zero). The

The

in value.

where

multiply by

(18.3.2)

8x
R

-dimensional parameter vector

where the

vectors

are required to be linearly independent, and we follow the convention

18.3. THE FOURIER FUNCTIONAL FORM

375

that the rst non-zero element be positive. For example

)
R r ) ) n
is a potential multi-index to be used, but

r
) )
R ) n
is not since its rst nonzero element is negative. Nor is

r
5
R %5 5 n
a multi-index we would use, since it is a scalar multiple of the original
multi-index.

We parameterize the matrix

differently than does Gallant because it

simplies things in practice. The cost of this is that we are no longer

able to test a quadratic specication using nested testing.


The vector of rst partial derivatives is

(18.3.3)

R R
E v H~% 4 s v % E t@

%
~C

I I
W

P v

P
 d v 

and the matrix of second partial derivatives is

H CE v H~% i v H~% 4 t
R
%
R P R

I I

 d v 

To dene a compact notation for partial derivatives, let be an


e

multi-index with no negative elements. Dene

(18.3.4)

-dimensional

as the sum of the elements

18.3. THE FOURIER FUNCTIONAL FORM

of the (arbitrary) function

indicate a certain partial derivative:

v $ v
bb
 xxb z   I 

. Taking this denition and the last

few equations into account, we see that it is possible to dene

c
' )

so that

vector

8 Bdc v
R

 d v

(18.3.5)

to

is the zero vector,

, use

When

arguments

of . If we have

376

Both the approximating model and the derivatives of the approximat-

ing model are linear in the parameters.


For the approximating model to the function (not derivatives), write
for simplicity.

dR

 d v

The following theorem can be used to prove the consistency of the Fourier
form.

is obtained by

. Consider the

Sx
f

8A@89 754) 
8 7 

relative topology dened by

with respect to

with respect to

and

(a) Compactness: The closure of

is a subset

is compact in the

is a dense subset of the closure of

on which is dened a norm

following conditions:

(b) Denseness:

where
b

of some function space

over

maximizing a sample objective function

T HEOREM 28. [Gallant and Nychka, 1987] Suppose that

18.3. THE FOURIER FUNCTIONAL FORM

with respect to

that is continuous in

in

and there is a function

(c) Uniform convergence: There is a point

377

such that

6  a

f
f

  uq S9  A
9

almost surely.


v f s

f A


v

Under these conditions

almost surely.

6  a`

must have

with


6 9

(d) Identication: Any point in the closure of

almost surely, provided that

 f

k
g

f A

The modication of the original statement of the theorem that has been
x

made is to set the parameter space

in Gallant and Nychkas (1987) Theorem

0 to a single point and to state the theorem in terms of maximization rather


than minimization.
This theorem is very similar in form to Theorem 19. The main differences
are:

(1) A generic norm

is used in place of the Euclidean norm. This

norm may be stronger than the Euclidean norm, so that convergence

with respect to

implies convergence w.r.t the Euclidean norm.

Typically we will want to make sure that the norm is strong enough to
imply convergence of all functions of interest.

parameter space

(2) The estimation space

is a function space. It plays the role of the

in our discussion of parametric estimators. There

is no restriction to a parametric family, only a restriction to a space of


functions that satisfy certain conditions. This formulation is much less
restrictive than the restriction to a parametric family.

18.3. THE FOURIER FUNCTIONAL FORM

378

(3) There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [19], see Gallant, 1987) but we will discuss its assumptions, in relation to
the Fourier form as the approximating model.

18.3.1. Sobolev norm. Since all of the assumptions involve the norm

, we need to make explicit what norm we wish to use. We need a norm that
guarantees that the errors in approximation of the functions we are interested
in are accounted for. Since we are interested in rst-order elasticities in the

Let

and its

be an open set that con`

8 



g  R

tains all values of

throughout the range of

rst derivative

present case, we need close approximation of both the function

that were interested in. The Sobolev norm is appropriate

in this case. It is dened, making use of our notation for partial derivatives, as:

is well approximated by an approxi-

, we would evaluate

Y
r

d 


mating model

  Y ~  Y r W
9

To see whether or not the function

W d v sq v


We see that this norm takes into account errors in approximating the function
If we want to estimate rst order elas-

ticities, as is the case in this example, the relevant

Further-

convergence w.r.t. the Sobolev means


9

uniform convergence, so that we obtain consistent estimates for all values of

8 

over


i`

more, since we examine the

would be

8
) 

8
"

and partial derivatives up to order

18.3. THE FOURIER FUNCTIONAL FORM

379

18.3.2. Compactness. Verifying compactness with respect to this norm is


quite technical and unenlightening. It is proven by Elbadawi, Gallant and
Souza, Econometrica, 1983. The basic requirement is that if we need consistency
then the functions of interest must belong to a Sobolev space

W v r v 

W
wE
r

where

of functions

. A Sobolev space is the set

which takes into account derivatives of order

) P
uu

w.r.t.

is a nite constant. In plain words, the functions must have bounded

partial derivatives of one order higher than the derivatives we seek to estimate.
18.3.3. The estimation space and the estimation subspace. Since in our
case were interested in consistent estimation of rst-order elasticities, well
dene the estimation space as follows:


E

8
Qg

The estimation space is an open set, and we presume that

8
g
Yr

D EFINITION 29. [Estimation space] The estimation space

So we are assuming that the function to be estimated has bounded second


`

derivatives throughout

With seminonparametric estimators, we dont actually optimize over the


dened as:

ned as

d 

g r

E
g

d v I d v  u
 r


 v
d


18.3.1.

is de-

where

D EFINITION 30. [Estimation subspace] The estimation subspace

b

estimation space. Rather, we optimize over a subspace,

is the Fourier form approximation as dened in Equation

18.3. THE FOURIER FUNCTIONAL FORM

380

18.3.4. Denseness. The important point here is that

is a space of func-

is not necessarily an element of

so optimiza-

may not lead to a consistent estimator. In order for optimization

will have to grow more slowly

, dened above, is a subset of the closure of the


is dense if the closure of

the countable union of the subsets is equal to the closure of

of a set

. A set of subsets

be dense subsets of

8
Q


q1

estimation space,

than . The second requirement is:

The estimation subspace

This

in equation 18.3.1 increasing functions

the sample size. It is clear that

(2) We need that the

and

is achieved by making

as

(1) The dimension of the parameter vector,

at least asymptotically, we

need that:


Q

to be equivalent to optimization over

of

elements, as

this parameter is estimable.

 1

over

tion over

Note that the true function

has

observations,

in equation 18.3.2). With

tions that is indexed by a nite dimensional parameter (

Use a picture here. The rest of the discussion of denseness is provided just for comY

 sI
r

with respect to

of

pleteness: theres no need to study it in detail. To show that

is a dense subset

it is useful to apply Theorem 1 of Gallant (1982),

who in turn cites Edmunds and Moscatelli (1977). We reproduce the theorem
as presented by Gallant, with minor notational changes, for convenience of
reference:
T HEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function
be continuously differentiable up to order

on an open set containing

18.3. THE FOURIER FUNCTIONAL FORM

. Then it is possible to choose a triangular array of coefcients


, and every

, and

5 

. By denition of the estimation

is the countable union of the

such that

The implication of Theorem 31 is that there is a sequence of {

} from

lowing Gallant and Nychka (1987),

 r
 Y sI

. Therefore,

for all

, so the theorem is applicable. Closely fol-

) 

q v
dI
d 8 8 8
99 ggd

open and contains the closure of

, which is

are once continuously differentiable on

 r( d v
888
v999
Y

space, the elements of

as

In the present application,

with

 G

such that for every

the closure of

381

However,

R


Q

so

8
Q

Therefore

r
s

, with respect to the norm

is a dense subset of

so

18.3.5. Uniform convergence. We now turn to the limiting objective function. We estimate by OLS. The sample objective function stated in terms of

 f
E d v sg

I@
1 RC
 d f
f )

maximization is

18.3. THE FOURIER FUNCTIONAL FORM

382

With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limiting objective function is


E v Hsq v
Y

 uG
 

and

are elements of

 H

presentation of the theorem. Both

in the
.

takes the place of the generic function

Q
Dt

where the true function

8 j 

(18.3.6)

The pointwise convergence of the objective function needs to be strengthened to uniform convergence. We will simply assume that this holds, since
the way to verify this depends upon the specic application. We also have

r
s

since

8 

Q
"t

with respect to the norm

r v

q v p 

p





continuity of the objective function in

v q v I  n G
Y
p

V  I 











By the dominated convergence theorem (which applies since the nite bound

used to dene

is dominated by an integrable function), the limit

and the integral can be interchanged, so by inspection, the limit is zero.

18.3.6. Identication. The identication condition requires that for any point

y 

 X
  '

and




satised given that

r
 Y sI

in

. This condition is clearly

are once continuously differentiable (by the as-

sumption that denes the estimation space).

18.3.7. Review of concepts. For the example of estimation of rst-order


elasticities, the relevant concepts are:

18.3. THE FOURIER FUNCTIONAL FORM




Consistency norm

Estimation subspace

Yr

Estimation space

383

: the function space in the closure of

which the true function must lie.


is compact with respect

r
s

The estimation subspace is the subset of

 d f
RC9

8
Q

Sample objective function

dense subsets of

that is representable by a Fourier form with parameter

to this norm.

The closure of

These are

the negative of the sum of squares.

By standard arguments this converges uniformly to the


which is continuous in and has

 
aG

Limiting objective function

a global maximum in its rst argument, over the closure of the innite

As a result of this, rst order elasticities




union of the estimation subpaces, at

8ig v
`
 C v
B

v Dv
B

are consistently estimated for all

18.3.8. Discussion. Consistency requires that the number of parameters


used in the expansion increase with the sample size, tending to innity. If parameters are added at a high rate, the bias tends relatively rapidly to zero. A
basic problem is that a high rate of inclusion of additional parameters causes
the variance to tend more slowly to zero. The issue of how to chose the rate at
which parameters are added and which to add rst is fairly complex. A problem is that the allowable rates for asymptotic normality to obtain (Andrews
1991; Gallant and Souza, 1991) are very strict. Supposing we stick to these

18.3. THE FOURIER FUNCTIONAL FORM

384

rates, our approximating model is:

matrix of regressors obtained by stacking ob-

'
1

as the

8 Bd4  d v
R

Dene

servations. The LS estimator is


f R
R 
q d

is the Moore-Penrose generalized inverse.


may be singular, as would be the case for

large enough when some dummy variables are included.

 dR

of the unknown function

normally distributed:

1


. The prediction,

This is used since

where

is asymptotically


w  j

8 w

h 

d 4 1
R

where

1 DR

f
A  w

Formally, this is exactly the same as if we were dealing with a para-

grows very slowly as

metric linear model. I emphasize, though, that this is only valid if

grows. If we cant stick to acceptable rates, we

should probably use some other method of approximating the small


sample distribution. Bootstrapping is a possibility. Well discuss this
in the section on simulation.

18.4. KERNEL REGRESSION ESTIMATORS

385

18.4. Kernel regression estimators


Readings: Bierens, 1987, Kernel estimators of regression functions, in
Advances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
An alternative method to the semi-nonparametric method is a fully nonparametric method of estimation. Kernel regression estimation is an example (others are splines, nearest neighbor, etc.). Well consider the NadarayaWatson kernel regression estimator in a simple case.

-dimensional. The model is

 f
S 

is

where

Suppose we have an iid sample from the joint density

 G P
tV  D  f
where

given

is

conditional expectation, we have

8
 H

f

8   t
G

The conditional expectation of

fSf  f
t
f f
$t ! 

8 f t f
S 


by estimating

and
y


)

 H

 

This suggests that we could estimate

8ft f
$ !  f

  H

is the marginal density of

r 

where

By denition of the

18.4. KERNEL REGRESSION ESTIMATORS

386

form

18.4.1. Estimation of the denominator. A kernel estimator for

has the

8 

I
@ 1
f
 f 
 h  @ f )  

G 

is the sample size and

is the dimension of

(the kernel) is absolutely integrable:

% t 

r
)

integrates to

dc
b

and

The function

where

8  t
4)   k
is like a density function, but we do not necessarily

to be nonnegative.

satises

is a sequence of positive numbers that

The window width parameter,

f
7

restrict

d
b

In this respect,

f

 f b1 A
h

f f
 

So, the window width must tend to zero, but not too quickly.
for


g 

To show pointwise consistency of

rst consider the ex-

pectation of the estimator (since the estimator is an average of iid


terms we only need to consider the expectation of a representative
term):

8 # t # f #
7$ G $0  @ 
f
h

 r  n

By the representative term argument, this is

I
@ 1
f
G

  A w )
f
I
f
@ 1
`  h  @ )
f

f
we have, due to the iid assump-


g 

f



 r

f 1


b


 n h f b1

tion
Next, considering the variance of

by an absolutely integrable function.


convergence theorem.. For this to hold we need that

be dominated

can pass the limit through the integral is a result of the dominated
by assumption. (Note: that we

)  4
#t #

#t #
4





and

f


since




# t  k
#

#t #f
4  f
#
#t #f
4  t  f
#



f
 r  n

Now, asymptotically,

y m
 m

8 gs  k
#t #f
#
#  #f
t f s  
#
f
h
h

#f
4   #

and

 r  n

f
7 4#   D#

so

f
h

we obtain
Change variables as

18.4. KERNEL REGRESSION ESTIMATORS

387

probability).
consistency (convergence in quadratic mean implies convergence in
Since the bias and the variance both go to zero, we have pointwise

are bounded, this is bounded, and

r
 n

by assumption, we have that


#
@t C
#
y

  r  n f b1 A f

h
and

8 t
# #

f b1


since

Since both
to be

Using exactly the same change of variables as before, this can be shown

8 $ G $V  @ 
# t # f #
f
h

f


f
A  r

f

 n h f b1 A

Therefore,

by the previous result regarding the expectation and the fact that

r  n hf

The second term converges to zero:

 # t # f #
f
r  n h f ss$ G $0  A h s
#$ ` $V  @  f ss$ G $0  A s
t #
f #
f
 # t #
f
#
f

h
h
h


f
f 
f #
f
6G $#0  @ s G $V  A
h
h
 q   

f #
` $j  @ f   r  n f b1


h
h



 r


 n h f b1

Also, since

we have

18.4. KERNEL REGRESSION ESTIMATORS

388

I
@
f

8 `   @ I f
`f
  @ f @f
b
I
@ I f

%b p
f

b f @

I f
%b p
f I


ft f
$  f  )




  H 

I
@ 1
f

ft f
f
 h  @ gf f )  $  f

I 

8
  $  k
ft f
r
U 

ft f
 $  f

is required to have mean zero:

I
@ 1
I f
f
f
  th  j A f )  S 
 f f
f

dc
b

G 

8 f
g ! 

The estimator has the same form as the estimator for


g 

18.4.3. Discussion.
This is the Nadaraya-Watson kernel regression estimator.
by marginalization of the kernel, so we obtain
With this kernel, we have
and to marginalize to the previous kernel for
The kernel

only with one dimension more:

ft f
$ !  f

estimator of

18.4.2. Estimation of the numerator. To estimate

we need an

18.4. KERNEL REGRESSION ESTIMATORS

389

18.4. KERNEL REGRESSION ESTIMATORS

8c 
1 8 5
!@A8@89764)  " f
%

are closer to


 D

The kernel regression estimator for

390

is a weighted average of the

, where higher weights are associated with points that


The weights sum to 1.

The window width parameter

A large window width reduces the variance (strong imposition of at-

f


f
7

is increasingly at as

imposes smoothness. The estimator

since in this case each weight tends to

8
q1 )

ness), but increases the bias.


A small window width reduces the bias, but makes very little use of

8
c 

information except points that are in a small neighborhood of

Since

relatively little information is used, the variance is large when the window width is small.

though there are possibly better alternatives.

and

 f
g 

The standard normal density is a popular choice for

18.4.4. Choice of the window width: Cross-validation. The selection of


an appropriate window width is important. One popular method is cross validation. This consists of splitting the sample into two parts (e.g., 50%-50%).
The rst part is the in sample data, which is used for estimation, and the
second part is the out of sample data, used for evaluation of the t though
RMSE or some other criterion. The steps are:

(2) Choose a window width .

and

8 

(1) Split the data. The out of sample data is




corresponding to each

8 

(3) With the in sample data, t

This tted

value is a function of the in sample data, as well as the evaluation


, but it does not involve

8 If

point

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD

391

(4) Repeat for all out of sample points.

that minimizes RMSE(

(Verify that a minimum has been

found, for example by plotting RMSE as a function of


and all of the data.

This same principle can be used to choose

and

(8) Re-estimate using the best

8
H

(7) Select the

H

tried.

or to the next step if enough window widths have been


75

(6) Go to step

D

(5) Calculate RMSE

in a Fourier form model.

18.5. Kernel density estimation


The previous discussion suggests that a kernel density estimator may easily
be constructed. We have already seen how joint densities may be estimated.
conditional on ,

then the kernel estimate of the conditional density is simply

If were interested in a conditional density, for example of

f
I
f

  A I@
G
f   t j f I@f )

f f f

b
I@ f

%b p
f I

 b
I@ f

b p
r  b p u u f I

! 
f

u


where we obtain the expressions for the joint and marginal densities from the
section on kernel regression.
18.6. Semi-nonparametric maximum likelihood
Readings: Gallant and Nychka, Econometrica, 1987. For a Fortran program
to do this and a useful discussion in the users guide, see

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD

392

this link . See also Cameron and Johansson, Journal of Applied Econometrics,
V. 12, 1997.
MLE is the estimation method of choice when we are condent about specifying the density. Is is possible to obtain the benets of MLE when were not
so condent about the specication? In part, yes.

(both may be

is a reasonable starting approxi-

t f
H 

vectors). Suppose that the density

conditional on

Suppose were interested in the density of

mation to the true density. This density can be reshaped by multiplying it by


a squared polynomial. The new density is

p
h H T
 f
T

is a normalizing factor to make the density integrate (sum) to


it is necessary

is set to 1. The normalization factor

 t
HT

 t f
D  T I H T
p

to impose a normalization:

is a homogenous function of
d

 t
H  T

one. Because

f
h h

and

 t
Hs
 T
 t f T
H  
t f f
H  H T 

where

is

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD

393

calculated (following Cameron and Johansson) using

p u

p u

p
8  t
gHT


` h

h 
T
p
u p
HstT I s t

f
 
f
Y` S
r h
h

p p T

h

HtT f f  sHt f

f

I
Y
G
h h
T T

 t
Ht HT
f

f f
Y` sH T

p T
h

p T u





we get that the normalizing factor is


h

p p
h HT
  t
T T

(18.6.1)

is set to 1 to achieve identication. The

in equation 18.6.1

Recall that

 f
Dt

18.6.1

Y
G

By setting

are the raw moments of the baseline density. Gallant and Nychka (1987) give
conditions under which such a density may be treated as correctly specied,
asymptotically. Basically, the order of the polynomial must increase as the
sample size increases. However, there are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial polynomial (NBP) density for count data. The negative binomial baseline density may be written (see equation as
e
u

PQX

PQX X
P f
R t)f
P

X

f
 Ht

Y
G

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD

. The usual means of incorporating condi-

. For both forms,

. In the case of
e
g

e
'

e &

e
'

the NB-II model, we have

binomial-II (NP-II) model. For the NB-I density,

we have

we have the negative


e &

)  X
&@
x y (e


the negative binomial-I model (NB-I). When

. When

is the parameterization

&
@ e

v
 X
ue y e  t

tioning variables

and

where

394

The reshaped density, with normalization to sum to one, is

PQX

PQX X
P f  t
U t)f HsfT   f

P
s sH T  Hst

(18.6.2)

Y
`

To get the normailization factor, we need the moment generating function:

P e e

 E2
X

(18.6.3)

To illustrate, here are the rst through fourth raw moments of the NB density,
calculated using
MuPAD, which is a Computer Algebra System that is free for personal use,
and then programmed in Ox. These are the moments you would need to use a

if(k_gam >= 1)

5  

second order polynomial

{
m[][0] = lambda;
m[][1] = (lambda .* (lambda + psi + lambda .* psi))
Econometrics/ psi;
}
if(k_gam >= 2)
{

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD

395

m[][2] = (lambda .* (psi .^ 2 + 3 .* lambda .* psi


.* (1 + psi) + lambda .^

2 .* (2 + 3 .* psi +

psi .^ 2))) Econometrics/ psi .^ 2;


m[][3] = (lambda .* (psi .^ 3 + 7 .* lambda .* psi
.^ 2 .* (1 + psi) +
6 .* lambda .^ 2 .* psi .* (2 + 3 .* psi + psi
.^ 2) +
lambda .^ 3 .* (6 + 11 .* psi + 6 .* psi .^ 2
+ psi .^ 3))) Econometrics/ psi .^ 3;
}
After calculating the raw moments, the normalization factor is calculated
using equation 18.6.1, again with the help of MuPAD.
if(k_gam == 1)
{
norm_factor = 1 + gam[0][] .* (2 .* m[][0] + gam[0][]
.* m[][1]);
}
else
if(k_gam == 2)
{
norm_factor = 1 + gam[0][] .^ 2 .* m[][1] + 2 .*
gam[0][] .* (m[][0] +

gam[1][] .* m[][2]) +

gam[1][] .* (2 .* m[][1] + gam[1][] .* m[][3]);


}

18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD


79  

For

396

the analogous formulae are impressively (i.e. several pages) long.

This is an example of a model that would be difcult ot formulate without the


help of a program like MuPAD.
It is possible that there is conditional heterogeneity such that the appropriate reshaping should be more local. This can be accomodated by allowing the
parameters to depend upon the conditioning variables, for example using

polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can
approximate a wide variety of densities arbitrarily well as the degree of the
polynomial increases with the sample size. This approach is not without its
drawbacks: the sample objective function can have an extremely large number
of local maxima that can lead to numeric difculties. If someone could gure
out how to do in a way such that the sample objective function was nice and
smooth, they would probably get the paper published in a good journal. Any
ideas?
Heres a plot of true and the limiting SNP approximations (with the order
of the polynomial xed) to four different count data densities, which variously
exhibit over and underdispersion, as well as excess zeros. The baseline model
is a negative binomial density.

Figures/SNP.eps not found!

18.7. EXAMPLES

397

18.7. Examples
18.7.1. Fourier form estimation. You need to get the le
FFF.ox, which sets up the data matrix for Fourier form estimation.

The rst DGP rst DGP generates data with a nonlinear mean and

er-

rors (with the mean subtracted out). Then the program fourierform.ox allows

you to experiment with different sample sizes and values of

. There is no

need to specify multi-indices with a univariate regressor (as is the case here to

with different .

4  1

keep the graphics simple). For a sample size of

, here are several plots

5 

This rst plot shows an underparameterized t (

).

Nonparametric-I/fff_2.eps not found!

This next one looks pretty good.

Nonparametric-I/fff_4.eps not found!

Heres an example of an overtted model - we are starting to chase the

) 

error term too much (

).

18.7. EXAMPLES

398

Nonparametric-I/fff_10.eps not found!

18.7.2. Kernel regression estimation. You need to get the le


KernelLib.ox, which contains the routines for kernel regression and density
estimation.
18.7.3. Kernel regression. We will use the same data generating process as
for the above examples of Fourier Form models. The program kernelreg1.ox
allows you to experiment with different sample sizes, window widths. For a

 1

sample size of

, here are several plots with different window widths.

Note that too small a window-width (ww = 0.1) leads to a very irregular t,
while setting the window width too high leads to too at a t.

Nonparametric-I/undersmoothed.eps not found!

Nonparametric-I/oversmoothed.eps not found!

18.7. EXAMPLES

399

Nonparametric-I/justright.eps not found!

Cross Validation

The leave-one-out method of cross validation consists of doing an out-of-sample


t to each data point in turn, and calculating the MSE. This is repeated for various window widths. The minimum MSE window width may then be chosen.
The program kernelreg2.ox does this. The results are:

Nonparametric-I/cvscores.eps not found!

Nonparametric-I/crossvalidated.eps not found!

18.7.4. Kernel density estimation. The second DGP second DGP gener-

d

ates

random variables, then estimates their density using kernel density

estimation. The program kerneldens.ox allows you to experiment using different sample sizes, kernels, and window widths. The following gure shows

18.7. EXAMPLES

400

an Epanechnikov kernel t using different window widths. To change kernels


you need to selectively (un)comment lines in the KernelLib.ox le.

Nonparametric-I/kerneldensfit.eps not found!

18.7.5. Seminonparametric density estimation and MuPAD. Following


the lecture notes, an SNP density for count data may be obtained by reshaping
a negative binomial density using a squared polynomial:

PQX X
P f  t
U t)f HsfT   f
P
s sH T  Hst

PQX

8 fh
h

p
h
f
 H UT
T

8 h





h


Y
`

p p
h HsT
  t
T T

To implement this using a polynomial of order


g5

of the negative binomial density up to order

(18.7.3)

The normalization factor is

(18.7.2)

(18.7.1)

we need the raw moments

. I couldnt nd the NB moment

generating function anywhere, so a solution is to calculate it using a Computer


Algebra System (CAS). Rather than using one of the expensive alternatives, we
can try out MuPAD, which can be downloaded and is free (in the sense of free

18.7. EXAMPLES

401

beer) for personal use. It is installed on the Linux machines in the computer
room, and if you like you can install the Windows version, too.
The le negbinSNP.mpd, if run using the the command mupad negbinSNP.mpd,
will give you the output that follows:

*----*
/|

MuPAD 2.5.1 -- The Open Computer Algebra System

/|

*----* |

Copyright (c)

| *--|-*
|/

1997 - 2002

by SciFace Software

All rights reserved.

|/

*----*

Licensed to:

Dr. Michael Creel

Negative Binomial SNP Density

First define the NB density

\a /

\y

gamma(a + y) | ----- |

| ----- |

\ a + b /

\ a + b /

---------------------------------gamma(a) gamma(y + 1)

Verify that it sums to 1

18.7. EXAMPLES

402

Define the MGF

\a

| ----- |
\ a + b /
--------------------/ a + b - b exp(t) \a
| ---------------- |
\

a + b

Print the MGF in TeX format

"\\frac{\\frac{a}{\\left(a + b\\right)}^a}{\\frac{\\left(a + b - b\\, \


ox{exp}\\left(t\\right)\\right)}{\\left(a + b\\right)}^a}"

Find the first moment (which we know is b (lambda))

18.7. EXAMPLES

403

Find the fifth moment (which we probably dont know)

5
(24 b

4
+ 60 a b

3
75 a

+ a

3
b

4
10 a

4
+ 15 a

4
b

4
+ a

5
b + 50 a b

2
b

2
+ 35 a

2
+ 50 a

5
b

3
+ 60 a

3
+ 15 a

4
b

2
b

4
+ 25 a

+ 110 a

3
b

+ 10 a

b ) / a

Print the fifth moment in fortran form, to program ln L

"

t3 = a**-4*(b**5*24.0D0+60.0D0*a*b**4+a**4*b+50.0D0*a*b**5+50.0D

~(a*a)*b**3+15.0D0*a**3*(b*b)+110.0D0*(a*a)*b**4+75.0D0*a**3*b**3

~5.0D0*a**4*(b*b)+35.0D0*(a*a)*b**5+60.0D0*a**3*b**4+25.0D0*a**4*

~*3+10.0D0*a**3*b**5+10.0D0*a**4*b**4+a**4*b**5)"

Print the fifth moment in TeX form

"\\frac{24\\, b^5 + 60\\, a\\, b^4 + a^4\\, b + 50\\, a\\, b^5 + 50\\,

\\, b^3 + 15\\, a^3\\, b^2 + 110\\, a^2\\, b^4 + 75\\, a^3\\, b^3 + 15\

a^4\\, b^2 + 35\\, a^2\\, b^5 + 60\\, a^3\\, b^4 + 25\\, a^4\\, b^3 + 1

18.7. EXAMPLES

404

, a^3\\, b^5 + 10\\, a^4\\, b^4 + a^4\\, b^5}{a^4}"

To get the normalizing factor, we need expressions of the

form of the following

a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) +

b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) +

a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) +

b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6)
>> quit
Once you get expressions for the moments and the double sums, you can
use these to program a loglikelihood function in Ox, without too much trouble.
The le NegBinSNP.ox implements this. The le EstimateNBSNP.ox will let
you estimate NegBinSNP models for the MEPS data. The estimation results

5 

for OBDV using

and a NB-I baseline model are

Ox version 3.20 (Linux) (C) J.A. Doornik, 1994-2002

***********************************************************************
MEPS data, OBDV

18.7. EXAMPLES

405

negbin_snp_obj results
Strong convergence

Observations = 500

Avg. Log Likelihood


-2.2426

Standard Errors

params

se(OPG)

se(Sand.)

se(Hess)

1.5340

0.13289

0.12645

0.12593

0.16113

0.053100

0.056824

0.054144

0.090624

0.062689

0.065619

0.063835

sex

0.16863

0.047614

0.050720

0.048707

age

0.17950

0.048407

0.045060

0.046301

educ

0.039692

0.047968

0.058794

0.052521

inc

0.032581

0.064384

0.043708

0.051033

1.8138

0.18466

0.17398

0.17378

-0.052710

0.0089429

0.0078799

0.0083419

0.013382

0.0042349

0.0039745

0.0040547

params

t(OPG)

t(Sand.)

t(Hess)

1.5340

11.543

12.132

12.181

constant
pub_ins
priv_ins

ln_alpha

t-Stats

constant

18.7. EXAMPLES

pub_ins

406

0.16113

3.0344

2.8356

2.9759

0.090624

1.4456

1.3811

1.4197

sex

0.16863

3.5416

3.3248

3.4621

age

0.17950

3.7082

3.9837

3.8769

educ

0.039692

0.82746

0.67509

0.75573

inc

0.032581

0.50603

0.74541

0.63842

1.8138

9.8226

10.425

10.438

-0.052710

-5.8941

-6.6892

-6.3188

0.013382

3.1599

3.3669

3.3003

priv_ins

ln_alpha

Information Criteria

CAIC

BIC

AIC

2314.7

2304.7

2262.6

***********************************************************************

Note that the CAIC and BIC are lower for this model than for the ordinary
NB-I model. NOTE: density functions formed in this way may have MANY
local maxima, so you need to be careful before accepting the results of a casual
run. To guard against having converged to a local maximum, one can try using
multiple starting values, or one could try simulated annealing as an optimization method. To do this, copy maxsa.ox and maxsa.h into your working directory, and then use the program EstimateNBSNP2.ox to see how to implement
SA estimation of the reshaped negative binomial model. For more details on

18.7. EXAMPLES

407

the Ox implementation of SA, see Charles Bos page. Note - in my own experience, using a gradient-based method such as BFGS with many starting values
is as successful as SA, and is usually faster. Perhaps Im not using SA as well
as is possible... YMMV.

CHAPTER 19

Simulation-based estimation
Readings: In addition to the book mentioned previously, articles include
Gallant and Tauchen (1996), Which Moments to Match?, ECONOMETRIC
THEORY, Vol. 12, 1996, pages 657-681; Gourieroux, Monfort and Renault
a
(1993), Indirect Inference, J. Apl. Econometrics; Pakes and Pollard (1989)
Econometrica; McFadden (1989) Econometrica.

19.1. Motivation
Simulation methods are of interest when the DGP is fully characterized by
a parameter vector, but the likelihood function is not calculable. If it were
available, we would simply estimate by MLE, which is asymptotically fully
efcient.

19.1.1. Example: Multinomial and/or dynamic discrete response models.


be a latent random vector of dimension

8
"

Bf

Let

Suppose that

G P
B V B  B f
is

8 e
'

where

Suppose that

 j B G

(19.1.1)

Henceforth drop the subscript when it is not needed for clarity.


408

19.1. MOTIVATION

409

is not observed. Rather, we observe a many-to-one mapping

f 

 f

This mapping is such that each element of

is either zero or one (in

some cases only one element will be one).


Dene

f
6

. In this case the elements of

be the vector of parameters of the model. The

observation to the likelihood function is

f t  f
B $ B j B 1
X


x t R 

contribution of the

 3
R R
Bf

Let

is independent of

is not

8 
4% y3 f

diagonal). However,

may not be independent of one another (and clearly are not if

Bf

B  B
f
f w B
 f
B f t  B Hyw

Suppose random sampling of

d
 4e B 

where

5
5

G I R y 7} eI p I6  G1
G ~ p

is the multivariate normal density of an

-dimensional random vec-

tor. The log-likelihood function is

I
B 1
d
4dR
B  A )  4R
f

and the MLE solves the score equations

8
$

I
I
B 1

4d B  B 1


4d B f )  d B  f )

d
e B

The problem is that evaluation of

410

and its derivative w.r.t.


d

19.1. MOTIVATION

by

standard methods of numeric integration such as quadrature is com(the dimension of

is higher than 3

has not been made specic so far. This setup is

quite general: for different choices of

it nests the case of dynamic

The mapping

or 4 (as long as there are no restrictions on

binary discrete choice models as well as the case of multinomial discrete choice (the choice of one out of a nite set of alternatives).
Multinomial discrete choice is illustrated by a (very simple) job
search model. We have cross sectional data on individuals matching to a set of

jobs that are available (one of which is unemploy-

ment). The utility of alternative is


%

G P
V


Utilities of jobs, stacked in the vector

are not observed. Rather,

we observe the vector formed of elements

C%

putationally infeasible when

 
X6"ig Dc h


Y ) f

Only one of these elements is different than zero.


Dynamic discrete choice is illustrated by repeated choices over
time between two alternatives. Let alternative have utility
%

 G
j%

5)
4
g

85)
i!A@8@8974

19.1. MOTIVATION

411

Then

G P
V"
I$Gj G0PcEIju

I



$

Now the mapping is (element-by-element)


f )  f


2
)  B f

period

if individual

zero otherwise.

chooses the second alternative in

that is

19.1.2. Example: Marginalization of latent variables. Economic data often presents substantial heterogeneity that may be difcult to model. A possibility is to introduce latent random variables. This can cause the problem that
there may be no known closed form for the distribution of observable variables after marginalizing out the unobservable latent variables. For example,
is often modeled using the Poisson

(
@3
e 7g}  3 
~ f

f
f
 S  U

e
g

Often, one parameterizes the conditional mean as

The mean and variance of the Poisson distribution are both equal to

distribution

@8A89774
8 7 5)

count data (that takes values

8 ~  B
 B 7g} HCe

19.1. MOTIVATION

412

This ensures that the mean is positive (as it must be). Estimation by ML is
straightforward.
Often, count data exhibits overdispersion which simply means that

8 f
S
f
If this is the case, a solution is to use the negative binomial distribution rather
than the Poisson. An alternative is to introduce a latent variable that reects
heterogeneity into the specication:

B 7} HCe
B P ~ B
be the density of

In some cases, the

( f

B Q B P ~ B P ~
Dt u g B 7} us B B 7g} d  }
~

marginal density of

Dt
B Q

on additional parameters). Let

(this density may depend

8 g
B

has some specied density with support

where

f
 B f 

will have a closed-form solution (one can derive the negative binomial distrihas an exponential distribution), but often this will not

be possible. In this case, simulation is a means of calculating

 f
3 

bution in the way if

which

is then used to do ML estimation. This would be an example of the Simulated


Maximum Likelihood (SML) estimation.

In this case, since there is only one latent variable, quadrature is probably a better choice. However, a more exible model with heterogeneity

19.1. MOTIVATION

413

would allow all parameters (not just the constant) to vary. For example
X

B Dt u s D( B f
Q
~
~
~
B B  } B B 7g} d 7g}

-dimensional integral, which will not be evaluable

B A


e

by quadrature when

f
 B f 

'

entails a

gets large.

19.1.3. Estimation of models specied in terms of stochastic differential


equations. It is often convenient to formulate models in terms of continuous
time using differential equations. A realistic model should account for exogenous shocks to the system, which can be done by assuming a random component. This leads to a model that is expressed as a system of stochastic differential equations. Consider the process

t f d P 2 t f d f
"!7R D$!7RD  $t
is a standard Brownian motion (Weiner

 jD"t

process), such that


Qg

which is assumed to be stationary.

Brownian motion is a continuous-time stochastic process such that

That is, non-overlapping segments are independent.

are independent for


Su%

2
EVqt

Ej jE2Vqt 
2

 

and

One can think of Brownian motion the accumulation of independent normally


distributed shocks with innitesimal variance.

f d
g7eH

The function

is the deterministic part.

19.1. MOTIVATION

414

determines the variance of the shocks.

f d
!7R 

To estimate a model of this sort, we typically have data that are assumed to be

8 f@8A89 f
8 fI

direct ML or GMM estimation is not usually fea-

sible, because one cannot, in general, deduce the transition density


7d

To perform inference on

is a continu-

8d f
46I gf $

gf

ous process it is observed in discrete time.

That is, though

in discrete points

gf

observations of

This density is necessary to evaluate the likelihood function or to evaluate moment conditions (which are based upon expectations with respect to this density).
A typical solution is to discretize the model, by which we mean to

nd a discrete time approximation to the model. The discretized version of the model is

)
x9
G f t P f t
tcI g!s cI g!sH
(that is, the

p
Ht


tG
f
 I jHf

The discretization induces a new parameter,

which

denes the best approximation of the discretization to the actual (un-

p
d

known) discrete time version of the model is not equal to

which is

the true parameter value). This is an approximation, and as such ML

estimation of

(which is actually quasi-maximum likelihood, QML)

based upon this equation is in general biased and inconsistent for the
original parameter, . Nevertheless, the approximation shouldnt be
d

too bad, which will be useful, as we will see.


The important point about these three examples is that computational
difculties prevent direct application of ML, GMM, etc. Nevertheless

19.2. SIMULATED MAXIMUM LIKELIHOOD (SML)

415

the model is fully specied in probabilistic terms up to a parameter


vector. This means that the model is simulable, conditional on the
parameter vector.

19.2. Simulated maximum likelihood (SML)


For simplicity, consider cross-sectional data. An ML estimator solves
0

is the density function of the

is an infeasible estimator. However,

d f
4 g

does not have a known closed form,

observation. When

d f
a 

 2
I
@
1 d f
dca f


A f )  4RC ~ E

where

it may be possible to dene a random function such that

d f d  f a
a gs  4cag!p

is known. If this is the case, the simulator

d  f
4ag 
) 
8 d f
g4a gs

d  f
4a 

The SML simply substitutes

in place of

log-likelihood function, that is

d f
ca g

is unbiased for

d  f a
4 Eg!@t

where the density of

in the

B 1 d f
daf 
 

f )  eS ~ 

19.2.1. Example: multinomial probit. Recall that the utility of alternative


is

G P
V


19.2. SIMULATED MAXIMUM LIKELIHOOD (SML)

is formed of elements

  g
""i h


Y ) f

cant be calculated when

d ) U
 f

The problem is that

C%

and the vector

416

is larger than 4 or

5. However, it is easy to simulate this probability.

) U B f


times and dene


I
B f g U B
B

-vector formed of the

. Each element of

tween 0 and 1, and the elements sum to one.

is be-

d f
B } RB f   B  B 

Now

as the

Dene

Repeat this

is the matrix formed by stacking the

  g
Q"iD hB

Dene

(where

P
B G B B


Calculate

C%

B G

from the distribution

B
 j

Draw




The SML multinomial probit log-likelihood function is

4d f RB B 1

B  B  A f ) 
f

This is to be maximized w.r.t.

are draw only once and are used repeatedly during

The draws are different for each

are re-drawn at every iteration the estimator will not converge.

B G

If the

and

the iterations used to nd

draws of

8
E3

The

B G

Notes:

and

The log-likelihood function with this simulator is a discontinuous func

and

tion of

This does not cause problems from a theoretical point

19.2. SIMULATED MAXIMUM LIKELIHOOD (SML)

A


of view since it can be shown that

417

is stochastically equicon-

tinuous. However, it does cause problems if one attempts to use a


d

gradient-based optimization method such as Newton-Raphson.

are zero. If the corresponding element of

equal to 1, there will be a

some elements of

, are used, that

problem.

Bf

It may be the case, particularly if few simulations,

is

Solutions to discontinuity:

1) use an estimation method that doesnt require a continuous and


differentiable objective function, for example, simulated annealing. This is computationally costly.
2) Smooth the simulated probabilities so that they are continuous
functions of the parameters. For example, apply a kernel transformation such as

I
I
r dhB ~ h B n j' P h r dhB ~ h B n ' w
) 8
W 
W
w

is a large positive number. This approximates a step

if it is the maximum. This makes

and

so that

B 

) B


tinuous function of

is not the max-

and therefore

a con-


B f

B f

imum, and

is very close to zero if


U B f

function such that

where

will be continuous and differentiable. Consistency requires that


so that the approximation to a step function becomes

1
T D w

arbitrarily close as the sample size increases. There are alternative


methods (e.g., Gibbs sampling) that may work better, but this is
too technical to discuss here.
d

To solve to log(0) problem, one possibility is to search the web for the
slog function. Also, increase

if this is a serious problem.

19.3. METHOD OF SIMULATED MOMENTS (MSM)

418

19.2.2. Properties. The properties of the SML estimator depend on how


is set. The following is taken from Lee (1995) Asymptotic Bias in Simulated
Maximum Likelihood Estimation of Discrete Choice Models, Econometric Theory, 11, pp. 437-83.
d

1
eI "tf
p

h p i
d

d
1

a nite constant, then

d
E p e I o  j

h p i
d

1
 h eI " f
p

e
g

d
1

is a nite vector of constants.


d

This means that the SML estimator is asymptotically biased if


grow faster than

doesnt

8 eI 1
p

where

then

d
p e I o  j

2) if

 

T HEOREM 32. [Lee] 1) if

The varcov is the typical inverse of the information matrix, so that


as long as

grows fast enough the estimator is consistent and fully




asymptotically efcient.
19.3. Method of simulated moments (MSM)

d f
4 

the density of

which is simulable given , but is such that


d

Suppose we have a DGP

is not calculable.

Once could, in principle, base a GMM estimator upon the moment conditions

# d  f
xc  Huq  g  4eE
d
where

 f t d f   f d
$  s  c k' c  D

19.3. METHOD OF SIMULATED MOMENTS (MSM)

The problem is that this density is not available.


is readily simulated using

d
k

as

d
4c  s}
) 

4c  s c  s}
d 
d

I
g

By the law of large numbers,

  }f

dc  D

8
c 

However

is the density

conditional on

d f
  

of

is a vector of instruments in the information set and

419

which

provides a clear intuitive basis for the estimator, though in fact we obtain consistency even for

operating across the

nite, since a law of large numbers is also

observations of real data, so errors introduced

by simulation cancel themselves out.


This allows us to form the moment conditions

d
 4RE

is drawn from the information set. As before, form

d
 eX}

1
1

(19.3.2)

Ig

I
B

# w

 f
  }f D ) q  v
f
d

I
d
4eE B
f

where

d  f
# r 4c  s} q  n


(19.3.1)

with which we form the GMM criterion and estimate as usual. Note

  }f D

that the unbiased simulator

appears linearly within the sums.

19.3.1. Properties. Suppose that the optimal weighting matrix is used. McFadden (ref. above) and Pakes and Pollard (refs. above) show that the asymptotic distribution of the MSM estimator is very similar to that of the infeasible
GMM estimator. In particular, assuming that the optimal weighting matrix is

)
) P D
m

h p i
d

d
1

is the asymptotic variance of the infeasible GMM estid

That is, the asymptotic variance is inated by a factor

8 D)
) P

I R I

mator.


I R I

where

nite,
h

(19.3.3)

420

used, and for

19.3. METHOD OF SIMULATED MOMENTS (MSM)

For this

reason the MSM estimator is not fully asymptotically efcient relative

small and controllable, by setting

nite, but the efciency loss is


d

to the infeasible GMM estimator, for

reasonably large.

The estimator is asymptotically unbiased even for

8
) 

advantage relative to SML.

This is an

If one doesnt use the optimal weighting matrix, the asymptotic varcov

8 )
) P

is just the ordinary GMM varcov, inated by

The above presentation is in terms of a specic moment condition


based upon the conditional mean. Simulated GMM can be applied




to moment conditions of any form.


d

19.3.2. Comments. Why is SML inconsistent if

is nite, while MSM is?

The reason is that SML is based upon an average of logarithms of an unbiased


simulator (the densities of the observations). To use the multinomial probit
model as an example, the log-likelihood function is

I
B 1


BR
B  f )  A
f

I
B 1


BR
B  f )  A
f

The SML version is

19.3. METHOD OF SIMULATED MOMENTS (MSM)

421

The problem is that

  
E B  B 

in spite of the fact that

B   B 


d

d
b

is a nonlinear transformation. The only way for the


tends to innite so that


b

two to be equal (in the limit) is if

tends to

d
b

due to the fact that

The reason that MSM does not suffer from this problem is that in this case
the unbiased simulator appears linearly within every sum of terms, and it ap-

pears within a sum over

(see equation [19.3.2]). Therefore the SLLN applies

to cancel out simulation errors, from which we get consistency. That is, using
simple notation for the random sampling case, the moment conditions
d

I
g
I


# w 6 G P4c
d 
G P d B 1
 D ) t0 p c  Dcv )
f

I
d g
I
B 1

q  f
 cg v )
)
f

# w   }f D

d
 eX


(19.3.5)

(19.3.4)

converge almost surely to

8 Qt d d
g  D  C# 4  Duq p   D

d
 4ea

is assume to be made up of functions of

converges to

8
g 

(note:

The objective function

p
d
d
4ea I cea  ea
R d
d

which obviously has a minimum at

henceforth consistency.

If you look at equation 19.3.5 a bit, you will see why the variance in-

I c
P )

ation factor is

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

422

19.4. Efcient method of moments (EMM)


The choice of which moments upon which to base a GMM estimator can
have very pronounced effects upon the efciency of the estimator.

A poor choice of moment conditions may lead to very inefcient es-

timators, and can even cause identication problems (as weve seen
with the GMM problem set).
The drawback of the above approach MSM is that the moment condi-

tions used in estimation are selected arbitrarily. The asymptotic efciency of the estimator may be low.
The asymptotically optimal choice of moments would be the score vector of the likelihood function,

d
eE   4RE
d
As before, this choice is unavailable.

The efcient method of moments (EMM) (see Gallant and Tauchen (1996),
Which Moments to Match?, ECONOMETRIC THEORY, Vol. 12, 1996, pages
657-681) seeks to provide moment conditions that closely mimic the score vector. If the approximation is very good, the resulting estimator will be very
nearly fully efcient.
The DGP is characterized by random sampling from the density

p eE 
d

p   g
d 
f

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

423

We can dene an auxiliary model, called the score generator, which simply provides a (misspecied) parametric density

e 
 f
$

This density is known up to a parameter

We assume that this den-

sity function is calculable. Therefore quasi-ML estimation is possible.


Specically,

e
I
@
1
8
E
f 
A f )  e C ~ E e
e

e  A
 f

After determining we can calculate the score functions

The important point is that even if the density is misspecied, there is


for which the true expectation, taken with respect to

is zero:

 p e   w Y 4r p

We have seen in the section on QML that

f$4eE
t d 

the moment conditions

; this suggests using

I@ 1
 d f
 e ei
f )

These moment conditions are not calculable, since


able, but they are simulable using

e  }f


Ig @ 1
I
eS
 d f
) f )  e


(19.4.1)

pe
T e

Q t f t d f 
f

  D p   p e   A

over

and then marginalized

d
eEi

the true but unknown density of

 d f  
g p   9f




a pseudo-true

is not avail-

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

converges to

holding

xed. By the LLN and

8  p e  p ea
d

e eS}
 d f

then

e c 
 f

 d f
g  

imates

The advantage of this procedure is that if

This is not the case for other values of , assuming that

the fact that

pe
d
ge &

is a draw from

where

424

is identied.

closely approx-

will closely approximate the optimal

moment conditions which characterize maximum likelihood estimation, which is fully efcient.

If one has prior information that a certain density approximates the

8 b
d

data well, it would be a good choice for

If one has no density in mind, there exist good ways of approximating


unknown distributions parametrically: Philips ERAs (Econometrica,

1983) and Gallant and Nychkas (Econometrica, 1987) SNP density estimator which we saw before. Since the SNP density is consistent, the
efciency of the indirect estimator is the same as the infeasible ML
estimator.

19.4.1. Optimal weighting matrix. I will present the theory for

nite,
d

and possibly small. This is done because it is sometimes impractical to estimate with

very large. Gallant and Tauchen give the theory for the case of
d

so large that it may be treated as innite (the difference being irrelevant given
the numerical precision of a computer). The theory for the case of
follows directly from the results presented here.

innite

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

e 7RX}
 d

can apply Theorem 22 to conclude that

depends on the pseudo-ML estimate

The moment condition

425

We

hp

were in fact the true density

gp e o I 6p e QP

 d f
g4  

e c  g
 f

e 1

be the maximum likelihood estimator, and

then

If the density

I ! p e Q p e o I ! p e Q%
P
P

(19.4.2)

would

would be an identity

matrix, due to the information matrix equality. However, in the present case
so there is no

Comparing the denition of

e S9
f

8 h gp e S y z A 
f


p e QP

Recall that

e c  g
 f

cancellation.

is only an approximation to

 d f
4  

we assume that

with the denition of the moment condition in Equation 19.4.1, we see that

8
g p e  p eX y  p e QP
d

As in Theorem 22,

Re
S9f

if

1 f  p e
o

In this case, this is simply the asymptotic variance covariance matrix of the

)
tcT P h p

e p e  p RX y w 1
d

e 9gRi1
 p d f

P
 p e  p RC 1
d f

. It is straightforward but somewhat tedious to

h
d f
 h e  p eS 1

 p d f
p e xgRC 1

show that the asymptotic variance of this term is


p e a I

First consider

about

Now take a rst order Taylors series approximation to

moment conditions,

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

 p d f y
p e 9ei

. Note that

hp

e p e 9gR y w 1
 p d

so we have

Next consider the second term

426

8 8
4C6 h p

e p e Ql1
P

 hp


g p e QP

e p e  p RX y w 1
d

But noting equation 19.4.2


h

p e o   h p

e p e Ql1
P

Now, combining the results for the rst and second terms,
h

p e

d f
) )  u e  p ei 1
x

p e

Suppose that

is a consistent estimator of the asymptotic variance-covariance

matrix of the moment conditions. This may be complicated if the score generator is a poor approximator, since the individual score contributions may not
have mean zero in this case (see the section on QML) . Even if this is the case,
the individuals means can be calculated by simulation, so it is always possible

gp e

to consistently estimate

when the model is simulable. On the other hand,

if the score generator is taken to be correctly specied, the ordinary estimator


of the information matrix is consistent. Combining this with the result on the

efcient GMM weighting matrix in Theorem 25, we see that dening as


d

e 7Ri
 d f

p e

) R  d f
) P  e eS  d

is the GMM estimator with the efcient choice of weighting matrix.


If one has used the Gallant-Nychka ML estimator as the auxiliary model,
the appropriate weighting matrix is simply the information matrix of
the auxiliary model, since the scores are uncorrelated. (e.g., it really is

19.4. EFFICIENT METHOD OF MOMENTS (EMM)

427

ML estimation asymptotically, since the score generator can approximate the unknown density arbitrarily well).

19.4.2. Asymptotic distribution. Since we use the optimal weighting matrix, the asymptotic distribution is as in Equation 15.4.1, so we have (using the
result in Equation 19.4.2):

R
I p e

P
) )
o

 
m

h p d 1
d

where

8
qc p e  p e !  f 
d Rf

This can be consistently estimated using

e d ! 
 Rf


19.4.3. Diagnotic testing. The fact that
h

d
o

d f
) )  u e  p ei1

R ' 7d C e

e
I

d
ge e A

P ) R  f
) D e d CH1

since without

d
4e A

where is

p e

implies that

moment conditions the model

is not identied, so testing is impossible. One test of the model is simply based

on this statistic: if it exceeds the

critical point, something may be wrong

(the small sample performance of this sort of test would be a topic worth investigating).

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS

428

Information about what is wrong can be gotten from the pseudo-tstatistics:


h

d if1

p e
e
I eI

) P )

diag

can be used to test which moments are not well modeled. Since these
moments are related to parameters of the score generator, which are
usually related to certain features of the model, this information can be
h

and

e 7d C1
 f

e 7d SH1
 f

e  p eSH1
d f

since

)
gx9

used to revise the model. These arent actually distributed as

have different distributions (that of

is somewhat more complicated). It can be shown that the

pseudo-t statistics are biased toward nonrejection. See Gourieroux et.


al. or Gallant and Long, 1995, for more details.
19.5. Example: estimation of stochastic differential equations
It is often convenient to formulate theoretical models in terms of differential equations, and when the observation frequency is high (e.g., weekly, daily,
hourly or real-time) it may be more natural to adopt this framework for econometric models of time series.
The most common approach to estimation of stochastic differential equations is to discretize the model, as above, and estimate using the discretized
version. However, since the discretization is only an approximation to the true
discrete-time version of the model (which is not calculable), the resulting estimator is in general biased and inconsistent.
An alternative is to use indirect inference: The discretized model is used as
the score generator. That is, one estimates by QML to obtain the scores of the
discretized approximation:

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS

)
x9
G f t P f t
tcI g!s cI g!sH


tG
f
 I jHf
8  d f
gt ei

Indicate these scores by


equations

429

Then the system of stochastic differential

t f d P 2 t f d f
"!7R D$!7RD  $t
is simulated over , and the scores are calculated and averaged over the simud

lations

t 7Rif B
 d

I
B
 d f
t 7RC
) 
is chosen to set the simulated scores to zero

 f
t 7d i

$

(since and

are of the same dimension).

This method requires simulating the stochastic differential equation. There


are many ways of doing this. Basically, they involve doing very ne discretizations:

gf

very small, the sequence of

fairly well.

 j

g!e g!eHiP gf
f  d P f  d 

By setting

approximates a Brownian motion

19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS

430

This is only one method of using indirect inference for estimation of differential equations. There are others (see Gallant and Long, 1995 and Gourieroux
et. al.). Use of a series approximation to the transitional density as in Gallant and Long is an interesting possibility since the score generator may have
a higher dimensional parameter than the model, which allows for diagnostic


7d

the same dimension as is

so diagnostic testing is not possible.

testing. In the method described above the score generators parameter

is of

CHAPTER 20

Parallel programming for econometrics


In this chapter well see how commonly used computations in econometrics can be done in parallel on a cluster of computers.

431

CHAPTER 21

Introduction to Octave
Why is Octave being used here, since its not that well-known by econometricians? Well, because it is a high quality environment that is easily extensible,
uses well-tested and high performance numerical libraries, it is licensed under
the GNU GPL, so you can get it for free and modify it if you like, and it runs
on both GNU/Linux, Mac OSX and Windows systems. Its also quite easy to
learn.
21.1. Getting started
Get the bootable CD, as was described in Section 1.3. Then burn the image,
and boot your computer with it. This will give you this same PDF le, but with
all of the example programs ready to run. The editor is congure with a macro
to execute the programs using Octave, which is of course installed. From this
point, I assume you are running the CD (or sitting in the computer room across
the hall from my ofce), or that you have congured your computer to be able
to run the *.m les mentioned below.
21.2. A short introduction
The objective of this introduction is to learn just the basics of Octave. There
are other ways to use Octave, which I encourage you to explore. These are just
some rudiments. After this, you can look at the example programs scattered
throughout the document (and edit them, and run them) to learn more about
how Octave can be used to do econometrics. Students of mine: your problem
432

21.2. A SHORT INTRODUCTION

433

F IGURE 21.2.1. Running an Octave program

sets will include exercises that can be done by modifying the example programs in relatively minor ways. So study the examples!
Octave can be used interactively, or it can be used to run programs that are
written using a text editor. Well use this second method, preparing programs
with NEdit, and calling Octave from within the editor. The program rst.m
gets us started. To run this, open it up with NEdit (by nding the correct
le inside the /home/knoppix/Desktop/Econometrics folder and clicking on the icon) and then type CTRL-ALT-o, or use the Octave item in the Shell
menu (see Figure 21.2.1).

21.2. A SHORT INTRODUCTION

434

Note that the output is not formatted in a pleasing way. Thats because
printf() doesnt automatically start a new line. Edit first.m so that the
8th line reads printf(hello world\n); and re-run the program.
We need to know how to load and save data. The program second.m
shows how. Once you have run this, you will nd the le x in the directory
Econometrics/Include/OctaveIntro/ You might have a look at it with
NEdit to see Octaves default format for saving data. Basically, if you have
data in an ASCII text le, named for example myfile.data, formed of
numbers separated by spaces, just use the command load myfile.data.
After having done so, the matrix myfile (without extension) will contain
the data.
Please have a look at CommonOperations.m for examples of how to do
some basic things in Octave. Now that were done with the basics, have a look
at the Octave programs that are included as examples. If you are looking at
the browsable PDF version of this document, then you should be able to click
on links to open them. If not, the example programs are available here and the
support les needed to run these are available here. Those pages will allow
you to examine individual les, out of context. To actually use these les (edit
and run them), you should go to the home page of this document, since you
will probably want to download the pdf version together with all the support
les and examples. Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You
might like to check the article Econometrics with Octave and the Econometrics Toolbox ,
which is for Matlab, but much of which could be easily used with Octave.

21.3. IF YOURE RUNNING A LINUX INSTALLATION...

435

21.3. If youre running a Linux installation...


Then to get the same behavior as found on the CD, you need to:

Get the collection of support programs and the examples, from the

Put them somewhere, and tell Octave how to nd them, e.g., by putting

Make sure nedit is installed and congured to run Octave and use

document home page.

a link to the MyOctaveFiles directory in /usr/local/share/octave/site-m

syntax highlighting. Copy the le /home/econometrics/.nedit


from the CD to do this. Or, get the le NeditConguration and save
it in your $HOME directory with the name .nedit. Not to put too
ne a point on it, please note that there is a period in that name.
Associate *.m les with NEdit so that they open up in the editor when
you click on them. That should do it.

CHAPTER 22

Notation and Review


All vectors will be column vectors, unless they have a transpose symbol (or I forget to apply this rule - your help catching typos and er0rors
vector,

) '
V"

vector. When I refer to a -vector, I mean a column vector.

R 

is a

is a


' )

is much appreciated). For example, if

22.1. Notation for differentiation of vectors and matrices


[3, Chapter 1]

 4R
d

 y



R
Rd
R
8 edd d  e d  4Rdd d

d

y


 '
) c

c z

matrix. Also,

both -vectors, show that

be a -vector valued function of the -vector . Let


436

8d
ge y  R R e
d

. Then

f
y

1 '
%)
d
T y e

valued transpose of

be the

is a

 '
Y

Let

and

vector and

.
.
.

E XERCISE 33. For

is a

is

z
c

Following this convention,

Then

R 4R
d

y r
b
T

organized as a -vector,

8
d

be a real valued function of the -vector

Let

22.2. CONVERGENGE MODES

d
T y 4e

d
T y e

v
y

8) '
4j
R d P R d  e 4R d
d R d

8 '
 )
R
R
P d R d R d d
 4R 4e

Rd
R

Applying the transposition rule we get

which has dimension

be a -vector valued function of an

H

b
T y

-vector valued argument . Then

R c
R
R

H  e d  
d

both

)
Y' 

E XERCISE 35. For and

vectors, show that

~
  R  7g}  x y

8 Q1
'

R w P w

has dimension

vector, show that

a -vector valued function of a -vector

argument, and let

 '
i
:

Chain rule: Let

matrix and

) '
V"

E XERCISE 34. For

6
 y 6

has dimension

be -vector valued

functions of the -vector . Then

and

Product rule: Let

437

22.2. Convergenge modes


Readings: [1, Chapter 4];[4, Chapter 4].
We will consider several modes of convergence. The rst three modes discussed are simply for background. The stochastic modes are those which will
be used later in the course.

22.2. CONVERGENGE MODES

438

D EFINITION 36. A sequence is a mapping from the natural numbers

85)
 A8@897

to some other set, so that the set is ordered according to the nat-

1 I 1
t  9f t

ural numbers associated with its elements.


Real-valued sequences:
con-

such that for all

 4
f

G t 0s4  1
f

Deterministic real-valued functions. Consider a sequence of functions


where


Cf

f
$

written

is the limit of

8
C

there exists an integer

if for any

verges to the vector

f
7

D EFINITION 37. [Convergence] A real-valued sequence of vectors

8
sy

r
af

may be an arbitrary set.

and

so that converge may be

q f

than for others. Uniform convergence requires

D EFINITION 39. [Uniform convergence] A sequence of functions

8
G
1 

such that

if for any

to the function (


6 Sf

a similar rate of convergence throughout

verges uniformly on

there

8  G
1 

much more rapid for certain

depends upon


y

Its important to note that

such that

if for all

g
q

exists an integer

to the function (

converges pointwise on


6 Sf

D EFINITION 38. [Pointwise convergence] A sequence of functions

con-

there exists an integer

 9
 Sf A

22.2. CONVERGENGE MODES

439

lie)

in which

must

Sf

(insert a diagram here showing the envelope around

Stochastic sequences. In econometrics, we typically deal with stochastic

rQ u
 


recall that a random variable

is a collection of such mappings, i.e., each

 R I X R  S

f

8 
i %

the OLS estimator

is

For example,
where

G P p
" 

is the sample size, can be used to form a sequence of random vectors

f
S


6 Sf x

a random variable with respect to the probability space


given the model

A sequence of

S
f

random variables

maps the sample space to the real line, i.e.,

8
sy

sequences. Given a probability space

A number of modes of convergence are in use when dealing with sequences


of random variables. Several such modes of convergence should already be
familiar:
D EFINITION 40. [Convergence in probability] Let

f   a
r
f

be a random variable. Let

converges in probability to

if

G u


% T f
8   f c A f
G

converges almost surely to

f
Sax

. Then

be a random variable. Let

dom variables, and let

be a sequence of ran-

f
r
 Ca% f A w 

D EFINITION 41. [Almost sure convergence] Let

or plim

C
f

Convergence in probability is written as

8
%  a
f

f
Sax
u

. Then

S
f

dom variables, and let

be a sequence of ran-

if

8
4)  c

(ordinary convergence of the two functions)

such that

8 d


Sf

except on a set

In other words,

Almost sure convergence is

22.2. CONVERGENGE MODES

f
a

or

8 8 
4C6%


i f


written as

440

One can show that

8
% T a a
f
f
T

8
i

Convergence in distribution is written as

8
i

converges in distribution to

then

If

every continuity point of

have distribution function

and the r.v.

have distribuP

tion function

D EFINITION 42. [Convergence in distribution] Let the r.v.

at

It can be shown that con-

vergence in probability implies convergence in distribution.

Stochastic functions. Simple laws of large numbers (LLNs) allow us to

T y f

and

1
f
 1
G R I R kP p  S
f
p S
T

directly conclude that

in the OLS example, since

by a SLLN. Note that this term is not a function of the parameter

This easy proof is a result of the linearity of the model, which allows us to

express the estimator in a way that separates parameters from random functions. In general, this is not possible. We often deal with the more complicated
situation where the stochastic sequence depends on parameters in a manner
that is not reducible to a simple sequence of random variables. In this case,
d

is a random variable with respect to a probability space


x g
pwd

and the parameter belongs to a parameter space

d f
4yCa

each

where


 d  f
ySx

we have a sequence of random functions that depend on :

22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY

if

  yuj4yCa   f
d  d  f
9
@

are random vari-

Well indicate uniform almost sure conver-

8 x g
d




gence by

for all

d
4yCf

ables w.r.t.

and

and uniform convergence in probability by

8 T

Implicit is the assumption that all

converges uni-

(a.s.)

d 
yu

to

d
yu

formly almost surely in

d f
64ySx

D EFINITION 43. [Uniform almost sure convergence]

441

An equivalent denition, based on the fact that almost sure means

with probability one is


A

)   yVj4yCa  t  f
d  d  f
9

This has a form similar to that of the denition of a.s. convergence -


9

the essential difference is the addition of the

22.3. Rates of convergence and asymptotic equality


Its often useful to have notation for the relative magnitudes of quantities.
Quantities that are small relative to others can often be ignored, which simplies analysis.

be two real-valued functions.

1 
DH

is a nite constant.

means there exists some

xf

9f 

1
 

1
 

x
 9ff 


This denition doesnt require that

edly).

and

such that for

1


1
ED

where

be two real-valued functions.

1
H

means

D EFINITION 45. [Big-O] Let


The notation

The notation

and

8  9f

xf  f
1
DH
1
D

D EFINITION 44. [Little-o] Let

have a limit (it may uctuate bound-

22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY

are sequences of random variables analogous denitions

1


D EFINITION 46. The notation

)
tcT
)
xT  G R I X R

 (


 GVPpgdp R I X R  R I X R  d

x
8 T 9ff 
1 
HT 

are

and

f
g

If

442

means

E XAMPLE 47. The least squares estimator

8 ) P p
gtcT gd  d

8
G R I Q R gd

P p

and

I
y  y

Since plim

we can write

Asymptotically, the term

is negligible. This is just a

way of indicating that the LS estimator is consistent.

1
 D

 1

H
1
G)
D
1

8Gj) f
f
)
t f
 a
S

such that

)
gtcT

then

since, given


G

E XAMPLE 49. If

is a nite constant.

always some

S
d

where

and all

1 
EHT

such that for

means there exists some

D EFINITION 48. The notation

there is

Useful rules:

( T T  ( T T T 
1
1
1

( T $T T ( T S T T 
1 S
1
1 S

E XAMPLE 50. Consider a random sample of iid r.v.s with mean 0 and vari

 )
gtcT S d eI 1 D! jk d eI 1
8
p
p
B I
C fB 1 )  d

. The estimator of the mean

distributed, e.g.,

)
gxT  d

we had

is asymptotically normally

So

so

8 p
g eI 1$T S d

ance

Before

now we have have the stronger result that relates the rate of

convergence to the sample size.

443

E XAMPLE 51. Now consider a random sample of iid r.v.s with mean


gt)cT S h Qpd p 1 g  h
8
eI
I
BC fB 1 )  d

. The estimator of the mean

8) S
gxT T d

p 1
Qspd
eI

and variance

normally distributed, e.g.,


so

22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY

is asymptotically

So

so

 p 1
gg eI T

S
T

pd

These two examples show that averages of centered (mean zero) quantities typically have plim 0, while averages of uncentered quantities have nite

1
D

does not mean that

and

are of the same order. Asymptotic equality ensures that this is the case.

1
) DH
 D 3 
1

vious way.

and
S

f
4  f

Finally, analogous almost sure versions of

if

7f

asymptotically equal (written

and

f
g}

D EFINITION 52. Two sequences of random variables

1
D

nonzero plims. Note that the denition of

are

are dened in the ob-

EXERCISES

444

Exercises
vectors, show that

)j
'
) '
j

vectors, show that

 '
i

both

and

both

and

) '


matrix and

vector, show that

x
R
 7~}
x

~
~
  R  7}  R  7g}
R w P  y 6

(4) For

both

) '


(3) For

and

 y 6

(2) For

(1) For

vectors, nd the analytic expression for

(5) Write an Octave program that veries each of the previous results by taking numeric derivatives. For a hint, type help numgradient and help
numhessian inside octave.

CHAPTER 23

The GPL
This document and the associated examples and materials are copyright
Michael Creel, under the terms of the GNU General Public License. This license follows:
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place,
Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and
distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your freedom to
share and change it. By contrast, the GNU General Public License is intended
to guarantee your freedom to share and change free softwareto make sure the
software is free for all its users. This General Public License applies to most
of the Free Software Foundations software and to any other program whose
authors commit to using it. (Some other Free Software Foundation software is
covered by the GNU Library General Public License instead.) You can apply it
to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our
General Public Licenses are designed to make sure that you have the freedom
to distribute copies of free software (and charge for this service if you wish),
that you receive source code or can get it if you want it, that you can change
445

23. THE GPL

446

the software or use pieces of it in new free programs; and that you know you
can do these things.
To protect your rights, we need to make restrictions that forbid anyone to
deny you these rights or to ask you to surrender the rights. These restrictions
translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or
for a fee, you must give the recipients all the rights that you have. You must
make sure that they, too, receive or can get the source code. And you must
show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy, distribute
and/or modify the software.
Also, for each authors protection and ours, we want to make certain that
everyone understands that there is no warranty for this free software. If the
software is modied by someone else and passed on, we want its recipients to
know that what they have is not the original, so that any problems introduced
by others will not reect on the original authors reputations.
Finally, any free program is threatened constantly by software patents. We
wish to avoid the danger that redistributors of a free program will individually
obtain patent licenses, in effect making the program proprietary. To prevent
this, we have made it clear that any patent must be licensed for everyones
free use or not licensed at all.
The precise terms and conditions for copying, distribution and modication follow.

23. THE GPL

447

GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains a
notice placed by the copyright holder saying it may be distributed under the
terms of this General Public License. The "Program", below, refers to any such
program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modications
and/or translated into another language. (Hereinafter, translation is included
without limitation in the term "modication".) Each licensee is addressed as
"you".
Activities other than copying, distribution and modication are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its
contents constitute a work based on the Program (independent of having been
made by running the Program). Whether that is true depends on what the
Program does.
1. You may copy and distribute verbatim copies of the Programs source
code as you receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to
the absence of any warranty; and give any other recipients of the Program a
copy of this License along with the Program.
You may charge a fee for the physical act of transferring a copy, and you
may at your option offer warranty protection in exchange for a fee.

23. THE GPL

448

2. You may modify your copy or copies of the Program or any portion of
it, thus forming a work based on the Program, and copy and distribute such
modications or work under the terms of Section 1 above, provided that you
also meet all of these conditions:
a) You must cause the modied les to carry prominent notices stating that
you changed the les and the date of any change.
b) You must cause any work that you distribute or publish, that in whole
or in part contains or is derived from the Program or any part thereof, to be
licensed as a whole at no charge to all third parties under the terms of this
License.
c) If the modied program normally reads commands interactively when
run, you must cause it, when started running for such interactive use in the
most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying
that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License.
(Exception: if the Program itself is interactive but does not normally print such
an announcement, your work based on the Program is not required to print an
announcement.)
These requirements apply to the modied work as a whole. If identiable
sections of that work are not derived from the Program, and can be reasonably
considered independent and separate works in themselves, then this License,
and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole
which is a work based on the Program, the distribution of the whole must be

23. THE GPL

449

on the terms of this License, whose permissions for other licensees extend to
the entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights
to work written entirely by you; rather, the intent is to exercise the right to
control the distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of a
storage or distribution medium does not bring the other work under the scope
of this License.
3. You may copy and distribute the Program (or a work based on it, under
Section 2) in object code or executable form under the terms of Sections 1 and
2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable source
code, which must be distributed under the terms of Sections 1 and 2 above on
a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three years, to give
any third party, for a charge no more than your cost of physically performing
source distribution, a complete machine-readable copy of the corresponding
source code, to be distributed under the terms of Sections 1 and 2 above on a
medium customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code
or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modications to it. For an executable work, complete source code means

23. THE GPL

450

all the source code for all modules it contains, plus any associated interface
denition les, plus the scripts used to control compilation and installation of
the executable. However, as a special exception, the source code distributed
need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component itself
accompanies the executable.
If distribution of executable or object code is made by offering access to
copy from a designated place, then offering equivalent access to copy the
source code from the same place counts as distribution of the source code,
even though third parties are not compelled to copy the source along with the
object code.
4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy,
modify, sublicense or distribute the Program is void, and will automatically
terminate your rights under this License. However, parties who have received
copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
5. You are not required to accept this License, since you have not signed
it. However, nothing else grants you permission to modify or distribute the
Program or its derivative works. These actions are prohibited by law if you do
not accept this License. Therefore, by modifying or distributing the Program
(or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or
modifying the Program or works based on it.

23. THE GPL

451

6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor
to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients exercise
of the rights granted herein. You are not responsible for enforcing compliance
by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict
the conditions of this License, they do not excuse you from the conditions of
this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by all those
who receive copies directly or indirectly through you, then the only way you
could satisfy both it and this License would be to refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any
particular circumstance, the balance of the section is intended to apply and the
section as a whole is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or
other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people
have made generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that system; it is

23. THE GPL

452

up to the author/donor to decide if he or she is willing to distribute software


through any other system and a licensee cannot impose that choice.
This section is intended to make thoroughly clear what is believed to be a
consequence of the rest of this License. 8. If the distribution and/or use of the
Program is restricted in certain countries either by patents or by copyrighted
interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those
countries, so that distribution is permitted only in or among countries not thus
excluded. In such case, this License incorporates the limitation as if written in
the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will be
similar in spirit to the present version, but may differ in detail to address new
problems or concerns.
Each version is given a distinguishing version number. If the Program
species a version number of this License which applies to it and "any later
version", you have the option of following the terms and conditions either of
that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you
may choose any version ever published by the Free Software Foundation.
10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask
for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions
for this. Our decision will be guided by the two goals of preserving the free

23. THE GPL

453

status of all derivatives of our free software and of promoting the sharing and
reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED
BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING
THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE
PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED
INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR
A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS

23. THE GPL

454

How to Apply These Terms to Your New Programs


If you develop a new program, and you want it to be of the greatest possible
use to the public, the best way to achieve this is to make it free software which
everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach
them to the start of each source le to most effectively convey the exclusion of
warranty; and each le should have at least the "copyright" line and a pointer
to where the full notice is found.
<one line to give the programs name and a brief idea of what it does.>
Copyright (C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option) any
later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it
starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision
comes with ABSOLUTELY NO WARRANTY; for details type show w. This is

23. THE GPL

455

free software, and you are welcome to redistribute it under certain conditions;
type show c for details.
The hypothetical commands show w and show c should show the appropriate parts of the General Public License. Of course, the commands you
use may be called something other than show w and show c; they could
even be mouse-clicks or menu itemswhatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if necessary.
Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program Gnomovision (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public
License instead of this License.

CHAPTER 24

The attic
The GMM estimator, briey
The OLS estimator can be thought of as a method of moments estimator.

 v
e

. So, likewise,

 y f  h f

With weak exogeneity,

The idea of the MM estimator is to choose the estimator to make the sample
counterpart hold:

R I 6 R

1
 h i R


1
 eR

This means of deriving the formula requires no calculus. It provides another


interpretation of how the OLS estimator is dened.
We can perhaps think of other variables that are not correlated with , say

is greater than

then we have more un

 Ea
e 

a

of

that satisy

assume that we have instruments

!e

. This may be needed if the weak exogeneity assumption fails for

. Let us

. If the dimension

This holds material that is not really ready to be incorporated into the main
body, but that I dont want to lose. Basically, ignore it, unless youd like to help
get it ready for inclusion.

456

24.1. MEPS DATA: MORE ON COUNT MODELS

457

24.1. MEPS data: more on count models


Note to self: this chapter is yet to be converted to use Octave. To check the
plausibility of the Poisson model, we can compare the sample unconditional
variance with the estimated unconditional variance according to the Poisson
P

f
 nb

f
 S

model:

. For OBDV and ERV, we get We see that even after

TABLE 1. Marginal Variances, Sample and Estimated (Poisson)


OBDV
ERV
Sample
37.446 0.30614
Estimated 3.4540 0.19060

conditioning, the overdispersion is not captured in either case. There is huge


problem with OBDV, and a signicant problem with ERV. In both cases the
Poisson model does not appear to be plausible.

24.1.1. Innite mixture models. Reference: Cameron and Trivedi (1998)


Regression analysis of count data, chapter 4.
The two measures seem to exhibit extra-Poisson variation. To capture unobserved heterogeneity, a possibility is the random parameters approach. Consider the possibility that the constant term in a Poisson model were random:

G 7g}  v 7g}
~ & ~

4G0P v 7g}
~
&
(
f
~
4 7g}
u
d d

 d
G f
  v


Y
G

24.1. MEPS DATA: MORE ON COUNT MODELS

. Now

G ~
7} 

) and

captures the randomness in the


a

v 7} ue
~ 
&

where

458

constant. The problem is that we dont observe , so we will need to marginala

ize it to get a usable density


()f
d 7g}
~

# t #
$

d d
u G8

f
 v

Y
`

This density can be used directly, perhaps using numerical integration to evaluate the likelihood function. In some cases, though, the integral will have an
follows a certain one parameter gamma
a

analytic solution. For example, if


density, then

PXX

t f
  v

appears since it is the parameter of the gamma density.

f
 v

e
q


&


&

, then

, then

is parameterized.

. Note that

e &

f
 v

e
'

, where

X
U e  t

The variance depends upon how

, which we have parameterized

 R v 7g} e
~ 

Y
`

For this density,

&
I q
e
X

If

PQX
) P f
RX X xf
Ui
P
X

where

(24.1.1)

is a

function of , so that the variance is too. This is referred to as the

NB-I model.

f
 v

) 

&
@

to as the NB-II model.

e &

, where

If

. This is referred

So both forms of the NB model allow for overdispersion, with the NB-II model
allowing for a more radical form.
Testing reduction of a NB model to a Poisson model cannot be done
using standard Wald or LR procedures. The critical

values need to be adjusted to account for the fact that

&



&

by testing

is on

the boundary of the parameter space. Without getting into details,

24.1. MEPS DATA: MORE ON COUNT MODELS

459

suppose that the data were in fact Poisson, so there is equidispersion


. Then about half the time the sample data will

&


and the true

de underdispersed, and about half the time overdispersed. When the


will be

( &

&

data is underdispersed, the MLE of

. Thus, under the

null, there will be a probability spike in the asymptotic distribution of

& 1

at 0, so standard testing methods will not be valid.

&

& 1

Here are NB-I estimation results for OBDV, obtained using this estimation program
.

MEPS data, OBDV


negbin results
Strong convergence
Observations = 500
Function value

-2.2656

t-Stats
params

t(OPG)

t(Sand.)

-0.055766

-0.16793

-0.17418

-0.17215

pub_ins

0.47936

2.9406

2.8296

2.9122

priv_ins

0.20673

1.3847

1.4201

1.4086

sex

0.34916

3.2466

3.4148

3.3434

age

0.015116

3.3569

3.8055

3.5974

educ

0.014637

0.78661

0.67910

0.73757

inc

0.012581

0.60022

0.93782

0.76330

1.7389

23.669

11.295

16.660

constant

ln_alpha

Information Criteria
Consistent Akaike
2323.3

t(Hess)

24.1. MEPS DATA: MORE ON COUNT MODELS

Schwartz
2315.3
Hannan-Quinn
2294.8
Akaike
2281.6

460

24.1. MEPS DATA: MORE ON COUNT MODELS

461

Here are NB-II results for OBDV

*********************************************************************
MEPS data, OBDV
negbin results
Strong convergence
Observations = 500
Function value

-2.2616

t-Stats
params

t(OPG)

t(Sand.)

-0.65981

-1.8913

-1.4717

-1.6977

pub_ins

0.68928

2.9991

3.1825

3.1436

priv_ins

0.22171

1.1515

1.2057

1.1917

sex

0.44610

3.8752

2.9768

3.5164

age

0.024221

3.8193

4.5236

4.3239

educ

0.020608

0.94844

0.74627

0.86004

inc

0.020040

0.87374

0.72569

0.86579

0.47421

5.6622

4.6278

5.6281

constant

ln_alpha

Information Criteria
Consistent Akaike
2319.3
Schwartz
2311.3
Hannan-Quinn
2290.8
Akaike
2277.6

t(Hess)

24.2. HURDLE MODELS

462

*********************************************************************

For the OBDV model, the NB-II model does a better job, in terms of
the average log-likelihood and the information criteria.

Note that both versions of the NB model t much better than does the

The t-statistics are now similar for all three ways of calculating them,

Poisson model.

which might indicate that the serious specication problems of the


Poisson model for the OBDV data are partially solved by moving to
the NB model.
is highly signicant.
&

The estimated

To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance according to
P

f
z  nb

f


the NB-II model:

. For OBDV and ERV (estimation results

not reported), we get The overdispersion problem is signicantly better than


TABLE 2. Marginal Variances, Sample and Estimated (NB-II)
OBDV
ERV
Sample
37.446 0.30614
Estimated 26.962 0.27620
in the Poisson case, but there is still some overdispersion that is not captured,
for both OBDV and ERV.

24.2. Hurdle models


Returning to the Poisson model, lets look at actual and tted count prob-

Y
`

I
B  % 

f
f

ted frequencies are

1 4d  B %

C
1 % pB $) B  % 
 f

abilities. Actual relative frequencies are

and t-

We see that for the OBDV

24.2. HURDLE MODELS

463

TABLE 3. Actual and Poisson tted frequencies


Count
OBDV
ERV
Count Actual Fitted Actual Fitted
0
0.32
0.06
0.86
0.83
1
0.18
0.15
0.10
0.14
2
0.11
0.19
0.02
0.02
3
0.10
0.18
0.004 0.002
4
0.052
0.15
0.002 0.0002
5
0.032
0.10
0
2.4e-5

measure, there are many more actual zeros than predicted. For ERV, there are
somewhat more actual zeros than tted, but the difference is not too important.
Why might OBDV not t the zeros well? What if people made the decision to contact the doctor for a rst visit, they are sick, then the doctor decides
on whether or not follow-up visits are needed. This is a principal/agent type
situation, where the total number of visits depends upon the decision of both
the patient and the doctor. Since different parameters may govern the two
decision-makers choices, we might expect that different parameters govern
e

be the parameters of

be the paramter of the doctors dee

the patients demand for visits, and let

the probability of zeros versus the other counts. Let


m

mand for visits. The patient will initiate visits according to a discrete choice
model, for example, a logit model:

  ~
$T e 7g} P ) )
~ P
T e 7g} d) ) )

 tT e 

Y
`



 

24.2. HURDLE MODELS

464

The above probabilities are used to estimate the binary 0/1 hurdle process.
Then, for the observations where visits are positive, a truncated Poisson density is estimated. This density is

m c  } )
~
e
m e
f
Y
`

f
m e `
f Y

m
f
 f e

Y
`

since according to the Poisson model with the doctors paramaters,

8 pm m ( c
f
 }  
~
e
e

Since the hurdle and truncated components of the overall density for

share

no parameters, they may be estimated separately, which is computationally


more efcient than estimating the overall model. (Recall that the BFGS algorithm, for example, will have to invert the approximated Hessian. The com-

is

is the number of parameters to be

estimated) . The expectation of

where

putational overhead is of order

c 7g} )  ~ P
~
m
T e )7g} )
e
  


 

24.2. HURDLE MODELS

465

Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program

*********************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value

-0.58939

t-Stats
params

t(OPG)

t(Sand.)

-1.5502

-2.5709

-2.5269

-2.5560

1.0519

3.0520

3.0027

3.0384

priv_ins

0.45867

1.7289

1.6924

1.7166

sex

0.63570

3.0873

3.1677

3.1366

age

0.018614

2.1547

2.1969

2.1807

educ

0.039606

1.0467

0.98710

1.0222

inc

0.077446

1.7655

2.1672

1.9601

constant
pub_ins

t(Hess)

Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39

*********************************************************************

24.2. HURDLE MODELS

466

The results for the truncated part:

*********************************************************************
MEPS data, OBDV
tpoisson results
Strong convergence
Observations = 500
Function value

-2.7042

t-Stats
params

t(OPG)

t(Sand.)

constant

0.54254

7.4291

1.1747

3.2323

pub_ins

0.31001

6.5708

1.7573

3.7183

0.014382

0.29433

0.10438

0.18112

sex

0.19075

10.293

1.1890

3.6942

age

0.016683

16.148

3.5262

7.9814

educ

0.016286

4.2144

0.56547

1.6353

-0.0079016

-2.3186

-0.35309

-0.96078

priv_ins

inc

t(Hess)

Information Criteria
Consistent Akaike
2754.7
Schwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2

*********************************************************************

24.2. HURDLE MODELS

467

Fitted and actual probabilites (NB-II ts are provided as well) are:


TABLE 4. Actual and Hurdle Poisson tted frequencies
Count
OBDV
ERV
Count Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II
0
0.32
0.32
0.34
0.86
0.86
0.86
1
0.18
0.035
0.16
0.10
0.10
0.10
2
0.11
0.071
0.11
0.02
0.02
0.02
3
0.10
0.10
0.08
0.004
0.006
0.006
4
0.052
0.11
0.06
0.002
0.002
0.002
5
0.032
0.10
0.05
0
0.0005
0.001

For the Hurdle Poisson models, the ERV t is very accurate. The OBDV t
is not so good. Zeros are exact, but 1s and 2s are underestimated, and higher
counts are overestimated. For the NB-II ts, performance is at least as good as
the hurdle Poisson model, and one should recall that many fewer parameters
are used. Hurdle version of the negative binomial model are also widely used.

24.2.1. Finite mixture models. The nite mixture approach to tting health
care demand was introduced by Deb and Trivedi (1997). The mixture approach
has the intuitive appeal of allowing for subgroups of the population with different health status. If individuals are classied as healthy or unhealthy then
two subgroups are dened. A ner classication scheme would lead to more
subgroups. Many studies have incorporated objective and/or subjective indicators of health status in an effort to capture this heterogeneity. The available
objective measures, such as limitations on activity, are not necessarily very
informative about a persons overall health status. Subjective, self-reported
measures may suffer from the same problem, and may also not be exogenous

24.2. HURDLE MODELS

468

Finite mixture models are conceptually simple. The density is

. Identication re

bb
xxb

are ordered in some way, for example,

, and

) DB BT
 I

% p! t
 3  B t
B
I

 85 3
B I BT q)  T @A8@897)  ! B
I
B
8 Tt 8I t f Y
 EI T A@8@89I 4@A8@896D `
I T

quires that the

 f
xTt T Y T  B 6 Y B
P t f
B

where

and

. This is simple to accomplish post-estimation by rearrangement

and possible elimination of redundant component densities.

The properties of the mixture density follow in a straightforward way


from those of the components. In particular, the moment generating function is the same mixture of the moment generating functions

is the mean of the

 3

B

where

I
 B Q B BT  

of the component densities, so, for example,

component density.

Mixture densities may suffer from overparameterization, since the to-

tal number of parameters grows rapidly with the number of component densities. It is possible to constrained parameters across the mixtures.
Testing for the number of component densities is a tricky issue. For
example, testing for

Not that when

(a mixture of two components) involves the

, which is on the boundary of the parameter space.

)  I

)  I

restriction

5  
)  

mixture) versus

(a single component, which is to say, no

, the parameters of the second component can

take on any value without affecting the density. Usual methods such
as the likelihood ratio test are not applicable when parameters are on
the boundary under the null hypothesis. Information criteria means
of choosing the model (see below) are valid.

24.2. HURDLE MODELS

469

The following are results for a mixture of 2 negative binomial (NB-I) models,
for the OBDV data, which you can replicate using this estimation program

24.2. HURDLE MODELS

470

*********************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value

-2.2312

t-Stats
params

t(OPG)

t(Sand.)

0.64852

1.3851

1.3226

1.4358

-0.062139

-0.23188

-0.13802

-0.18729

0.093396

0.46948

0.33046

0.40854

sex

0.39785

2.6121

2.2148

2.4882

age

0.015969

2.5173

2.5475

2.7151

-0.049175

-1.8013

-1.7061

-1.8036

0.015880

0.58386

0.76782

0.73281

ln_alpha

0.69961

2.3456

2.0396

2.4029

constant

-3.6130

-1.6126

-1.7365

-1.8411

2.3456

1.7527

3.7677

2.6519

priv_ins

0.77431

0.73854

1.1366

0.97338

sex

0.34886

0.80035

0.74016

0.81892

age

0.021425

1.1354

1.3032

1.3387

0.22461

2.0922

1.7826

2.1470

0.019227

0.20453

0.40854

0.36313

2.8419

6.2497

6.8702

7.6182

0.85186

1.7096

1.4827

1.7883

constant
pub_ins
priv_ins

educ
inc

pub_ins

educ
inc
ln_alpha
logit_inv_mix

Information Criteria

t(Hess)

24.2. HURDLE MODELS

471

Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2

*********************************************************************
Delta method for mix parameter st.
mix

se_mix

0.70096

err.

0.12043

The 95% condence interval for the mix parameter is perilously close

to 1, which suggests that there may really be only one component density, rather than a mixture. Again, this is not the way to test this - it is
merely suggestive.
Education is interesting. For the subpopulation that is healthy, i.e.,
that makes relatively few visits, education seems to have a positive

effect on visits. For the unhealthy group, education has a negative


effect on visits. The other results are more mixed. A larger sample
could help clarify things.
The following are results for a 2 component constrained mixture negative biare the same across

the two components. The constants and the overdispersion parameters


allowed to differ for the two components.

D&

x 7e


nomial model where all the slope parameters in

are

24.2. HURDLE MODELS

472

*********************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value

-2.2441

t-Stats
params

t(OPG)

t(Sand.)

-0.34153

-0.94203

-0.91456

-0.97943

pub_ins

0.45320

2.6206

2.5088

2.7067

priv_ins

0.20663

1.4258

1.3105

1.3895

sex

0.37714

3.1948

3.4929

3.5319

age

0.015822

3.1212

3.7806

3.7042

educ

0.011784

0.65887

0.50362

0.58331

inc

0.014088

0.69088

0.96831

0.83408

ln_alpha

1.1798

4.6140

7.2462

6.4293

const_2

1.2621

0.47525

2.5219

1.5060

lnalpha_2

2.7769

1.5539

6.4918

4.2243

logit_inv_mix

2.4888

0.60073

3.7224

1.9693

constant

Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn

t(Hess)

24.2. HURDLE MODELS

473

2284.3
Akaike
2266.1

*********************************************************************
Delta method for mix parameter st.
mix

se_mix

0.92335

err.

0.047318

Now the mixture parameter is even closer to 1.


The slope parameter estimates are pretty close to what we got with the
NB-I model.




24.2.2. Comparing models using information criteria. A Poisson model


cant be tested (using standard methods) as a restriction of a negative binomial model. Testing for collapse of a nite mixture to a mixture of fewer components has the same problem. How can we determine which of competing
models is the best?
The information criteria approach is one possibility. Information criteria
are functions of the log-likelihood, with a penalty for the number of parameters used. Three popular information criteria are the Akaike (AIC), Bayes (BIC)
and consistent Akaike (CAIC). The formulae are

5 P
$d f p
5

1 A $d f p
P
5

)
tP 1 H$d f p
P
5

w
7Y

7Yw

It can be shown that the CAIC and BIC will select the correctly specied model
from a group of models, asymptotically. This doesnt mean, of course, that the

24.3. MODELS FOR TIME SERIES DATA

474

correct model is necesarily in the group. The AIC is not consistent, and will
asymptotically favor an over-parameterized model over the correctly specied
model. Here are information criteria values for the models weve seen, for
OBDV. According to the AIC, the best is the MNB-I, which has relatively many
TABLE 5. Information Criteria, OBDV
Model
Poisson
NB-I
Hurdle Poisson
MNB-I
CMNB-I

AIC
3822
2282
3333
2265
2266

BIC CAIC
3911 3918
2315 2323
3381 3395
2337 2354
2312 2323

parameters. The best according to the BIC is CMNB-I, and according to CAIC,
the best is NB-I. The Poisson-based models do not do well.
24.3. Models for time series data
This section can be ignored in its present form. Just left in to form a basis
for completion (by someone else ?!) at some point.
Hamilton, Time Series Analysis is a good reference for this section. This is
very incomplete and contributions would be very welcome.

dependent variables, e.g.,

gf

consider the behavior of

as a

These variables can of course contain lagged

8 f 8 
!@A88@9I f!UF  
8
c 

function of other variables

4f

Up to now weve considered the behavior of the dependent variable

Pure time series methods

as a function only of its own lagged values, un-

conditional on other observable variables. One can think of this as modeling

gf

the behavior of

after marginalizing out all other variables. While its not

immediately clear why a model that has other explanatory variables should
marginalize to a linear in the parameters time series model, most time series

24.3. MODELS FOR TIME SERIES DATA

475

work is done with linear models, though nonlinear time series is also a large
and growing eld. Well stick with linear time series models.

24.3.1. Basic concepts.

D EFINITION 53 (Stochastic process). A stochastic process is a sequence of


random variables, indexed by time:

(24.3.1)

@ i

D EFINITION 54 (Time series). A time series is one observation of a stochastic process, over a specic interval:

I f
@f gt

So a time series is a sample of size

(24.3.2)

from a stochastic process. Its impor-

tant to keep in mind that conceptually, one could draw another sample, and
that the values would be different.


%

D EFINITION 55 (Autocovariance). The

autocovariance of a stochastic

process is

8 f
g  HQ

where

H Hsg 
Q
f Q f

(24.3.3)

D EFINITION 56 (Covariance (weak) stationarity). A stochastic process is


covariance stationary if it has time constant mean and autocovariances of all

24.3. MODELS FOR TIME SERIES DATA

476

orders:

2
   
2 
Q  HQ

As weve seen, this implies that

the autocovariances depend only

one the interval between observations, but not the time of the observations.
D EFINITION 57 (Strong stationarity). A stochastic process is strongly stadoesnt

Since moments are determined by the distribution, strong stationarity


stationarity.

weak

The time series is one sample from the stochastic

process. One could think of

What is the mean of

8
2

depend on

tionary if the joint distribution of an arbitrary collection of the

repeated samples from the stoch. proc., e.g.,

By a LLN, we would expect that

dWgf

W t
f

IqW


) A

The problem is, we have only one sample to work with, since we cant go back

ergodicity is the needed property.

in time and collect another. How can

be estimated then? It turns out that

D EFINITION 58 (Ergodicity). A stationary stochastic process is ergodic (for


the mean) if the time average converges to the mean

I
@ 1
gf
f )

(24.3.4)

24.3. MODELS FOR TIME SERIES DATA

477

A sufcient condition for ergodicity is that the autocovariances be absolutely summable:

k
l


%

autocovariance divided by the variance:

is just the




(24.3.5)

autocorrelation,

D EFINITION 59 (Autocorrelation). The

dependent that they dont satisfy a LLN.

are not so strongly

4f

This implies that the autocovariances die off, so that the

D EFINITION 60 (White noise). White noise is just the time series literature

are independent,

!e

Ee

2

normality assumption.

8 
4 2

and

2
  EU
e

and iii)

is white noise if i)

ii)


xD  E
e

term for a classical error.

Gaussian white noise just adds a

24.3.2. ARMA models. With these concepts, we can discuss ARMA models. These are closely related to the AR and MA error processes that weve
already discussed. The main difference is that the lhs variable is observed directly now.
order moving average (MA) process is

24.3.2.1. MA(q) processes. A

txxxP tG DI tcgStVP
G(d Pbbb
d P GI d P G
Q

 gf

.
.
.

T gf

DI i
P

bxxb
b

)
.

.
.
.


..

P  C

bb
xxb
.

..

T
t

..

gf

I gf

.
.
.

..

tG

)
)
bb
xxb t t
I

I T gf

I gf

gf

.
.
.

 

or
.
.
.

order difference equation as a vector rst order difference equation:


The dynamic behavior of an AR(p) process can be studied by writing this

G
tVP T g9xxxP f wsI cwP  gf
fTt Pbbb
t P fI t
24.3.2.2. AR(p) processes. An AR(p) process can be represented as
and all of the

as long as

are nite.

Therefore an MA(q) process is necessarily covariance stationary and ergodic,

%
s
(d(d Pbbb d P Id d
%

 9xxP d sgI P d





Similarly, the autocovariances are

P I ! 
d
d P )
tG I tcgP t
d P GI d G 

( xxxP
d P b b b
G(d Pbbb
6( t9xxP

Q f
 

tG

where

is white noise. The variance is


24.3. MODELS FOR TIME SERIES DATA

478

8
$f

Save this result, well need it in a minute.

r
 I sI

Therefore, stationarity requires that


must die off. Otherwise a shock causes a permanent change in the mean of

If the system is to be stationary, then as we move forward in time this impact

r
I sI

4I
P $

This is simply

8 $ f

Pbbb P
xxI I

I sI R
r


$ C

Consider the impact of a shock in period on

P
t

P
I i I

Pbbb
cxxP
P

P $ C


or in general

PsI P C P I i P P P P P P 
P
P

I P P sI C6 P P P P P P 
P
P

C
DI C P P 
P

and

I S P I C P P P P 
P
P

I P DI i P P P P 
P

I P C P P 

I C

With this, we can recursively work forward in time:


24.3. MODELS FOR TIME SERIES DATA

479

24.3. MODELS FOR TIME SERIES DATA

These are the for


e

Consider the eigenvalues of the matrix

480

such that

 Ce P

is simply

I
t 

the matrix

so


e

I
t

can be written as

 I
e t
is

)
t t
I

Vqt
t I

)
I
Dt

 Ce

 Ce P

e
g

So the eigenvalues are the roots of the polynomial

and

e
t


75  

so




the matrix
P

When


)  

The determinant here can be expressed as a polynomial. for example, for

Vt
t I

481

which can be found using the quadratic equation. This generalizes. For a
order AR process, the eigenvalues are the roots of

 

24.3. MODELS FOR TIME SERIES DATA

Tt T
 VI t

bb
gxxb t T e Dt I T e T
I
e

Supposing that all of the roots of this polynomial are distinct, then the matrix
can be factored as

I  


is the matrix which has as its columns the eigenvectors of

and


where

is a diagonal matrix with the eigenvalues on the main diagonal. Using this
decomposition, we can write

bb
I  xxb I  I  
P

is repeated times. This gives


%

I  

e


are all real valued, it is clear that

r
 I sI

 85
A@8A874)  3 Ce
B

 8 5 3
@@8A894)  !) Se
B

requires that

..

Supposing that the

and

I 

where

24.3. MODELS FOR TIME SERIES DATA

482

e.g., the eigenvalues must be less than one in absolute value.


It may be the case that some eigenvalues are complex-valued. The

previous result generalizes to the requirement that the eigenvalues be


less than one in modulus, where the modulus of a complex number
h

is

3U P
 pt

hP
U

3U P
6p

This leads to the famous statement that stationarity requires the roots
of the determinantal polynomial to lie inside the complex unit circle.
draw picture here.

When there are roots on the unit circle (unit roots) or outside the unit

Dynamic multipliers:

circle, we leave the world of stationary processes.


is a dynamic multiplier or an

r
I sI

 tG v$ f

impulse-response function. Real eigenvalues lead to steady movements,


whereas comlpex eigenvalue lead to ocillatory behavior. Of course,
when there are multiple eigenvalues the overall effect can be a mixture. pictures

Invertibility of AR process
f

To begin with, dene the lag operator

I f  gf

f

The lag operator is dened to behave just as an algebraic quantity, e.g.,

gf f

gf

I gf f 

 gf
f


T e e xxH
bbb
e

e vEI

e  u e I V4xx T e u I T e u T
Tt
Tt bbb
t
It
I
I  V I I 6xx
# Tt
# Ttbbb
so we get

T e I xxxH
# b b b

I I
#
e

# e

t
#I t
T # u T I cV T #

and now dene

T#
)
c  T 9uxxb # ucV)
#Tt bb
t #I t

Multiply both sides by

# )bbb
4tT e ctxxH$#

) #
4I
e

r
#
such that the following two expressions are

B
e

B
ie

the same for all

same as determination of the

is dened to operate as an algebraic quantitiy, determination of the

is the

)
v f I
e

B
ie
)
c  T f V4xxb f u f V)
Tt bb
t
It

are coefcients to be determined. Since

f T e 9xxH f
)bbb

For the moment, just assume that the


Factor this polynomial as

tG  T f uxxb f V f uf
Tt bb
t
I t )

or

tG  T g9uxxb gf uI gDVHf


fTt bb
t fI t
A mean-zero AR(p) process can be written as

g0)
f

gf
f

gf
f

P gf

)
f

f
 gc
f

P )

or
24.3. MODELS FOR TIME SERIES DATA

483

tG f t@8@8P f P f
t P 8
t


4) t
t P 8
wtA8@8P f wP f w)
t
t P

t
wP )  gf

 gf I f I 4

P
Sgf I f I t  f

since

so

tG

t P 8
wA8@8P f wP f )  gf I f I u)
t
t P

Now as
so

tG

and with cancellations we have

G
t7 f t@8@8P f P f c s
t P 8
t
t P ) 

gf I f I u f u4@8A88 f 0 f V f wt@8A8P f P f
t
t
t
t
t P 8
t
t P )



or, multiplying the polynomials on th LHS, we get

tG f tPx8A@88P f twP f twP )  f Vc f t@8A8P f P f )


f t )
t P 8
t
t P


t P 8 P t
@8A8 f wP f w)
t P
to get

8
4) t

Multiply both sides by

Stationarity, as above, implies that

f t )
G  f Vc
Now consider a different stationary process
P

B
Ce

Therefore, the

the eigenvalues of the matrix

of

that are the coefcients of the factorization are simply

The LHS is precisely the determinantal polynomial that gives the eigenvalues
24.3. MODELS FOR TIME SERIES DATA

484

24.3. MODELS FOR TIME SERIES DATA

485

and the approximation becomes better and better as increases. However, we


%

started with

f t )
G  f Vc
Substituting this into the above equation we have

f t )
f uc

t P 8
t@8A8P f P f  gf
t
t P )
so

)  f u f wA8@8P f wP f w
t )
t P 8
t
t P )

and the approximation becomes arbitrarily good as increases arbitrarily. There%

dene

f
t

t )
6 uc
 I f


4) t

fore, for

Recall that our mean zero AR(p) process

tG  T f uxxb f V f uf
Tt bb
t
I t )
can be written using the factorization

tG  f T e xxxH f
)bbb
and given stationarity, all the

)
cgf

are the eigenvalues of

8
4) e
B

)
v f I
e

where the

Therefore, we can invert each rst order polynomial on the LHS to get

 f
p
p

tGj T bxxb
b
f e
e

 gf

Gbbb
tc!xxP f

P f iP c  gf
I X )

sented as

The RHS is a product of innite-order polynomials in

which can be repre-

24.3. MODELS FOR TIME SERIES DATA

are real-valued and absolutely summable.

are real-valued because any complex-valued

In multiplication

83U
6i

is

is an eigenvalue of

3U P
6U

conjugate pairs. This means that if

always occur in
P

The

8B t

functions of the

, which are in turn

are formed of products of powers of the

B
ie

The

B
ie

where the

486

then so

P 
U
3U 3U P 3U
66"uH6"0$  6i6p
3U 3U P
which is real-valued.

This shows that an AR(p) process is representable as an innite-order


MA(q) process.
Recall before that by recursive substitution, an AR(p) process can be
written as

4I
P $

Pbbb P
xxI I

P
t

P
I i I
P

Pbbb
cxxP

drops out. Take

on the RHS drops out. The

Pbbb P
txxI I

P
I i I

the lagged

 i
P

As

sI
P

are vectors

of zeros except for their rst element, so we see that the rst equation
here, in the limit, is just

tG I sI P  gf

P $ C

P

this and lag it by periods to get

If the process is mean zero, then everything with a

T t@8A8P  wHI sDt 


Tt P 8
t P I
f
f T t P 8 P Q f t P Q f I t

sQs
gHG0PsQsyT gwA8@8D ssqI gvA 
Q
s gHsg@ 
f Q f

The autocovariances of orders

follow the rule

P TTt P 8
t P I I
u}wA8@8P  wDiDt  p


With this, the second moments are easy to nd: The variance is

G P Q f T t P 8 P Q
tVHsYT t@8@8Ds
G P fTt Pbbb
0UT 9wtxxP f wsI
t P

g wDsI gt
f t P Q f I 
cw}u4@8A88 DuQ
fI t P QT t QI t 

Hf

Q
s

so
Q

T t QI t 
u4@8A88 DVQ

Tt
uA8@88 uu)
t It

and

so

T t P 8 P t P QI t
@8@8Q wwP 

Q
}

2 
Q  U
f
G
tVP T g9xxxP f wsI cwP  gf
fTt Pbbb
t P fI t

Assuming stationarity,

so

Moments of AR(p) process. The AR(p) process is

8
g

as well, recalling the previous factorization of

and the

Bt

B
Se

the

which makes explicit the relationship between the


24.3. MODELS FOR TIME SERIES DATA

(and
487

G P 8
tV@8A8P gf sI cI P  gf

P f

tG  @8A8ss g qssqI gI q
8 P Q
f Q f Q f
p

tG Q f
 s g f

I ! (
f


f

)
 p

P 8
@8@8P f g
Id P )

( d


Q f
tG  s I ! (
If this is the case,
f

or

so we get

P 8
@8@8P f g
Id P )

( d


8
4) g
B
) 8
Yc@8@8g f c f I c  (
)
)

can be inverted as long as

or
with

will be an innite-order polynomial in


where
f

then we can write

f g c
B )

and each of the

P 8
A8@8P f g
Id P )

( d


As before, the polynomial on the RHS can be factored as

G
tc (
f

P 8
@8A8P f gP c 
Id )

( d


gf

24.3.2.3. Invertibility of MA(q) process. An MA(q) can be written as


can be solved for recursively.
and solve for the unknowns. With

)
P 

equations for

T 8I  
t@A8@89C p TxD


y




for

) P
s

unknowns (

one can take the

 8)
@@8A899 u

these, the

which have

Using the fact that

24.3. MODELS FOR TIME SERIES DATA

488

24.3. MODELS FOR TIME SERIES DATA

489

where

4
B

@8A8Q P I Q
8 P
Q
P 

So we see that an MA(q) has an innite AR representation, as long as the

8 85 3
7@@8A894)  4)

It turns out that one can always manipulate the parameters of an MA(q)
process to nd an invertible representation. For example, the two
MA(1) processes

)


d
i

gf

Q
s

and

I $ 
G f d )

have exactly the same moments if

d 
For example, weve seen that

8 d P )
g  p


Given the above relationships amongst the parameters,

hc  $ d  p
d P )
d P )


so the variances are the same. It turns out that all the autocovariances
will be the same, as is easily checked. This means that the two MA
processes are observationally equivalent. As before, its impossible to
distinguish between observationally equivalent processes on the basis
of data.

24.3. MODELS FOR TIME SERIES DATA

490

For a given MA(q) process, its always possible to manipulate the pa-

Its important to nd an invertible representation, since its the only

rameters to nd an invertible representation (which is unique).

gG

The other representations express

as a function of past

8
$ R f

representation that allows one to represent

Why is invertibility important? The most important reason is that it


provides a justication for the use of parsimonious models. Since an
representation, one can reverse the ark

gument and note that at least some MA(

AR(1) process has an MA(

processes have an AR(1)

representation. At the time of estimation, its a lot easier to estimate


the single AR(1) coefcient rather than the innite number of coefcients associated with the MA representation.
This is the reason that ARMA models are popular. Combining low-

order AR and MA models can usually offer a satisfactory representation of univariate time series data with a reasonable number of parameters.
Stationarity and invertibility of ARMA models is similar to what weve
seen - we wont go into the details. Likewise, calculating moments is

similar.

P )

E XERCISE 61. Calculate the autocovariances of an ARMA(1,1) model:

e
E

P ) f
cP  gc f t

Bibliography
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford
Univ. Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ.
Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supplementary use only).

491

Index

asymptotic equality, 442

observations, inuential, 27
outliers, 27
own inuence, 29

Chain rule, 436


Cobb-Douglas model, 21

parameter space, 49

convergence, almost sure, 438

Product rule, 436

convergence, in distribution, 439


convergence, in probability, 438

R- squared, uncentered, 31

Convergence, ordinary, 437

R-squared, centered, 32

convergence, pointwise, 437


convergence, uniform, 437
convergence, uniform almost sure, 440
cross section, 17

estimator, linear, 28, 38


estimator, OLS, 23
extremum estimator, 247

leverage, 28
likelihood function, 49

matrix, idempotent, 27
matrix, projection, 26
matrix, symmetric, 27
492

You might also like