Econometrics Creel

Econometrics
c Michael Creel
Version 0.70, September, 2005
D EPT.
OF
B ARCELONA ,
E CONOMICS
AND
E CONOMIC H ISTORY, U NIVERSITAT A UTNOMA
MICHAEL . CREEL @ UAB . ES , H T T P :// P A R E T O . U A B . E S / M C R E E L
DE
Contents
List of Figures
10
List of Tables
12
Chapter 1. About this document
13
1.1. License
14
1.2. Obtaining the materials
14
1.3. An easy way to use LYX and Octave today
15
1.4. Known Bugs
17
Chapter 2. Introduction: Economic and econometric models
18
Chapter 3. Ordinary Least Squares
21
3.1. The Linear Model
21
3.2. Estimation by least squares
22
3.3. Geometric interpretation of least squares estimation
25
3.4. Inuential observations and outliers
28
3.5. Goodness of t
31
3.6. The classical linear regression model
34
3.7. Small sample statistical properties of the least squares estimator
36
3.8. Example: The Nerlove model
43
Exercises
49
Chapter 4. Maximum likelihood estimation

3
50
CONTENTS
4.1. The likelihood function
50
4.2. Consistency of MLE
54
4.3. The score function
56
4.4. Asymptotic normality of MLE
58
4.6. The information matrix equality
63
4.7. The Cramr-Rao lower bound
65
Exercises
68
Chapter 5. Asymptotic properties of the least squares estimator
70
5.1. Consistency
70
5.2. Asymptotic normality
71
5.3. Asymptotic efciency
72
Chapter 6. Restrictions and hypothesis tests
75
6.1. Exact linear restrictions
75
6.2. Testing
81
6.3. The asymptotic equivalence of the LR, Wald and score tests
90
6.4. Interpretation of test statistics
94
6.5. Condence intervals
94
6.6. Bootstrapping
95
6.7. Testing nonlinear restrictions, and the Delta Method
98
6.8. Example: the Nerlove data
102
Chapter 7. Generalized least squares
110
7.1. Effects of nonspherical disturbances on the OLS estimator
111
7.2. The GLS estimator
112
7.3. Feasible GLS
115
7.4. Heteroscedasticity
117
CONTENTS
7.5. Autocorrelation
130
Exercises
151
Exercises
153
Chapter 8. Stochastic regressors
154
8.1. Case 1
155
8.2. Case 2
156
8.3. Case 3
158
8.4. When are the assumptions reasonable?
158
Exercises
161
Chapter 9. Data problems
162
9.1. Collinearity
162
9.2. Measurement error
171
9.3. Missing observations
175
Exercises
181
Exercises
181
Exercises
181
Chapter 10. Functional form and nonnested tests
182
10.1. Flexible functional forms
183
10.2. Testing nonnested hypotheses
195
Chapter 11. Exogeneity and simultaneity
199
11.1. Simultaneous equations
199
11.2. Exogeneity
202
11.3. Reduced form
205
11.4. IV estimation
208
CONTENTS
11.5. Identication by exclusion restrictions
214
11.6. 2SLS
227
11.7. Testing the overidentifying restrictions
231
11.8. System methods of estimation
236
11.9. Example: 2SLS and Kleins Model 1
245
Chapter 12. Introduction to the second half
248
Chapter 13. Numeric optimization methods
257
13.1. Search
258
13.2. Derivative-based methods
258
13.3. Simulated Annealing
267
13.4. Examples
268
13.5. Duration data and the Weibull model
272
13.6. Numeric optimization: pitfalls
276
Exercises
282
Chapter 14. Asymptotic properties of extremum estimators
283
14.1. Extremum estimators
283
14.2. Consistency
284
14.3. Example: Consistency of Least Squares
289
14.4. Asymptotic Normality
291
14.5. Examples
294
14.6. Example: Linearization of a nonlinear model
298
Exercises
303
Chapter 15. Generalized method of moments (GMM)

15.1. Denition
304
304
CONTENTS
15.2. Consistency
307
308
15.4. Choosing the weighting matrix
310
15.5. Estimation of the variance-covariance matrix
313
15.6. Estimation using conditional moments
316
15.7. Estimation using dynamic moment conditions
322
15.8. A specication test
322
15.9. Other estimators interpreted as GMM estimators
325
15.10. Example: The Hausman Test
334
15.11. Application: Nonlinear rational expectations
341
15.12. Empirical example: a portfolio model
345
Exercises
347
Chapter 16. Quasi-ML
348
Chapter 17. Nonlinear least squares (NLS)
354
17.1. Introduction and denition
354
17.2. Identication
356
17.3. Consistency
358
358
17.5. Example: The Poisson model for count data
360
17.6. The Gauss-Newton algorithm
361
17.7. Application: Limited dependent variables and sample selection 364

Chapter 18. Nonparametric inference
368
18.1. Possible pitfalls of parametric inference: estimation
368
18.2. Possible pitfalls of parametric inference: hypothesis testing
372
18.3. The Fourier functional form
373
CONTENTS
18.4. Kernel regression estimators
385
18.5. Kernel density estimation
391
18.6. Semi-nonparametric maximum likelihood
391
18.7. Examples
397
Chapter 19. Simulation-based estimation
408
19.1. Motivation
408
19.2. Simulated maximum likelihood (SML)
415
19.3. Method of simulated moments (MSM)
418
19.4. Efcient method of moments (EMM)
422
19.5. Example: estimation of stochastic differential equations
428
Chapter 20. Parallel programming for econometrics
431
Chapter 21. Introduction to Octave
432
21.1. Getting started
432
21.2. A short introduction
432
21.3. If youre running a Linux installation...
435
Chapter 22. Notation and Review
436
22.1. Notation for differentiation of vectors and matrices
436
22.2. Convergenge modes
437
22.3. Rates of convergence and asymptotic equality
441
Exercises
444
Chapter 23. The GPL
445
Chapter 24. The attic
456
24.1. MEPS data: more on count models
457
24.2. Hurdle models
462
CONTENTS
24.3. Models for time series data
474
Bibliography
491
Index
492
List of Figures
1.2.1
LYX
15
1.2.2
Octave
16
3.2.1
Typical data, Classical Model
23
3.3.1
Example OLS Fit
26
3.3.2
The t in observation space
26
3.4.1
Detection of inuential observations
30
3.5.1
Uncentered
32
3.7.1
Unbiasedness of OLS under classical assumptions
37
3.7.2
Biasedness of OLS when an assumption fails
38
3.7.3
Gauss-Markov Result: The OLS estimator
41
3.7.4
Gauss-Markov Result: The split sample estimator
42
6.5.1
Joint and Individual Condence Regions
96
6.8.1
RTS as a function of rm size
107
7.4.1
Residuals, Nerlove model, sorted by rm size
125
7.5.1
Autocorrelation induced by misspecication
132
7.5.2
Durbin-Watson critical values
144
7.6.1
Residuals of simple Nerlove model
147
7.6.2
OLS residuals, Klein consumption equation
149
10
LIST OF FIGURES
9.1.2
9.1.1
11
when there is no collinearity
164
when there is collinearity
165
9.3.1
Sample selection bias
179
13.1.1
The search method
259
13.2.1
Increasing directions of search
261
13.2.2
Newton-Raphson method
263
13.2.3
Using MuPAD to get analytic derivatives
266
13.5.1
Life expectancy of mongooses, Weibull model
275
13.5.2
Life expectancy of mongooses, mixed Weibull model
277
13.6.1
A foggy mountain
278
15.10.1
OLS and IV estimators when regressors and errors are

correlated
21.2.1
335
Running an Octave program
433
List of Tables
1
Marginal Variances, Sample and Estimated (Poisson)
457
Marginal Variances, Sample and Estimated (NB-II)
462
Actual and Poisson tted frequencies
463
Actual and Hurdle Poisson tted frequencies
467
Information Criteria, OBDV
474
12
CHAPTER 1
About this document

This document integrates lecture notes for a one year graduate level course
with computer programs that illustrate and apply the methods that are studied. The immediate availability of executable (and modiable) example programs when using the PDF1 version of the document is one of the advantages
of the system that has been used. On the other hand, when viewed in printed
form, the document is a somewhat terse approximation to a textbook. These
notes are not intended to be a perfect substitute for a printed textbook. If you
are a student of mine, please note that last sentence carefully. There are many
good textbooks available. A few of my favorites are listed in the bibliography.
With respect to contents, the emphasis is on estimation and inference within
the world of stationary data, with a bias toward microeconometrics. The second half is somewhat more polished than the rst half, since I have taught that
course more often. If you take a moment to read the licensing information in
the next section, youll see that you are free to copy and modify the document.
If anyone would like to contribute material that expands the contents, it would
be very welcome. Error corrections and other additions are also welcome. As
an example of a project that has made use of these notes, see these very nice
lecture slides.
It is possible to have the program links open up in an editor, ready to run using keyboard
macros. To do this with the PDF version you need to do some setup work. See the bootable
CD described below.
13
1.2. OBTAINING THE MATERIALS
14
1.1. License
All materials are copyrighted by Michael Creel with the date that appears
above. They are provided under the terms of the GNU General Public License,
which forms Section 23 of the notes. The main thing you need to know is that
you are free to modify and distribute these materials in any way you like, as
long as you do so under the terms of the GPL. In particular, you must make
available the source les, in editable form, for your modied version of the
materials.
1.2. Obtaining the materials
The materials are available on my web page, in a variety of forms including
PDF and the editable sources, at pareto.uab.es/mcreel/Econometrics/. In addition to the nal product, which youre looking at in some form now, you can
obtain the editable sources, which will allow you to create your own version,
if you like, or send error corrections and contributions. The main document
was prepared using LYX (www.lyx.org) and Octave (www.octave.org). LYX is
a free2 what you see is what you mean word processor, basically working as
A
a graphical frontend to LTEX. It (with help from other applications) can export
A
your work in LTEX, HTML, PDF and several other forms. It will run on Linux,
Windows, and MacOS systems. Figure 1.2.1 shows LYX editing this document.
GNU Octave has been used for the example programs, which are scattered
though the document. This choice is motivated by two factors. The rst is the
high quality of the Octave environment for doing applied econometrics. The
fundamental tools exist and are implemented in a way that make extending
2
Free is used in the sense of freedom, but LYX is also free of charge.
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY
15
F IGURE 1.2.1. LYX
them fairly easy. The example programs included here may convince you of
this point. Secondly, Octaves licensing philosophy ts in with the goals of this
project. Thirdly, it runs on Linux, Windows and MacOS. Figure 1.2.2 shows an
Octave program being edited by NEdit, and the result of running the program
in a shell window.
1.3. An easy way to use LYX and Octave today

The example programs are available as links to les on my web page in the
PDF version, and here. Support les needed to run these are available here.
The les wont run properly from your browser, since there are dependencies
1.3. AN EASY WAY TO USE LYX AND OCTAVE TODAY
16
F IGURE 1.2.2. Octave
between les - they are only illustrative when browsing. To see how to use
these les (edit and run them), you should go to the home page of this document, since you will probably want to download the pdf version together
with all the support les and examples. Then set the base URL of the PDF le
to point to wherever the Octave les are installed. All of this may sound a bit
complicated, because it is. An easier solution is available:
The le pareto.uab.es/mcreel/Econometrics/econometrics.iso is an ISO image le that may be burnt to CDROM. It contains a bootable-from-CD Gnu/Linux
1.4. KNOWN BUGS
17
system that has all of the tools needed to edit this document, run the Octave example programs, etcetera. In particular, it will allow you to cut out small portions of the notes and edit them, and send them to me as LYX (or TEX) les for
inclusion in future versions. Think error corrections, additions, etc.! The CD
automatically detects the hardware of your computer, and will not touch your
hard disk unless you explicitly tell it to do so. It is based upon the Knoppix
GNU/Linux distribution, with some material removed and other added. Additionally, you can use it to install Debian GNU/Linux on your computer (run
knoppix-installer as the root user). The versions of programs on the CD
may be quite out of date, possibly with security problems that have not been
xed. So if you do a hard disk installation you should do apt-get update,
apt-get upgrade toot sweet. See the Knoppix web page for more information.
1.4. Known Bugs
This section is a reminder to myself to try to x a few things.
The PDF version has hyperlinks to gures that jump to the wrong gure. The numbers are correct, but the links are not. ps2pdf bugs?
CHAPTER 2
Introduction: Economic and econometric models

Economic theory tells us that the demand function for a good is something
like:
#
$!"!
is the quantity demanded
) '
0(&
is
vector of prices of the good and its substitutes and comple-

%
ments
is income
is a vector of other variables such as individual characteristics that
#
affect preferences
Suppose we have a sample consisting of one observation on
1 8 5
!@A8@89764) 3
1
demands at time period (this is a cross section, where
individuals
indexes the
individuals in the sample). The individual demand functions are
B ! B ! B CDC
#
B B
The model is not estimable as it stands, since:
B#
Some components of
8
E3
The form of the demand function is different for all
may not be observable to an outside modeler.
For example, people dont eat the same lunch every day, and you cant
tell what they will order just by looking at them. Suppose we can
18

2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS
into the observable components
BG
B#
component
B F
break
19
and a single unobservable
A step toward an estimable econometric model is to suppose that the model

may be written as
G P ` RB F P W
P T R PI B
B 0aS YVXS B VU$ BSQDH HC
We have imposed a number of restrictions on the theoretical model:
which in principle may differ for all
dc C
b B
restricted to all belong to the same parametric family.
The functions
have been
Of all parametric families of functions, we have restricted the model
The parameters are constant across individuals.
to the class of linear in the variables functions.
There is a single unobservable component, and we assume it is additive.
If we assume nothing about the error term , we can always write the last
equation. But in order for the
coefcients to have an economic meaning,
and in order to be able to estimate them from sample data, we need to make
additional assumptions. These additional assumptions have no theoretical
basis, they are assumptions on top of those needed to prove the existence of
a demand function. The validity of any results we obtain using this model
will be contingent on these additional restrictions being at least approximately
correct. For this reason, specication testing will be needed, to check that the
model seems to be reasonable. Only when we are convinced that the model is
at least approximately correct should we use it for economic analysis.
2. INTRODUCTION: ECONOMIC AND ECONOMETRIC MODELS
20
When testing a hypothesis using an econometric model, three factors can

cause a statistical test to reject the null hypothesis:
(1) the hypothesis is false
(2) a type I error has occured
(3) the econometric model is not correctly specied so the test does not
have the assumed distribution
We would like to ensure that the third reason is not contributing to rejections,
so that rejection will be due to either the rst or second reasons. Hopefully the
above example makes it clear that there are many possible sources of misspecication of econometric models. In the next few sections we will obtain results
supposing that the econometric model is entirely correctly specied. Later we
will examine the consequences of misspecication and see some methods for
determining if a model is correctly specied. Later on, econometric methods
that seek to minimize maintained assumptions are introduced.
CHAPTER 3
Ordinary Least Squares

3.1. The Linear Model
can consider a model that is a linear approximation:
Linearity: the model is a linear function of the parameter vector
. We
r p
Yq
using the variables
h 8
i @@8A89 gI
Consider approximating a variable
e h P 8
uP i p h ut@8A8P p VsI pI
P
f
or, using vector notation:
e
wP p dR f
v
is a scalar random variable,
is a -vector of explanatory variables, and
perscript 0 in
y bb
8 c p h xxb p pI
p
y h bb
i xxb I Xv

The dependent variable
The su-
means this is the true value of the unknown parameter.
It will be dened more precisely later, and usually suppressed when its not
necessary for clarity.
Suppose that we want to use data to try to determine the best linear ap-
8 v
using the variables
The data
1 85
A@8A874) v
2 f
proximation to
are
obtained by some form of sampling1. An individual observation is thus
G P v
t0 R f
1
For example, cross-sectional data may be obtained by random sampling. Time series data
accumulate historically.
21
3.2. ESTIMATION BY LEAST SQUARES
observations can be written in matrix form as
(3.1.1)
G P
V
h ff bb f I
) '
jQ1 R igxxb ef d
where
is
and
h
bb

R "f v xxb v I v lk
The
22
Linear models are more general than they might rst appear, since one can
employ nonlinear transformations of the variables:
F
gpvI mu I $ p u f
# m
F n #
G P r F b b b F
V spT m xxqp m I m o $ p m
B t
where the
are known functions. Dening
etc. leads
to a model in the form of equation 3.6.1. For example, the Cobb-Douglas model
G ~ } x| z x w
7gwyF gyF #
{
can be transformed logarithmically to obtain
w D$7# f
I
8G
70P | F | uP F uP w A #
If we dene
etc., we can put the model in the form needed.
The approximation is linear in the parameters, but not necessarily linear in the
variables.
3.2. Estimation by least squares

Figure 3.2.1, obtained by running TypicalData.m shows some data that fol. The green line is the true regression
, and the red crosses are the data points

f
g!
H
PI
dom error that has mean zero and is independent of
where
Ee
line
e
EaP aH gf
PI
lows the linear model
is a ran-
. Exactly how the green
line is dened will become clear later. In practice, we only have the data, and
23
F IGURE 3.2.1. Typical data, Classical Model

10
data
true regression line
-5
-10
-15
10
X
12
14
16
18
20
we dont know where the green line lies. We need to gain information about
the straight line that best ts the data points.
The ordinary least squares (OLS) estimator is dened as the value that minimizes the sum of the squared errors:

E

where
%
R V tuis
R P R 5 R

% R q %
I
@
R f
v g
f

24
This last expression makes it clear how the OLS estimator is dened: it minand
The tted OLS coefcients
8
"
imizes the Euclidean distance between
will dene the best linear approximation to using
as basis functions, where
best means minimum Euclidean distance. One could think of other estimators based upon other metrics. For example, the minimum absolute distance
R v gf q@
I f
(MAD) minimizes
. Later, we will see that which estimator is
best in terms of their statistical properties, rather than in terms of the metrics
that dene them, depends upon the properties of , about which we have as
and it to zero:

g
To minimize the criterion
nd the derivative with respect to
yet made no assumptions.
5 P
5
x
R w R p q
so
8
R I ! R
To verify that this is a minimum, check the s.o.s.c.:
x
R 5
this matrix is positive denite, since its a quadratic
i G

8

The tted values are in the vector

The residuals are in the vector
1
minimizer.
, so
form in a p.d. matrix (identity matrix of order

Since
is in fact a
3.3. GEOMETRIC INTERPRETATION OF LEAST SQUARES ESTIMATION
25
Note that
P
G
G P
0

G R
h % R
R R
which is to say, the OLS residuals are orthogonal to

this more carefully.
Also, the rst order conditions can be written as
. Lets look at
3.3. Geometric interpretation of least squares estimation

!i
3.3.1. In
Space. Figure 3.3.1 shows a typical t to data, along with the
true regression line. Note that the true line and the estimated line are different.
This gure was created by running the Octave program OlsFit.m . You can
experiment with changing the parameter values to see how this affects the t,
and to see how the tted line will sometimes be close to the true line, and
sometimes rather far away.
3.3.2. In Observation Space. If we want to plot in observation space, well

need to use only two or three observations, or well encounter some limitations
of the blackboard. Lets use two. With only two observations, we cant have
8
4)
26
F IGURE 3.3.1. Example OLS Fit

15
data points
fitted line
true line
10
-5
-10
-15
10
X
12
14
16
18
20
F IGURE 3.3.2. The t in observation space

Observation 2
e = M_xY
S(x)
x
x*beta=P_xY
Observation 1
into two components: the orthogonal projection

,
nent that is the orthogonal projection onto the
Since
is chosen to make
to the space spanned by
as short as possible,
Since
8
i
G
8
G i
orthogonal to the span of
is in this space,
and the composubpace that is
will be orthogonal
8 G R
dimensional space spanned by
i1

onto the
We can decompose
27
Note that
the f.o.c. that dene the least squares estimator imply that this is so.
is the projection of onto the span of

i
3.3.3. Projection Matrices.
or
fi I Qi
R
R
onto the span of
Therefore, the matrix that projects
is
R R
i I !Xiu s
since

8
f
. We have that
G
jf

8fHRi I !X%upf d
R
fi I !X%ujf
R R

to the span of
dimensional space that is orthogonal
is the projection of onto the
3.4. INFLUENTIAL OBSERVATIONS AND OUTLIERS
onto the space orthogonal to the span of
is
8 pf

R R
i I 6Xiujpf
So the matrix that projects
28
X

We have
8
f Q G
Therefore
8 P
G
f X f s
P
dimensional vector
dimensional
are symmetric and idempotent.
A symmetric matrix
is one such that
An idempotent matrix
and
8 ww
w
8 R w
w
Note that both

i
space.
and the portion that lies in the orthogonal
u1
ned by
into two
dimensional space de-
orthogonal components - the portion that lies in the
f
These two projection matrices decompose the
is one such that
The only nonsingular idempotent matrix is the identity matrix.
3.4. Inuential observations and outliers
3
element of the vector

B
f BR
f AB i I 6Qi
R R

The OLS estimator of the
is simply
29
This is how we dene a linear estimator - its a linear function of the dependent variable. Since its a linear combination of the observations on the
dependent variable, where the weights are detemined by the observations on
the regressors, some observations may have more inuence than others. Dene
)
and

)
a in the t position). So
is a
is the t element on the main diagonal of
vector of zeros with
8
q1 e sy

. If the weight is much higher, then
the observation has the potential to affect the t importantly. The weight,
1
So, on average, the weight on the s is
is referred to as the leverage of the observation. However, an observation may
To account for this, consider estimation of
innon, pp. 32-5 for proof) that
8
@
tion (designate this estimator as
s.
, rather than the weight it is multiplied
without using the
2
by, which only depends on the
also be inuential due to the value of
observa-
One can show (see Davidson and MacK-
t G i I !X% g ) ) @
R R
30
F IGURE 3.4.1. Detection of inuential observations

14
Data points
fitted
Leverage
Influence
12
10
0.5
1.5
X
2.5
observations tted value is
so the change in the
G g
2
-2
)

@ aj a
While an observation may be inuential if it doesnt affect its own tted value,
it certainly is inuential if it does. A fast means of identifying inuential ob-

t G h I
servations is to plot
(which I will refer to as the own inuence of the
observation) as a function of . Figure 3.4.1 gives an example plot of data, t,
leverage and inuence. The Octave program is InuentialObservation.m . If

you re-run the program you will see that the leverage of the last observation
(an outlying value of x) is always high, and the inuence is sometimes high.
After inuential observations are detected, one needs to determine why
they are inuential. Possible causes include:
3.5. GOODNESS OF FIT
31
data entry error, which can easily be corrected once detected. Data
special economic factors that affect some observations. These would
entry errors are very common.
need to be identied and incorporated in the model. This is the idea

behind structural change: the parameters may not be constant across all
observations.
pure randomness may have caused us to sample a low-probability observation.
There exist robust estimation methods that downweight outliers.

3.5. Goodness of t
The tted model is
P
G f
Take the inner product:
P R R 5 P R R
R
G tR G G i i ff
R P R R
R
G t G i ff
is dened as
t
H 4v
f

f
fR f
R R
f R f )
G R G

(3.5.1)
The uncentered
G R
But the middle term of the RHS is zero since
, so
and the span of
changes if we add a constant to

f
The uncentered
is the angle between
since this changes
(see Figure 3.5.1, the yellow vector is a constant, since its on the
degree line in observation space). Another, more common deni-
F IGURE 3.5.1. Uncentered
tion measures the contribution of the variables, other than the constant
term, to explaining the variation in
f
8
f
where
32
the model to explain the variation of

mean.
Thus it measures the ability of

about its unconditional sample
q
8
4)
then one can show that

g
Supposing that a column of ones is in the space spanned by
R R
X
where
So
R P R R
G G i f f
R
In this case
8
G 9G
R
G i

contains a column of ones (i.e., there is a constant term),
7u @f f R f
f f I
) f f
G y)
G R
and

t G
so
Supposing that
G R G
where
The centered
is dened as
9G t G X is f f
R P R R
R
from the mean, equation 3.5.1 becomes
Let
1 R t9@A8@8944
) 8))
just returns the vector of deviations from the mean. In terms of deviations
1 wsf
R
R I 6REwsf

-vector. So
33
3.6. THE CLASSICAL LINEAR REGRESSION MODEL
34
3.6. The classical linear regression model

Up to this point the model is empty of content beyond the denition of a
best linear approximation to
and some geometrical properties. There is no
economic content to the model, and the regression parameters have no eco-
7
respect to
? The linear approximation is
nomic interpretation. For example, what is the partial derivative of
with
e h P 8
wP ih u@8A8P uI H f
P I
P
e
f
Up to now, theres no guarantee that
=0. For the
The partial derivative is
to have an economic
meaning, we need to make additional assumptions. The assumptions that are

appropriate to make depend on the data under consideration. Well start with
the classical linear regression model, which incorporates some assumptions
that are clearly not realistic for economic data. This is to be able to explain
some concepts with a minimum of confusion and notational clutter. Later well
adapt the results to what we can get with more realistic assumptions.
e h P 8
uP i p h ut@8A8P p VsI pI
P
f
(3.6.1)
or, using vector notation:
r p
Yq
e
wP p dR f
v
3.6. THE CLASSICAL LINEAR REGRESSION MODEL
, its number of columns, and 3.6.2
(3.6.2)
where
is a xed matrix of con-
1

R ) A
stants, it has rank
Nonstochastic linearly independent regressors:
35
is a nite positive denite matrix. This is needed to be able to iden-
tify the individual effects of the explanatory variables.

Independently and identically distributed errors:

$f ! e

(3.6.3)
is jointly distributed IIN. This implies the following two properties:

Homoscedastic errors:
2
c p
G
(3.6.4)
Nonautocorrelated errors:
p E!G
2 e
(3.6.5)
Optionally, we will sometimes assume that the errors are normally distributed.
Normally distributed errors:

4f ! je
(3.6.6)
3.7. SMALL SAMPLE STATISTICAL PROPERTIES OF THE LEAST SQUARES ESTIMATOR
36
3.7. Small sample statistical properties of the least squares estimator

Up to now, we have only examined numeric properties of the OLS estimator, that always hold. Now we will examine statistical properties. The statistical properties depend upon the assumptions we can make.
Rf I X R

3.7.1. Unbiasedness. We have
. By linearity,
Gi I 6Qi
R R P
G P
V" R I Q R

By 3.6.2 and 3.6.3
G i I 6Xi
R R
Gi I !Xi
R R
R R
Gi I !X%

so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.7.1 shows the results of a small Monte Carlo experiment where the
OLS estimator was calculated for 10000 samples from the classical model with
,
, and

5 1
G 5 P
P 0) f
that the
, where
is xed across samples. We can see
appears to be estimated without bias. The program that generates
the plot is Unbiased.m , if you would like to experiment with this.

With time series data, the OLS estimator will often be biased. Figure 3.7.2
shows the results of a small Monte Carlo experiment where the OLS estimator
and
)
5 1
where
G P
tvI f 8 P 4f

was calculated for 1000 samples from the AR(1) model with
. In this case, assumption 3.6.2 does not hold: the
37
F IGURE 3.7.1. Unbiasedness of OLS under classical assumptions

Beta hat - Beta true
0.12
0.1
0.08
0.06
0.04
0.02
-3
-2
-1
regressors are stochastic. We can see that the bias in the estimation of
about -0.2.
is
The program that generates the plot is Biased.m , if you would like to experiment with this.
8
RG I X R a
P
3.7.2. Normality. With the linearity assumption, we have
This is a linear function of . Adding the assumption of normality (3.6.6, which
implies strong exogeneity), then
R
p I 6Xixui
since a linear function of a normal random vector is also normally distributed.

In Figure 3.7.1 you can see that the estimator appears to be normally distributed. It in fact is normally distributed, since the DGP (see the Octave program) has normal errors. Even when the data may be taken to be IID, the
38
F IGURE 3.7.2. Biasedness of OLS when an assumption fails

Beta hat - Beta true
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
assumption of normality is often questionable or simply untenable. For example, if the dependent variable is the number of automobile trips per week, it
is a count variable with a discrete distribution, and is thus not normally distributed. Many variables in economics can take on only nonnegative values,
which, strictly speaking, rules out normality.2
3.7.3. The variance of the OLS estimator and the Gauss-Markov theorem. Now lets make all the classical assumptions except the assumption of
Normality may be a good model nonetheless, as long as the probability of a negative value
occuring is negligable under the model. This depends upon the mean being large enough in
relation to the variance.

p I X R
RG
R

9I !XRiutGRi I 6Qi

R h 0% h 0
RG I X R
P
and we know that
normality. We have
39
. So
The OLS estimator is a linear estimator, which means that it is a linear func-
8
f
tion of the dependent variable,
f R I X R

f
where
is a function of the explanatory variables only, not the dependent vari-
able. It is also unbiased under the present assumptions, as we proved above.

that dene some
other linear estimator. Well still insist upon unbiasedness. Consider
9

%

Q
Note that since
is
it is nonstochastic, too. If the estimator is unbiased, then we

:
f
u
p "

G
P p "U

must have
matrix function of
1 '
jX
a function of
is some
8
i
where
f

that are a function of
One could consider other weights
is
8 p $
R

The variance of
40
Dene

R I X R
so

R I X R P
9
so

p h I Q R P R
R R
R R
p R % I 6XiP % I 6XiP
Since

q$

So

7
The inequality is a shorthand means of expressing, more formally, that
is a positive semi-denite matrix. This is a proof of the Gauss-Markov
Theorem. The OLS estimator is the best linear unbiased estimator (BLUE).
It is worth emphasizing again that we have not used the normality
assumption in any way to prove the Gauss-Markov theorem, so it is
valid if the errors are not normally distributed, as long as the other
assumptions hold.
To illustrate the Gauss-Markov result, consider the estimator that results from
equally-sized parts, estimating using each part of
the data separately by OLS, then averaging the
splitting the sample into
resulting estimators. You
should be able to show that this estimator is unbiased, but inefcient with
respect to the OLS estimator. The program Efciency.m illustrates this using
41
F IGURE 3.7.3. Gauss-Markov Result: The OLS estimator

Beta 2 hat, OLS
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0.5
1.5
2.5
3.5
a small Monte Carlo experiment, which compares the OLS estimator and a 3way split sample estimator. The data generating process follows the classical
8
75
In Figures 3.7.3 and
p i
y

I
)
75 1
model, with
. The true parameter value is
but we still need to
3.7.4 we can see that the OLS estimator is more efcient, since the tails of its
histogram are more narrow.
estimate the variance of ,
and
, in order to have an idea of the precision of the
estimates of . A commonly used estimator of
G R G
1 p

This estimator is unbiased:
p e

We have that
is
F IGURE 3.7.4. Gauss-Markov Result: The split sample estimator

Beta 2 hat, Split Sample Estimator
0.12
0.1
0.08
0.06
0.04
0.02
2.5
1
Su p
1.5
p

Y
G
R tG y
R GG 4y
G R G y
G tG
R
p

1
)
1
)
1
)
1
)
1
)
1
)
1

p

0.5
R
G t G
3.5
42
3.8. EXAMPLE: THE NERLOVE MODEL
y
w
w y
where we use the fact that
43
when both products are con-
formable. Thus, this estimator is also unbiased under these assumptions.
3.8. Example: The Nerlove model

and
as given, the cost minimization problem is to choose the

to solve the problem
R
yF
The solution is the vector of factor demands
F
4
subject to the restriction
8

quantities of inputs
the output level
3.8.1. Theoretical background. For a rm that takes input prices
. The cost function is ob-
tained by substituting the factor demands into the criterion function:
8 F R
4 YF 4F

Monotonicity Increasing factor prices cannot decrease cost, so
F
4
Remember that these derivatives give the conditional factor demands

(Shephards Lemma).
Homogeneity The cost function is homogeneous of degree 1 in input
where is a scalar constant. This is because
F
4 2 c
F2
prices:
44
the factor demands are homogeneous of degree zero in factor prices they only depend upon relative prices.

Returns to scale The returns to scale parameter
is dened as the in-
verse of the elasticity of cost with respect to output:
t4

Constant returns to scale is the case where increasing production
im-
plies that cost increases in the proportion 1:1. If this is the case, then
.
)

3.8.2. Cobb-Douglas functional form. The Cobb-Douglas functional form

is linear in the logarithms of the regressors and the dependent variable. For a

cost function, if there are
factors, the Cobb-Douglas cost function has the
form
F
x x F@8A88 Ix F u

w

with respect to
x
x
x F@88A8 IYF w x x F8@8 x F8 Ix F 6

w
I
F
h # !
F "
What is the elasticity of
`

This is one of the reasons the Cobb-Douglas form is popular - the coefcients
are easy to interpret, since they are the elasticities of the dependent variable
45
with respect to the explanatory variable. Not that in this case,
F

F
F
h # !
F "
`

$
input. So with a Cobb-Douglas cost function,

%
F
4
. The cost shares are constants.
where

j
the cost share of the
w A V&

Note that after a logarithmic transformation we obtain
e P (
u )VP F A Vt@8A8F A HuP
P 8 PI I

u A
&
'
. So we see that the transformed model is linear in the logs of
the data.
One can verify that the property of HOD1 implies that

) B
In other words, the cost shares add up to 1.

The hypothesis that the technology exhibits CRTS implies that
Likewise, monotonicity implies that the coefcients
3
!

8 8 8
A@@94)
8
4) )
(
) ))(
so
3.8.3. The Nerlove data and OLS. The le nerlove.data contains data on
145 electric utility companies cost of production, output and input prices. The
data are for the U.S., and were collected by M. Nerlove. The observations are
1
2
, PRICE OF FUEL
1
0
OF LABOR
, OUTPUT
8
g

g
by row, and the columns are COMPANY, COST
46
PRICE
and PRICE OF CAPITAL
that the data are sorted by output level (the third column).
Note
We will estimate the Cobb-Douglas model
e
uP )VP 5 4uP 1 | VP uD
6
3
PI
2
0
(3.8.1)
using OLS. To do this yourself, you need the data le mentioned above, as
well as Nerlove.m (the estimation program) , and the library of Octave functions mentioned in the introduction to Octave that forms section 21 of this
document.3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
constant
output
labor
fuel
capital
estimate
-3.527
0.720
0.436
0.427
-0.220
st.err.
1.774
0.017
0.291
0.100
0.339
t-stat.
-1.987
41.244
1.499
4.249
-0.648
p-value
0.049
0.000
0.136
0.000
0.518
*********************************************************
Do the theoretical restrictions hold?
What do you think about RTS?
Does the model t well?

3
If you are running the bootable CD, you have all of this installed and ready to run.
47
While we will use Octave programs as examples in this document, since following the programming statements is a useful way of learning how theory
is put into practice, you may be interested in a more user-friendly environment for doing econometrics. I heartily recommend Gretl, the Gnu Regression,
Econometrics, and Time-Series Library. This is an easy to use program, available in English, French, and Spanish, and it comes with a lot of data ready to
A
use. It even has an option to save output as LTEX fragments, so that I can just
include the results into this document, no muss, no fuss. Here the results of
the Nerlove model from GRETL:
Model 2: OLS estimates using the 145 observations 1145

Dependent variable: l_cost
5 E ) 8
444 8
) C7 @)8
9
7
5 @F78
9C7 A)8
) 58
E
9C9 A ) 8
A 8
CBA)
)
758
E E
CECC
AG)
9
@5 8
) @C7 8
7 9
5
7 4FA8
9
@5 8
7
8
l_capita
A
D7
l_fuel
p-value
A F8
9
E

58
5 )
8
8
$4 5)
8)
E
A C 4
l_labor
-statistic
444 8
8
l_output
Std. Error
E
CE
const
Coefcient
Variable
9
C9
S.D. of dependent variable

Sum of squared residuals
Standard error of residuals ( )
Adjusted
Unadjusted
9
)
Akaike information criterion

Schwarz Bayesian criterion
A
D9 8
)
E
C 8 )
9 FD7
98 A
E
7
E I5 8
4 5 8
9 @5 H8
7 7
5 4 g75
8)
54AG)75 4)
8
5 8
BA4)
Mean of dependent variable
48
Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in the bootable CD mentioned in the introduction. I recommend using
GRETL to repeat the examples that are done using Octave.
The previous properties hold for nite sample sizes. Before considering
the asymptotic properties of the OLS estimator it is useful to review the MLE
estimator, since under the assumption of normal errors the two estimators coincide.
EXERCISES
49
Exercises
(1) Prove that the split sample estimator used to generate gure 3.7.4 is unbiased.
(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL,
and provide printouts of the results. Interpret the results.
(3) Do an analysis of whether or not there are inuential observations for OLS
estimation of the Nerlove model. Discuss.
(4) Using GRETL, examine the residuals after OLS estimation and tell me whether
or not you believe that the assumption of independent identically distributed normal errors is warranted. No need to do formal tests, just look
at the plots. Print out any that you think are relevant, and interpret them.
and
w y
for
(6) Using Octave, write a little program that veries that
Y
w
S Q
gT R
and are conformable matrices of contants?
P
o w
where
what is the distribution of
U
V
(5) For a random vector
4x4 matrices of random numbers. Note: there is an Octave
function trace.
e P
E H 4f
P I
(7) For the model with a constant and a single regressor,
which satises the classical assumptions, prove that the variance of the
OLS estimator declines to zero as the sample size increases.
CHAPTER 4
Maximum likelihood estimation

The maximum likelihood estimator is important since it is asymptotically
efcient, as is shown below. For the classical linear model with normal errors,
the ML and OLS estimators of
are the same, so the following theory is pre-
sented without examples. In the second half of the course, nonlinear models
with nonnormal errors are introduced, and examples may be found there.
4.1. The likelihood function
Suppose we have a sample of size of the random vectors and . Suppose
# 88 I
h f98 g#
f
r p X
ff 888 I
h g9qf
and
ized by a parameter vector
the joint density of
is character-
8 X W
p yT
a Y
b`
This is the joint density of the sample. This density can be factored as
p !Ye ` p YW
W a d
p X W
YT
a Y
cG
X W
g YT
i g X
ph
X W
UyT
f
is a parameter space.
X
likelihood function.
50
The maximum likelihood estimator of
is the value of
X
where
Y
a
The likelihood function is just this density evaluated at other values
that maximizes the
4.1. THE LIKELIHOOD FUNCTION
with respect to is the same as the max-
d
YW
W d
!Y 4YW
X W
UYT
a Y
c`
that correspond to . In this case, the variables
for the elements of
pd
imizer of the overall likelihood function
a
`
tional likelihood function
share no elements, then the maximizer of the condiY
and
Note that if
51
are said
to be exogenous for estimation of , and we may more conveniently work with

d
~ p

d
observations are independent, the likelihood function can be
pd
fq
d
YW
I
@
where the
d f
# g
written as
If the
Y
G
D EFINITION 4.1.1. The maximum likelihood estimator of
d
4YW
for the purposes of estimating
d
4yW
Y
a
the conditional likelihood function
are possibly of different form.
If this is not possible, we can always factor the likelihood into contributions of observations, by using the fact that a joint density can be factored
into the product of a marginal and conditional (doing this iteratively)
d f# f888 rI f f bbb d # fI f d #I f dI I f

476f gE999 f sf g xxH | 6 !f | 4 66f 4g#
d

f
To simplify notation, dene
# f 8 fI f
6I !@A8@89 !t

6t gg# I
#If
I
so
, etc. - it contains exogenous and predetermined endo-
geous variables. Now the likelihood function can be written as
c
d
f
fq
I
@
d
4
f
52
The criterion function can be dened as the average log-likelihood function:
I
@ 1
1 d f
c f
d
d
) 4 f A ) 4RC
f
The maximum likelihood estimator may thus be dened equivalently as
d f
geS ~ d
maximize at the same value of
8
7d
8
7d
has no effect on
and
is a monotonic
Dividing by
increasing function,
d
b
where the set maximized over is dened below. Since
4.1.1. Example: Bernoulli trial. Suppose that we are ipping a coin that
may be biased, so that the probability of a heads may not be 0.5. Maybe were
t
t" ) f
interested in estimating the probability of a heads. Let
be a binary
variable that indicates whether or not a heads is observed. The outcome of a

toss is a Bernoulli random variable:
) f
9 j g !
) g f
9 v! u I p ic p4
) u
p f

Y
`
So a representative term that enters the likelihood function is
)
f
I i% u
Y
G
and
) f ) P
iiy S0 f
f
Y
G
53
The derivative of this is
)
ic

if
)
ii
f )
S0 f
Y
`

i
f

Averaging this over a sample of size
gives
i B 1
) I

i B f f ) dif

Setting to zero and solving gives
f
p
So its easy to calculate the MLE of
in this case.
Now imagine that we had a bag full of bent coins, each bent around a
sphere of a different radius (with the head pointing to the outside of the sphere).
We might suspect that the probability of a heads could depend upon the ra-
~ ) B
I BR 7g} P c ! C
B
is a 2 1 vector. Now
B f
BC ! C i B
ic
)
BC B i) B B B if B

Y
`
!
f

'
so
B )
C B % B q B

that
where
B
R r B ) n qi
dius. Suppose that
, so
4.2. CONSISTENCY OF MLE
54
So the derivative of the average log lihelihood function is now

B B f I
C ! C i B fB if
This is a set of 2 nolinear equations in the two unknown elements in . There
is no explicit solution for the two elements that set the equations to zero. This
is common with ML estimators, they are often nonlinear, and nding their
values often require use of numeric methods to nd solutions to the rst order
conditions.
4.2. Consistency of MLE

To show consistency of the MLE, we need to make explicit some assumptions.
x g
pwd

x
This implies that is an interior point of the parameter space
which is compact.
imixation is over
an open bounded subset of
Compact parameter space:
Max-
Uniform convergence:
8
x
d d
g p 7ea
g d
wc
d f
4RC!4 A f 4eS
d f
We have suppressed
here for simplicity. This requires that almost sure con-
vergence holds for all possible parameter values. For a given parameter value,
an ordinary Law of Large Numbers will usually imply almost sure convergence to the limit of the expectation. Convergence for a single element of
the parameter space, combined with the assumption of a compact parameter
space, ensures uniform convergence.
4.2. CONSISTENCY OF MLE

7
is
has a unique maximum in its rst argument.
We will use these assumptions to show that
f
d
First,
This implies that
8 p d d
f
T
p 7RaG
d d
87d
d f
eS9
Identication:
g d
d
continuous in
8
x
is continuous in
p eaG
d d
Continuity:
55
certainly exists, since a continuous function has a maximum on a
compact set.

b
p e
d
p e
d
d
4R f f
d
4R f f A
p d d

Second, for any
by Jensens inequality (
is a concave function).
Now, the expectation on the RHS is
p
4) p e f 4eRdd
ft d
f
f
'
p e
d
d
e f f
is the density function of the observations, and since the integral of
any density is 1 Therefore, since
t
)
p e
d
since
d
p p e

d
4R f f A
Taking limits, this is
8 E p Ry4R
d f d f
or
pd pd pd d
e Guq 7e G
except on a set of zero probability (by the uniform convergence assumption).
4.3. THE SCORE FUNCTION
56
By the identication assumption there is a unique maximizer, the inequal-
is a limit point of
at least one limit point). Since
(any sequence from a compact set has
is a maximizer, independent of
we must
have
a.s.

q1
Cd
Suppose that
f
d
f
d
p d cc p p Ra uq p 7Ra
d
d
d
d d
p d d

ity is strict if
p p Ra uq p Ra
d
d
d d
These last two inequalities imply that
a.s.
pd
Thus there is only one limit point, and it is equal to the true parameter value
with probability one. In other words,
8 8 p d d f
as
This completes the proof of strong consistency of the MLE. One can use weaker
assumptions to prove weak consistency (convergence in probability to
) of
the MLE. This is omitted here. Note that almost sure convergence implies
convergence in probability.
4.3. The score function
pd
of
is twice continuously differentiable
, at least when
p e
d
in a neighborhood
d f
4RC9
Differentiability: Assume that
is large enough.
4.3. THE SCORE FUNCTION
57
To maximize the log-likelihood function, take derivatives:
Note that the score function has
argument, which implies that it is a random function.
8)
gxp'
I
@
1
8 d
ge
$
f )
I
@
1
d f
g

f )
d f
eS Sg
d f
This is the score vector (with dim
as an
(and any exogeneous
variables) will often be suppressed for clarity, but one should not forget that
they are still there.
sets the derivatives to zero:
8
$
8 d
p e
82 d
s4ecg 1)
I
@
1 4d S
4d
f
f )
This is the expectation taken with respect
not necessarily
8 f t d f
g$c g
d f
gf$c g c g c ) g
t d
f d f
f t d f d f
g$ c g
d
4R
d
4R14
to the density
We will show that
The ML estimator
4.4. ASYMPTOTIC NORMALITY OF MLE
Given some regularity conditions on boundedness of
58
we can switch the
order of integration and differentiation, by the dominated convergence theorem. This gives
f t d f
$4c
d
eE5
)
9

where we use the fact that the integral of the density is 1.
so it implies that

2
This hold for all
the expectation of the score vector is zero.
8 4S)
d f
r 4Rx
d
So
4.4. Asymptotic normality of MLE

is twice continuously differentiable. Take
d
about the true value
r p
d f
4RC9
a rst order Taylors series expansion of
d H

Recall that we assume that
h p d E eH y p RD
d
d P d

4 d H
or with appropriate denitions

d
Assume
is invertible (well justify
p RHb1 I 6 R
d
d
h p ipd 1
d
) P
Ycyd
e
f
this in a minute). So
e
d
8 e
4) "g p
d
d
g p eHd h p d R
d
d
where
This is
d
d
e
8 d
e

where the notation
R
8 d d
d f
eS $
I
@
eE
d
f

eS
d f
RD y
d
Now consider
59
d f
4R 9
Given that this is an average of terms, it should usually be the case that this
satises a strong law of large numbers (SLLN). Regularity conditions are a set
of assumptions that guarantee that this will happen. There are different sets
of assumptions that can be used to justify appeal to different SLLNs. For
d
6IeE A
example, the
must not be too strongly dependent over time, and
their variances must not become innite. We dont assume any particular set
here, since the appropriate assumptions will depend upon the particularities
of a given model. However, we assume that a SLLN applies.
d
e
j
. Also, by the above differentiability assumtion,
converges to the limit of its expectation:
p d
R
d f
f
d
eS Q A R
This matrix converges to a nite limit.
k
l
e
d
continuous in . Given this,
we have that
is consistent, and since
d
4R
p e cP
d
)
Also, since we know that
is
60
Re-arranging orders of limits and differentiation, which is legitimate given

regularity conditions, we get
d
p p ea
d
d
d f
E p eS A f
p d
R

Weve already seen that
p p RaG p 7R
d
d
d d
maximizes the limiting objective function. Since there is a unique maxd
i.e.,
imizer, and by the assumption that
p ea
d
d f
4RC9
(which holds in the limit), then
is twice continuously differentiable
must be negative denite, and there-
fore of full rank. Therefore the previous inversion is justied, asymptotically,
h p d 1
d
T

p d f
RCb1
1
1
1h
)

8 eE5)
d
by consistency. To avoid this collapse to a degenerate
r.v. (a constant vector) we need to scale by
8
1
p RS
d f
T
Note that
As such, it is reasonable to assume that

h
a CLT applies.
I
@
p e
d
f
I
@
p c f
d
gE
f
d f
eS
8 d
g p RDb1 h
This is
8 d
g p eHb1 I 6 p ea
d
Now consider
(4.4.1)
and we have
A generic CLT states that, for
61
a random vector that satises certain conditions,
f
E$a
4f a
f
will be of the form of an average, scaled by

h
1
I
a @f 1
f
a
h
For example, if the
8
c
p RDb1
d
the properties of the
for example. Then the properties of
depend on
have nite variances and are

h
This is the case for
ally,
must satisfy depend on the case at hand. Usu-
f
a
The certain conditions that
not too strongly dependent, then a CLT for dependent processes will apply.

9 q
p eSgn1
d f
Supposing that a CLT applies, and noting that

h
p e gb1 eI ! p ea
d f
d
p
where
p e fb1 er f
p d
ea

e4
p R o q
d
(4.4.2)
d f
d f
R p RCD p RCh 1
This can also be written as
p RCb1
d f
is known as the information matrix.

d
p ea o
d
Combining [4.4.1] and [4.4.2], we get

d
8
I p ea p Ra o I p R d u h p ssd 1
d
d
d
The MLE estimator is asymptotically normally distributed.
we get
and asymptotically normally distributed if

h
h p d 1
d
where
-consistent

n6
(4.4.3)
is
of a parameter
d
D EFINITION 1 (CAN). An estimator
62
is a nite positive denite matrix.
There do exist, in special cases, estimators that are consistent such that
1
8 h p ssd 1
d
T
h
mally,
These are known as superconsistent estimators, since nor-
is the highest factor that we can multiply by an still get convergence
to a stable limiting distribution.
D EFINITION 2 (Asymptotic unbiasedness). An estimator
of a parameter
is asymptotically unbiased if
8
7d d t tA f
pd
(4.4.4)
Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An
example is
E XERCISE 4.5. Consider an estimator
with density
r
1 d If
I
d
p

d I f )
d
Show that this estimator is consistent but asymptotically biased. Also ask
yourself how you could dene an estimator that would have this density.
R s4RD "s4RD 1 4 s4e

d d

d

14
allows us to write
d
appear to be correlated one may question the specication of the model). This
(This forms the basis for a specication test proposed by White: if the scores
I f!A8@@889If g $ 2
d
f

4 2
I
@ 1

v
w4
d
w
HR 4dRus4dREg
s4RE
f )
has conditioned on prior information, so what was random in
4
I@ 1
f )
are uncorrelated for
is xed in .
since for
If
and
and multiply by
d d P d
R s4eE ue5Vs4RE
y d
d
eE

d
ueE 5VP 4eE
A
f4RE s4RE y us4RE A 4eE
t d
d
d
P d
f t d
$4RE y e $eE eE A
d
P f t d d
g

54
d 4
The scores
Now sum over

(4.6.1)

Now differentiate again:
f$4RE E4ec
t d
d
f t d
$eE
so
be short for
d
4RE
f t d
$eE
Let
d f
c g
)
8 d
g4Ra 4R
d
We will show that
4.6. The information matrix equality

d
4.6. THE INFORMATION MATRIX EQUALITY
63
p Rac
d
p Ra p ea o p e
d
d
d
I
I
p ea x d o
d
x d
x
I
p ea x
d
I
x d
p d
Ran
p d
Ranx
p d
Ranx
x
all valid. These include
From this we see that there are alternative ways to estimate
that are
to estimate the information matrix. Why not?
p d
f f
R r d Sg n r d n 1 Ra
8
4d
I
@
R
$d d
d1

f
p Ra
d
o
and
p R
d
d
p d
ea
Note, one cant use

x
p d
ea x o d
x
We can use
To estimate the asymptotic variance, we need estimators of
I 6 p Ra o d h p ipd 1
d
d

(4.6.3)
h
p ea o I 6 p e
d
d
Using this,
I 6 p e
d
8p
dd
h p d 1
d

in particular, for
simplies to

7d
This holds for all
8 d
4Ra o e
d
(4.6.2)
d
limits, we get
since all cross products between different periods expect to zero. Finally take
4.6. THE INFORMATION MATRIX EQUALITY
64
4.7. THE CRAMR-RAO LOWER BOUND
65
These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since
it coincides with the covariance estimator of the quasi-ML estimator.
4.7. The Cramr-Rao lower bound

T HEOREM 3. [Cramer-Rao Lower Bound] The limiting variance of a CAN
, say , minus the inverse of the information matrix is a positive
pd
estimator of
semidenite matrix.
Proof: Since the estimator is CAN, it is asymptotically unbiased, so
f
d
4ip d t4 A
rR
matrix of zeros
8
g
this is a

d f y
ip d t
f

8 f$t 9 v4d y
h ip d y
d
8 t h ip d y t f p4R y 4R h d
f d
d
P f t d
d d
d
ge A y e 4 y
d
d
we can write
and
8 $4e y 4e h ip d
d d
f t d
h
we get
|
1z
f t d ~ d }{ y
$e 4R ) 1

t
Playing with powers of
With this we
Now note that

have
Noting that
'
f d d
$t r h d 4 n y
Differentiate wrt
h d 1
d
66
h d 1 n A f
d
R d
9 r 4eHc1
h ip d 1
d
This means that the covariance of the score function with
for
h
h
we can write
so
d
geH
Note that the bracketed part is just the transpose of the score vector,
any CAN estimator, is an identity matrix. Using this, suppose the variance of
Therefore,
d
ea
9
d
49an
h ss d 1
d

d
4eHn1
h ip d 1
d
(4.7.1)
8
g$ d h n
tends to

n
Since this is a covariance matrix, it is positive semi-denite. Therefore, for any

&

8
&
4Ra o
d
& I
d
ea
9

$ d r
-vector
r e I o R & R & n
d

This simplies to
h 4e I q9r R
d o d
d
ea
&
d
9an
d
e I
o
&
CAN estimator.
&
This means that
proof.
is arbitrary,
Since
is positive semidenite. This conludes the
is a lower bound for the asymptotic variance of a
D EFINITION 4.7.1. (Asymptotic efciency) Given two CAN estimators of a
and ,
, say
is asymptotically efcient with respect to
is a positive semidenite matrix.
parameter
if
4d an0q4 d ar

p
A direct proof of asymptotic efciency of an estimator is infeasible, but

if one can show that the asymptotic variance is equal to the inverse of the
67
information matrix, then the estimator is asymptotically efcient. In particular,

the MLE is asymptotically efcient.
Summary of MLE
Consistent
Asymptotically efcient
This is for general MLE: we havent specied the distribution or the
Asymptotically normal (CAN)
Asymptotically unbiased
linearity/nonlinearity of the estimator
EXERCISES
68
Exercises
(1) Consider coin tossing with a single possibly biased coin. The density func-
t
t" ) f
tion for the random variable
is
) f
9 j g !
) g f
9 v! u I p ic p4
) u
p f

Y
`
Suppose that we have a sample of size . We know from above that the ML
. We also know from the theory above that

d
f p h
estimator is
I p a p a o I p d da d p i 1
f
and
p a
a) nd the analytical expressions for
for this problem
p da
b) Write an Octave program that does a Monte Carlo study that shows that
p %g 1
f
several values of .
Ee
&
P
j R f
(2) Consider the model
is large. Please
p iv 1
f
give me histograms that show the sampling frequency of
is approximately normally distributed when
for
where the errors follow the Cauchy
(Student-t with 1 degree of freedom) density. So
e P) )
9 w
e
E
The Cauchy density has a shape similar to a normal density, but with much
thicker tails. Thus, extremely small and large errors occur much more frequently with this density than would happen if the errors were normally
d
where
Rh R d
EePi R 4f

R h & R l
d f
4RCg
distributed. Find the score function
(3) Consider the model classical linear regression model
d f
4RCg
. Find the score function
where
where
! j e

EXERCISES
69
(4) Compare the rst order conditional that dene the ML estimators of problems 2 and 3 and interpret the differences. Why are the rst order conditions that dene an efcient estimator different in the two cases?
CHAPTER 5
Asymptotic properties of the least squares estimator

The OLS estimator under the classical assumptions is unbiased and BLUE,
for all sample sizes. Now lets see what happens when the sample size tends
to innity.
5.1. Consistency
GVP"tRi I 6Qi
R

Rf I Q R
I yf

I
@ 1
G
)
f
f
y
f y f tf A

1
1
GR I R P
G% I 6XiP
R R

p

Consider the last two terms. By assumption
I
since the inverse of a nonsingular matrix is a continuous function of the
Each
tG
elements of the matrix. Considering
e1
RG
70
RG
has expectation zero, so
5.2. ASYMPTOTIC NORMALITY
71
The variance of each term is
8 R
e
E
As long as these are nite, and given a technical condition1, the Kolmogorov
I@ 1
f )
This implies that
8
G

SLLN applies, so
8p

This is the property of strong consistency: the estimator converges in almost

surely to the true value.
The consistency proof does not use the normality assumption.
Remember that almost sure convergence implies convergence in probability.


Weve seen that the OLS estimator is normally distributed under the assumption of normal errors. If the error distribution is unknown, we of course dont
know the distribution of the estimator. However, we can get asymptotic re-
assumptions hold:
sults. Assuming the distribution of
is unknown, but the the other classical
For application of LLNs and CLTs, of which there are very many to choose from, Im going
to avoid the technicalities. Basically, as long as terms of an average have nite variances and
are not too strongly dependent, one will be able to nd a LLN or CLT to apply.
5.3. ASYMPTOTIC EFFICIENCY
72
1
1
GR I R
h i 6Xi
GR I R
Gi I 6XiP p
R R
fy
f
8
I I y
p
h j 1
p

0%
h
Now as before,
Considering
the limit of the variance is

h
p

1
f
A
R ! Re
e
1 f

RG

The mean is of course zero. To get asymptotic normality, we need to

apply a CLT. We assume one (for instance, the Lindeberg-Feller CLT)
holds, so
h
p
!
p
I !
1
GR
h
Therefore,
h p 0 1

In summary, the OLS estimator is normally distributed in small and
applied.
tributed,
is normally distributed. If
large samples if
is not normally dis-
is asymptotically normally distributed when a CLT can be
5.3. Asymptotic efciency

The least squares objective function is
73
f
R g
I@
f

q
Supposing that is normally distributed, the model is
G
70P p i f
so
I
5 @
g5
G D 7~} )
fq

g4f p ! h
4G
The joint density for can be constructed using a change of variables. We have
yu
and
so
p
I
@
8 g5 f
A j 5
1
R g f

"jf G
Taking logs,
I
5 @
g5
8
f
~
S
R g s 7g} )
f
fq
hu
4) y
f
so
1

y ! f
Its clear that the fonc for the MLE of
are the same as the fonc for OLS (up
to multiplication by a constant), so the estimators are the same, under the present
assumptions. Therefore, their properties are the same. In particular, under the
classical assumptions with normality, the OLS estimator
is asymptotically efcient.
As well see later, it will be possible to use (iterated) linear estimation

methods and still achieve asymptotic efciency even if the assumption that
as long as
is still normally distributed. This is not the case if

7f
G
74
is nonnormal. In general with nonnormal errors it will be necessary to use
nonlinear estimation methods to achieve asymptotically efcient estimation.
CHAPTER 6
Restrictions and hypothesis tests

6.1. Exact linear restrictions
In many cases, economic theory suggests restrictions on the parameters of
a model. For example, a demand function is supposed to be homogeneous
of degree zero in prices and income. If we have a Cobb-Douglas (log-linear)
model,
G
VP
P I I p
| uP A Vs HuP A
then we need that
G
VP 7 | VP uH A HuP p p

PI I
so
8" | VP A uPDH HV | uP uDHS A

I I P P I
7 | uP A uH H

PI I
P I I
| uP A uH H
The only way to guarantee this for arbitrary
is to set
| uP uDH
PI
which is a parameter restriction. In particular, this is a linear equality restriction,

which is probably the most commonly encountered case.
75
6.1. EXACT LINEAR RESTRICTIONS
76
6.1.1. Imposition. The general formulation of linear equality restrictions

is the model
f
G P
0"
is of rank
so that there are no redundant restrictions.
subject to the restrictions

Lets consider how to estimate
obvious approach is to set up the Lagrangean
8 p
sible.
vector of constants.
that satises the restrictions: they arent infea-
We also assume that
'
We assume
and is a
matrix,
is a
)
'
p
where
The most
x
8 5 P f f
g %p R e " R qi ) 1

The Lagrange multipliers are scaled by 2, which makes things less messy. The
e
$
$ r

5 P
5 P 5
R ar R w Rf

e w
x
e
which can be written as
fR
Rf

R
I
R

We get
fonc are
R

I
I X R w I
I R I X R I X R u I R I X R I Q R
}
I
I X R Y

I R I X R
I X R

I

w
w
so

R I X R

I R I X R
9

and

R I X R

R I X R w

R I X R

$

8
$

I X R w

I X R

Note that
For the masochists: Stepwise Inversion
77
78
so (everyone should start paying attention again)
makes it easy to determine their
is already known. Recall that for
a matrix and vector of constants, respectively,

U
and

random vector, and for
distributions, since the distribution of
are linear functions of
I
I

P

s I R I X R 9
I R I X R
h % I
h % I R I X R

I
I X R w I

I R I X R I Q R w I R I X R I X R
and
Rf
The fact that
8 R w gP Yw
w U
Though this is the obvious way to go about nding the restricted estima-
tor, an easier way, if the number of restrictions is small, is to impose them by

substitution. Write
I
H
r n
I
nonsingular. Supposing the
'
pendent, one can always make
restrictions are linearly inde-
nonsingular by reorganizing the columns of
8 I 0 I H
I
I
I
Then
f
is
G
VP VDH
P I I
where
8
i
79
Substitute this into the model
G
0P H I j
I I
G
VP VP I E I E
I I
I I
I I
I jf
f
or with the appropriate denitions,
8G
VP @f
This model satises the classical assumptions, supposing the restriction is true.
One can estimate by OLS. The variance of
is as before
p I D i
and the estimator is
in the normal way, using the restricted model, i.e.,
p
1

R h Cf
use the restriction. To nd the variance of

so

p R
I I R I R

R I I R E
I
H

I I
I I

I
H
is a linear function of
I
gH

h @f

p

To recover
I D R

where one estimates
use the fact that it
and

p0
MSE, depending on the magnitudes of
If the restriction is true, the second term is 0, so we are better off. True
If the restriction is false, we may be better or worse off, in terms of

restrictions improve efciency of estimation.
term is NSD.
So, the rst term is the OLS covariance. The second term is PSD, and the third
R
R R
I 6Xia I I 6Xi

R R
I 6QRia I R p0 up0 I I 6Xi

I X R

r
of the third, we obtain
zero, and that the cross of the rst and third has a cancellation with the square
Noting that the crosses between the second term and the other terms expect to
R
c0r 0r Dr

Mean squared error is
Gi I !Xia I
R R

pV I
R R
I !Xi
R R
I !Xi
Gi I !Xi
R R
Gi I 6XR%a I I !X%CpV I I 6QiUGi I !Xi
R
R R
R R P R R P
fRi I 6QRia I I 6Qi I I !Xi

R R
R R P
h % I I 6Xi

R R

0r

6.1.2. Properties of the restricted estimator. We have that

80
6.2. TESTING
81
6.2. Testing
In many cases, one wishes to test economic theories. If theory suggests parameter restrictions, as in the above homogeneity example, one can test theory
by testing parameter restrictions. A number of tests are available.
6.2.1. t-test. Suppose one has the model
p
pr
vs.

pr
GdP
V" f
d
with normality of the errors,
and one wishes to test the single restriction
. Under
p R I X R a6
but the test would only be valid asymptotically in this case.
D
p
P ROPOSITION 4.
(
z
Rv2 )(
t9 j
(6.2.1)
and the
distribution.
Q
$f j
is a vector of
e
1
R
'
(6.2.2)
)
t9 j
We need a few results on the

P ROPOSITION 5. If
are independent.
as long as the
independent r.v.s., then
in place of
is unknown. One could use the consistent estimator
p

R I Q R a p p R I X R a
The problem is that
8t9
)
p
so
6.2. TESTING
R
Q
B Q B e

is the noncentrality parameter.
r.v. has the noncentrality parameter equal to zero, it is referred
noncentrality parameter.
P ROPOSITION 6. If the
dimensional random vector
suppressing the

g
r.v., and its distribution is written as
to as a central
1
gD
When a
where
82
then
8 1
gD
I R
'
Well prove this one as an indication of how the following unproven propositions could be proved.
as
We have
Proof: Factor
(this is the Cholesky factorization). Then consider
R jaf

8 R f
but
af R f
'
. Thus
1
D
R R
R

4f juf
f R
and thus
R
f
so
but
I a R R ff
R
and we get the result we wanted.
A more general proposition which implies this result is
R

'
is idempotent.
if and only if
(6.2.3)
dimensional random vector

ju
P ROPOSITION 7. If the
then
v
1 2
x
R p

I X R a

so
X i I !Xi
R R
and
uv2
1
G R G
z y f
R I X R a p

y y
" x
%
RG I Q R
This will have the
distribution if
and
are independent. But
Now consider (remember that we have only one restriction in this case)
8
are independent if

l 1

R
and

Yw
P ROPOSITION 9. If the random vector (of dimension )
then

1
p
p
G X R G
p
G X R G

p
GR G

Consider the random variable
R

'
then
and
(6.2.4)

g ju 1
is idempotent with rank

An immediate consequence is
6.2. TESTING
83
6.2. TESTING
84
In particular, for the commonly encountered test of signicance of an individual

coefcient, for which
, the test statistic is
v B
1 2
B
p
p

B $r

HB $r
test is strictly valid only if the errors are actually normally
Note: the
vs.
distributed. If one has nonnormal errors, one could use the above as-
)
t
v2
1
then
t
P
f

'
and
af
provided that
and
'
P ROPOSITION 10. If
(6.2.5)
less often since the distribu-
test allows testing multiple restrictions jointly.

gt
test. The
P
6.2.2.
tion is fatter-tailed than is the normal.
distribution if nonnor-
procedure is to take critical values from the

mality is suspected. This will reject
are independent.
8
distribution:
8
!R
1
h % R X R w R h
I
I

A numerically equivalent expression is
then
distribution, it is simple

yw R
to show that the following statistic has the

R
Using these results, and previous results on the
are independent if

g j 1
and
distri-
In practice, a conservative
as
bution, since
)
t9 j
ymptotic result to justify taking critical values from the
6.2. TESTING
8
!R
1
P

1

test is strictly valid only if the errors are truly normally

P
Note: The
85
distributed. The following tests will be appropriate when one cannot

assume normally distributed errors.
6.2.3. Wald-type tests. The Wald principle is based on the idea that if a
restriction is true, the unrestricted model should approximately satisfy the
restriction. Given that the least squares estimator is asymptotically normally
h
p !
R I
h p 0 1

we have
h p pr p

then under
p
I !
distributed:
h 1

so by Proposition [6]
h I p R h % 1
I R

are not observable. The test statistic we use substitutes the

4 R 1
I D1 R
and the statistic to use is
this, there is a cancellation of
as the consistent estimator of
With
consistent estimators. Use
8 I
or
Note that
h
hR R p h
% I I 6Xia R %
The Wald test is a simple way to test restrictions without having to

estimate the restricted model.
Note that this formula is similar to one of the formulae provided for
P
the
test.

6.2. TESTING
86
6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,
an unrestricted model may be nonlinear in the parameters, but the model is
linear in the parameters under the null hypothesis. For example, the model
r p
under

t
but is linear in
8
4)
G P
VUs" f
and

is nonlinear in
Estimation of
nonlinear models is a bit more complicated, so one might prefer to have a

test based upon the restricted, linear model. The score test is useful in this
situation.
Score-type tests are based upon the general principle that the gradient
vector of the unrestricted model, evaluated at the restricted estimate,
should be asymptotically normally distributed with mean zero, if the

restrictions are true. The original development was for ML estimation,
but the principle is valid for a wide variety of estimation methods.
We have seen that
h I

h % I 6Xia
I R R
e

h
h 1

under the null hypothesis,
R p !
I
Given that
p
I R I I !
m
R
p
1
I I I 1 A !
e
1
e
1
or
6.2. TESTING
87
since the s cancel and inserting the limit of a matrix of constants changes
nothing.
However,
R
I
R 1 (
I R
R R
I !XiaD1
1 A

So there is a cancellation and we get

h
e
1
R
e R I X R a e
8 p
estimator of
since the powers of
p
I 1 !
In this case,
cancel. To get a usable test statistic substitute a consistent
This makes it clear why the test is sometimes referred to as a Lagrange

multiplier test. It may seem that one needs the actual Lagrange multipliers to calculate this. If we impose the restrictions by substitution,
these are not available. Note that the test can be written as
p
R X R R h e R

e I
However, we can use the fonc for the restricted estimator:
R P R P R
war i0dfi
6.2. TESTING
88
to get that
R
IG i
fR
Hr i
R
e

Substituting this into the above, we get
8
gR

G R I X R u R G
@
G R G
but this is simply
To see why the test is also known as a score test, note that the fonc for restricted
least squares
R P R P R
war i0dfi
give us
ijfi e
R R R
and the rhs is simply the gradient (score) of the unrestricted model, evaluated
at the restricted estimator. The scores evaluated at the unrestricted estimate are
identically zero. The logic behind the score test is that the scores evaluated at
the restricted estimate should be approximately zero, if the restriction is true.
The test is also known as a Rao test, since P. Rao rst proposed it in 1948.
6.2. TESTING
89
6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using
the unrestricted model. The score test can be calculated using only the restricted model. The likelihood ratio test, on the other hand, uses both the restricted and the unrestricted estimators. The test statistic is

q4 d f 5

x
take a second order Taylors series expansion of
r
d
d
4R
1
$ d d f
h d d gd R h d p d 5 $d

1 P f
about
is the restricted estimate. To show
f
d
that it is asymptotically
is the unrestricted estimate and
h 4 d
where
d
49 f
49 f
(note, the rst order term drops out since
need to multiply the second-order term by
since
is dened in terms of
h

s d d R h d p d y
1
) so
d
4R f A I f
f
d
by the information matrix equality. So
h d d p R R h d p d 1
o
f
d
p R o p R
d
gd
As
by the fonc and we
We also have that, from [??] that

h
8 d p d
g p eH eI 1 I ! p ea
o
pd
h ip d 1
An analogous result for the restricted estimator is (this is unproven here, to
p
prove this set up the Lagrangean for MLE subject to
and manipulate
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS
90
the rst order conditions) :

h
8 d p
g p RD eI 1 h I 6 p Ra o I I 6 p Ra o 70sf I ! p ea
d
R d R d
o
pd
h ip d 1
Combining the last two equations

h
p eH I 6 p Ra o I I ! p ea o I ! p ea o eI Y h d p d 1
d
d
R d
R d p 1
so, substituting into [??]
d p
c p RH eI 1 I p Ra o I R I p Ra o R I p ea o R p RD eI 1
d
d
d
d
p
But since
d
E p e o
8
g R I p Ra o 6 j
d

m
p eH eI 1
d
p
the linear function
p eH eI 1 I p ea o
d
d
p
We can see that LR is a quadratic form of this rv, with the inverse of its variance
in the middle, so
8
4
6.3. The asymptotic equivalence of the LR, Wald and score tests
fact, they all converge to the same
We have seen that the three tests all converge to
random variables. In
rv, under the null hypothesis. Well show
that the Wald and LR tests are asymptotically equivalent. We have seen that
the Wald test is asymptotically equivalent to
h I p R h % 1
I R

8 "j R "j )5 0 5
1
f f
1

y s! f A
h
Under normality, we have seen that the likelihood function is
p eH eI 1 I ! p e o I I 6 p R o I 6 p e o p eH eI 1
d
d
R d
R d R d p
p
Now consider the likelihood ratio statistic
columns, so the projec-
R I X R u

tion matrix has rank
Note that this matrix is idempotent and has
is the projection matrix formed by the matrix

RG I X R a I R I X R a p 7R
Gi I I I p
R
R
RG 1 1
p
eI
I R
Gi I 6XiaH1
R R
p
G
i R G
p
GR w
I wR w wR G
I X R u R G

R
I tG I 1
R

where
Substitute this into [??] to get
p
0 aD1
we get
p 0 a

and

RG I X R p 0
Using
91
f
alent. Similarly, one can show that, under the null hypothesis,
This completes the proof that the Wald and LR tests are asymptotically equiv-
p
G
i R G

RG I X R a I R I X R a p R I X R R R G

Substituting these last expressions into [??], we get
pd
I I e
R
1
yx
p ijf R
p H y x
p Ra
d
so

A

pd
e
Also, by the information matrix equality:
H1
GR
1
p "j R
f

s! f ) 1 x

$
p H
Using this,
92
The proof for the statistics except for
93
does not depend upon nor-
mality of the errors, as can be veried by examining the expressions

for the statistics.
statistic is based upon distributional assumptions, since one
The
However, due to the close relationship between the statistics
cant write the likelihood function without them.

P
statistic can be thought of as a pseudoP
supposing normality, the
and
LR statistic, in that its like a LR statistic in that it uses the value of

the objective functions of the restricted and unrestricted models, but it
doesnt require distributional assumptions.
The presentation of the score and Wald tests has been done in the
context of the linear model. This is readily generalizable to nonlinear
models and/or other estimation methods.

Though the four statistics are asymptotically equivalent, they are numerically
different in small samples. The numeric values of the tests also depend upon
d
is estimated, and weve already seen than there are several ways to do
this. For example all of the following are consistent for
under
f
y

f
h y

f
y
h y f
and in general the denominator call be replaced with any quantity
how
such that
8
4) 1 A
6.5. CONFIDENCE INTERVALS
94
It can be shown, for linear regression models subject to linear restrictions,
f
y
is used for the score test,
8 f
f
y
that
is used to calculate the Wald test and
and if
For this reason, the Wald test will always reject if the LR test rejects, and in
turn the LR test rejects if the LM test rejects. This is a bit problematic: there is
the possibility that by careful choice of the statistic used, one can manipulate
reported results to favor or disfavor a hypothesis. A conservative/honest approach would be to report all three test statistics when they are available. In
P
the case of linear models with normal errors the
test is to be preferred, since
asymptotic approximations are not an issue.

The small sample behavior of the tests can be quite different. The true size
(probability of rejection of the null when the null is true) of the Wald test is
often dramatically higher than the nominal size associated with the asymptotic
distribution. Likewise, the true size of the score test is often smaller than the
nominal size.
6.4. Interpretation of test statistics
Now that we have a menu of test statistics, we need to know how to use
them.
6.5. Condence intervals
Condence intervals for single coefcients are generated in the normal
manner. Given the statistic
x

% v2
6.6. BOOTSTRAPPING
d
condence interval for
&
)
c 4 )
using a
is the interval
The set of such
signicance level:

v2
does not reject
such that
is dened by the bounds of the set of
x

r
p
0% p T St &

p $r p
&
p
95
A condence ellipse for two coefcients jointly would be, analogously, the
such that the
(or some other test statistic) doesnt reject at the

P
gD
I
set of {
specied critical value. This generates an ellipse, if the estimators are correlated.
The region is an ellipse, since the CI for an individual coefcient de-
&
nes a (innitely long) rectangle with total prob. mass
since the
other coefcient is marginalized (e.g., can take on any value). Since the
&
Y)
must extend beyond the bounds of the individual CI.
ellipse is bounded in both dimensions but also contains mass
it
From the pictue we can see that:

Rejection of hypotheses individually does not imply that the joint
test will reject.

Joint rejection does not imply individal tests will reject.
6.6. Bootstrapping
When we rely on asymptotic theory to use the normal distribution-based
tests and condence intervals, were often at serious risk of making imporh
tant errors. If the sample size is small and errors are highly nonnormal, the
h p 0 1

small sample distribution of
may be very different than its large
6.6. BOOTSTRAPPING
F IGURE 6.5.1. Joint and Individual Condence Regions
96
6.6. BOOTSTRAPPING
97
sample distribution. Also, the distributions of test statistics may not resemble
their limiting distributions at all. A means of trying to gain information on the
small sample distribution of test statistics and estimators is the bootstrap. Well
consider a simple example, just to get the main idea.
Suppose that
is unknown, the distribution of

p "
G
G
Given that the distribution of
p !
G
0P
is nonstochastic
will be un-
known in small samples. However, since we have random sampling, we could

generate articial data. The steps are:
(its
G f
P
8 ) '
gtjQ1
1
(2) Then generate the data by
observations from with replacement. Call this vector
(1) Draw
(3) Now take this and estimate
8 f i I !X%
R R
With this, we can use the replications to calculate the empirical distribution of
&
from smallest to largest, and drop the rst and last
would be to order the
5 q&
condence interval for
One way to form a 100(1-
of
(5) Repeat steps 1-4, until we have a large number,
(4) Save
of the replications,
and use the remaining endpoints as the limits of the CI. Note that this will not
give the shortest CI if the empirical distribution is skewed.
Suppose one was interested in the distribution of some function of
98
6.7. TESTING NONLINEAR RESTRICTIONS, AND THE DELTA METHOD
for example a test statistic. Simple: just calculate the transformation
tion.

4%
for each
and work with the empirical distribution of the transforma-
If the assumption of iid errors is too strong (for example if there is
heteroscedasticity or autocorrelation, see below) one can work with a
How to choose

f
bootstrap dened by sampling from
with replacement.
should be large enough that the results dont
change with repetition of the entire bootstrap. This is easy to check.
If you nd the results change a lot, increase
and try again.
The bootstrap is based fundamentally on the idea that the empirical distribution of the sample data converges to the actual sampling
distribution as
becomes large, so statistics based on sampling from
the empirical distribution should converge in distribution to statistics

based on sampling from the actual sampling distribution.
In nite samples, this doesnt hold. At a minimum, the bootstrap is a
good way to check if asymptotic theory results offer a decent approxi-
mation to the small sample distribution.
6.7. Testing nonlinear restrictions, and the Delta Method

Testing nonlinear restrictions of a linear model is not much more difcult,
at least when the model is linear. Since estimation subject to nonlinear restrictions requires nonlinear estimation methods, which are beyond the score
of this course, well just consider the Wald test for nonlinear restrictions on a
linear model.
99
Consider the nonlinear restrictions
8 p
is a -vector valued function. Write the derivative of the restriction
as
We suppose that the restrictions are not redundant in a neighborhood of

that

Ea
8p
p j aw p q
P
is a convex combination of
Under the null hypothesis we
8 xp p a I p a6 m 1
R

8 p j% 1

p 0 p aD1 1
h
p
h
h
Due to consistency of
and
p 0 v a

have
8p
where
Take a rst order Taylors series expansion of
we can replace
by
about
in a neighborhood of
, so
evaluated at
x
qa y x
dc
b
where
, asymptotically, so
Weve already seen the distribution of
Using this we get
Considering the quadratic form
I R p a I p a R 1
the resulting statistic is
p
r
under the null hypothesis. Substituting consistent estimators for
100
and
p
h R q a I Q R a R

I
under the null hypothesis.
This is known in the literature as the Delta method, or as Kleins approx-
imation.
Since this is a Wald test, it will tend to over-reject in nite samples. The
score and LR tests are also possibilities, but they require estimation
methods for nonlinear models, which arent in the scope of this course.
Note that this also gives a convenient way to estimate nonlinear functions and
not hypothesized to be zero, we just have
associated asymptotic condence intervals. If the nonlinear function
is
x R

p p a I p a6
m
h p q 1

so an approximation to the distribution of the function of the estimator is

p c p a I 6Xiv p a6 p u
R R
For example, the vector of elasticities of a function
where
is
means element-by-element multiplication. Suppose we estimate a
linear function
8G P
VdR f
are
R

w.r.t.
The elasticities of
101
(note that this is the entire vector of elasticities). The estimated elasticities are
R

To calculate the estimated standard errors of all ve elasticites, use
h vh
..

I H
I

bb
xxb
% R
.
.
.
.
.
.
R
bb
xxb
hi
bb
xxb
..

I
R

bb
xxb
.
.
.
.
.
.

a
To get a consistent estimator just substitute in . Note that the elasticity and
8
shows how this can be done.
the standard error are functions of
The program ExampleDeltaMethod.m
In many cases, nonlinear restrictions can also involve the data, not just the
be a demand funcion, where is prices and
is income. An expenditure share
8 & 8 5 3
a6@A8@89764) ! ! B

B
CB
goods is
&
system for

!d
parameters. For example, consider a model of expenditure shares. Let
6.8. EXAMPLE: THE NERLOVE DATA
102
Now demand must be positive, and we assume that expenditures sum to income, so we have the restrictions
)

! B B

3
4) !d B
Suppose we postulate a linear model for the expenditure shares:
G
0P W uP T R QP I !d B

B
B
B
It is fairly easy to write restrictions such that the shares sum to one, but the
interval depends on both parameters
It is impossible to impose the restriction that
8
"
8
"
for all possible and
and
)

and the values of
restriction that the shares lie in the
In such cases, one might consider whether
) !d B

or not a linear model is a reasonable specication.

6.8. Example: the Nerlove data
Remember that we in a previous example (section 3.8.3) that the OLS results for the Nerlove model are
*********************************************************
Observations 145
R-squared 0.925955
constant
output
estimate
-3.527
0.720
st.err.
1.774
0.017
t-stat.
-1.987
41.244
p-value
0.049
0.000
labor
fuel
capital
0.436
0.427
-0.220
0.291
0.100
0.339
103
1.499
4.249
-0.648
0.136
0.000
0.518
*********************************************************
.
u

Remember that if we have constant returns to scale, then
) P 2 P 0
there is homogeneity of degree 1 then

4)
, and that
) VP 2 VP 0
Note that
and if
. We can test these
hypotheses either separately or jointly. NerloveRestrictions.m imposes and

tests CRTS and then HOD1. From it we obtain the results that follow:
Imposing and testing HOD1
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
estimate
st.err.
t-stat.
p-value
-4.691
0.891
-5.263
0.000
output
0.721
0.018
41.040
0.000
labor
0.593
0.206
2.878
0.005
fuel
0.414
0.100
4.159
0.000
-0.007
0.192
-0.038
0.969
constant
capital
*******************************************************
Value
p-value
0.574
0.450
Wald
0.594
0.441
LR
0.593
0.441
Score
0.592
104
0.442
Imposing and testing CRTS
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.790420
estimate
st.err.
t-stat.
p-value
-7.530
2.966
-2.539
0.012
output
1.000
0.000
Inf
0.000
labor
0.020
0.489
0.040
0.968
fuel
0.715
0.167
4.289
0.000
capital
0.076
0.572
0.132
0.895
constant
*******************************************************
Value
p-value
256.262
0.000
Wald
265.414
0.000
LR
150.863
0.000
Score
93.771
105
0.000
Notice that the input price coefcients in fact sum to 1 when HOD1 is im). Also,
@)8 u&

posed. HOD1 is not rejected at usual signicance levels (e.g.,
does not drop much when the restriction is imposed, compared to the un-

) 0

) j

restricted results. For CRTS, you should note that

satised. Also note that the hypothesis that
, so the restriction is
is rejected by the test sta-
tistics at all reasonable signicance levels. Note that
drops quite a bit when
imposing CRTS. If you look at the unrestricted estimation results, you can see
) V

not overlap 1.
also rejects, and that a condence interval for
does
that a t-test for
From the point of view of neoclassical economic theory, these results are
not anomalous: HOD1 is an implication of the theory, but CRTS is not.
E XERCISE 12. Modify the NerloveRestrictions.m program to impose and
test the restrictions jointly.
The Chow test. Since CRTS is rejected, lets examine the possibilities more
carefully. Recall that the data is sorted by output (the third column). Dene
5 subsamples of rms, with the rst group being the 29 rms with the lowest
output levels, then the next 29 rms, etc. The ve subsamples can be indexed
for
2
5
%
85
5@8A874) 2
)
@A8@89764)
8 5
Dene a piecewise linear model
for
8 7
@8A84)
where
by
, etc.
e
EwP 6 VP 1 3 uS 1 A | VC VP I A
2
P 0
P
(6.8.1)
where is a superscript (not a power) that inicates that the coefcients may be
%
different according to the subsample in which the observation falls. That is,
which in turn depends upon
Note that the
8
2
the coefcients depend upon
106
rst column of nerlove.data indicates this way of breaking up the sample. The
new model may be written as
.
.
.

%
)
'
)
j' 5
i '
vector of errors for the
vector of coefcient for the
is the
I
f

6
@
I )
iy'
is the
is 29
bb
xxb
I
f
subsample, and
is 29
where
.
.
.
Ie
.
.
.
(6.8.2)
subsample.
The Octave program Restrictions/ChowTest.m estimates the above model.

It also tests the hypothesis that the ve subsamples share the same parameter
vector, or in other words, that there is coefcient stability across the ve subsamples. The null to test is that the parameter vectors for the separate groups
are all the same, that is,
6
3 I
|
This type of test, that parameters are constant across different sets of data, is
sometimes referred to as a Chow test.
There are 20 restrictions. If thats not clear to you, look at the Octave
The restrictions are rejected at all conventional signicance levels.
program.
Since the restrictions are rejected, we should probably use the unrestricted
model for analysis. What is the pattern of RTS as a function of the output
107
F IGURE 6.8.1. RTS as a function of rm size

2.6
RTS
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
1.5
2.5
3
Output group
3.5
4.5
group (small to large)? Figure 6.8.1 plots RTS. We can see that there is increasing RTS for small rms, but that RTS is approximately constant for large rms.
108
(1) Using the Chow test on the Nerlove model, we reject that there is coefcient stability across the 5 groups. But perhaps we could restrict the
input price coefcients to be the same but let the constant and output
coefcients vary by group size. This new model is
e B
6
3
B
B wP q )uP B 5 4uP B 1 | VP s A uP I DB

2
0
(a) estimate this model by OLS, giving
, estimated standard errors
(6.8.3)
for coefcients, t-statistics for tests of signicance, and the associated p-values. Interpret the results in detail.
(b) Test the restrictions implied by this model using the F, Wald, score
and likelihood ratio tests. Comment on the results.
(c) Plot the estimated RTS parameters as a function of rm size. Compare the plot to that given in the notes for the unrestricted model.
Comment on the results.
x
I
(2) For the simple Nerlove model, estimated returns to scale is
c.
Apply the delta method to calculate the estimated standard error for
) $r
) p
rd
) $r p

than testing
versus
versus
) $r
estimated RTS. Directly test
rather
. Comment on the results.
(3) Perform a Monte Carlo study that generates data from the model
)
t e
)
}9
|

e
wP | P Y f
)
) P 5
where the sample size is 30,

distributed on
and
are independently uniformly
and
(a) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that
8
5 | VP
109
(b) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that
8
) | VP
(c) Discuss the results.

(4) Get the Octave scripts bootstrap_example1.m , bootstrap.m , bootstrap_resample_iid.m
and myols.m gure out what they do, run them, and interpret the results.
CHAPTER 7
Generalized least squares

One of the assumptions weve made up to now is that

! tG

or occasionally
8
g ! DtG
Now well investigate the consequences of nonidentically and/or dependently
distributed errors. Well assume xed regressors for now, relaxing this admittedly unrealistic assumption later. The model is
f
G P
Vi
G
U
G

is a general symmetric positive denite matrix (well write

S
to simplify the typing of these notes).
in place
is a diagonal matrix gives uncorrelated, nonidentiS
The case where
cally distributed errors. This is known as heteroscedasticity.

The case where
S
of
where
has the same number on the main diagonal but
nonzero elements off the main diagonal gives identically (assuming

higher moments are also the same) dependently distributed errors.
This is known as autocorrelation.
110
7.1. EFFECTS OF NONSPHERICAL DISTURBANCES ON THE OLS ESTIMATOR
111
The general case combines heteroscedasticity and autocorrelation. This
is known as nonspherical disturbances, though why this term is

used, I have no idea. Perhaps its because under the classical assump-
persphere.
would be an
q1
tions, a joint condence region for
dimensional hy-
7.1. Effects of nonspherical disturbances on the OLS estimator

The least square estimator is

Rf I X R
G% I 6Xi
R R P

We have unbiasedness, as before.

is
or the probability
of is invalid. In particular, the formulas for the
P 2
limit
R
r c0 j n
Due to this, any test statistic that is based upon
(7.1.1)
R R R
I 6XiVSi I !X%
I 6XiuttGi I !Xi
R R G R R
The variance of
based
tests given above do not lead to statistics with these distributions.

is still consistent, following exactly the same argument given before.
If is normally distributed, then
R R R
I 6XiVSi I !X%tu"
S
The problem is that
is unknown in general, so this distribution wont
be useful for testing hypotheses.
7.2. THE GLS ESTIMATOR

h
we still have

h j% 1
G R eI 1
p
RG 1 1
p
eI
I R
Gi I 6Qi1
R R
Dene the limiting variance of
(supposing a CLT applies) as
I I m h 0% 1
1

tf
R tG R u
G

'
so we obtain
Without normality, and unconditional on
112
Summary: OLS with heteroscedasticity and/or autocorrelation is:
unbiased in the same circumstances in which the estimator is unbiased
has a different variance than before, so the previous test statistics arent
is consistent
with iid errors
valid
is asymptotically normally distributed, but with a different limiting
covariance matrix. Previous test statistics arent valid in this case for
this reason.
is inefcient, as is shown below.
7.2. The GLS estimator

S
Suppose
were known. Then one could form the Cholesky decomposition
I yS R
We have
f
f I S R I X I S R

Rf R I Q R R
fR I 6 R

C0
to the transformed model:

satises the classical assumptions. The GLS estimator is simply OLS applied
G
U
G P
V
G

f
Therefore, the model
f

G
R R tG
R S
G @G
This variance of
is
8 V f
G P
or, making the obvious denitions,

GR "R fR
P
Consider the model
f R S
which implies that
R R S R
so
113
114
The GLS estimator is unbiased in the same circumstances under which the
OLS estimator is unbiased. For example, assuming
is nonstochastic
R R
G0Pi I yS% I 6X I ySi
R R
f I yS% I 6X I ySi
can be calculated using
@0
The variance of the estimator, conditional on
G I P
R 6 R
GVP gR I 6 R
fR I 6 R
C0

so
R
I 6X I ySi
I 6 R
I 6 R R I 6 R

I 6 R R G GR I 6 R
R h 0
C0
h j
@0

Either of these last formulas can be used.

All the previous results regarding the desirable properties of the least
squares estimator hold, when dealing with the transformed model,

since the transformed model satises the classical assumptions..
Tests are valid, using the previous formulas, as long as we substitute
Furthermore, any test that involves
can set it to
8
%
This is preferable to re-deriving the appropriate formulas.
8
4)
in place of
7.3. FEASIBLE GLS
115
The GLS estimator is more efcient than the OLS estimator. This is a
consequence of the Gauss-Markov theorem, since the GLS estimator is

based on a model that satises the classical assumptions but the OLS
estimator is not. To see this directly, not that (the following needs to
be completed)
y w S w
R R R R
I 6Q I ySi I 6X%uSi I 6Qi
C0
Vq

8 I S R I X I S R R I Q R w
where
This may not seem ob-
vious, but it is true, as you can verify for yourself. Then noting that
is a quadratic form in a positive denite matrix, we conclude
y w S w
y w S w
that
is positive semi-denite, and that GLS is efcient relative to
OLS.
As one can verify by calculating fonc, the GLS estimator is the solution
to the minimization problem
C0
f R f
qij I yBSc"j E
so the metric
is used to weight the residuals.
7.3. Feasible GLS

isnt known usually, so this estimator isnt available.
S
unique elements.
: its an
1 '
Y1
Consider the dimension of
matrix with
1
1 1
P 5 0
The problem is that
5 DuP
1 1
7.3. FEASIBLE GLS
116
8
q1
faster than
and increases
The number of parameters to estimate is larger than
Theres no way to devise an estimator that satises a
LLN without adding restrictions.

The feasible GLS estimator is based upon making sufcient assumptions

4d iS
T
in the formulas for the GLS estimator with

pS
If we replace
is a continuous function of
(by the
d
iS

S
Slutsky theorem). In this case,
d
4iS
as long as
we can cond
is of xed dimension. If we can consistently estimate
sistently estimate
may include
d
4%S
where
and , where

7d
as well as other parameters, so that
as a function of
Suppose that we parameterize
so that a consistent estimator can be devised.

d
regarding the form of
we obtain the
FGLS estimator. The FGLS estimator shares the same asymptotic properties
as GLS. These are
(1) Consistency
(2) Asymptotic normality
(3) Asymptotic efciency if the errors are normally distributed. (CramerRao).
(4) Test procedures are asymptotically valid.
In practice, the usual way to proceed is
This is a case-by-case proposition,
depending on the parameterization
8 d
g4RS
8
7d
(1) Dene a consistent estimator of
Well see examples below.
7.4. HETEROSCEDASTICITY
(3) Calculate the Cholesky factorization

(4) Transform the model using
I S I)

4d %S
(2) Form
117
P
RG i R Rf
(5) Estimate using OLS on the transformed model.
7.4. Heteroscedasticity
Heteroscedasticity is the case where
R G G
ttU
is a diagonal matrix, so that the errors are uncorrelated, but have different
variances. Heteroscedasticity is usually thought of as associated with cross
sectional data, though there is absolutely no reason why time series data cannot also be heteroscedastic. Actually, the popular ARCH (autoregressive conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic.
Consider a supply function
is some measure of size of the
3
is price and
G
T PI
B VP B iuP B $uH HB
where
rm. One might
suppose that unobservable factors (e.g., talent of managers, degree of coordiIf there
is more variability in these factors for large rms than for small rms, then
may have a higher variance when
is high than when it is low.
BG
8B G
nation between production units, etc.) account for the error term
118
Another example, individual demand.
G
W
T P I
B 0P B SuP B $VsH DB
is price and
is income. In this case,
BG
where
can reect variations in
preferences. There are more possibilities for expression of preferences when
BG
high.
could be higher when
one is rich, so it is possible that the variance of
is
Add example of group means.
7.4.1. OLS with heteroscedastic consistent varcov estimation. Eicker (1967)

and White (1980) showed how to modify test statistics to account for heteroscedasticity of unknown form. The OLS estimator has asymptotic distrih
I I
bution
h

0 1
as weve already seen. Recall that we dened
'
1

tf
R tG R u
G
cant estimate
'
This matrix has dimension
and can be consistently estimated, even if we
consistently. The consistent estimator, under heteroscedastic-
ity but no autocorrelation is
I
@ 1

G R f )
One can then modify the previous test statistics to obtain tests that are valid
when there is heteroscedasticity of unknown form. For example, the Wald test
119
1
1
h
R h
% R I R I R 5R 1
I
p
pr
for
would be
7.4.2. Detection. There exist many tests for the presence of heteroscedasticity. Well discuss three methods.
will be independent.

I 1
|
1
I G I y I G I G R I G
m
m
8
| !
1 I 1
G y G G R G
| | |
| |
P
m
1
|
qs
I 1
so
and
and
|1
and third parts of the sample, separately, so that

Then we have
and
. The model is estimated using the rst
1 | iP iP 1
1 1 I
observations, where
!gq1
1I
Goldfeld-Quandt. The sample is divided in to three parts, with
| G R | G
I G R I G
The distributional result is exact if the errors are normally distributed. This test
is a two-tailed test. Alternatively, and probably more conventionally, if one has
prior ideas about the possible magnitudes of the variances of the observations,
one could order the observations accordingly, from largest to smallest. In this
case, one would use a conventional one-tailed F-test. Draw picture.
Ordering the observations is an important step if the test is to have
The motive for dropping the middle observations is to increase the
any power.
difference between the average variance in the subsamples, supposing that there exists heteroscedasticity. This can increase the power of
120
the test. On the other hand, dropping too many observations will suband
8 G R G
| |
I G R I G
stantially increase the variance of the statistics
A rule of
thumb, based on Monte Carlo experiments is to drop around 25% of

the observations.
If one doesnt have any ideas about the form of the het. the test will
probably have low power since a sensible data ordering isnt available.
Whites test. When one has little idea if there exists heteroscedasticity, and
no idea of its potential form, the White test is a possibility. The idea is that if
there is homoscedasticity, then
2
U
G
isnt available, use the consistent estimator
(1) Since
shouldnt help to explain
(2) Regress
-vector.
is a
instead.
P R #
ta wP G
where
The test works as
g G
follows:
or functions of
8 G
g
so that
may include some or all of the variables in
as well as other variables. Whites original suggestion was to use
8c

c
, plus the set of all unique squares and cross products of variables in
The
8
(3) Test the hypothesis that
statistic in this case is
t)
1


Note that
121
so dividing both numerator and denomina-
tor by this we get
V)

1
t)
eroscedasticity, not the
or the articial regression used to test for het-
Note that this is the
of the original model.
An asymptotically equivalent statistic, under the null of no heteroscedasticity

should tend to zero), is
(so that
'
D1
This doesnt require normality of the errors, though it does assume that the
sary?
fourth moment of
is constant, under the null. Question: why is this neces-
The White test has the disadvantage that it may not be very power-
ful unless the
vector is chosen well, and this is hard to do without
knowledge of the form of heteroscedasticity.
It also has the problem that specication errors other than heteroscedas-
Note: the null hypothesis of this test may be interpreted as
ticity may lead to rejection.

d
dc
b
where
#
gd R P &
G
the variance model
for
is an arbitrary func-
tion of unknown form. The test is more general than is may appear
from the regression that is used.
Plotting the residuals. A very simple method is to simply plot the residuals
(or their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will
122
be more informative if the observations are ordered according to the suspected

form of the heteroscedasticity.
7.4.3. Correction. Correcting for heteroscedasticity requires that a parabe supplied, and that a means for estimating
d
metric form for
consis-
d
4eS
tently be determined. The estimation method will be specic to the for sup-
8 d
geS
plied for
Well consider two examples. Before this, lets consider the
general nature of GLS when there is heteroscedasticity.

Multiplicative heteroscedasticity
Suppose the model is
H R
#
G
G P
0 R
gf

but the other classical assumptions hold. In this case
R #
iP H g G
has mean zero. Nonlinear least squares could be used to estimate
we can estimate
consistently using
and
since it is consistent by the Slutsky theorem.
in place of

G
Once we have
observable. The solution is to substitute the squared
G
OLS residuals
and
consistently, were
and
R #
H g
In the second step, we transform the model by dividing by the standard deviation:
S S S
t G P R g f
123
or
8 0 R f
G P
Asymptotically, this model satises the classical assumptions.

This model is a bit complex in that NLS is required to estimate the
model of the variance. A simpler version would be
G P
0 R
G

9#
where
is a single variable. There are still two parameters to be esti-
mated, and the model of the variance is still nonlinear in the parameters. However, the search method can be used in this case to reduce the
estimation problem to repeated applications of OLS.
equally spaced values, e.g.,

iP # G
W
gUW
and the corresponding
with the minimum

W
by OLS.
as the estimate.
so one can estimate
8
W
is linear in the parameters, conditional on
The regression
8 #
For each of these values, calculate the variable
Save the pairs (
Partition this interval into
e.g.,
87 8756A@8A8754A)
7
8 8 8
8 7
I g

First, we dene an interval of reasonable values for
Choose the pair
Next, divide the model by the estimated standard deviations.
Works well when the parameter to be searched over is low dimen-
Can rene. Draw picture.
sional, as in this case.
124
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of economic agents: e.g., 10 years of macroeconomic data on each of a set
of countries or regions, or daily observations of transactions of 200 banks. This
sort of data is a pooled cross-section time-series model. It may be reasonable to presume that the variance is constant over time within the cross-sectional units,
but that it differs across them (e.g., rms or countries of different sizes...). The
model is
2
c B
B V BR
G P
are the agents, and
& 85
6A@8@8974) 3
each agent.
1 8 5
!@A8@89764) 2
B G
B f
where
are the observations on
The other classical assumptions are presumed to hold.

In this case, the variance
In this model, we assume that
the
is specic to each agent, but constant over
observations for that agent.
8 E B G B U
G
This is a strong assumption
To correct for heteroscedasticity, just estimate each
using the natural estima-
B G
I@ 1
B
f )
tor:
B
s
that well relax later.
Note that we use
here since its possible that there are more than
regressors, so
could be negative. Asymptotically the difference
1
1)
is unimportant.
125
F IGURE 7.4.1. Residuals, Nerlove model, sorted by rm size

Regression residuals
1.5
Residuals
0.5
-0.5
-1
-1.5
20
40
60
80
100
120
140
160
With each of these, transform the model as usual:
B B
B
B G P BR B f
Do this for each cross-sectional group. This transformed model satises the classical assumptions, asymptotically.
7.4.4. Example: the Nerlove model (again!) Lets check the Nerlove data
for evidence of heteroscedasticity. In what follows, were going to use the
model with the constant and output coefcient varying across 5 groups, but
with the input price coefcients xed (see Equation 6.8.3 for the rationale behind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m
plots the residuals. We can see pretty clearly that the error variance is larger
for small rms than for larger rms.
126
Now lets try out some tests to formally check for heteroscedasticity. The
Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt
tests, using the above model. The results are
Value
GQ test
61.903
0.000
Value
Whites test
p-value
p-value
10.886
0.000
All in all, it is very clear that the data are heteroscedastic. That means that OLS
estimation is not efcient, and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were calculated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m
uses the Wald test to check for CRTS and HOD1, but using a heteroscedasticconsistent covariance estimator.1 The results are
Testing HOD1
Value
6.161
0.013
Value
Wald test
p-value
p-value
20.169
0.001
Testing CRTS
Wald test
By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator directly to restrict the fully general model with all coefcients
varying to the model with only the constant and the output coefcient varying. But
GLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into the
model. The methods are equivalent, but the second is more convenient and easier to understand.
127
We see that the previous conclusions are altered - both CRTS is and HOD1 are
rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald tests
tendency to over-reject?
From the previous plot, it seems that the variance of is a decreasing func-
tion of output. Suppose that the 5 size groups have different error variances
(heteroscedasticity by groups):
B
e
if
5 85
6A@8A874) 3 )
where
, etc., as before. The Octave program GLS/NerloveGLS.m
estimates the model using GLS (through a transformation of the model so that
OLS can be applied). The estimation results are
*********************************************************
Observations 145
R-squared 0.958822
Results (Het. consistent var-cov estimator)
estimate
st.err.
t-stat.
p-value
constant1
-1.046
1.276
-0.820
0.414
constant2
-1.977
1.364
-1.450
0.149
constant3
-3.616
1.656
-2.184
0.031
constant4
-4.052
1.462
-2.771
0.006
constant5
-5.308
1.586
-3.346
0.001
0.391
0.090
4.363
0.000
output1
128
output2
0.649
0.090
7.184
0.000
output3
0.897
0.134
6.688
0.000
output4
0.962
0.112
8.612
0.000
output5
1.101
0.090
12.237
0.000
labor
0.007
0.208
0.032
0.975
fuel
0.498
0.081
6.149
0.000
-0.460
0.253
-1.818
0.071
capital
*********************************************************
*********************************************************
Observations 145
R-squared 0.987429
Results (Het. consistent var-cov estimator)
estimate
st.err.
t-stat.
p-value
constant1
-1.580
0.917
-1.723
0.087
constant2
-2.497
0.988
-2.528
0.013
constant3
-4.108
1.327
-3.097
0.002
constant4
-4.494
1.180
-3.808
0.000
constant5
-5.765
1.274
-4.525
0.000
output1
0.392
0.090
4.346
0.000
output2
0.648
0.094
6.917
0.000
129
output3
0.892
0.138
6.474
0.000
output4
0.951
0.109
8.755
0.000
output5
1.093
0.086
12.684
0.000
labor
0.103
0.141
0.733
0.465
fuel
0.492
0.044
11.294
0.000
-0.366
0.165
-2.217
0.028
capital
*********************************************************
Testing HOD1
Value
9.312
Wald test
p-value
0.002
The rst panel of output are the OLS estimation results, which are used to
results. Some comments:
The
consistently estimate the
. The second panel of results are the GLS estimation
measures are not comparable - the dependent variables are
not the same. The measure for the GLS results uses the transformed
but I have not done so.
dependent variable. One could calculate a comparable
measure,
The differences in estimated standard errors (smaller in general for

GLS) can be interpreted as evidence of improved efciency of GLS,
since the OLS standard errors are calculated using the Huber-White
estimator. They would not be comparable if the ordinary (inconsistent) estimator had been used.
7.5. AUTOCORRELATION
130
Note that the previously noted pattern in the output coefcients per-
sists. The nonconstant CRTS result is robust.

The coefcient on capital is now negative and signicant at the 3%
level. That seems to indicate some kind of problem with the model or
the data, or economic theory.
Note that HOD1 is now rejected. Problem of Wald test over-rejecting?
Specication error in model?
7.5. Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that is usually associated with time series data, but also can affect crosssectional data. For example, a shock to oil prices will simultaneously affect
all countries, so one could expect contemporaneous correlation of macroeconomic variables across countries.
7.5.1. Causes. Autocorrelation is the existence of correlation across the error term:
8 2
4 EgGtG
Why might this occur? Plausible explanations include
(1) Lags in adjustment to shocks. In a model such as
G P
t0 R gf
as the equilibrium value. Suppose
gG
R
stant over a number of observations. One can interpret
one could interpret
is con-
as a shock
that moves the system away from equilibrium. If the time needed to
return to equilibrium is long with respect to the observation frequency,
to be positive, conditional on
I G
induces a correlation.
one could expect
131
positive, which
(2) Unobserved factors that are correlated over time. The error term is
often assumed to correspond to unobservable factors. If these factors
are correlated, there will be autocorrelation.
(3) Misspecication of the model. Suppose that the DGP is
G
tVP uS HuP p gf
P I
but we estimate
G P I
tVS HuP p gf
The effects are illustrated in Figure 7.5.1.
7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is
the same as in the case of heteroscedasticity - the standard formula does not
apply. The correct formula is given in equation 7.1.1. Next we discuss two
GLS corrections for OLS. These will potentially induce inconsistency when the
regressors are nonstochastic (see Chapter8) and should either not be used in
that case (which is usually the relevant case) or used with caution. The more
recommended procedure is discussed in section 7.5.5.
7.5.3. AR(1). There are many types of autocorrelation. Well consider two
examples. The rst is the most commonly encountered case: autoregressive
132
F IGURE 7.5.1. Autocorrelation induced by misspecication
order 1 (AR(1) errors. The model is
G P
tV R
3
t3
P G
CI t4
E
2
f
tG

kC
G
tU
We assume that the model satises the other classical assumptions.
8
4)
We need a stationarity assumption:
Otherwise the variance of
explodes as increases, so standard asymptotics will not apply.
order autocovariance:
G
t p
Note that the variance does not depend on
0) G
t
gG

G
t
j)

p
qW
G
W U
gG
p
qW
W C
W G
as
ut
P G
RUVCI t4gw I U
P
G 5 P
G
were covariance stationary, we could
is found as
drops out, since
The variance is the

so
obtain this using
If we had directly assumed that
With this, the variance of
obtain
In the limit the lagged
so we
CisI CDu iiP | t4

P
P
G
P
iiDI iDVP tG

CPI iiP t4p
G
P G
iisI t4

tG

By recursive substitution we obtain

133
is
I

Likewise, the rst order autocovariance
134
G
7 EI tG
2
0)

D$
G
t
G P G
I CI 4U
Using the same method, we nd that for
j)

D
stationary
cov
se se

f
!
f
f
S
corr
is covariance
and ) is dened as
The correlation (in general, for r.v.s
G
gx
G
E tG
The autocovariances dont depend on : the process
but in this case, the two standard errors are the same, so the -order autocor-
is

.
.
.
~ j{ ) z
|
}

|
}{
bb
xxb I f
this is the correlation matrix
..
this is the variance
..
.
.
.
has the form
f uxxb
bb
bb
I f uxxb $
All this means that the overall matrix

S

S
relation
135
So we have homoscedasticity, but elements off the main diagonal are
can estimate these consistently, we can apply FGLS.
and
8 s
not zero. All of this depends only on two parameters,
If we
It turns out that its easy to estimate these consistently. The steps are
G P
0 R gf
(1) Estimate the model
by OLS.
(2) Take the residuals, and estimate the model
P
I t G t G
G
T

ctG
Since
gression
this regression is asymptotically equivalent to the re-
P G
CI t4 tG
which satises the classical assumptions. Therefore, obtained by ap-
S S
P
I t G G

ng

i
@
1

T f )

S
, the
using the
and estimate by FGLS. Actually, one can omit
8
g f I yS R% h I yS % @0 2
)
g 0c
since it cancels out in the formula
One can iterate the process, by taking the rst FGLS estimator of
and
estimating
the factor
form
previous structure of
and
(3) With the consistent estimators
estimator
is consistent. Also, since
plying OLS to
re-
etc. If one iterates to convergences its equivalent
to MLE (supposing normal errors).
136
An asymptotically equivalent approach is to simply estimate the transformed model
p
pf

P R
c!I I g gf
)
(u1
using
observations (since
and
arent available). This is
the method of Cochrane and Orcutt. Dropping the rst observation is

asymptotically irrelevant, but it can be very important in small samples.
One can recuperate the rst observation by putting
)

)

I I

I
f I f
This somewhat odd-looking result is related to the Cholesky factorSee Davidson and MacKinnon, pg. 348-49 for more
If
discussion. Note that the variance of
is
8I
ization of
asymptotically, so we
see that the transformed model will be homoscedastic (and nonauto-
periods.
are uncorrelated with the

4 R f
correlated, since the
in different time
7.5.4. MA(1). The linear regression model with moving average order 1
errors is
G P
t0 R

2
! 3
t3
I CnP C
t
f
tG

C

G
t)
P
t
tw
P )

t
p
z I z
I
C z I
t
.
..
.
.
.
Note that the rst order autocorrelation is
bb
xxb
bb
xxb
.
.
.
t t)
P
t
wP )
t
t
..
so in this case
| irP CREI CrP CR@

t
t
and
it
CrDI CREI CnwP iA
t P t

I
i
Similarly

t P )
t
wP
I irSCR
t P

p
G
t
In this case,
137
and a minimum at
)
4 t
) t
This achieves a maximum at
138
and the
maximal and minimal autocorrelations are 1/2 and -1/2. Therefore,

series that are more strongly autocorrelated cant be MA(1) processes.
Again the covariance matrix has a simple structure that depends on only two
parameters. The problem in this case is that one cant estimate
using OLS on
I irP C t G
t
because the
are unobservable and they cant be estimated consistently. How-
ever, there is a simple way to estimate the parameters.

Since the model is homoscedastic, we can estimate
w t
t P )
G
using the typical estimator:
x
I
@ 1
t P )
G f ) c
By the Slutsky theorem, we can interpret this as dening an (unidenand

t
tied) estimator of both
e.g., use this as
I
@ 1
P )
G f ) t c
However, this isnt sufcient to dene consistent estimators of the parameters, since its unidentied.
and
I G
gG
To solve this problem, estimate the covariance of
using
@ 1
I t
G
G t G f ) t EI tG
139
This is a consistent estimator, following a LLN (and given that the

epsilon hats are consistent for the epsilons). As above, this can be
interpreted as dening an unidentied estimator:
@ 1
I
G G f ) t
Now solve these two equations to obtain identied (and therefore conand
sistent) estimators of both
Dene the consistent estimator
t S S

following the form weve seen above, and transform the model using the Cholesky decomposition. The transformed model satises the
classical assumptions asymptotically.
7.5.5. Asymptotically valid inferences with autocorrelation of unknown

form. See Hamilton Ch. 10, pp. 261-2 and 280-84.
When the form of autocorrelation is unknown, one may decide to use the
OLS estimator, without correction. Weve seen that this estimator has the limh
h 0 1

f

R t RG A
G
is
where, as before,
I I
iting distribution
f
tG
r
if xxb I n
bb
G

I@

f
I@
f

I
@
) 1 f
v
'
8
gE2
autocovariance of
as
8 R
g U
is potentially autocorrelated:
2
j
8 R

R
R

Note that this autocovariance does not depend on

2
(show this with an example). In general, we expect
will be autocorrelated, since
stationarity.
I@
R
f
R
Gi
that:
Note that
.
.
.
is covariance stationary (so that the covariance between
does not depend on
Dene the
and
We assume that
)
j'
so that
is dened
I
$G
vector). Note that
G
t
as a
(recall that
We need a consistent estimate of . Dene
140
due to covariance
), since the regressors
will in general be correlated (more on this later).
B B U
and heteroscedastic (
, which depends upon ), again since
the regressors will have different variances.
in

k B
contemporaneously correlated (
141
parametrically, we in general have little informa
While one could estimate
tion upon which to base a parametric specication. Recent research has fo-
I@

R
f
Now dene
cused on consistent nonparametric estimators of
I@

1
) f
We have (show that the following is true, by expanding sum and shifting rows to left)
P
sI f ) 1 xxxH R
Pbbb
1
P u1 IR
5 P
I @
8 R 1
f )
P
I f ) 1 9xxP h R
Pbbb
f
1
7 )
1
P V1 P h IR
5
8 h R
would be
here). So, a natural, but inconsis-
1)
t G

instead of
h I Rf
1
P
DI ) 1 P p
tent, estimator of
where
(note: one could put
is
I Rf
The natural, consistent estimator of
P s11 P p

I f
1
P
DI ) 1 P p
f
142
This estimator is inconsistent in general, since the number of parameters to

estimate is more than the number of observations, and increases more rapidly
tends to zero sufciently rapidly as
a modied estimator
hR
I
p
P
9f (
f
will be consistent, provided
1
S
as
1

slowly.
where
tends to
On the other hand, supposing that
than , so information does not build up as
grows sufciently
The assumption that autocorrelations die off is reasonable in many
can be dropped because it tends to one for
8
q1
increases slowly relative to
1
D
f f
given that
The term
has autocorrelations
S
1
that die off.
cases. For example, the AR(1) model with
A disadvantage of this estimator is that is may not be positive denite.

statistic, for example!
Newey and West proposed and estimator (Econometrica, 1987) that

solves the problem of possible nonpositive deniteness of the above
estimator. Their estimator is

8 hR
This could cause one to calculate a negative
9f (
I
w
")P
) P p
f
This estimator is p.d. by construction. The condition for consistency
1
D
I 1
3 p
e
Note that this is a very slow rate of growth
This estimator is nonparametric - weve placed no parametric
restrictions on the form of
for
is that
It is an example of a kernel estimator.
We can now use
f
T
as its limit,
has
Finally, since
143
and
R If
to consistently estimate the limiting distribution of the OLS estimator
under heteroscedasticity and autocorrelation of unknown form. With this,

asymptotically valid tests are constructed in the usual way.
7.5.6. Testing for autocorrelation. Durbin-Watson test

The Durbin-Watson test statistic is
I
G @
I P f 5
G sI t G t G 0 G
I
G @
f
EI t G t G
@
f
@
f

d
The null hypothesis is that the rst order autocorrelation of the errors
8 QI r

8 I r p
is zero:
The alternative is of course
Note
that the alternative is not that the errors are AR(1), since many general patterns of autocorrelation will have the rst order autocorrelation different than zero. For this reason the test is useful for detecting
autocorrelation in general. For the same reason, one shouldnt just assume that an AR(1) model is appropriate when the DW test rejects the
null.
Under the null, the middle term tends to zero, and the other two tend
so
5
7p
the middle term tends to
Supposing that we had an AR(1) error process with

75
These are the extremes:
so
case the middle term tends to
In this case
8)
4
Supposing that we had an AR(1) error process with
8
4)
8
75
to one, so
always lies between 0 and 4.
In this

144
F IGURE 7.5.2. Durbin-Watson critical values
The distribution of the test statistic depends on the matrix of regres-

%
sors,
so tables cant give exact critical values. The give upper and
lower bounds, which correspond to the extremes that are possible. See
Figure 7.5.2. There are means of determining exact critical values con-
8
i
ditional on
Note that DW can be used to test for nonlinearity (add discussion).
The DW test is based upon the assumption that the matrix
is xed
in repeated samples. This is often unreasonable in the context of economic time series, which is precisely the context where the test would
have application. It is possible to relate the DW test to other test statistics which are valid without strict exogeneity.
145
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity. The regression is
Pbbb
txxP t G sI t G CiP R t G
P
I

ui
statistic, just as in the White test. There are
8
g

sH1
restrictions, so the test statistic is asymptotically distributed as a
P
i G
and the test statistic is the
The intuition is that the lagged errors shouldnt contribute to explain-
ing the current error if there is no autocorrelation.
into here.
%

independent even if the
g G
is included as a regressor to account for the fact that the
are not
are. This is a technicality that we wont go
This test is valid even if the regressors are stochastic and contain lagged
dependent variables, so it is considerably more useful than the DW

test for typical time series data.
The alternative is not that the model is an AR(P), following the argument above. The alternative is simply that some or all of the rst
autocorrelations are different from zero. This is compatible with
many specic forms of autocorrelation.
7.5.7. Lagged dependent variables and autocorrelation. Weve seen that
following a LLN. An important excep-
contains lagged
R f
tion is the case where
4 RG U

This will be the case when
y f 3
the OLS estimator is consistent under autocorrelation, as long as
and the errors are autocorrelated.
A simple example is the case of a single lag of the dependent variable with
146
AR(1) errors. The model is
P G
CI 4
G f P
0P sI u R
f
tG
Now we can write

P G G P f P
G f
CI 4I tV gV I R tI gU
8 y f 3
I 4U
G
and therefore
which is clearly nonzero. In this case
Since

pG R U
since one of the terms is
1
P
G R 3 " 3
the OLS estimator is inconsistent in this case. One needs to estimate by instrumental variables (IV), which well get to later.
7.5.8. Examples.
Nerlove model, yet again. The Nerlove model uses cross-sectional data, so
one may not think of performing tests for autocorrelation. However, specication error can induce autocorrelated errors. Consider the simple Nerlove
model
e
uP )VP 5 4uP 1 | VP uH A
6
3
PI
2
0
and the extended Nerlove model
8e
uP )uP 5 4uP 5 | uP VP I u A
6
3
2
0

147
F IGURE 7.6.1. Residuals of simple Nerlove model

2
Residuals
Quadratic fit to Residuals
1.5
0.5
-0.5
-1
10
We have seen evidence that the extended model is preferred. So if it is in

fact the proper model, the simple model is misspecied. Lets check if this
misspecication might induce autocorrelated errors.
The Octave program GLS/NerloveAR.m estimates the simple Nerlove model,
and plots the residuals as a function of
, and it calculates a Breusch-Godfrey
test statistic. The residual plot is in Figure 7.6.1 , and the test results are:
Value
p-value
34.930
0.000
Clearly, there is a problem of autocorrelated residuals.

E XERCISE 7.6. Repeat the autocorrelation tests using the extended Nerlove
model (Equation ??) to see the problem is solved.
148
Klein model. Kleins Model I is a simple macroeconometric model. One of

the equations in the model explains consumption ( ) as a function of prots
( ), both current and lagged, as well as the sum of wages in the private sector
) and wages in the government sector (
). Have a look at the README
le for this data set. This gives the variable names and other information.
Consider the model
I e P
ws P T |

&
P
I
&
P
S I
&
P p
&
'
The Octave program GLS/Klein.m estimates this model by OLS, plots the
residuals, and performs the Breusch-Godfrey test, using 1 lag of the residuals. The estimation and test results are:
*********************************************************
Observations 21
R-squared 0.981008
estimate
st.err.
t-stat.
p-value
16.237
1.303
12.464
0.000
Profits
0.193
0.091
2.115
0.049
Lagged Profits
0.090
0.091
0.992
0.335
Wages
0.796
0.040
19.933
0.000
Constant
149
F IGURE 7.6.2. OLS residuals, Klein consumption equation

Regression residuals
2
Residuals
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
10
15
20
25
*********************************************************
Value
p-value
1.539
0.215
and the residual plot is in Figure 7.6.2. The test does not reject the null of
nonautocorrelatetd errors, but we should remember that we have only 21 observations, so power is likely to be fairly low. The residual plot leads me to
suspect that there may be autocorrelation - there are some signicant runs below and above the x-axis. Your opinion may differ.
Since it seems that there may be autocorrelation, letss try an AR(1) correction. The Octave program GLS/KleinAR1.m estimates the Klein consumption
equation assuming that the errors follow the AR(1) pattern. The results, with
the Breusch-Godfrey test for remaining autocorrelation are:
150
*********************************************************
Observations 21
R-squared 0.967090
estimate
st.err.
t-stat.
p-value
16.992
1.492
11.388
0.000
Profits
0.215
0.096
2.232
0.039
Lagged Profits
0.076
0.094
0.806
0.431
Wages
0.774
0.048
16.234
0.000
Constant
*********************************************************
Value
p-value
2.129
0.345
The test is farther away from the rejection region than before, and the
residual plot is a bit more favorable for the hypothesis of nonautocorrelated residuals, IMHO. For this reason, it seems that the AR(1)
correction might have improved the estimation.
Nevertheless, there has not been much of an effect on the estimated
coefcients nor on their estimated standard errors. This is probably
because the estimated AR(1) coefcient is not very large (around 0.2)
EXERCISES
151
The existence or not of autocorrelation in this model will be important

later, in the section on simultaneous equations.
Exercises
EXERCISES
152
(1) Comparing the variances of the OLS and GLS estimators, I claimed that the
following holds:
(2)
y w S w
@0
Verify that this is true.
(3) Show that the GLS estimator can be dened as
f R f
qij I yBSc"j E
C0
(4) The limiting distribution of the OLS estimator with heteroscedasticity of

h
I I
unknown form is
h

j% 1
where
'
1

tf
R t RG u
G
Explain why
I
@ 1

G R f )
is a consistent estimator of this matrix.
8 R

R
(6) For the Nerlove model
8
g RU
Show that
as

2
%
where
autocovariance of a covariance stationary process
(5) Dene the
e
wP )uP 1 A 4uP 5 | uP A uP I
6
3
2
0

EXERCISES
1&
e
E
&
'
assume that
153
Exercises
(a) Calculate the FGLS estimator and interpret the estimation results.
(b) Test the transformed model to check whether it appears to satisfy homoscedasticity.
CHAPTER 8
Stochastic regressors
Up to now we have treated the regressors as xed, which is clearly unrealistic. Now we will assume they are random. There are several ways to
think of the problem. First, if we are interested in an analysis conditional on the
explanatory variables, then it is irrelevant if they are stochastic or not, since
conditional on the values of they regressors take on, they are nonstochastic,
which is the case already considered.
In cross-sectional analysis it is usually reasonable to make the analysis
conditional on the regressors.

may depend on

gI gf
gf
In dynamic models, where
a conditional anal-
ysis is not sufciently general, since we may want to predict into the
8
i
the relevant test statistics unconditional on
future many periods out, so we need to consider the behavior of
and
The model well deal will involve a combination of the following assumptions
r p
G
tVP p R gf

or in matrix form,
154
)
i'
conformable.
is
and
where
and are
bb
R h f xxb I 94i1
) '
G
70P p i f
where is
8.1. CASE 1
155
Stochastic, linearly independent regressors

with probability 1
has rank
is stochastic

) R I f
Central limit theorem
is a nite positive denite matrix.
: is normally distributed
e $f ! je G

p j
G R eI 1
p
m
V
Normality (Optional):
where
Strongly exogenous regressors:
2

(8.0.1)
G
tU
Weakly exogenous regressors:
is the conditional mean of
given
R v v
v f
G
cv t
R v
In both cases,
2

(8.0.2)
8.1. Case 1
In this case,

7G
Normality of
strongly exogenous regressors
Gi I !XiP p
R R
G R R
X i I !XiP p
p

p I !XRite
i
, unconditional on
8
%

X U
and since this holds for all
Likewise,
8.2. CASE 2
the usual test statistics have the

i
distributions. Importantly, these distributions dont depend on
and

i
However, conditional on
in small samples.

2
Doing this leads to a nonnormal density for
and integrating over
Q
X"t
Q
XDt
multiplying the conditional density by
is obtained by
8
i
the marginal density of
is
If the density of
156
so
when marginalizing to obtain the unconditional distribution, nothing

changes. The tests are valid in small samples.
is stochastic but strongly exogenous and is nor-
Summary: When
mally distributed:
(2)
is unbiased
(1)
is nonnormally distributed
(3) The usual test statistics have the same distribution as with non-
8
i
stochastic
(4) The Gauss-Markov theorem still holds, since it holds conditionand this is true for all
8
%

i
ally on
(5) Asymptotic properties are treated in the next section.
8.2. Case 2
nonnormally distributed, strongly exogenous regressors
carries through as before. However, the argument
regarding test statistics doesnt hold, due to nonnormality of
8
G
The unbiasedness of
Still, we have
1
1
RG I R P
Gi I 6XiP
R R
p
p
8.2. CASE 2
Now
157
1
I T R
I
by assumption, and
gD
j
1
1
T GR I 1 GR
p
e
since the numerator converges to a
r.v. and the denominator still
goes to innity. We have unbiasedness and the variance disappearing, so, the
estimator is consistent:
8p
Considering the asymptotic distribution

h
RG I
p
h % 1
RG 1 1
p
eI
I R
1
1
R 1
p I j
so
hp

0 1
directly following the assumptions. Asymptotic normality of the estimator still

holds. Since the asymptotic results on all test statistics only require this, all the
previous asymptotic results on test statistics are also valid in this case.
nonnormal,
has the properties:
Summary: Under strongly exogenous regressors, with
normal or
(1) Unbiasedness
(2) Consistency
(3) Gauss-Markov theorem holds, since it holds in the previous case
and doesnt depend on normality.
(4) Asymptotic normality
8.4. WHEN ARE THE ASSUMPTIONS REASONABLE?
158
(5) Tests are asymptotically valid, but are not valid in small samples.
8.3. Case 3
Weakly exogenous regressors
An important class of models are dynamic models, where lagged dependent
variables have an impact on the current value. A simple version of these models that captures the important points is
G P
0 R
I
G0 g7
P f
P & g#
R
T

contains lagged dependent variables. Clearly, even with
e
v !
f

where now
and are not uncorrelated, so one cant show unbiasedness. For example,
9
G
I t
I gf
contains
(which is a function of

I G

since
as an element.
This fact implies that all of the small sample properties such as unbiasedness, Gauss-Markov theorem, and small sample validity of test
statistics do not hold in this case. Recall Figure 3.7.2. This is a case of
weakly exogenous regressors, and we see that the OLS estimator is
biased in this case.
Nevertheless, under the above assumptions, all asymptotic properties
continue to hold, using the same arguments as before.
8.4. When are the assumptions reasonable?

The two assumptions weve added are
(2)
p m G R eI 1
p

f

4) R I p f A
(1)
159
nite positive denite matrix.
The most complicated case is that of dynamic models, since the other cases can
be treated as nested in this case. There exist a number of central limit theorems
for dependent processes, many of which are fairly technical. We wont enter
into details (see Hamilton, Chapter 7 if youre interested). A main requirement
for use of standard asymptotics for a dependent sequence
I
@ 1
9g

#
f )
x#
to converge in probability to a nite limit is that
be stationary, in some sense.
Strong stationarity requires that the joint distribution of the set
@8A89( 6g !c
8 # ##
8
2
not depend on
Covariance (weak) stationarity requires that the rst and second mo-
8
2
ments of this set not depend on
An example of a sequence that doesnt satisfy this is an AR(1) process

with a unit root (a random walk):

G P
t0sI
depends upon in this case.

G
One can show that the variance of
Stationarity prevents the process from trending off to plus or minus innity,
and prevents cyclical behavior which would allow correlations between far
znd
x#
removed
to be high. Draw a picture here.
160
In summary, the assumptions are reasonable when the stochastic con-
ditioning variables have variances that are nite, and are not too strongly
dependent. The AR(1) model with unit root is an example of a case
where the dependence is too strong for standard asymptotics to apply.
The econometrics of nonstationary processes has been an active area
of research in the last two decades. The standard asymptotics dont
apply in this case. This isnt in the scope of this course.
EXERCISES
161
Exercises
if
w
. How is this used in the Gauss-Markov theorem?
(2) If it possible for an AR(1) model for time series data, e.g.,
satisfy weak exogeneity? Strong exogeneity? Discuss.
then
P
tG gI gf 8 P $f

and
w
(1) Show that for two random variables
CHAPTER 9
Data problems
In this section well consider problems associated with the regressor matrix:
collinearity, missing observation and measurement error.
9.1. Collinearity
Collinearity is the existence of linear relationships amongst the regressors.
We can always write
and

%
P
I v I
e
column of the regressor matrix
is an
vector.
) '
wj1
P se xxxP v
v Pbbb
3
is the
B
Dv
where
In the case that there exists collinearity, the variation in is relatively small, so
that there is an approximately exact linear relation between the regressors.

relative and approximate are imprecise, so its difcult to dene
when collinearilty exists.
In the extreme, if there are exact linear relationships (every element of equal)
Q R

so
so
X

then
is not invertible and the OLS estimator
is not uniquely dened. For example, if the model is
| & I &
P
G P
t0i | | Vi VsH
P
P I
162
gf

9.1. COLLINEARITY
163
then we can write
G P
t0C | C
P I
| | uP & EI & uDH
P P I
P P PI
| uS | & VsI & uDH
P P PI
| V | & sI & uDH
gf
G P
tVS |
G P
t0C |

$ R
are multiple values of
s dene two
cant be consistently estimated (there
that solve the fonc). The
the case of perfect collinearity.
the
equations in three
can be consistently estimated, but since the

The
are unidentied in
Perfect collinearity is unusual, except in the case of an error in construction of the regressor matrix, such as including the same regressor
twice.
Another case where perfect collinearity may be encountered is with models

with dummy variables, if one is not careful. Consider a model of rental price
of an apartment. This could depend factors such as size, quality etc., col-
B
D
Girona, Tarragona and Lleida. One could use a model such as
for
G P
B 0a BR P B
f
3
Eg4) B f P B P B P
& B
uP B 4uP B & | uP Y VsH DB f

3
B P I
6
)
In this model,
and
f
otherwise. Similarly, dene
if the
C
B
apartment is in Barcelona,
B B &
) p
B
as well as on the location of the apartment. Let
3
B
f
lected in
so there is an exact relationship between
these variables and the column of ones corresponding to the constant. One
must either drop the constant, or one of the qualitative variables.
9.1. COLLINEARITY
F IGURE 9.1.1.
164
when there is no collinearity
6
4
60
55
50
45
40
35
30
25
20
15
2
0
-2
-4
-6
-4
-2
-6
9.1.1. A brief aside on dummy variables. Introduce a brief discussion of

dummy variables here.
9.1.2. Back to collinearity. The more common case, if one doesnt make
mistakes such as these, is the existence of inexact linear relationships, i.e., correlations between the regressors that are less than one in absolute value, but
not zero. The basic problem is that when two (or more) variables move together, it is difcult to determine their separate inuences. This is reected
in imprecise estimates, i.e., estimates with high variances. With economic data,
collinearity is commonly encountered, and is often a severe problem.
When there is collinearity, the minimizing point of the objective function
that denes the OLS estimator (
, the sum of squared errors) is relatively
poorly dened. This is seen in Figures 9.1.1 and 9.1.2.
9.1. COLLINEARITY
when there is collinearity
F IGURE 9.1.2.
165
6
4
100
90
80
70
60
50
40
30
20
2
0
-2
-4
-6
-4
-2
-6
To see the effect of collinearity on variances, partition the regressor matrix

as
(note: we can interchange the columns of
is the rst column of
r v o
n
where
isf
we like, so theres no loss of generality in considering the rst column). Now,
the variance of
under the classical assumptions, is
R
I X%
Using the partition,
R v
vR v
R v R
R
9.1. COLLINEARITY
166
and following a rule for partitioned inversion,

b
I !
h h I y Vsf R
R R
I v
R R
I v I 6dR v v R
r
R
II sI Xi
v

v

where by
we mean the error sum of squares obtained from the regres-
8
iP
sion
%v

Since
)

V
)
so the variance of the coefcient corresponding to
is
we have
jyc
)

We see three factors inuence the variance of this coefcient. It will be high if
is large
Draw a picture here.
4)

The last of these cases is collinearity.
well. In this case,
will be close to 1. As
can explain the movement in
and the other regres-
sors, so that
(3) There is a strong linear relationship between
(2) There is little variation in
8 v
(1)
9.1. COLLINEARITY
167
Intuitively, when there are strong linear relations between the regressors, it
is difcult to determine the separate inuence of the regressors on the dependent variable. This can be seen by comparing the OLS objective function in
the case of no correlation between regressors with the objective function with
correlation between the regressors. See the gures nocollin.ps (no correlation)
and collin.ps (correlation), available on the web site.
9.1.3. Detection of collinearity. The best way is simply to regress each explanatory variable in turn on the remaining regressors. If any of these auxiliary
regressions has a high
there is a problem of collinearity. Furthermore, this
procedure identies which parameters are affected.

Sometimes, were only interested in certain parameters. Collinearity
isnt a problem if it doesnt affect what were interested in estimating.
An alternative is to examine the matrix of correlations between the regressors.

High correlations are sufcient but not necessary for severe collinearity.

g
Also indicative of collinearity is that the model ts well (high
but none
of the variables is signicantly different from zero (e.g., their separate inuences arent well determined).
In summary, the articial regressions are the best approach if one wants to
be careful.
9.1.4. Dealing with collinearity. More information
Collinearity is a problem of an uninformative sample. The rst question
is: is all the available information being used? Is more data available? Are
there coefcient restrictions that have been neglected? Picture illustrating how
a restriction can solve problem of perfect collinearity.
Stochastic restrictions and ridge regression
9.1. COLLINEARITY
168
Supposing that there is no more data or neglected restrictions, one possibility is to change perspectives, to Bayesian econometrics. One can express prior
beliefs regarding the coefcients using stochastic restrictions. A stochastic linear restriction would be something of the form
P p
and
are as in the case of exact linear restrictions, but
where
is a random
vector. For example, the model could be
(
f
gT(
f f
(
@D
G P
Vi
f
p
This sort of model isnt in line with the classical interpretation of parameters
P p
as constants: according to this interpretation the left hand side of
is
constant but the right is random. This model does t the Bayesian perspective:
we combine information coming from the model and the data, summarized in
4f ! j

G P
V"
f
G
with prior beliefs regarding the distribution of the parameter, summarized in

!( ! jQp
R Ù
G
Since the sample is random it is reasonable to suppose that
which is
the last piece of information in the specication. How can you estimate using
9.1. COLLINEARITY
169
this model? The solution is to treat the restrictions as articial data. Write
Dene the prior precision

f

This model is heteroscedastic, since
8
This expresses the degree of belief in the restriction relative to the varithen the model
ability of the data. Supposing that we specify
f

is homoscedastic and can be estimated by OLS. Note that this estimator is bi-
ased. It is consistent, however, given that
is a xed constant, even if the
restriction is false (this is in contrast to the case of false exact restrictions). To

is the number of rows of
articial observations have no weight in the objective
these
As
restrictions, where
see this, note that there are
function, so the estimator has the same limiting objective function as the OLS
estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of
the squared length of :
P R
y 4e

I
B
P R
B
Se u
P
I X R y R
RG X R X R u R G V R
P
I I
P
P
h RG I X R R h RG I X R

R U

(the trace is the sum of eigenvalues)
Q
R
(the eigenvalues are all positive, since
is p.d.
9.1. COLLINEARITY
170
so
R
y
D
P
R R U

e
is the minimum eigenvalue of
As collinearity becomes worse and worse,

tends to zero (recall that the
is nite.
r R n
f

and the estimator is
becomes
With this restriction the model
8
iP x
Now considering the restriction
the other hand,
tends to innite. On
determinant is the product of the eigenvalues) and
becomes more nearly singular, so
8
I X R
maximum eigenvalue of
(which is the inverse of the
R
where
r R n
f% I 9 wi
R
P R
m
B
This is the ordinary ridge regression estimator. The ridge regression estimator
which is more and
e
restrictions tend to
more nearly singular as collinearity becomes worse and worse. As
which is nonsingular, to

% R
can be seen to add
the
that is, the coefcients are shrunken toward zero.
Also, the estimator tends to
f R f% I 9
8
fi I wi
R
P R
m
B
B B R
m
so
This is clearly a false restriction in the limit, if our original
model is at al sensible.
9.2. MEASUREMENT ERROR
171
There should be some amount of shrinkage that is in fact a true restriction.

The problem is to determine the such that the restriction is correct. The inter-
est in ridge regression centers on the fact that it can be shown that there exists
m
and
and chooses
that artistically seems appropriate (e.g., where the effect of
dies off). Draw picture here. This means of choosing
increasing
as a function of
@0
the value of
B B R

The ridge trace method plots
which are unknown.
The problem is that this depends on
a such that
is obviously
subjective. This is not a problem from the Bayesian perspective: the choice of
reects prior beliefs about the length of
In summary, the ridge estimator offers some hope, but it is impossible to
guarantee that it will outperform the OLS estimator. Collinearity is a fact of

life in econometrics, and there is no clear solution to the problem.
9.2. Measurement error

Measurement error is exactly what it says, either the dependent variable or
the regressors are measured with error. Thinking about the way economic data
are reported, measurement error is probably quite prevalent. For example,
estimates of growth of GDP, ination, etc. are commonly revised several times.
Why should the last revision necessarily be correct?
9.2.1. Error of measurement of the dependent variable. Measurement errors in the dependent variable and the regressors have important differences.
172
First consider error in measurement of the dependent variable. The data generating process is presumed to be
G P
V"
f
P f
! 3
t3
is the unobservable true dependent variable, and
the classical assumptions. Given this, we have
is what is ob-
G
iP " f
served. We assume that and are independent and that
f
where
satises
G P
V"
P
f
so
f
G P
pV"
VP ! 3
t3
P
"
is uncorrelated with

%
As long as
this model satises the classical
assumptions and can be estimated by OLS. This type of measurement

error isnt a problem, then.
173
9.2.2. Error of measurement of the regressors. The situation isnt so good

in this case. The DGP is
G P
t0 R
gf
T 3
S
t3
iP

t
contains the true, unobserved regressors,

is independent of
and that
satises the classical assumptions. Now we have
G P
0 R
f
G P R
0 $s% R
P
R
G P
V f
The problem is that now there is a correlation between
and

the model
'
is what is observed. Again assume that

G
matrix. Now
and
is a
where
since
G P
tV R tHiP

where
8 R
H $
Because of this correlation, the OLS estimator is biased and inconsistent, just as
in the case of autocorrelated errors with lagged dependent variables. In matrix
notation, write the estimated model as
P
" f
We have that
1
1
Rf I R
S
I P
1

D R wP R 3
I

1
R 3
are independent, and
I
@ 1
R$t

f )
and
and
since
174
1
R 3

Likewise,

1
G P
0 R wP R 3
1
f R 3
so
P 3
S
h
So we see that the least squares estimator is inconsistent when the regressors
are measured with error.
A potential solution to this problem is the instrumental variables (IV)

estimator, which well discuss shortly.
9.3. MISSING OBSERVATIONS
175
9.3. Missing observations

Missing observations occur quite frequently: time series data may not be
gathered in a certain year, or respondents to a survey may not answer all questions. Well consider two cases: missing observations on the dependent variable and missing observations on the regressors.
9.3.1. Missing observations on the dependent variable. In this case, we
have
G P
V" f
or

I
f
f
hold.
I
$G
where
is not observed. Otherwise, we assume the classical assumptions
A clear alternative is to simply estimate using the compete observa-
tions
I G P I
0c f
I
Since these observations satisfy the classical assumptions, one could
estimate by OLS.
The question remains whether or not one could somehow replace the
by a predictor, and improve over OLS in some sense.
Now
R P I RI
f iVDf i
I
f

If
I
R I
be the predictor of
8f
R P I RI
i0 i

I
R
I

Let
unobserved
8
h
h

P w
w
and this will be unbiased only if

Now,
8 w 9
I
RI I R 0D RI 9
PI
I RI R P I RI
i iV %@ I iV %
R P I RI

% iVD %
R I
R P I RI

I RI
i I i0D i
R P I RI
and we use
$
w
where
i I RiVD i gsH i
R
P I RI P I I RI
r iVDH i n
R P I I RI
9 H w
P I
%V i
R P I RI
%V i
R P I RI

Substituting these into the equation for the overall combined estimator gives
8 f R R
would give
Likewise, an OLS regression using only the second (lled in) observations
I RI
f i H %
I I RI
so if we regressed using only the rst (complete) observations, we would have
Rf R
Recall that the OLS fonc are

176
177
The conclusion is the this lled in observations alone would need to
dene an unbiased estimator. This will be the case only if
G f
has mean zero. Clearly, it is difcult to satisfy this condition
where
without knowledge of
I f f
8
Note that putting
does not satisfy the condition and therefore
leads to a biased estimator.
E XERCISE 13. Formally prove this last statement.
One possibility that has been suggested (see Greene, page 275) is to
estimate
using a rst round estimation using only the complete ob-
servations
If RI I E RI
I
I
H f
f
I
H
I RI I RI I
f i I 6 i H
Now, the overall estimate is a weighted average of

above, but we have
I
H
to predict
and
then use this estimate,
just as
IH i I
R
R
f i I
I
H
! %
R
R
! %

178
This shows that this suggestion is completely empty of content: the nal estimator is the same as the OLS estimator using only the complete
observations.
9.3.2. The sample selection problem. In the above discussion we assumed
that the missing observations are random. The sample selection problem is a
case where the missing observations are not random. Consider the model
G P
t0 R f
dened as
f f
if
always observed. What is observed is
which is assumed to satisfy the classical assumptions. However,
is not
Or, in other words,
is missing when it is less than zero.
The difference in this case is that the missing values are not random: they
8
c
are correlated with the
Consider the case
G
0P f
, but using only the observations for which

u 4f

5 4G
with
to estimate.
Figure 9.3.1 illustrates the bias. The Octave program is sampsel.m

9.3.3. Missing observations on the regressors. Again the model is
I
$G
I
f
f
but we assume now that each row of
has an unobserved component(s).
Again, one could just estimate using the complete observations, but it may
179
F IGURE 9.3.1. Sample selection bias

25
Data
True Line
Fitted Line
20
15
10
-5
-10
10
seem frustrating to have to drop observations simply because of a single miss-
ing variable. In general, if the unobserved
is replaced by some prediction,
then we are in the case of errors of observation. As before, this means

is used instead of
that the OLS estimator is biased when
Consistency
is salvaged, however, as long as the number of missing observations doesnt
8
q1
increase with
Including observations that have missing values replaced by ad hoc

values can be interpreted as introducing false stochastic restrictions.
In general, this introduces bias. It is difcult to determine whether
MSE increases or decreases. Monte Carlo studies suggest that it is
dangerous to simply substitute the mean, for example.
180
In the case that there is only one regressor other than the constant,
for the missing
case that doesnt hold for
8
75

subtitution of
does not lead to bias. This is a special
E XERCISE 14. Prove this last statement.

In summary, if one is strongly concerned with bias, it is best to drop
observations that have missing components. There is potential for reduction of MSE through lling in missing elements with intelligent
guesses, but this could also increase MSE.
EXERCISES
181
Exercises
(1) Consider the Nerlove model
e
wP )uP 1 A 4uP 5 | uP A uP I
6
3
2
0

When this model is estimated by OLS, some coefcients are not signicant.
This may be due to collinearity.
Exercises
(a) Calculate the correlation matrix of the regressors.
(b) Perform articial regressions to see if collinearity is a problem.
(c) Apply the ridge regression estimator.
Exercises
(i) Plot the ridge trace diagram
large.
goes to zero, and as
(ii) Check what happens as
becomes very
CHAPTER 10
Functional form and nonnested tests

Though theory often suggests which conditioning variables should be included, and suggests the signs of certain derivatives, it is usually silent regarding the functional form of the relationship between the dependent variable
and the regressors. For example, considering a cost function, one could have a
Cobb-Douglas model
x gYF IyF U
z x x w
This model, after taking logarithms, gives
8 (
) n

4)
P I
8

8 | ! ! D! w
I
8 w A p
G P (
0 )VP F A uF HuP p A
PI I

where
Theory suggests that
This
model isnt compatible with a xed cost of production since
when
Homogeneity of degree one in input prices suggests that
while
constant returns to scale implies
While this model may be reasonable in some cases, an alternative

h
A

uP F Vh F HuP p
PI I

(
)
and
G P
0
may be just as plausible. Note that
look quite alike, for certain
values of the regressors, and up to a linear transform, so it may be difcult to

choose between these models.
182
10.1. FLEXIBLE FUNCTIONAL FORMS
183
The basic point is that many functional forms are compatible with the linearin-parameters model, since this model can incorporate a wide variety of nonlinear transformations of the dependent variable and the regressors. For ex-

b
is a real valued function and that
is a
H
b
ample, suppose that
vector-
valued function. The following model is linear in the parameters but nonlinear
in the variables:
x#
may be smaller than, equal to or larger than
re-
For example,
could include squares and cross products of the conditioning variables in
8
ct#
f
G P
tV R
fundamental conditioning variables , but there may be

gressors, where
#

There may be
10.1. Flexible functional forms

Given that the functional form of the relationship between the dependent
variable and the regressors is in general unknown, one might wonder if there
exist parametric models that can closely approximate a wide variety of functional relationships. A Diewert-Flexible functional form is dened as one
such that the function, the vector of rst derivatives and the matrix of second
derivatives can take on an arbitrary value at a single data point. Flexibility in
this sense clearly requires that there be at least
P
P
5 P ) d
free parameters: one for each independent effect that we wish to model.
184
Suppose that the model is
G P
V D f
A second-order Taylors series expansion (with remainder term) of the funcabout the point

is
wP 5
P
D H R H H D
P
D R
H
tion
Use the approximation, which simply drops the remainder term, as an approximation to
r
U H
As
the approximation becomes more and more exact, in the sense that
5
P
D H R D D
P

H R
For
(

8
H

and
the ap-
H
H

'
H

proximation is exact, up to the second order. The idea behind many exible
and
H
H g D
functional forms is to note that
are all constants. If we
treat them as parameters, the approximation will have exactly enough free pa-
8

which is of unknown form, exactly,
The model is
) P
R 5 R P
up to second order, at the point

g H
rameters to approximate the function

&
g
so the regression model to t is
G
VP R 5 dR P
) P
f
&
185
While the regression model has enough free parameters to be Diewert-
H 3
Is
H w & 3

Is
H 3
The answer is no, in general. The reason is that if we treat the true
is forced to play

the part of the remainder term, which is a function of
so that
values of the parameters as these derivatives, then
exible, the question remains: is
and
are correlated in this case. As before, the estimator is biased in this
case.
A simpler example would be to consider a rst-order T.S. approxima-
tion to a quadratic function. Draw picture.

The conclusion is that exible functional forms arent really exible in a useful statistical sense, in that neither the function itself nor
its derivatives are consistently estimated, unless the function belongs

to the parametric family of the specied functional form. In order to
lead to consistent inferences, the regression model must be correctly
specied.
10.1.1. The translog form. In spite of the fact that FFFs arent really as
exible as they were originally claimed to be, they are useful, and they are
certainly subject to less bias due to misspecication of the functional form than
are many popular forms, such as the Cobb-Douglas of the simple linear in the
variables model. The translog model is probably the most widely used FFF.
This model is as above, except that the variables are subjected to a logarithmic
tranformation. Also, the expansion point is usually taken to be the sample
mean of the data, after the logarithmic transformation. The model is dened
186
by
GVP R 5 )PR P &

g A 4 A
#
#
## A
A
f

f
2
In this presentation, the
subscript that distinguishes observations is sup-
pressed for simplicity. Note that
f

which is the elasticity of with respect to
8#
#
#
$
#

P

(the other part of is constant)
This is a convenient feature of the
translog model. Note that at the means of the conditioning variables, ,

f
are the rst-order elasticities, at the means of the data.
To illustrate, consider that
so the
#

so
is cost of production:
F
f
is a vector of input prices and
where
is output. We could add other vari-
ables by extending in the obvious manner, but this is supressed for simplicity.
187
By Shephards lemma, the conditional factor demands are
F
4

and the cost shares of the factors are therefore
F
F 4
F
which is simply the vector of elasticities of cost with respect to input prices. If
the cost function is modeled using a translog function, we have
I
I

g R A #
#5 # I R P I R 5 P gwR P
) P I ) R # P
IR
r # R n 5 P R w R P
) # P
and
8 |
|
|

I
| C
C
I
iC
I II
&
&
I
I

F
Y F
I

Note that symmetry of the second derivatives has been imposed.

Then the share equations are just
and
where

#
r I
I
I n
P
188
Therefore, the share equations and the cost equation have parameters in common. By pooling the equations together and imposing the (true) restriction
that the parameters of the equations be the same, we can gain efciency.
To illustrate in more detail, consider the case of two inputs, so
I

In this case the translog model of the logarithmic cost function is
I

5
# | EI | CP I CP # 5 P 5 P I i # P jHI HP
P #
I
I
P I
II P
|
|
&
The two cost shares of the inputs are the derivatives of
I
CiP
I
g
II P I
CiH
P
# | CiP iI
I
# | CiP CiI
I
I P
and
with respect to
Note that the share equations and the cost equation have parameters in
common. One can do a pooled estimation of the three equations at once, imposing that the parameters are the same. In this way were using more observations and therefore more information, which will lead to imporved efciency. Note that this does assume that the cost equation is correctly specied
(i.e., not an approximation), since otherwise the derivatives would not be the
true derivatives of the log cost function, and would then be misspecied for
the shares. To pool the equations, write the model in matrix form (adding in
f
tG
P
d
.
.
.
f
a
f
f
I
f
.
.
.
I
$G
.
.
.
observations:
1
7
of
The overall model would stack observations on the three equations for a total
G P d
t0R f
tion, a single observation can be written as

This is one observation on the three equations. With the appropriate nota
|G
G
I
| C
C
I
||
I
$G
II
C
# EI
#
I )
I )

I z zz z # I )
I
g
I
H
&
error terms)
189
190
Next we need to consider the errors. For observation the errors can be placed
in a vector
I
$G
| G
G
tG
First consider the covariance matrix of this vector: the shares are certainly
correlated since they must sum to one. (In fact, with 2 shares the variances are
equal and the covariance is -1 times the variance. General notation is used to
allow easy extension to the case of more than 2 inputs). Also, its likely that
the shares and the cost equation have different variances. Supposing that the
won t depend upon :
gG
model is covariance stationary, the variance of
I
| H
|
| b b

| b
H H
I II

p

tG
Note that this matrix is singular, since the shares sum to 1. Assuming that there
is no autocorrelation, the overall covariance matrix has the seemingly unrelated
191
regressions (SUR) structure.
I
G

..
.
. .
.
bb
xxb
f
G
.
.
.
..
indicates the Kronecker product. The Kronecker product of
is
I bb
Sxxb
.
.
.
I
I
T
t
bb
xxb
(
T
and
T
x
Personally, I can never keep straight the roles of
.
.
.
(
R
..
and
two matrices
where the symbol
bb
xxb
.
.
.
10.1.2. FGLS estimation of a translog model. So, this model has heteroscedasticity and autocorrelation, so OLS wont be efcient. The next question is: how
do we estimate efciently using FGLS? FGLS is based upon inverting the estiSo we need to estimate
8
S
8
S
mated error covariance
An asymptotically efcient procedure is (supposing normality of the errors)
192
(1) Estimate each equation by OLS
(3) Next we need to account for the singularity of
It can be shown that
8p
using
I
@ 1 p
R
G t G f ) S
(2) Estimate
will be singular when the shares sum to one, so FGLS wont work.
The solution is to drop one of the share equations, for example the
second. The model becomes
&

I
H
I
$G
G
II
C
I
| C
C
I
||
# I )
# I I z zz z # I )
#

or in matrix notation for the observation:
0yd f
G P
I
g
tions:
.
.
.
fG
P
d
If
.
.
.
observa-
I G
.
.
.
1
g5
and in stacked notation for all observations we have the
193
ff
or, nally in matrix notation for all observations:
G P
0yd f
Considering the error covariance, we can dene
, and form
S
ht
block of
8 p
5 '
Q5
f
p
S
as the leading
I
G
Dene
This is a consistent estimator, following the consistency of OLS and

applying a LLN.
(4) Next compute the Cholesky factorization
h p S
I

`)
and the Cholesky factorization of the overall covariance matrix of the

2 equation model, which can be calculated as
f

`)
194
(5) Finally the FGLS estimator can be calculated by applying OLS to the
transformed model
G d f
P
or by directly using the GLS formula
f h p S R h p S R C
I
I
I
C0
It is equivalent to transform each observation individually:
G Yd p u f p
P
and then apply OLS. This is probably the simplest approach.
A few last comments.
(1) We have assumed no autocorrelation across time. This is clearly restrictive. It is relatively simple to relax this, but we wont go into it
here.
(2) Also, we have only imposed symmetry of the second derivatives. Another restriction that the model should satisfy is that the estimated
shares should sum to 1. This can be accomplished by imposing
)
8 7 5
774)
%
I
B

q B
|
VH
PI
These are linear parameter restrictions, so they are easy to impose and
will improve efciency if they are true.
10.2. TESTING NONNESTED HYPOTHESES
195
(3) The estimation procedure outlined above can be iterated. That is, esti-
as above, then re-estimate
using errors calculated as
@0
C0
mate

jf G
These might be expected to lead to a better estimate than the es-
@0
since FGLS is asymptotically more efcient.
using the new estimated error covariance. It can

d
Then re-estimate
timator based on
be shown that if this is repeated until the estimates dont change (i.e.,
iterated to convergence) then the resulting estimator is the MLE. At
any rate, the asymptotic properties of the iterated and uniterated estimators are the same, since both are based upon a consistent estimator
of the error covariance.
10.2. Testing nonnested hypotheses

Given that the choice of functional form isnt perfectly clear, in that many
possibilities exist, how can one choose between forms? When one form is a
parametric restriction of another, the previously studied tests such as Wald,
P
LR, score or
are all possibilities. For example, the Cobb-Douglas model is a
parametric restriction of the translog: The translog is
G P
VS R 5 R P
) P
gf
&
g
where the variables are in logarithms, while the Cobb-Douglas is
G P
0 R P
8
f
&
so a test of the Cobb-Douglas versus the translog is simply a test that
196
The situation is more complicated when we want to test non-nested hypotheses. If the two functional forms are linear in the parameters, and use the same
transformation of the dependent variable, then they may be written as
! 3
t3
G P
Vi
P

W
! d 3
t3

ktG

f
$r
wd
p
B r
85
74) 3
is misspecied, for
fr
$UI
We wish to test hypotheses of the form:
is correctly specied versus
One could account for non-iid errors, but well suppress this for simplicity.
test, pro
There are a number of ways to proceed. Well consider the
posed by Davidson and MacKinnon, Econometrica (1981). The idea is

to articially nest the two models, e.g.,
P W
Ht
is zero.
P
"c
&
)
f
On the other hand, if the second model is correctly specied then

X&
&
&
If the rst model is correctly specied, then the true value of
The problem is that this model is not identied in general. For

example, if the models share some regressors, as in
C3 | iS iDC
P P
P I
G P
0C | | VC VDH
P
PI
fr
$I
f
$r
8
4)

B r

197
then the composite model is
P
C3 |

P
C
&
&

PI
DC
P
S | | c
&
&
) P
C c
&
) P I
DH
)
c gf
&
Combining terms we get
P
i | | c
) P
i
&
&
8

in place of
8 Q&

&

gf
P
si3 3 C | | i I
P
P
P
P & EC & Hc & c
) P I P I )
test is to substitute
&
is not, since we have four equa-
tions in 7 unknowns, so one cant test the hypothesis that
The idea of the
are consistently estimable, but
P
C3 |
The four
This is a consistent
estimator supposing that the second model is correctly specied. It will tend
to a nite probability limit even if the second model is misspecied. Then
estimate the model
P W
H e
&
P
f
is consistently estimable, and
P d
p
8
f a f R W I pW R e
W
f
is asymptotically normal:
T 2

)
x9
& 2
If the second model is correctly specied, then
since
&

Q&
and that the ordinary -statistic for
&
)
f
one can show that, under the hypothesis that the rst model is correct,
&
In this model,
&
&
P
"
where
tends in
probability to 1, while its estimated standard error tends to zero. Thus

the test will always reject the false null model, asymptotically, since the
statistic will eventually exceed any critical value with probability one.
198
We can reverse the roles of the models, testing the second against the
It may be the case that neither model is correctly specied. In this case,
rst.
the test will still reject the null hypothesis, asymptotically, if we use
distribution, since as long as

T 2
)
t9 j
something different from zero,
&
critical values from the
tends to
Of course, when we switch
the roles of the models the other will also be rejected asymptotically.
In summary, there are 4 possible outcomes when we test two models,

each against the other. Both may be rejected, neither may be rejected,
or one of the two may be rejected.
There are other tests available for non-nested models. The
test is
simple to apply when both models are linear in the parameters. The
-test is similar, but easier to apply when
is nonlinear.
The above presentation assumes that the same transformation of the
dependent variable is used by both models. MacKinnon, White and

Davidson, Journal of Econometrics, (1983) shows how to deal with the
case of different transformations.
Monte-Carlo evidence shows that these tests often over-reject a correctly specied model. Can use bootstrap critical values to get betterperforming tests.
CHAPTER 11
Exogeneity and simultaneity

Several times weve encountered cases where correlation between regressors and the error term lead to biasedness and inconsistency of the OLS estimator. Cases include autocorrelation with lagged dependent variables and
measurement error in the regressors. Another important case is that of simultaneous equations. The cause is different, but the effect is the same.
11.1. Simultaneous equations

Up until now our model is
G P
V" f
not interested in conditioning on
as xed. This means that
When analyzing dynamic models, were

as we saw in the section on stochastic
regressors. Nevertheless, the OLS estimator obtained by treating
we condition on
%
8
i
when estimating
where, for purposes of estimation we can treat
continues to have desirable asymptotic properties even in that case.

199
as xed
11.1. SIMULTANEOUS EQUATIONS
200
Simultaneous equations is a different prospect. An example of a simultaneous equation system is a simple supply-demand system:
v
v
r G I
$G n
are jointly determined at the same time by
S
the intersection of these equations. Well assume that
4f
I
G
and
2
S
H H
I II
G0P S uDH
PI
P
S & I &
P
The presumption is that
&
Supply:
I G
$VP f |
Demand:
is determined by some
unrelated process. Its easy to see that we have correlation between regressors
C

&
H0H
I II

&
&
I G& I
$G j$G P gf | & P HjqI &
I
r I
qG
S
&
&
&
G$G P gf | & P H0qI & S

I
I
GI$VP gf | & DH0qI & S & S
G
PI
VP S VsH $VP f | &

G P I
I G
and errors. Solving for
P
CS
&
P
sI
&
Now consider whether
I
$G U

Because of this correlation, OLS estimation of the demand equation will be

biased and inconsistent. The same applies to the supply equation, for the same
reason.
11.1. SIMULTANEOUS EQUATIONS
and
are the endogenous varibles (endogs), that are deter-
S
In this model,
201
mined within the system.
is an exogenous variable (exogs). These concepts
gf
are a bit tricky, and well return to it in a minute. First, some notation. Suppose
endogs,
is
8
c
Group current and lagged exogs, as well as lagged endogs in the vector
equations into the error vector
&
Stack the errors of the
8)
4'
The model, with additional assumtions, can be written as

R U

2
observations and write the model as
sT j
i
Rf

RI

Rf
8 & '
aiX1

is
.
.
.
RI
.
.
.

R
Rf
and
.
.
.
R
RI
where
Q1
'
is
2T
S
R P
We can stack all
8
c
a
8) '
4"&
, which is

& '
aiX1
is
&
If there are
we group together current endogs in the vector
This system is complete, in that there are as many equations as endogs.

There is a normality assumption. This isnt necessary, but allows us to
consider the relationship between least squares and ML estimators.
11.2. EXOGENEITY
202
Since there is no autocorrelation of the
s, and since the columns of
are individually homoscedastic, then
f
f Huxxb f H f H
I bb
I
II
.
.
.
.. .
. .
.
may contain lagged endogenous and exogenous variables. These
variables are predetermined.
We need to dene what is meant by endogenous and exogenous
when classifying the current period variables.
11.2. Exogeneity
The model denes a data generating process. The model involves two sets of
as well as a parameter vector
S
R r R x
R x
R x n
In general, without additional restrictions, is a
P &
45 V aP P &
&
&
and
variables,
dimensional vector. This is the parameter vector that were inter-
&
ested in estimating.
8
t
depends on a parameter vector
Write this density as
and

ca
In principle, there exists a joint density function for
which

o t EiE
11.2. EXOGENEITY
s of course. This can be factored into the density of

times the marginal density of
conditional on
This includes lagged
and lagged
8
2
is the information set in period
where
203
t
o st aE o CE o st acC

This is a general factorization, but is may very well be the case that not
to indicate elements
may share elements,
that enter into the conditional density and write
of course. We have
I
st
that enter into the marginal. In general,
and
of
affect both factors. So use
I
st
all parameters in
for parameters
o t aE o 6c C o st aCE
I t

Recall that the model is

2
2T j
S
R P i
R
R
s

R U
Normality and lack of correlation over time imply that the observations are
independent of one another, so we can write the log-likelihood function as the
11.2. EXOGENEITY
204
sum of likelihood contributions of each observation:
I
@
P I t
6Dca CE
o
f

I
@
!ca iE
I t
o
f
I
@
st aCE

f
is weakly exogeneous for
cannot share elements if
Supposing that
g
tI t
I
t
arbitrary combinations of
(the
that is invariant
is weakly exoge-
changes, which prevents consideration of
is weakly exogenous, then the MLE of
I
st
would change as
I
t
nous, since
and
8t
This implies that
8EISd SIg
t t d tI t
t
More formally, for an arbitrary

to
to
d
original parameter vector) if there is a mapping from
I
@

o t a
f
o t aE

o 7d f
D EFINITION 15 (Weak Exogeneity).
using the joint
density is the same as the MLE using only the conditional density
8 t
I
@

6Ita iE
d
A f o i f
In other words, the joint
and conditional log-likelihoods maximize at the same value of
Since the DGP of
as xed in inference.
is irrelevant, we can treat
By the invariance property of MLE, the MLE of is
It Sd

8
d
I
6st
rameter of interest,
is irrelevant for
is sufcient to recover the pa-
and knowledge of
I
t
inference on
With weak exogeneity, knowledge of the DGP of
8I
st
since the conditional likelihood doesnt depend on
and this map-
ping is assumed to exist in the denition of weak exogeneity.
11.3. REDUCED FORM
205
Of course, well need to gure out just what this mapping is to recover
With lack of weak exogeneity, the joint and conditional likelihood func-
This is the famous identication problem.
tions maximize in different places. For this reason, we cant treat
8I
Dt
from
as
xed in inference. The joint MLE is valid, but the conditional MLE is
not.
to be weakly exogenous if
we are to be able to treat them as xed in estimation. Lagged
In resume, we require the variables in
sat-
isfy the denition, since they are in the conditioning information set,
Lagged
arent exogenous in the normal usage of the
8
c
I i
e.g.,
word, since their values are determined within the model, just earlier
on. Weakly exogenous variables include exogenous (in the normal sense)
variables as well as all predetermined variables.
11.3. Reduced form

Recall that the model is
R P i
R
S
R
s
This is the model in structural form.
D EFINITION 16 (Structural form). An equation is in structural form when

more than one current period endogenous variable is included.
Sgf P I
P

&
&
P I &
GjG P gf |
I
HjqI &
&
j$VSgf | & HjqI &
G I G P
PI
I G P
0Cf |
&
P
SS
&
P
I
&
S

S & S
G PI
VP S uDH
Similarly, the rf for price is
&
G &
I
G

$G
I

uP gf |
I G
$VP gf |
&
I f
HP I I
P I

P gf& P H & qI
| &
I & &
us 0sH qI
P G P I & &
&
P qH jv & I &
P
G I
g
g
g
&
demand:
tity is obtained by solving the supply equation for price and substituting into
An example is our supply/demand system. The reduced form for quancurrent period endog is included.
D EFINITION 17 (Reduced form). An equation is in reduced form if only one
reduced form.
Now only one current period endog appears in each equation. This is the
R R
P ei
R
I R P I i

R

The solution for the current period endogs is easy to nd. It is

11.3. REDUCED FORM
206
11.3. REDUCED FORM
207
The interesting thing about the rf is that the equations individually satisfy the
by assumption,
&
&
I
H
B
f
is
82
P H & g0H
I
5 II
&
&

&
I
H$G G & $G
I
I
D
The variance of
and
The errors of the rf are
zz gz x
z gx
z
z
z z gx
i=1,2,
I
G
and therefore
classical assumptions, since
I
H

This is constant over time, so the rst rf equation is homoscedastic.

are independent over time, so are the
8
ci
Likewise, since the

The variance of the second rf error is

I 5 II
VP & HVH
&
&
G I
jH$G jH$G
G I

and the contemporaneous covariance of the errors across equations is

I
VP H & P qH
II
&

&
&
jG G & $G
G I
I

I
HU
In summary the rf equations individually satisfy the classical assumptions, under the assumtions weve made, but they are contemporaneously correlated.
11.4. IV ESTIMATION
208
The general form of the rf is
R R
P ei
R
I R P I i
R

so we have that
R I $ x u R I C

if the
2
h I
and that the
are timewise independent (note that this wouldnt be the case
were autocorrelated).
11.4. IV estimation
The IV estimator may appear a bit unusual at rst, but it will grow on you
over time.
The simultaneous equations model is
Considering the rst equation (this is without loss of generality, since we can
matrix as
r D f n
I
is the rst column
always reorder the equations) we can partition the
are the other endogenous variables that enter the rst equation
are endogs that are excluded from this equation
as
r e n
I

I
f
Similarly, partition
11.4. IV ESTIMATION
are the excluded exogs.
are the included exogs, and
209
I

Finally, partition the error matrix as
r I G n
Assume that
has ones on the main diagonal. These are normalization
restrictions that simply scale the remaining coefcients on each equation, and
which scale the variances of the error terms.
Given this scaling and our partitioning, the coefcient matrices can be written as
I
C
I
H

With this, the rst equation can be written as
f
G P I I P I I
VDHVDCsD

7G
is correlated with
since

endogs.
G
VP
The problem, as weve seen is that
is formed of
11.4. IV ESTIMATION
210
Now, lets consider the general problem of a linear regression model with
correlation between regressors and the error term:
8 RG U

gf 3 G
t3
G P
0" f
The present case of a structural equation from a system of equations ts into
this notation, but so do other problems, such as measurement error or lagged
dependent variables with autocorrelated errors. Consider some matrix
which is formed of variables uncorrelated with . This matrix denes a projec-
tion matrix
R I R
by the denition of

G
projection matrix we get
correlated with
so that anything that is projected onto the space spanned by
Transforming the model with this
G
P
" ! f !
or
G P
V f
are uncorrelated, since this is simply
G R U
G R R U
!
!
RG U
and
Now we have that
will be un-
11.4. IV ESTIMATION
211
and
R I R
!
This is a linear combination of
so it must be uncorrelated with

X
OLS to the model
8
G
the columns of
on
is the tted value from a regression of
This implies that applying
G P
V f
will lead to a consistent estimator, given a few more assumptions. This is the
ments. The estimator is
generalized instrumental variables estimator.
is known as the matrix of instru-
f R
R
i I !X ! i
from which we obtain
P
R I X ! R
!
G P R
V" i I !X i
R
!
!

so
G I 6RRi I I !ddi
R
R R R
G R
R
% I 6X ! i
1

0)

Now we can introduce factors of
to get
1
1
1
p 1
1
1

RG I R
R
R I R
R Y j)
I
so that
Assuming that each of the terms with a
in the denominator satises a LLN,
11.4. IV ESTIMATION
212
!
!
, a nite pd matrix
(= cols
a nite matrix with rank
T y
f!
T y
T !yf
then the plim of the rhs is zero. This last term has plim 0 since we assume that
and are uncorrelated, e.g.,
tG R

Given these assumtions the IV estimator is consistent
8
T

)
h
we have
h
1
1
1
GR I R R
I

q1
Furthermore, scaling by
h
1
1
1
R I R R

h 0) 1
Assuming that the far right term saties a CLT, so that

!
!

m
are the obvious ones. An estimator for
is
8 h
!
!

!
h 0) 1

and
The estimators for

I ! ! R ! ! t
I
!
then we get
jf R h ) f ) 1

This estimator is consistent following the proof of consistency of the OLS esti-
mator of
when the classical assumptions hold.
11.4. IV ESTIMATION
213
The formula used to estimate the variance of
is
h Q I dD% )
R R R

I
The IV estimator is
(1) Consistent
(2) Asymptotically normally distributed
!
I X ! R
R
G
!
R I X ! R U6 G ! R
(3) Biased in general, since even though
are not independent.
then the IV estimator using
8I
gj

as the estimator that used
I
0
When we have two sets of instruments,
and
!
!
ments inuences the efciency of the estimator.
The choice of instru-
such that
j
I
and these depend upon the choice of
depends upon
and
An important point is that the asymptotic distribution of
and
may not be zero, since
is at least as efciently asymptotically
More instruments leads to more asymp-
totically efcient estimation, in general.

There are special cases where there is no gain (simultaneous equations
is an example of this, as well see).
The penalty for indiscriminant use of instruments is that the small

sample bias of the IV estimator rises as the number of instruments
to
increases. The reason for this is that
becomes closer and closer
itself as the number of instruments increases.
IV estimation can clearly be used in the case of simultaneous equations. The only issue is which instruments to use.
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS
214
11.5. Identication by exclusion restrictions

The identication problem in simultaneous equations is in fact of the same
nature as the identication problem in any estimation setting: does the limiting objective function have the proper curvature so that there is a unique
global minimum or maximum at the true parameter value? In the context of
IV estimation, this is the case if the limiting covariance of the IV estimator is
I 6 ! R !! ! ) an
I
RG I f 3
positive denite and
. This matrix is
The necessary and sufcient condition for identication is simply that
this matrix be positive denite, and that the instruments be (asymptotically) uncorrelated with .
For this matrix to be positive denite, we need that the conditions
).
!
!
of full rank (
must be positive denite and
noted above hold:
must be
These identication conditions are not that intuitive nor is it very ob-
vious how to check them.
11.5.1. Necessary conditions. If we use IV estimation for a single equation

of the system, the equation can be written as
G
VP
f
where
W
Let
r dD n
I I
Notation:
be the total numer of weakly exogenous variables.
I
`
be the number of excluded exogs (in this equation).
&V&

s&
)PEID I &

Let
be the number of included exogs, and let
Let
215
be the total number of included endogs, and let
be the number of excluded endogs.
Using this notation, consider the selection of instruments.
exhausts the set of possible instruments, in that
I

if the variables in
It turns out that
ments.
are weakly exogenous and can serve as their own instru-
Now the
dont lead to an identied model then no other
instruments will identify the model either. Assuming this is true (well
prove it in a moment), then a necessary condition for identication is
since if not then at least one instrument must
I
ED I
will not have full column rank:
) wP
&
!
a
&
) wP
be used twice, so
that
This is the order condition for identication in a set of simultaneous

equations. When the only identifying information is exclusion restrictions on the variables that enter an equation, then the number of excluded exogs must be greater than or equal to the number of included
endogs, minus 1 (the normalized lhs endog), e.g.,
) &
To show that this is in fact a necessary condition consider some arbi-
trary set of instruments
A necessary condition for identication is
r I
VP I n R 1 3 W R 1 3
I
)
)
I
D
I
s
r I I
eDwP VP I n R 1 W R 1
I
)
)
and
converges in probability to zero, so
s are uncorrelated with the
s, by assumption, the cross
I
HP VP I D
I
I
r D
I
n P
|I
I
I
I
between
Because the
so
so we have
r n r D f n
I
I
we can write the reduced form using the same partition
r e n
I
r D f n
I
P
Given the reduced form

as
Recall that weve partitioned the model
r dD n
I I
) wP
&

R ) 1 3 $
where
that
216
Since the far rhs term is formed only of linear combinations of columns of
columns, then it is not of full column rank.
columns we have
P
) &

or noting that
regardless of the choice of
has more than
When
has more than
instruments. If
the rank of this matrix can never be greater than
217

%
) &
In this case, the limiting matrix is not of full column rank, and the identication
condition fails.
11.5.2. Sufcient conditions. Identication essentially requires that the structural parameters be recoverable from the data. This wont be the case, in general, unless the structural model is subject to some restrictions. Weve already
identied necessary conditions. Turning to sufcient conditions (again, were
only considering identication through zero restricitions on the parameters,
for the moment).
The model is
P i
R
S
R
s
218
This leads to the reduced form
I SR I
R
SP ei
R
I P I i
R

S

The reduced form parameters are consistently estimable, but none of them are
known a priori, and there are no restrictions on their values. The problem is
that more than one structural form has the same reduced form, so knowledge
of the reduced form parameters alone isnt enough to determine the structural
parameters. To see this, consider the model
matrix. The rf of this new model
R
R
&
i' &
P
CwP
I P I
I I P P I I P
P
P
P P I P
P
is
is some arbirary nonsingular
where
R
i
R
i
R
219
Likewise, the covariance of the rf of the transformed model is
I
Since the two structural forms lead to the same rf, and the rf is all that is directly estimable, the models are said to be observationally equivalent. What we
ble
and
such that the only admissi
need for identication are restrictions on
is an identity matrix (if all of the equations are to be identied). Take the
coefcient matrices as partitioned before:
I
C
I
H
The coefcients of the rst equation of the transformed model are simply these
. This gives
I
I
I
C
coefcients multiplied by the rst column of
I
H
I
I
220
For identication of the rst equation we need that there be enough restrictions
so that the only admissible
I
I

P
be the leading column of an identity matrix, so that
)

I
Ct
I
H
I
I

I
i
I
H
Note that the third and fth rows are

Supposing that the leading matrix is of full column rank, e.g.,
) &
`

then the only way this can hold, without additional restrictions on the models
is a vector of zeros, then
) I
I
)
I
I
) &
r I

) n
Therefore, as long as
the rst equation
is a vector of zeros. Given that

P
parameters, is if
221
then
I
I

P
The rst equation is identied in this case, so the condition is sufcient for
identication. It is also necessary, since the condition implies that this submarows. Since this matrix has
&
P V&
) &
trix must have at least
&
rows, we obtain
) &
&
P 0&
or
) &
which is the previously derived necessary condition.

The above result is fairly intuitive (draw picture here). The necessary condition ensures that there are enough variables not in the equation of interest to
potentially move the other equations, so as to trace out the equation of interest. The sufcient condition ensures that those other equations in fact do move
around as the variables change their values. Some points:
)
4(s&
When an equation has
is is exactly identied, in that
omission of an identiying restriction is not possible without loosing

consistency.
)
4
&
When
the equation is overidentied, since one could
drop a restriction and still retain consistency. Overidentifying restrictions are therefore testable. When an equation is overidentied we
222
have more instruments than are strictly necessary for consistent estimation. Since estimation by IV with more instruments is more efcient
asymptotically, one should employ overidentifying restrictions if one
is condent that theyre true.
We can repeat this partition for each equation in the system, to see
which equations are identied and which arent.

These results are valid assuming that the only identifying informa-
tion comes from knowing which variables appear in which equations,

e.g., by exclusion restrictions, and through the use of a normalization. There are other sorts of identifying information that can be used.
These include
(1) Cross equation restrictions
(2) Additional restrictions on parameters within equations (as in the
Klein model discussed below)
(3) Restrictions on the covariance matrix of the errors
(4) Nonlinearities in variables
When these sorts of information are available, the above conditions
arent necessary for identication, though they are of course still sufcient.
To give an example of how other information can be used, consider the model
where
is an upper triangular matrix with 1s on the main diagonal. This is a
triangular system of equations. In this case, the rst equation is
I I f
P
I
223
Since only exogs appear on the rhs, this equation is identied.

The second equation is
5 &

P 0DI t f
P I f
This equation has
excluded exogs, and
included endogs, so it
fails the order (necessary) condition for identication.
I
However, suppose that we have the restriction
so that the rst
and second structural errors are uncorrelated. In this case
I f
G I G P R
0DI i7 GU
S
so theres no problem of simultaneity. If the entire
matrix is diago-
nal, then following the same logic, all of the equations are identied.
This is known as a fully recursive model.
11.5.3. Example: Kleins Model 1. To give an example of determining identication status, consider the following macro model (this is the widely known
8 I a I I w s& ) n R
and the predetermined variables are all others:
r
a T o
n R
The endogenous variables are the
government

c

} }

8
c w
and a time trend,
taxes,
|
|

|
H H
I II
I
| H

s&
lhs variables,
nonwage spending,
The other variables are the government wage bill,

I
P
|e
e
I
ge

T V Ha
& P P
wS S
| VS w | sI a CaCiP p
G P P P I
0sI | uDI VC HuP p
G P
P
P I
I G P
$V P T | & I & C I & P p &
P
P

a
T

Capital Stock:
Prots:
Output:
Private Wages:
Investment:
Consumption:
Kleins Model 1)
224
225
The model assumes that the errors of the equations are contemporaneously
correlated, by nonautocorrelated. The model written as
gives
)
| &

|

&
|

| &
p p
&

)

)

HY I &
) Ct
I
and
)

)

To check this identication of the consumption equation, we need to extract
the submatrices of coefcients of endogs and exogs that dont appear
in this equation. These are the rows that have zeros in the rst column, and
226
we need to drop the rst column. We get
example, selecting rows 3,4,5,6, and 7 we obtain the matrix
'

|
|
)

)

) qC
) I
)

)
We need to nd a set of 5 rows of this matrix gives a full-rank 5
matrix. For
) |
|
)

w
This matrix is of full rank, so the sufcient condition for identication is met.
and counting excluded exogs,

) 7
) &

77 &
Counting included endogs,
so
11.6. 2SLS
227
The equation is over-identied by three restrictions, according to the
counting rules, which are correct when the only identifying information are the exclusion restrictions. However, there is additional infor-
and
mation in this case. Both
enter the consumption equation,
and their coefcients are restricted to be the same. For this reason the
consumption equation is in fact overidentied by four restrictions.
11.6. 2SLS
When we have no information regarding cross-equation restrictions or the
structure of the error covariance matrix, one can estimate the parameters of a
single equation of the system without regard to the other equations.
This isnt always efcient, as well see, but it has the advantage that
misspecications in other equations will not affect the consistency of

the estimator of the parameters of the equation of interest.
Also, estimation of the equation wont be affected by identication
problems in other equations.
is re-
gressed on all the weakly exogenous variables in the system, e.g., the entire
matrix. The tted values are
I
D
I R R
i I !Xiu
on the space spanned by
I
s

and since any vector in this space is uncorrelated with
by assumption,
ID
I

Since these tted values are the projection of

i
The 2SLS estimator is very simple: in the rst stage, each column of
is
11.6. 2SLS
8
G
I

related with
Since
I

uncorrelated with
228
is simply the reduced-form prediction, it is cor-
The only other requirement is that the instruments be linearly
independent. This should be the case when the order condition is satised,
in this case.
in place of
I

original model is
I
The second stage substitutes
than in
I
D
since there are more columns in
and estimates by OLS. This
f
G P I I P I I
VDHVDCsD
G
VP

W
and the second stage model is
8 G P I I P I I
VDHVDCsD f
is in the space spanned by
I

stage model as
I
%
I
Since
so we can write the second
G
VP W s
G P I I P I I
VDH DCsD s
f
$
The OLS estimator applied to this model is
f s R W I pW s R e
W
which is exactly what we get if we estimate using IV, with the reduced form
predictions of the endogs used as instruments. Note that if we dene
W
r I I
dD n
11.6. 2SLS
are the instruments for

YW
so that
229
then we can write
fyW I !pWyW
R R
Important note: OLS on the transformed model can be used to calcusince we see that its equivalent to IV using
late the 2SLS estimate of
a particular set of instruments. However the OLS covariance formula is

not valid. We need to apply the IV covariance formula already seen
above.
Actually, there is also a simplication of the general IV variance formula. Dene
W
s

The IV covariance estimator would ordinarily be
R
R
R
h WW h W yW h W yW d
I
I
However, looking at the last term in brackets
I
RI
I
RI
I
c R I DE R I
I

I I
I I
r eD n R r d n W R W
11.6. 2SLS
we can write
I
RI
I
R I
r eD
I I
is idempotent and since
I I
n R r dD
I
D s RI
ID sUs R I

i
but since
230
r I I
r I I
eD n R D n

R
yW
Therefore, the second and last term in the variance formula cancel, so the 2SLS
varcov estimator simplies to
h W yW

I
which, following some algebra similar to the above, can also be written as
h W yW d
I
Finally, recall that though this is presented in terms of the rst equation, it is
general since any equation can be placed rst.
Properties of 2SLS:
(1) Consistent
(2) Asymptotically normal
(3) Biased when the mean esists (the existence of moments is a technical
issue we wont go into here).
(4) Asymptotically inefcient, except in special circumstances (more on
this later).
11.7. TESTING THE OVERIDENTIFYING RESTRICTIONS
231
11.7. Testing the overidentifying restrictions

The selection of which variables are endogs and which are exogs is part of
the specication of the model. As such, there is room for error here: one might
erroneously classify a variable as exog when it is in fact correlated with the
error term. A general test for the specication on the model can be formulated
as follows:
The IV estimator can be calculated by applying OLS to the transformed
model, so the IV objective function at the minimized value is
h ) jf
R
h
jf l )

but

)
!
GVP"
R I X !
!
f R I X
!
!
f i I !X
R
jf

R
iujf

R uj

R uj
G P
0" w

where
R
R
% I 6X ! iuj
$
w
so
GVPi
R R PR

w R w diugtG s)
!
. Substituting a consistent

w R w ' )

!

w R w ' )

!
)
t9 j

G
w R w R G )
!
random variable with an idempotent matrix in
are normally distributed, with variance

xs
estimator,
This isnt available, since we need to estimate

the middle, so
is a quadratic form of a
variable
Supposing the
then the random
G
R
w R w tG s)
!
so

l
R I Q R u

!
!

w

is orthogonal to
8 i I 6X
R
! R
R
Ri I !X iu i I !X
!
!
!
!
R !
R !
R
% I 6X ! iuj ! i I 6X
!
R
iu !
R%u ! !
R
iu !
Furthermore,
w R w
!

w R w
!
Moreover,
is idempotent, as can be veried by multiplication:

232
Even if the
233
arent normally distributed, the asymptotic result still
holds. The last thing we need to determine is the rank of the idempotent matrix. We have
!
R
R
i I !X ! iu !

w ! R w
so
R
I 6X iu
R !
i I 6X !
R I R y

R
! i y ! y
!
R
iu ! ! y
I R R y
and

is the number of columns of
w R w
!
8
%
columns of
where
is the number of
The degrees of freedom of the test is simply the number
of overidentifying restrictions: the number of instruments we have

beyond the number that is strictly necessary for consistent estimation.
This test is an overall specication test: the joint null hypothesis is that
the model is correctly specied and that the
form valid instruments
(e.g., that the variables classied as exogs really are uncorrelated with
W
8
G
f
and
8
7G
or that there is correlation between
G
XP
Rejection can mean that either the model
is misspecied,
This is a particular case of the GMM criterion test, which is covered in
the second half of the course. See Section 15.8.

Note that since
G
w
234
and
G w R w R G s)

!
we can write

)
D1
! e 1
1
GR G
G R I R R I R R G
test statistic.
on all of the instruments
is the uncentered
from a regression of the
where
residuals
. This is a convenient way to calculate the
On an aside, consider IV estimation of a just-identied model, using the standard notation
G P
V" f
is the matrix of instruments. If we have exact identication then

y
, so
is a square matrix. The transformed model is
X I

G
P
" ! f !
and the fonc are
f R
s) j %
!
The IV estimator is

`
and
f R
R
i I X ! i
h ) R Vf R
P
!
!
)
R R P
i s f i
R
!
!
h ) jf

R
i
!
h ) jf

!

) jf

!

) jf

!
) jf
R h
R
h
R
h
jf
R f

R f

R
f

R
f

)
The objective function for the generalized IV estimator is
f I 6Q
R R

Rf I R R I R R I Q R
f R I R R I Q R

!
we obtain

f R
!
R R R
I @i I !X
R R
R
I I !ddi I !X
R R R
I I 6di

Now multiplying this by

R
I X i
!
Considering the inverse here
235
11.8. SYSTEM METHODS OF ESTIMATION
236
by the fonc for generalized IV. However, when were in the just indentied
case, this is
!

)
f R I X R u R I R 0 R I R R f
f I 6Qu f
R R
R
R R ! R
f I 6QVjf f

The value of the objective function of the IV estimator is zero in the just identied
case. This makes sense, since weve already shown that the objective function
with degrees of freedom equal to the
rv, which has mean 0 and variance 0,
is asymptotically
after dividing by
number of overidentifying restrictions. In the present case, there are no overi
dentifying restrictions, so we have a
e.g., its simply 0. This means were not able to test the identifying restrictions
in the case of exact identication.
11.8. System methods of estimation

2SLS is a single equation method of estimation, as noted above. The advantage of a single equation method is that its unaffected by the other equations
of the system, so they dont need to be specied (except for dening what are
the exogs, so 2SLS can use the complete set of instruments). The disadvantage
of 2SLS is that its inefcient, in general.
Recall that overidentication improves efciency of estimation, since
an overidentied equation can use more instruments than are necessary for consistent estimation.
Secondly, the assumption is that
237
sT j
i
R
i
Since there is no autocorrelation of the
s, and since the columns of
are individually homoscedastic, then
f
f Huxxb f H f H
I bb
I
II
.
.
.
.. .
. .
.
This means that the structural equations are heteroscedastic and correlated with one another
In general, ignoring this will lead to inefcient estimation, following
the section on GLS. When equations are correlated with one another
estimation should account for the correlation in order to obtain efciency.
Also, since the equations are correlated, information about one equa-
tion is implicitly information about all equations. Therefore, overidentication restrictions in any equation improve efciency for all equations, even the just identied equations.
Single equation methods cant use these types of information, and are
therefore inefcient (in general).
238
11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments estimator (see Chapter 15). I no
longer teach the following section, but it is retained for its possible historical
interest. Another alternative is to use FIML (Subsection 11.8.2), if you are willing to make distributional assumptions on the errors. This is computationally
feasible with modern computers.
Following our above notation, each structural equation can be written as
G
B VP B B W
G PI PI
B 0H B 0Di B

B f

.
.
.
I
UW
bb
xxb
G
VP
I
f

f
W
where we already have that
.
.
.
..
bb
xxb
.
.
.
.
.
.
.
.
.
I
G
or
equations together we get
&
Grouping the
RG
tG
i
239
The 3SLS estimator is just 2SLS combined with a GLS correction that takes
as
R I X R u

bb
xxb
..
W
bb
xxb
.
.
.
bb
xxb
I I
d
.
.
.
..
.
.
.
8
i
W R X R u
I
I
UW R I X R u
bb
xxb
.
.
.
Dene
advantage of the structure of
These instruments are simply the unrestricted rf predicitions of the endogs,

combined with the exogs. The distinction is that if the model is overidentied,
then
'
does not impose these restrictions. Also, note that
and
and
may be subject to some zero restrictions, depending on the restrictions on
is calculated
using OLS equation by equation. More on this later.

The 2SLS estimator would be
fyW I !pWyW
R R
as can be veried by simple multiplication, and noting that the inverse of a

block-diagonal matrix is just the matrix with the inverses of the blocks on the
main diagonal. This IV estimator still ignores the covariance information. The
natural extension is to add the GLS transformation, putting the inverse of the
240
error covariance into the formula, which gives the 3SLS estimator
f f I yS yW h W f I yS yW
R
R
I
f I $f egyW h W I 4f yW
S R
S R
I
The solution is to dene a feasible

The obvious solution is to use an
8
S
B f HB G

Then the element
1

G BR G B

3
Substitute
is estimated by
B
B B
of
not
W
(IMPORTANT NOTE: this is calculated using
estimator based on the 2SLS residuals:
0
@`
0
CG
estimator using a consistent estimator of
8
g B
8
S
This estimator requires knowledge of
into the formula above to get the feasible 3SLS estimator.
Analogously to what we did in the case of 2SLS, the asymptotic distribution

of the 3SLS estimator can be shown to be
h
$f
I I
S
R
f
0
@`
| 1
A formula for estimating the variance of the 3SLS estimator in nite samples
is
h W hf
I
1
R
I y S W h
(cancelling out the powers of

0
@`
This is analogous to the 2SLS formula in equation (??), combined with

the GLS correction.
241
In the case that all equations are just identied, 3SLS is numerically
equivalent to 2SLS. Proving this is easiest if we use a GMM interpre-
tation of 2SLS and 3SLS. GMM is presented in the next econometrics

course. For now, take it on faith.
equation by equation using OLS:
The 3SLS estimator is based upon the rf parameter estimator
calculated
d% I 6Xi
R R
which is simply
r xxb f f n R I X R
f bb
I

that is, OLS equation by equation using all the exogs in the estimation of each
column of
It may seem odd that we use OLS on the reduced form, since the rf equations are correlated:
R P R
R
I R P I i
R

and
2
h I
R I u R I C
Let this var-cov matrix be indicated by
R I
242
OLS equation by equation to get the rf is equivalent to
is the
3
3
endog,
is the entire
column of
and
bb
xxb
column of
I
f

3
) '
w1
Bf
'
j1
Use the notation
is the
.
.
.
vector of observations of the
matrix of exogs,
.
.
.
bb
xxb
is the
..
where
.
.
.
I
$
.
.
.
.
.
.
f
to indicate the pooled model. Following this notation, the error covariance
matrix is

7R
This is a special case of a type of model known as a set of seemingly

unrelated equations (SUR) since the parameter vector of each equation
is different. The equations are contemporanously correlated, however.
The general case would have a different
for each equation.
Note that each equation of the system individually satises the classi-
cal assumptions.
However, pooled estimation using the GLS correction is more efcient,
since equation-by-equation estimation is equivalent to pooled estimais block diagonal, but ignoring the covariance informa-
The model is estimated by GLS, where
tion.
tion, since
is estimated using the OLS
residuals from equation-by-equation estimation, which are consistent.
are the same, which is true in the
v
R
I
Using the rules
and
we get

I
f R I X R u
f iw I I 6Xiu
R
R
f iw I I Xwtf iw I t
R

R
R Xwf Xwtf I 4f R Xwtf

.
.
.
w

(3)
w
R R w
I w
(2)
(1)
I
8
%wtf
this note that in this case
OLS. To show
f I $f
present case of estimation of the rf parameters, SUR

$
In the special case that all the
243

So the unrestricted rf coefcients can be estimated efciently (assum-
We have ignored any potential zeros in the matrix
ing normality) by OLS, even if the equations are correlated.

which if they
exist could potentially increase the efciency of estimation of the rf.

Another example where SUR OLS is in estimation of vector autore$
gressions. See two sections ahead.
11.8.2. FIML. Full information maximum likelihood is an alternative estimation method. FIML will be asymptotically efcient, since ML estimators based on a given information set are asymptotically efcient w.r.t. all
other estimators that use the same information set, and in the case of the
244
full-information ML estimator we use the entire information set. The 2SLS

and 3SLS estimators dont require distributional assumptions, while FIML of
course does. Our model is, recall
R
s

R U
means that the density for
requires the Jacobian
R t
} t }
5 ~ p
R R j R I ) R R ) 7g} eI I
is
} p
so the density for
to
is the multivariate nor-
I S R )5 } eI I
~ p
The transformation from
mal, which is

2
2T j
S
R P i
R
The joint normality of
5
} p
Given the assumption of independence over time, the joint log-likelihood function is
R R j R I 4 R j R
I@ 5
I
f )
5
} A 1

5
S
1 P 5
} p A D1 T f
&
do this in the next section.
of this can be done using iterative numeric methods. Well see how to
This is a nonlinear in the parameters objective function. Maximixation
11.9. EXAMPLE: 2SLS AND KLEINS MODEL 1
245
It turns out that the asymptotic distribution of 3SLS and FIML are the
same, assuming normality of the errors.

One can calculate the FIML estimator by iterating the 3SLS estimator,
thus avoiding the use of a nonlinear optimizer. The steps are
8 CGI0 | CG |
|
|
0
@`
as normal.
This is new, we didnt estimate
in
(2) Calculate
and
0
CG
(1) Calculate
this way before. This estimator may have some zeros in it. When
Greene says iterated 3SLS doesnt lead to FIML, he means this for
and
and
you do converge to FIML.
and calculate
using
(3) Calculate the instruments
and
If you update
but only updates
a procedure that doesnt update
to get the estimated errors, applying the usual estimator.
(5) Repeat steps 2-4 until there is no change in the parameters.
8
S
(4) Apply 3SLS using these new instruments and the estimate of
FIML is fully efcient, since its an ML estimator that uses all informa-
tion. This implies that 3SLS is fully efcient when the errors are normally
distributed. Also, if each equation is just identied and the errors are
normal, then 2SLS will be fully efcient, since in this case 2SLS 3SLS.
$
When the errors arent normally distributed, the likelihood function is

of course different than whats written above.
11.9. Example: 2SLS and Kleins Model 1

The Octave program Simeq/Klein.m performs 2SLS estimation for the 3
equations of Kleins model 1, assuming nonautocorrelated errors, so that lagged
endogenous variables can be used as instruments. The results are:
CONSUMPTION EQUATION
246
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
estimate
st.err.
t-stat.
p-value
16.555
1.321
12.534
0.000
Profits
0.017
0.118
0.147
0.885
Lagged Profits
0.216
0.107
2.016
0.060
Wages
0.810
0.040
20.129
0.000
Constant
*******************************************************
INVESTMENT EQUATION
*******************************************************
Observations 21
R-squared 0.884884
estimate
st.err.
t-stat.
p-value
20.278
7.543
2.688
0.016
Profits
0.150
0.173
0.867
0.398
Lagged Profits
0.616
0.163
3.784
0.001
Constant
Lagged Capital
-0.158
0.036
247
-4.368
0.000
*******************************************************
WAGES EQUATION
*******************************************************
Observations 21
R-squared 0.987414
estimate
st.err.
t-stat.
p-value
Constant
1.500
1.148
1.307
0.209
Output
0.439
0.036
12.316
0.000
Lagged Output
0.147
0.039
3.777
0.002
Trend
0.130
0.029
4.475
0.000
*******************************************************
The above results are not valid (specically, they are inconsistent) if the errors are autocorrelated, since lagged endogenous variables will not be valid
instruments in that case. You might consider eliminating the lagged endogenous variables as instruments, and re-estimating by 2SLS, to obtain consistent
parameter estimates in this more complex case. Standard errors will still be
estimated inconsistently, unless use a Newey-West type covariance estimator.
Food for thought...
CHAPTER 12
Introduction to the second half
over a set
x
d f f
47C9
optimizing element of an objective function
D EFINITION 12.0.1. [Extremum estimator] An extremum estimator
available data, based on a sample of size .
be the
Well begin with study of extremum estimators in general. Let
is the
Well usually write the objective function suppressing the dependence on
Example: Least squares, linear model

Stacking observations
8 R h f bxbxb I d f
t"HCf
fG Ppd f
d 1
8 5 2 G
p r!A@8@8974) SctjP p d R gf
v
x
pg
vertically,
where
8f
7
Let the d.g.p. be
The least squares
estimator is dened as
8
R I R d
f D R `@f sD 1 eS9 E
f d
f ) d f
d
`@
We readily nd that
Example: Maximum likelihood
5
7~g} I
p
e
fq
I
@
d
4eSf
~

$
Because the logarithmic function is strictly increasing on
The maxi-
, maximization
k
5
d f
4ig
mum likelihood estimator is dened as
8 ) p d
t99ej Yf
Suppose that the continuous random variable
of the average logarithm of the likelihood function is achieved at the same as

d
248
12. INTRODUCTION TO THE SECOND HALF
249
for the likelihood function:
8
d
I
@
5
)
D1 5 5 4R f D1 4eS ~ g
)
d
) d f
d f
i f
Solution of the f.o.c. leads to the familiar result that
MLE estimators are asymptotically efcient (Cramr-Rao lower bound,
Theorem3), supposing the strong distributional assumptions upon which

they are based are true.
One can investigate the properties of an ML estimator supposing
that the distributional assumptions are incorrect. This gives a quasiML estimator, which well study later.
The strong distributional assumptions of MLE may be questionable
in many cases. It is possible to estimate using weaker distributional

assumptions based only on some of the moments of a random variable(s).
Example: Method of moments
from the
pd
ge
Suppose we draw a random sample of
distribution. Here,
is the parameter of interest. The rst moment (expectation),
I
Q
of a random
In this example, the relationship is the identity function
p
xgd eQ
p d I
variable will in general be a function of the parameters of the distribution, i.e.,

.
is a moment-parameter equation.
sample rst moment is
8
q1 f
I@
I
Q
f
p RvQ
d I
p d I
ggRvQ Q
I
though in general the relationship may be more complicated. The
250
Dene
I dI
Q q4egQ 4R%
d I
The method of moments principle is to choose the estimator of the
parameter to set the estimate of the population moment equal to the
I
d i
sample moment, i.e.,
. Then the moment-parameter equation
is inverted to solve for the parameter estimate.
I@
pd 4d %
I
f
p
gd
1 f @
I
T f
Since
8 1 gf
In this case,
by the LLN, the estimator is consistent.
More on the method of moments
r.v. is
8 p @5 p ig g
d
d f
f
1
d d
p@5 4e
f f I
7ug @f
Dene
pd
ge
Continuing with the above example, the variance of a
The MM estimator would set
8
$
pd 5 4d
f f I
7g @f
Again, by the LLN, the sample variance is consistent for the true vari-
f f1 5 I
d
7g @f
So,
1
8 p d@5
f f I
T 7ug @f
ance, that is,
251
which is obtained by inverting the moment-parameter equation, is

consistent.
Example: Generalized method of moments (GMM)
The previous two examples give two estimators of
which are both con-
sistent. With a given sample, the estimators will be different in general.

With two moment-parameter equations and only one parameter, we
have overidentication, which means that we have more information

than is strictly necessary for consistent estimation of the parameter.
The GMM combines information from the two moment-parameter equations to form a new estimator which will be more efcient, in general
(proof of this below).
i.e.,
I
@
81
gf pd
f
I
@
eEI%
d
1)
f

.
d I
e%
p d I
ge%
and
p
dI
4REi
Clearly, when evaluated at the true parameter value
both
From the second example we dene additional moment conditions
f f d d
7gp@5 eE
8 f f1 I
d d
p@5 4R
7ug @f
and
is
dI
s p eE%
8 f
gwd 4REi
dI
the sample average of
We already have that
d I
evi
From the rst example, dene
to set either
I
k d i
chose
or
The MM estimator would
In general, no single value of

d
8 4d

8 p e
d

Again, it is clear from the LLN that
252
will solve the two equations simultaneously.
and choosing
8HE4eXpt 4deS d
d
f
R E4e g4R% eX
d d I
d

" wR

t
An example would be to choose
where
where
d
gE4eXt
The GMM estimator is based on dening a measure of distance
is a positive denite
matrix. While its clear that the MM gives consistent estimates if there is a oneto-one relationship between parameters and moments, its not immediately
obvious that the GMM estimator is consistent. (Well see later that it is.)
These examples show that these widely used estimators may all be interpreted as the solution of an optimization problem. For this reason, the study
of extremum estimators is useful for its generality. We will see that the general
results extend smoothly to the more specialized results available for specic
estimators. After studying extremum estimators in general, we will study the
GMM estimator, then QML and NLS. The reason we study GMM rst is that
LS, IV, NLS, MLE, QML and other well-known parametric estimators may all
be interpreted as special cases of the GMM estimator, so the general results on
GMM can simplify and unify the treatment of these other estimators. Nevertheless, there are some special results on QML and NLS, and both are important in empirical research, which makes focus on them useful.
One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models arent useful. Linear models are more general than
253
they might rst appear, since one can employ nonlinear transformations of the
variables:
r bb

G
t0P p d $T m xxb m vI m n g p m
f
f
P
SI uP
&

i
ts this form.
G P
tVC I P I
For example,
The important point is that the model is linear in the parameters but not
necessarily linear in the variables.
In spite of this generality, situations often arise which simply can not be convincingly represented by linear in the parameters models. Also, theory that
applies to nonlinear models also applies to linear models, so one may as well
start off with the general case.
Example: Expenditure shares
f
!d B
f
!d DC
goods is
B
f CAB
) B B
I
)
} 'g B
or
B
C
for
and
8 f
B
so necessarily
3
An expenditure share is
of
&
Roys Identity states that the quantity demanded of the
. No linear in the parameters model
with a parameter space that is dened independent of the data can
guarantee that either of these conditions holds. These constraints will often be
violated by estimated linear models, which calls into question their appropriateness in cases of this sort.
Example: Binary limited dependent variable
254
The referendum contingent valuation (CV) method of infering the social

value of a project provides a simple example. This example is a special case
of more general discrete choice (or binary response) models. Individuals are
where
is income and
pG P p
9q"7
ity in the base case (no project) is
for provision of a project. Indirect util-
asked if they would pay an amount
is a
vector of other variables such as prices, personal characteristics, etc. After proreect variations
of preferences in the population. With this, an individual agrees1 to pay

C
if the consumer agrees to pay
) f
8 p
g"t
I G G
G p
otherwise. The probability of agreement is
To make the example
s0
sy
&
w %

" I

" p
IG
and
p
G
and
8
g w iHw
f
t)
specic, suppose that
To simplify notation, dene
for the change,
8
4s w %w
(12.0.1)
w

w %w
and let
and
f
w d I

G
I j p G
Dene
collect
let
if

" p w I
Dene
8 I q" I
G P
The random terms
5
7) ! G
3
B
vision, utility is
are i.i.d. extreme value random variables. That is, utility de-
pends only on income, preferences in both states are homothetic, and a specic distributional assumption is made on the distribution of preferences in
the population. With these assumptions (the details are unimportant here, see
1
We assume here that responses are truthful, that is there is no strategic behavior and that
individuals are able to order their preferences in this hypothetical situation.
255
articles by D. McFadden if youre interested) it can be shown that

w VP &
d
w

is the logistic distribution function
8 I $Y 7g} $
# ~ P ) #
#
$
where
This is the simple logit model: the choice probability is the logit function of a
linear in parameters function.
is
. Thus, we
P
w VP &
can write
or 1, and the expected value of
is either
w VP &
Now,
f
U
One could estimate this by (nonlinear) least squares

E w uP &
can be written as a linear
w uP &
1
h
) E &
in the parameters model, in the sense that, for arbitrary , there are no
R m
7Bdc w w VP &

such that
R m

we can always nd a

7d
w m

4)
negative or greater than
eter. This is because for any
and is a dimensional paramd
is a -vector valued function of
where
such that
w m d

The main point is that it is impossible that
will be
which is illogical, since it is the expectation of a 0/1
binary random variable. Since this sort of problem occurs often in empirical
work, it is useful to study NLS and other nonlinear models.
256
After discussing these estimation methods for parametric models well briey
introduce nonparametric estimation methods. These methods allow one, for exconsistently when we are not willing to assume that a
G P
t0
model of the form
ample, to estimate
f
can be restricted to a parametric form
t

g
gf
G
$# t
P
g
#
st
and perhaps
#
t
G P d
04
dc
b
where
are of known functional form. This is im-
portant since economic theory gives us general information about functions

and the signs of their derivatives, but not about their specic form.
Then well look at simulation-based methods in econometrics. These methods allow us to substitute computer power for mental power. Since computer
power is becoming relatively cheap compared to mental effort, any econometrician who lives by the principles of economic theory should be interested in
these techniques.
Finally, well look at how econometric computations can be done in parallel on a cluster of computers. This allows us to harness more computational
power to work with more complex models that can be dealt with using a desktop computer.
CHAPTER 13
Numeric optimization methods

Readings: Hamilton, ch. 5, section 7 (pp. 133-139) Gourieroux and Mon!
fort, Vol. 1, ch. 13, pp. 443-60 ; Goffe, et. al. (1994).
If were going to be applying extremum estimators, well need to know

how to nd an extremum. This section gives a very brief introduction to what
is a large literature on numeric optimization methods. Well consider a few
well-known techniques, and one fairly new technique that may allow one to
solve difcult problems. The main objective is to become familiar with the
issues, and to learn how to use the BFGS algorithm at the practical level.
The general problem we consider is how to nd the maximizing element
-vector) of a function
8 d
4R
(a
This function may not be continuous, and
it may not be differentiable. Even if it is twice continuously differentiable, it

may not be globally concave, so local maxima, minima and saddlepoints may
were a quadratic function of

7d
d
4R
all exist. Supposing
e.g.,
R P RU P
d d )5 yd! 4e
d
the rst order conditions would be linear:
8
xU I d
P
d
U 4e
so the maximizing (minimizing) element would be
This is the sort
of problem we have with linear models estimated by OLS. Its also the case for
257
13.2. DERIVATIVE-BASED METHODS
258
feasible GLS, since conditional on the estimate of the varcov matrix, we have
a quadratic objective function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able
to solve for the maximizer analytically. This is when we need a numeric optimization method.
13.1. Search
The idea is to create a grid over the parameter space and evaluate the function at each point on the grid. Select the best point. Then rene the grid in
the neighborhood of the best point, and continue until the accuracy is good
enough. See Figure 13.1.1. One has to be careful that the grid is ne enough
in relationship to the irregularity of the function to ensure that sharp peaks are
not missed entirely.
points. For example, if
and
4 )
)'p)7G8 7
A )
"#
p I 4 )
there would
points to check. If 1000 points can be checked in a second, it would
take
years to perform the calculations, which is approximately the
age of the earth. The search method is a very reasonable choice if
but it quickly becomes infeasible if
is moderate or large.
be
we need to check
dimensional parameter space,
) o

values in each dimension of a
To check
is small,
13.2. Derivative-based methods

13.2.1. Introduction. Derivative-based methods are dened by
I h
(3) the stopping criterion.
tives)
given
(2) the iteration method for choosing
(1) the method for choosing the initial value,
(based upon deriva-
259
F IGURE 13.1.1. The search method
The iteration method can be broken into two problems: choosing the stepsize
which is of the same
h h
t
wP
t
I h

d
A locally increasing direction of search
so that
dimension of
(a scalar) and choosing the direction of movement,
is a direction such that
"wy e S
t P d r
for positive but small. That is, if we go in direction , we will improve on the
t
objective function, at least if we dont go too far in that direction.
260
As long as the gradient at is not zero there exist increasing directions,
RH
d
h h
is a symmetric pd
is the gradient at . To see this, take a T.S.

d
p
d
e ep
d
matrix and
where
and they can all be represented as
expansion around
P P d P P d
pt R t eHu t e
)
t
t
8 t R 4RD
d
)
x
positive denite, we guarantee that
d
geH
Dening
is to be an inwhere
P R d P d
t4eHCwe
term can be ignored. If

t
t P d
"we
creasing direction, we need
the
)
tc
For small enough
is
d R d
R d
eH ceH Bt4eH
Every increasing direction can be represented in this
is less that 90 degrees). See Figure 13.2.1.
way (p.d. matrices are those such that the angle between
and
d
4RD
8 ( eH
d
unless
With this, the iteration rule becomes
eH uP
d
h h h
h
d
I h

and we keep going until the gradient becomes zero, so that there is no increas-
is fairly straightforward. A simple line

is a scalar.
The remaining problem is how to choose
search is an attractive possibility, since
, choosing
Conditional on
and
ing direction. The problem is how to choose
Note also that this gives no guarantees to nd a global maximum.

261
F IGURE 13.2.1. Increasing directions of search
13.2.2. Steepest descent. Steepest descent (ascent if were maximizing) just
sets
to and identity matrix, since the gradient provides the direction of max-
imum rate of change of the objective function.

Advantages: fast - doesnt require anything more than rst deriva-
tives.
Disadvantages: This doesnt always work too well however (draw picture of banana function).
262
13.2.3. Newton-Raphson. The Newton-Raphson method uses information

about the slope and curvature of the objective function to determine which direction and how far to move from an initial point. Supposing were trying
(an initial guess).

d
sd R R pd 5 P d c eH RCu 4RC

d
d
)
d
R d P d f d f
h
h
h
h
h
d
s
we can maximize the portion of the right-hand

d
pd R R ipd 5 Bdc RD 4e

d
d
) P R d
d
h
h
h
d
i
This is a much easier problem, since it is a quadratic function
so it has linear rst order conditions. These are

d
pd e
d
h
P d d
eH e
h
d
i
So the solution for the next round estimate is

7d
in
8
7d
with respect to
i.e., we can maximize

d
d f
geSx
side that depends on
To attempt to maximize
8 d f
geS9
about
Take a second order Taylors series approximation of
d f
eSt
to maximize
eH I R
d
d
h
h
I h
This is illustrated in Figure 13.2.2.

7d
may be bad far away from the maximizer
d f
eSt
However, its good to include a stepsize, since the approximation to
so the actual iteration formula is
eH I ! e j
d
d
h
h h h
d
I h
A potential problem is that the Hessian may not be negative denite

d
I h e
when were far from the maximizing point. So
may not be
263
F IGURE 13.2.2. Newton-Raphson method
eH I e
d
d
h
h
positive denite, and
may not dene an increasing di-
rection of search. This can happen when the objective function has at
regions, in which case the Hessian matrix is very ill-conditioned (e.g.,
d
is nearly singular), or when were in the vicinity of a local minimum,

is positive denite, and our direction is a decreasing direction
e
d
h
of search. Matrix inverses by computers are subject to large errors

when the matrix is ill-conditioned. Also, we certainly dont want to
go in the direction of a minimum when were maximizing. To solve
d
this problem, Quasi-Newton methods simply add a positive denite

to ensure that the resulting matrix is positive defwhere
U
$ P d
%U Ue
inite, e.g.,
d
4e
component to
is chosen large enough so that
264
is well-conditioned and positive denite. This has the benet that
improvement in the objective function is guaranteed.

Another variation of quasi-Newton methods is to approximate the
Hessian by using successive gradient evaluations. This avoids actual

calculation of the Hessian, which is an order of magnitude (in the dimension of the parameter vector) more costly than calculation of the
gradient. They can be done to ensure that the approximation is p.d.
DFP and BFGS are two well-known examples.
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is
subject to limited machine precision and round-off errors. For these reasons,
it is unreasonable to hope that a program can exactly nd the point that maximizes a function. We need to dene acceptable tolerances. Some stopping
criteria are:
Negligable change in parameters:
Negligable relative change:
I
$G
d
I h h d
% G
Id h
I h h d
Negligable change of function:
d
d
| G I h Ruq h e
Gradient negligibly different from zero:
3
cxGG R
d
h
265
Or, even better, check all of these.
Also, if were maximizing, its good to check that the last round (real,
not approximate) Hessian is negative denite.
Starting values
The Newton-Raphson and related algorithms work well if the objective
function is concave (when maximizing), but not so well if there are convex
regions and local minima or multiple local maxima. The algorithm may converge to a local minimum or to a local maximum that is not optimal. The
algorithm may also have difculties converging at all.
The usual way to ensure that a global maximum has been found
is to use many different starting values, and choose the solution that
returns the highest objective function value. THIS IS IMPORTANT
in practice. More on this later.
Calculating derivatives
The Newton-Raphson algorithm requires rst and second derivatives. It
is often difcult to calculate derivatives (especially the Hessian) analytically if
dcC
b f
the function
is complicated. Possible solutions are to calculate derivatives
numerically, or to use programs such as MuPAD or Mathematica to calculate

analytic derivatives. For example, Figure 13.2.3 shows MuPAD1 calculating a
derivative that I didnt know off the top of my head, and one that I did know.
Numeric derivatives are less accurate than analytic derivatives, and
are usually more costly to evaluate. Both factors usually cause opti-
mization programs to be less successful when numeric derivatives are

used.
1
MuPAD is not a freely distributable program, so its not on the CD. You can download it from
http://www.mupad.de/download.shtml
266
F IGURE 13.2.3. Using MuPAD to get analytic derivatives
One advantage of numeric derivatives is that you dont have to worry
about having made an error in calculating the analytic derivative. When

programming analytic derivatives its a good idea to check that they
are correct by using numeric derivatives. This is a lesson I learned the
hard way when writing my thesis.
Numeric second derivatives are much more accurate if the data are
scaled so that the elements of the gradient are of the same order of
8
4) 4 8 S x
b f
b f
4 ) S
G P # P
cw & f
magnitude. Example: if the model is

mation is by NLS, suppose that
and esti-
and
13.3. SIMULATED ANNEALING
dcC x
b f
S
b f
8 44 ) # # ! 44 ) g4 ) ! 44 ) q&

&
One could dene
267
In this case, the gradients
and
will both be 1.
In general, estimation programs always work better if data is scaled

in this way, since roundoff errors are less likely to become important.
This is important in practice.
There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian.
The iterations are faster for this reason since the actual Hessian isnt
calculated, but more iterations usually are required for convergence.
Switching between algorithms during iterations is sometimes useful.
13.3. Simulated Annealing

Simulated annealing is an algorithm which can nd an optimum in the
presence of nonconcavities, discontinuities and multiple local minima/maxima.
Basically, the algorithm randomly selects evaluation points, accepts all points
that yield an increase in the objective function, but also accepts some points
that decrease the objective function. This allows the algorithm to escape from
local minima. As more and more points are tried, periodically the algorithm
focuses on the best point so far, and reduces the range over which random
points are generated. Also, the probability that a negative move is accepted
reduces. The algorithm relies on many evaluations, as in the search method,
but focuses in on promising areas, which reduces function evaluations with
respect to the search method. It does not require derivatives to be evaluated. I
have a program to do this if youre interested.
13.4. EXAMPLES
268
13.4. Examples
This section gives a few examples of how some nonlinear models may be
estimated using maximum likelihood.
13.4.1. Discrete Choice: The logit model. In this section we will consider
maximum likelihood estimation of the logit model for binary 0/1 dependent
variables. We will use the BFGS algotithm to nd the MLE.
We saw an example of a binary choice model in equation 12.0.1. A more
general representation is
H
P
)
f
G
jq H
d
4
f
f
t)
$
The log-likelihood function is
s4d
f ) P d
f B 1 d f
B
C id) B 04 C A B ) 4eS
B
f
For the logit model (see the contingent valuation example above), the probability has the specic form
4d ~ P
c } )
d
)
&
'
You should download and examine LogitDGP.m , which generates data

according to the logit model, logit.m , which calculates the loglikelihood, and
EstimateLogit.m , which sets things up and calls the estimation routine, which
uses the BFGS algorithm.
13.4. EXAMPLES
d
and the true
8 R x9
)
4 ) 1
Here are some estimation results with
269
***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: 0.607063
Observations: 100
constant
slope
estimate
0.5400
0.7566
st. err
0.2229
0.2374
t-stat
2.4224
3.1863
p-value
0.0154
0.0014
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
The estimation program is calling mle_results(), which in turn calls

a number of other routines. These functions are part of the octave-forge
repository.
13.4.2. Count Data: The Poisson model. Demand for health care is usually thought of a a derived demand: health care is an input to a home production function that produces health, and health is an argument of the utility
function. Grossman (1972), for example, models health as a capital stock that
is subject to depreciation (e.g., the effects of ageing). Health care visits restore
the stock. Under the home production framework, individuals decide when to
make health care visits to maintain their health stock, or to deal with negative
shocks to the stock in the form of accidents or illnesses. As such, individual
13.4. EXAMPLES
270
demand will be a function of the parameters of the individuals utility functions.

The MEPS health data le , meps1996.data, contains 4564 observations
on six measures of health care usage. The data is from the 1996 Medical Expenditure Panel Survey (MEPS). You can get more information at http://www.meps.ahrq.gov/.
The six measures of use are are ofce-based visits (OBDV), outpatient visits (OPV), inpatient visits (IPV), emergency room visits (ERV), dental visits
(VDV), and number of prescription drugs taken (PRESCR). These form columns
1 - 6 of meps1996.data. The conditioning variables are public insurance
(PUBLIC), private insurance (PRIV), sex (SEX), age (AGE), years of education
(EDUC), and income (INCOME). These form columns 7 - 12 of the le, in the
order given here. PRIV and PUBLIC are 0/1 binary variables, where a 1 indicates that the person has access to public or private insurance coverage. SEX
is also 0/1, where 1 indicates that the person is female. This data will be used
in examples fairly extensively in what follows.
The program ExploreMEPS.m shows how the data may be read in, and
gives some descriptive information about variables, which follows:
All of the measures of use are count data, which means that they take on
@8@8974
85)
the values
. It might be reasonable to try to use this information by
specifying the density as a count data density. One of the simplest count data
densities is the Poisson density, which is
( f
8 u )
7g}
~
e e
f
S
Y
G
The Poisson average log-likelihood function is
I
B 1
(
f
d f
B f Se B uP Se ) 4RC
B
B
f
13.4. EXAMPLES
271
We will parameterize the model as
8
@R ( 02 & w f 1 d)
BR v 7}
~
B
Ce
B
Dv
This ensures that the mean is positive, as is required for the Poisson model.
Note that for this parameterization
e

e U
so
variable.
with respect to the
6

the elasticity of the conditional mean of
conditioning
The program EstimatePoisson.m estimates a Poisson model using the full

data set. The results of the estimation, using OBDV as the dependent variable
are here:
MPITB extensions found
OBDV
******************************************************
Poisson model, MEPS 1996 full data set
MLE Estimation Results
13.5. DURATION DATA AND THE WEIBULL MODEL
272
Average Log-L: -3.671090

Observations: 4564
estimate
st. err
t-stat
p-value
-0.791
0.149
-5.290
0.000
pub. ins.
0.848
0.076
11.093
0.000
priv. ins.
0.294
0.071
4.137
0.000
sex
0.487
0.055
8.797
0.000
age
0.024
0.002
11.471
0.000
edu
0.029
0.010
3.061
0.002
inc
-0.000
0.000
-0.978
0.328
constant
CAIC : 33575.6881
Avg. CAIC:
7.3566
BIC : 33568.6881
Avg. BIC:
7.3551
AIC : 33523.7064
Avg. AIC:
7.3452
******************************************************
13.5. Duration data and the Weibull model

In some cases the dependent variable may be the time that passes between
the occurence of two events. For example, it may be the duration of a strike,
or the time needed to nd a job once one is unemployed. Such variables take
on values on the positive real line, and are referred to as duration data.
273
A spell is the period of time between the occurence of initial event and the
concluding event. For example, the initial event could be the loss of a job, and
the nal event is the nding of a new job. The spell is the period of unemployment.
be the time the initial event occurs, and
be the time the conclud-
I
2
p2
Let
ing event occurs. For simplicity, assume that time is measured in years. The

gE2
3
4
function of
is the duration of the spell,

with distribution function
8E2 E
2
p2 I 3 P
j62
random variable
. Dene the density
Several questions may be of interest. For example, one might wish to know
the expected time one has to wait to nd a job given that one has already
waited years. The probability that a spell lasts years is
8
gt
3
P
) t
) t
conditional on the spell already having lasted years is
8 t
E32
3
4
P
)
2
t
The density of
3
4
The expectanced additional time required for the spell to end given that is has
with respect to this density, minus

3
5
3
#
u 7t t 43# P y) # Vt U

8
4
To estimate this function, one needs to specify the density
2
E
already lasted years is the expectation of
as a para-
metric density, then estimate by maximum likelihood. There are a number of

possibilities including the exponential density, the lognormal, etc. A reasonably exible model that is a generalization of the exponential density is the
Weibull density
the log densities.
8 e
8 I E2 e q e 76@ d 2 4

3
According to this model,
274
The log-likelihood is just the product of
To illustrate application of this model, 402 observations on the lifespan of

mongooses in Serengeti National Park (Tanzania) were used to t a Weibull
model. The spell in this case is the lifetime of an individual mongoose.
and

7 8 4 8 e

The parameter estimates and standard errors are
and the log-likelihood value is -659.3. Figure 13.5.1 presents tted
7 A
$@7 8 D9 E 8
life expectancy (expected additional years of life) as a function of age, with 95%
condence bands. The plot is accompanied by a nonparametric Kaplan-Meier
estimate of life-expectancy. This nonparametric estimator simply averages all
spell lengths greater than age, and then subtracts age. This is consistent by the
LLN.
In the gure one can see that the model doesnt t the data well, in that it
predicts life expectancy quite differently than does the nonparametric model.
For ages 4-6, the nonparametric estimate is outside the condence interval that
results from the parametric model, which casts doubt upon the parametric
model. Mongooses that are between 2-6 years old seem to have a lower life
expectancy than is predicted by the Weibull model, whereas young mongooses
that survive beyond infancy have a higher life expectancy, up to a bit beyond
2 years. Due to the dramatic change in the death rate as a function of , one
as a mixture of two Weibull densities,
8 h I z E2 e e z 6 @ z t P h I EI e vCI
)
2 I
e
2
E
@ t
3
4
might specify
4d 2
3
4
275
F IGURE 13.5.1. Life expectancy of mongooses, Weibull model
and
5 3 B
) ! Se
The parameters
are the parameters of the two Weibull densi-
ties, and is the parameter that mixes the two.
With the same data, can be estimated using the mixed model. The results
d
are a log-likelihood = -623.17. Note that a standard likelihood ratio test can-
and
(single density), the two parameters
)
not be used to chose between the two models, since under the null that
are not identied. It is possi-
ble to take this into account, but this topic is out of the scope of this course.
Nevertheless, the improvement in the likelihood function is considerable. The
parameter estimates are
13.6. NUMERIC OPTIMIZATION: PITFALLS
276
Parameter Estimate St. Error

0.233
0.016
1.722
0.166
1.731
0.101
1.522
0.096
0.428
0.035
I
i
Note that the mixture parameter is highly signicant. This model leads to
the t in Figure 13.5.2. Note that the parametric and nonparametric ts are
9
quite close to one another, up to around
years. The disagreement after this
point is not too important, since less than 5% of mongooses live more than 6
years, which implies that the Kaplan-Meier nonparametric estimate has a high
variance (since its an average of a small number of observations).
Mixture models are often an effective way to model complex responses,
though they can suffer from overparameterization. Alternatives will be discussed later.
13.6. Numeric optimization: pitfalls
In this section well examine two common problems that can be encountered when doing numeric optimization of nonlinear models, and some solutions.
13.6.1. Poor scaling of the data. When the data is scaled so that the magnitudes of the rst and second derivatives are of different orders, problems can
easily result. If we uncomment the appropriate line in EstimatePoisson.m, the
data will not be scaled, and the estimation program will have difculty converging (it seems to take an innite amount of time). With unscaled data, the
elements of the score vector have very different magnitudes at the initial value
277
F IGURE 13.5.2. Life expectancy of mongooses, mixed Weibull model
of (all zeros). To see this run CheckScore.m. With unscaled data, one element
d
of the gradient is very large, and the maximum and minimum elements are 5
orders of magnitude apart. This causes convergence problems due to serious
numerical inaccuracy when doing inversions to calculate the BFGS direction
of search. With scaled data, none of the elements of the gradient are very
large, and the maximum difference in orders of magnitude is 3. Convergence
is quick.
13.6.2. Multiple optima. Multiple optima (one global, others local) can
complicate life, since we have limited means of determining if there is a higher
278
F IGURE 13.6.1. A foggy mountain
maximum the the one were at. Think of climbing a mountain in an unknown
range, in a very foggy place (Figure 13.6.1). You can go up until theres nowhere
else to go up, but since youre in the fog you dont know if the true summit
is across the gap thats at your feet. Do you claim victory and go home, or do
you trudge down the gap and explore the other side?
The best way to avoid stopping at a local maximum is to use many starting
values, for example on a grid, or randomly generated. Or perhaps one might
have priors about possible values for the parameters (e.g., from previous studies of similar data).
279
Lets try to nd the true minimizer of minus 1 times the foggy mountain
function (since the algoritms are set up to minimize). From the picture, you
can see its close to
, but lets pretend there is fog, and that we dont know
that. The program FoggyMountain.m shows that poor start values can lead to
problems. It uses SA, which nds the true global minimum, and it shows that
BFGS using a battery of random start values can also nd the global minimum
help. The output of one run is here:
MPITB extensions found
======================================================
BFGSMIN final results
Used numeric gradient
-----------------------------------------------------STRONG CONVERGENCE
Function conv 1
Param conv 1
Gradient conv 1
-----------------------------------------------------Objective function value -0.0130329

Stepsize 0.102833
43 iterations
------------------------------------------------------
param
gradient
change
15.9999
-0.0000
0.0000
-28.8119
0.0000
0.0000
The result with poor start values

ans =
16.000
-28.812
================================================
SAMIN final results
NORMAL CONVERGENCE
Func. tol. 1.000000e-10 Param. tol. 1.000000e-03

Obj. fn. value -0.100023
parameter
search width
0.037419
0.000018
-0.000000
0.000051
================================================
Now try a battery of random start values and
a short BFGS on each, then iterate to convergence
The result using 20 randoms start values
ans =
3.7417e-02
2.7628e-07
The true maximizer is near (0.037,0)
280
281
In that run, the single BFGS run with bad start values converged to a point far
from the true minimizer, which simulated annealing and BFGS using a battery
of random start values both found the true maximizaer. battery of random
start values managed to nd the global max. The moral of the story is be
cautious and dont publish your results too quickly.
EXERCISES
282
Exercises
(1) In octave, type help bfgsmin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call bfgsmin. Run it, and
examine the output.
(2) In octave, type help samin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call samin. Run it, and
examine the output.
(3) Using logit.m and EstimateLogit.m as templates, write a function to calculate the probit loglikelihood, and a script to estimate a probit model. Run
it using data that actually follows a logit model (you can generate it in the
same way that is done in the logit example).
(4) Study mle_results.m to see what it does. Examine the functions that
mle_results.m calls, and in turn the functions that those functions call.
Write a complete description of how the whole chain works.
(5) Look at the Poisson estimation results for the OBDV measure of health care
use and give an economic interpretation. Estimate Poisson models for the
other 5 measures of health care usage.
CHAPTER 14
Asymptotic properties of extremum estimators

Readings: Gourieroux and Monfort (1995), Vol. 2, Ch. 24 Amemiya, Ch.
!
8
4 section 4.1 ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey
and McFadden (1994), Large Sample Estimation and Hypothesis Testing, in

Handbook of Econometrics, Vol. 4, Ch. 36.
14.1. Extremum estimators
9#
d f f
47C
with
The OLS estimator minimizes
1)
1)

are dened similarly to
8
YW
d f W f
eC
and
d
pj
I
f B
d
BR B
f
8 R BR B HB #
f
where
B G d BR kB f
P
E XAMPLE 18. Given the model
. Let the objective function
283
'
X1
are -vectors and is nite.
as the optimizing
f# bb I
f
R r uxxb # g# n W
d f
4eS9
the
random matrix
depend upon a
over a set
element of an objective function
In Denition 12.0.1 we dened an extremum estimator
where
observations, dene
14.2. CONSISTENCY
284
14.2. Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article,
ref. later), which well see in its original form later in the course. It is interesting to compare the following proof with Amemiyas Theorem 4.1.1, which is
done in terms of convergence in probability.
(1) Compactness: The parameter space
is an open subset of Euclidean
is compact.
such that
A

4Ra uq4e @ f
d d f 9
d f
eSx
is such that
d f
64yCxg
and hold it xed. Then
of functions. Suppose that
i.e.,
8p
9gd d
f

g d p
wxd cx4RaG
d d
G
b
Proof: Select a
Then
has a unique global maximum at
x g
pt d
(3) Identication:
a.s.
is a xed sequence
converges uniformly to
This happens with probability one by assumption (b). The sequence

x
in the compact set
lies
by assumption (1) and the fact that maximixation is over
. Since every sequence from a compact set has at least one limit point (Davidd

fd
There is a subsequence
is simply a sequence of increasing integers) with
fd
87fd
son, Thm. 2.12), say that is a limit point of
W1
Yt
f
7d
8 d
g4RaG
continuous in on
that is
(2) Uniform Convergence: There is a nonstochastic function
p d
ge
The closure of
space
is obtained by maximiz-
d
4RI
d f
eS9
Assume
over
ing
f
d
T HEOREM 19. [Consistency of e.e.] Suppose that
. By
8 d
p Ra
p ea p e t W
d
d
f

gd a d W
f f
8 d f
g p e t W
p e
d
f
d W
f f
d 9
f f
C
d

d d A

is . So the above claim is true.

B
of
k
dcuG
b
implies that
8
gd d W
f
8
gd a d W
f f
from the sequence
8 CD fd
by uniform convergence, so
as seen above, and
However,
which holds in the limit, so
Next, by maximization
since the limit as
Continuity of
uniform convergence implies
To see this, rst of all, select an element
Then
uniform convergence and continuity

14.2. CONSISTENCY
285
14.2. CONSISTENCY
286
so
Finally, all of the above limits hold

xed, but now we need to consider
except on a set
p
d
d
g p ea 4d a

f
d
has only one limit point,
d
8
with
. Therefore
almost surely, since so far we have held

all
8p
and
d
4RaI
we must have
at
p
9d
But by assumption (3), there is a unique global maximum of
Discussion of the proof:

This proof relies on the identication assumption of a unique global
An equivalent way to state this is
d
eG
with
must have
p Ra
d
(c) Identication: Any point in
v p d
d
8p
xgd
maximum at
which matches the way we will write the assumption in the section on nonparametric inference.
is in fact a global maximum of
f
d
quired to be unique for
8 d f
Hex
We assume that
It is not re-
nite, though the identication assumption
requires that the limiting objective function have a unique maximizing argument. The next section on numeric optimization methods will
trivial problem.
d f
ey9
show that actually nding the global maximum of
may be a non-
See Amemiyas Example 4.1.4 for a case where discontinuity leads to
The assumption that
breakdown of consistency.
x
is in the interior of
(part of the identica-
tion assumption) has not been used to prove consistency, so we could

is simply an element of a compact set
8
x
directly assume that
The
reason that we assume its in the interior here is that this is necessary
for subsequent proof of asymptotic normality, and Id like to maintain
a minimal set of simple assumptions, for clarity. Parameters on the
boundary of the parameter set cause theoretical difculties that we
14.2. CONSISTENCY
287
will not deal with in this course. Just note that conventional hypothesis testing methods do not apply in this case.
is not required to be continuous, though
d
4R`
d f
e9
Note that
is.
The following gures illustrate why uniform convergence is important.
With uniform convergence, the maximum of the sample

objective function eventually must be in the neighborhood
of the maximum of the limiting objective function
14.2. CONSISTENCY
288
With pointwise convergence, the sample objective function

may have its maximum far away from that of the limiting
objective function
We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 19. The following theorem is from Davidson, pg. 337.
be a sequence of stochastic
real-valued functions on a totally-bounded metric space
Then
8 x
!
d f &
e
T HEOREM 20. [Uniform Strong LLN] Let
A
d f
eSs&
9
if and only if
where
x
p
x g
pwd
d f &
eSs
d f
T 4R s&
is a dense subset of
and
is strongly stochastically equicontinuous..
the Euclidean norm.
The metric space we are interested in now is simply
(b)
for each
(a)
using
The pointwise almost sure convergence needed for assuption (a) comes
from one of the usual SLLNs.
14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES
289
Stronger assumptions that imply those of the theorem are:
the parameter space is compact (this has already been assumed)

the objective function is continuous and bounded with probability one on the entire parameter space
a standard SLLN can be shown to apply to some point in the parameter space
These are reasonable conditions in many cases, and henceforth when
dealing with specic estimators well simply assume that pointwise

almost sure convergence can be extended to uniform almost sure convergence in this way.
The more general theorem is useful in the case that the limiting obeven if
d f
4eSx
jective function can be continuous in
is discontinuous.
This can happen because discontinuities may be smoothed out as we

take expectations over the data. In the section on simlation-based estimation we will se a case of a discontinuous objective function.
14.3. Example: Consistency of Least Squares

We suppose that data is generated by random sampling of
has the common distribution function
Suppose that the variances
for which
x
gf
&
x
lg
8ctud R gf
G P p
R 6s!p & gd
p
p
we can write
8
'
G F G P Fp P
tU tU!swqp
are independent) with support
are nite. Let
1
R U4
F )

`
F nDQ
Q `
F f
p!
is compact. Let
, where
( and
and
so
The sample objective function for a sample size
14.3. EXAMPLE: CONSISTENCY OF LEAST SQUARES
290
is
I@
I
@
I
@
1
d d
1 5 t d d
G
)P tG p s R
P i p R
f
f
f
I
B
I
@
f
d R G
tVP p d R 1 ) 4d R g
f
f
d f
eS9
1)

1)
Considering the last term, by the SLLN,
8
Q
"t
Q
Dt
G

H F
IhG
I
@
G f 1)
dent, the SLLN implies that it converges to zero.
and
and are indepen-
Considering the second term, since
Finally, for the rst term, for a given , we assume that a SLLN applies
so that
F
Q
Dt
F
Q
Dt
t
d
i p R
d

F
I
F 0 p Dp 0 p

P F

Q t
0 p P "8F F 0 p
F
p & wP &
5
&
p & wP &
5
&
I
x i p R @
d
d

f
1)
(14.3.1)
p &

p &
Finally, the objective function is clearly continuous, and the parameter space
is assumed to be compact, so the convergence is also uniform. Thus,
8p
9s !9p 'Q&
&
F
P F

5
uP j p 0 p & p & sP
&
p & 4RaG
d
A minimizer of this is clearly
E XERCISE 21. Show that in order for the above solution to be unique it is
8 p
F
necessary that
Discuss the relationship between this condition and
the problem of colinearity of regressors.
291
This example shows that Theorem 19 can be used to prove strong consistency of the OLS estimator. There are easier ways to show this, of course - this
is only an example of application of the theorem.
14.4. Asymptotic Normality

A consistent estimator is oftentimes not very useful unless we know how
fast it is likely to be converging to the true value, and the probability that it
is far away from the true value. Establishment of asymptotic normality with
a known scaling factor solves these two problems. The following theorem is
similar to Amemiyas Theorem 4.1.3 (pg. 111).
T HEOREM 22. [Asymptotic normality of e.e.] In addition to the assumptions
of Theorem 19, assume
exists and is continuous in an open, convex neighbor-
a nite negative denite matrix, for any sequence
8p
d
p e
d f P
6$fRC4
8pd
d f
4RCP
that converges almost surely to
hood of
(b)
d f
eS9
(a)
p R@ p ea
d
P
d
I d P d o I p e % m h p d 1
p d f
gRC 1 Qtf ggRa o

p d
p d
4ge o q ggRC 1
m p d f
(c)
where
fd
`
Then
Proof: By Taylor expansion:
h p ipd RC p eS96 4d S
d
d f
P d f f f
pd
xg
will be in the neighborhood where
) P
cd
probability one as
d f
4RC
Note that
8
4)
where
becomes large, by consistency.
exists with
292
Now the l.h.s. of this equation is zero, at least asymptotically, since

is a maximizer and the f.o.c. must hold exactly since the limiting
f
d
f
d
Cd
gives
and
p
is between
and since
f
p d d

Also, since
8p
9d
objective function is strictly concave in a neighborhood of
, assumption (b)
p e P eS
d
d f
T
So
h p ipd tT p e P p RC
d
) P d
P d f
h p ssd 1 t$T p e
d
) P d
P d f
p eS96 1
is a nite negative denite matrix, so the

, so we can write
term is asymptoti-
p d h
gRaP
pd
6ge
cally irrelevant next to
)
tT
Now
And
p eS 1 I 6 p eaR h p ipd 1
d f
d P
d
h p ipd 1 h p e Q p eS96 1 h
d
d
P P
d f
Because of assumption (c), and the formula for the variance of a linear combih
I p e
d
p ea o I p R
d
d

P
%
nation of r.v.s,
h p ipd 1
d
Assumption (b) is not implied by the Slutsky theorem. The Slutsky
f
81
H
b
H f H
cant depend on
is a function of
fd f
e P
Ch. 4) is
our case
and
dH
b
However, the function
if
is continuous at
8
theorem says that
to use this theorem. In
A theorem which applies (Amemiya,
8xd d
p
and
p
gd
d
4RaC
is continuous at
p
d f
4RCg
then
To apply this to the second derivatives, sufcient conditions would

be that the second derivatives be strongly stochastically equicontinuand that an ordinary LLN applies to the
derivatives when evaluated at
8p d g
ge"wd
ous on a neighborhood of
p
xgd
p d
gRaC

Stronger conditions that imply this are as above: continuous and bounded
Skip this in lecture. A note on the order of these matrices: Supposing

is representable as an average of
d f
4RC
for all estimators we consider,
terms, which is the case
is also an average of
d f
4eS
that
second derivatives in a neighborhood of
8p
9d
matrices,
the elements of which are not centered (they do not have zero expectation). Supposing a SLLN applies, the almost sure limit of
as we saw in Example 51. On the other hand, assumpmeans that
q
hm
d f
p eS96 1
S
T
wed have

q1
U
tVT
where we use the result of Example 49. If we were to omit the
p e
d
p eS96 1
d f
) S p d
t T ggRP
tion (c):
pd f
gggR 9
h z 1 T S
)
t$T S z 1

d f
p eS96
if
uniformly on an open neighborhood of
p ea@ d g
d
function
converges uniformly almost surely to a nonstochastic
T HEOREM 23. If
293
14.5. EXAMPLES
8
g
1
T
1
1
1
( T S T
p eS96
d f
to zero.
is centered, so we need to scale by
294
The sequence
to avoid convergence
14.5. Examples
14.5.1. Binary response models. Binary response models arise in a variety
of contexts. Weve already seen a logit model. Another simple example is a
probit threshold-crossing model. Assume that
)
x9
)
f
G
j R
f
G
is an unobserved (latent) continuous variable, and
Gt ~ p
4 5 G 7} eI x
5
, where
is negative or positive. Then
is a binary vari-
f
s t) s
able that indicates whether
Here,

is the standard normal distribution function.
In general, a binary response model will require that the choice probability
be parameterized in some form. For a vector of explanatory variables , the
response probability will be parameterized in some manner
d
4 )
f

4d R
d
we have a logit model. If
where
dc
b

gd R
d
s

W
If
standard normal distribution function, then we have a probit model.
is the
14.5. EXAMPLES
295
Regardless of the parameterization, we are dealing with a Bernoulli density,

X
u
d ) d B
B f XY
I 4 ic X u 4 C C B %G
so as long as the observations are independent, the maximum likelihood (ML)

is the maximizer of
I
B 1
8d
4
B f
C B )
f
I
d
f ) P d
f B 1
B
C s%) A B 0yc C B )
B
f
Following the above theoretical results,
d f
4RC
(14.5.1)
tends in probability to the
maximizes the uniform almost sure limit of
8 d f
g
d f
eSx
8 d f
4RCx
and following a SLLN for i.i.d. processes,

expectation of a representative term
to get
conditional on
Noting that
that
gpd BC s B
f
p
gd

d
estimator,
converges almost surely to the
First one can take the expectation

8 d
$s4 i) A p i) E4 p s yd) A Sjcs4 t u
d
P d
d
d
f ) P d f
Next taking expectation over
is the (joint - the integral is understood to be multiple, and
d
4ea
Y
'g
is the
Q
where
Q id) c p i) s p
t
d
d P d
d
(14.5.2)
we get the limiting objective function
support of ) density function of the explanatory variables . This is clearly

is continuous, and if the parameter space is

d
compact we therefore have uniform almost sure convergence. Note that
d
4
as long as
d

continuous in
is continous for the logit and probit models, for example. The maximizing
R
8 p d p d p ea
d
f
d
f
d
p R
d

p e
d

p de
1
)
p R 1 )
d
1
h
p e
d

f
So we get

f
d f
f
p eS 1
The terms in
also drop out by the same argument:
i.i.d.
f.o.c. in the consistency proof above and the fact that observations are
Theres no need to subtract the mean, since its zero, following the
the expectation of a typical element of the outer product of the gradient.
d f
pgRC 1 f gRa o

p d
8 d P d
q I 6 p he p Ra o I ! p R% @
d P
In the case of i.i.d. observations
is simply
h p d 1
d
The asymptotic normality theorem tells us that

tent. Question: whats needed to ensure that the solution is unique?
8p
xgd d
d
d
siy)
d

d

d
d
d
Q siy) q g Y
pd
pd
Ig4RaG
d d
This is clearly solved by
Provided the solution is unique,
is consis-
element of
solves the rst order conditions

14.5. EXAMPLES
296
8R Dv p iq p
d
d
v
d
v c p %f
Rd d
d
p e
d
f d
p
(14.5.3)
So
8 v 4 %4
d d

d ) d
v i4
d R 7g} )
~ P
~ P )
v 4dv R v 7} I d R v 7g}
~
~ ~ P )
v 4dR v 7} dR v 7g}

d d
4
We can simplify the above results in this case. We have that
8 I dR v 7g} P c 4
~ ) d
Now suppose that we are dealing with a correctly specied logit model:
8 p i) 0 p f p
d

f ) P d
d
f
8
From above, a typical element of the objective function is

and
or equivalently, rst over
then over

f
R
8g p d d
d
f
d
p e

on
Expectations are jointly over

P
Likewise,
conditional
14.5. EXAMPLES
297
14.6. EXAMPLE: LINEARIZATION OF A NONLINEAR MODEL
gives
t
v
d
P d d 5
BQR Dv p Q p s p u f
8 QR Hv p iq p
t
v
d
d
d
p RP
f
S Y
(14.5.6)
. Likewise,
8 BQR vHv p iq p
t
d
d
pd
g v Y
f
d
p Ra
(14.5.5)
(14.5.4)
then
Taking expectations over
298
Note that we arrive at the expected result: the information matrix equality
. With this,
I 6 p e
d
p ea o I 6 p R
d
d
pd
e

P
%
I 6 p RR9 d
d P
h
d
p ipd 1
gRaP
p d h
o
holds (that is,
simplies to
h p ipd 1
d
which can also be expressed as

h
8 I p e
d

m
h p ipd 1
d
On a nal note, the logit and standard normal CDFs are very similar - the
logit distribution is a bit more fat-tailed. While coefcients will vary slightly
between the two models, functions of interest such as estimated probabilities
will be virtually identical for the two models.

4d
14.6. Example: Linearization of a nonlinear model

Ref. Gourieroux and Monfort, section 8.3.4. White, Intnl Econ. Rev. 1980 is
an earlier reference.
299
Suppose we have a nonlinear model
G P d B
B V p C B f

where
! B G
t33
The nonlinear least squares estimator solves
f B 1
d
B
C B ) g f d
f
Well study this more later, but for now it is clear that the foc for minimization
will require solving a set of nonlinear equations. A common approach to the
problem seeks to avoid this difculty by linearizing the model. A rst order
Taylors series expansion about the point
with remainder gives
R p C p p B f
B aQP p p
B P d

d

p

and the Taylors series remainder. Note that

a
encompasses both
BG
where
is no longer a classical error - its mean is not zero. We should expect problems.
Dene

p p
d

Rp q p p
d
d
pgp

&
by applying OLS to
a
B QP C uP
B
and
&

B f
&
g
be consistent for
and
&
&
Question, will
and
Given this, one might try to estimate
300
as extremum
&
8 R R ! &
C 0
B
B
f
&
estimators. Let
and
The answer is no, as one can see by interpreting
I
B 1 f
H

f )
The objective function converges to its expectation
f
Y HaG HS9

f
T
0
that minimizes

HG
p

to the
&
f
Y p

&
8 8
4C
converges
j
and
Noting that
0 & q p
VP

G P d
0 & p0 p Y
&
&
f
Y

G
perplane that is closest to the true regression function
correspond to the hyaccording to the
mean squared error criterion. This depends on both the shape of

density function of the conditioning variables.
and
pd
g4
drop out.
dR
since cross products involving
and the
301
Inconsistency of the linear approximation, even at

the approximation point
x
h(x,)
x
Tangent line
Fitted line
x
x
x_0
It is clear that the tangent line does not minimize MSE, since, for ex-
pd
gg
ample, if
is concave, all errors between the tangent line and
the true function are negative.
p
d
Note that the true underlying parameter
is not estimated consis-
tently, either (it may be of a different dimension than the dimension

of the parameter of the approximating model, which is 2 in this example).
Second order and higher-order approximations suffer from exactly
the same problem, though to a less severe degree, of course. For
this reason, translog, Generalized Leontiev and other exible functional forms based upon second-order approximations in general suffer from bias and inconsistency. The bias may not be too important for
analysis of conditional means, but it can be very important for analyzing rst and second derivatives. In production and consumer analysis,
rst and second derivatives (e.g., elasticities of substitution) are often
302
of interest, so in this case, one should be cautious of unthinking application of models that impose stong restrictions on second derivatives.
This sort of linearization about a long run equilibrium is a common
practice in dynamic macroeconomic models. It is justied for the purposes of theoretical analysis of a model given the models parameters,
but it is not justiable for the estimation of the parameters of the model
using data. The section on simulation-based methods offers a means
of obtaining consistent estimators of the parameters of dynamic macro
models that are too complex for standard methods of analysis.
EXERCISES
303
Exercises
and
that are the probability limits of
&
C
B
&
the numeric values of
by OLS. Find
and
B B
P C P 'B f
&

B P B ) qB f
G

Suppose we estimate the misspecied model
is iid(0,
where
BG
uniform(0,1), and
8
gD
(1) Suppose that
(2) Verify your results using Octave by generating data that follows the above
model, and calculating the OLS estimator. When the sample size is very
large the estimator should be very close to the analytical results you obtained in question 1.
(3) Use the asymptotic normality theorem to nd the asymptotic distribution
and
This means nding
where
x
yx x
x cb gsQP S z
p f
QG
G P p
U l f
8
and is independent of
for the model

)
t
of the ML estimator of
8
g p
CHAPTER 15
Generalized method of moments (GMM)

Readings: Hamilton Ch. 14 ; Davidson and MacKinnon, Ch. 17 (see pg.
587 for refs. to applications); Newey and McFadden (1994), Large Sample
Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch.
36.
15.1. Denition
Weve already seen one example of GMM in the introduction, based upon
distribution. Consider the following example based upon the t-distribution.
is
p
4d

q1
d
45 p R p d
f P
d f
p d ) `5 tqegIpp RA
) P d p g! `
Y
one could estimate
Given an iid sample of size
The density function of a t-distributed r.v.
the
by maximizing the log-
likelihood function
d f
4g
Y
`
I
@
d

A f 4RSf ~ g
This approach is attractive since ML estimators are asymptotically efcient. This is because the ML estimator uses all of the available information (e.g., the distribution is fully specied up to a parameter). Recalling that a distribution is completely characterized by its moments,
the ML estimator is interpretable as a GMM estimator that uses all of
304
15.1. DEFINITION
305
ments to estimate a
the moments. The method of moments estimator uses only
mo-
dimensional parameter. Since information is
discarded, in general, by the MM estimator, efciency is lost relative

to the ML estimator.
(for
8 p
g45 gd
5 p d p f
4Vg gd
Using the notation introduced previously, dene a moment condition
y)
X Xf
d
5
I
d
$ d %

8 deI%
p
p dI
6geE%
8 7f @f 1 )
I
f 4upR d 4eE%
5 d
dI
both
and
Choosing to set
dI I
d I
e% @f 1 ) 4Rv%
As before, when evaluated at the true parameter value
p
and
5 d
VR
has mean zero and variance
p d f Y
ggg @
Continuing with the example, a t-distributed r.v. with density
(15.1.1)
yields a MM estimator:
This estimator is based on only one moment of the distribution - it uses less
information than the ML estimator, so it is intuitively clear that the MM estimator will be inefcient relative to the ML estimator.
An alternative MM estimator could be based upon the fourth moment

of the t-distribution. The fourth moment of a t-distributed r.v. is
p RD 5 p d
d
gR4u e 3
f
p d y7
We can dene a second moment condition

3
3
cQ
8 gd
p
I
@ 1
d 45 d d
f
s4YVR 4R
d
7
f )
provided
15.1. DEFINITION
306
4d
to set
A second, different MM estimator chooses
If you
solve this youll see that the estimate is different from that in equation
15.1.1.
This estimator isnt efcient either, since it uses only one moment. A GMM estimator would use the two moment conditions together to estimate the single
parameter. The GMM estimator is overidentied, which leads to an estimator which is efcient relative to the just identied MM estimators (more on
efciency later).
dicate the sample size. Note that
subscript is used to in-
xgd QdDt)$T d eX
p S d
p 1
g eI T
d
gpeX
8 R E4e !ge% eC
d d I
d f
The
As before, set
since it is an
average of centered random variables, whereas
where expectations are taken using the true distribution with param-
8p
xd
eter
This is the fundamental reason that GMM is consistent.
A GMM estimator requires dening a measure of distance,
We assume
converges to a
In general, assume we have moment conditions, so
is a -vector
'

is a
and
matrix.
d
4RX
nite positive denite matrix.
8 d f d
g4RXi R 4eX e 9
d f
and we minimize
f
"
"fi R E4eXpt
d
d
E4RXpt
popular choice (for reasons noted below) is to set
.A
For the purposes of this course, the following denition of the GMM estimator
is sufciently general:
f
i
eX!4

d
d f f d f
g4Rig" R eS $ 4eS g
d f
p
-vector,
with
and
symmetric positive denite matrix
-dimensional parameter vec-
where
d I
4RE @f I f 4RC
d f
tor
D EFINITION 24. The GMM estimator of the
is a
converges almost surely to a nite

.
'

15.2. CONSISTENCY
307
Whats the reason for using GMM if MLE is asymptotically efcient?

Robustness: GMM is based upon a limited set of moment conditions.
For consistency, only these moment conditions need to be correctly
specied, whereas MLE in effect requires correct specication of every

conceivable moment condition. GMM is robust with respect to distributional misspecication. The price for robustness is loss of efciency with
respect to the MLE estimator. Keep in mind that the true distribution
is not known so if we erroneously specify a distribution and estimate
by MLE, the estimator will be inconsistent in general (not always).
Feasibility: in some cases the MLE estimator is not available, because we are not able to deduce the likelihood function. More
on this in the section on simulation-based estimation. The GMM
estimator may still be feasible even though MLE is not possible.
15.2. Consistency
We simply assume that the assumptions of Theorem 19 hold, so the GMM
estimator is strongly consistent. The only assumption that warrants additional comments is that of identication. In Theorem 19, the third assumphas a unique global maximum at
p
d
daG
b
tion reads: (c) Identication:
i.e.,
Taking the case of a quadratic objective function
p
9gd "d
d
4ea
ea pp R gpgdea gge G
p d

pd

y
8 p dRa
d f
p RC 4

8 d
g4Ra eSf
d

T
8 d
4eSf
d f f d f
g4RCgi R eS 4RC
d f
8 p d d
xd cgea ea
p d
rst consider
Applying a uniform law of large numbers, we get

Since
by assumption,
Since
cation, we need that
in order for asymptotic identi-
for
for at least some element
Q
f

p
d
matrix guarantee that
'

t
denite
of the vector. This and the assumption that
308
a nite positive
is asymptotically identied.
'
Note that asymptotic identication does not rule out the possibility
of lack of identication for a given data set - there may be multiple
minimizing solutions in nite samples.

We also simply assume that the conditions of Theorem 22 hold, so we will
have asymptotic normality. However, we do need to nd the structure of the
asymptotic variance-covariance matrix of the estimator. From Theorem 22, we
h
y
8gpgeS 1 X f A ea o
d f
p d
d f
eS z
I 6 p e p ea o I 6 p R
d P d
d
P
%
h p ipd 1
d
p d
geP
where
have
is the almost sure limit of
and
We need to determine the form of these matrices given the objective function
8 d f f d f
g4RCgi R eS 4RC
d f
Now using the product rule from the introduction,
d f d
5 4RC
d Rf
HR d
matrix
$
d
eSf
'
so:
8d
HR e 5 4e d
d
d
and
d f
eS
f d
"74eSf eS9
d f
,
is omitted to unclutter the notation).
all depend on the sample size

q1
(15.3.1)
(Note that
d f f d
eygi e yf
Dene the
but it

d
d
f
Rf f" R 4RXH1 p eXH1 if A f
Rf fi4eXc p RXh i$f 1 A f

R d
d
f
h
p RCf d 1 A f
d

p d
ggh RXq
by assumption), we have
and
ps
d
5 p ea
rows of
a.s. (we assume a LLN
@848C6 R

d
p ea
p
gd
p Ra o
d
(since
, following equation 15.3.1, and noting that the scores
A4C7
8 8
Rd d
P
d
p RCf
A
have mean zero at

With regard to
holds).
we get
R
BR p e d
d

where we dene
Stacking these results over the
ggtcT p eX
)
d

since
d
R p RXg5
to a nite limit. In this case, we have
y
d
BR e
satises a LLN, so that it converges almost surely
d d
BR 4e
p
xgd
d
R 4eX5
R
5 P
R B d dRggR B 5
R
de fie B d
d
5
assume that
be the
th row of
8d
ge
at
When evaluating the term
d R d
d
4R B

uct rule,
To take second derivatives, let
Using the prod-
309
15.4. CHOOSING THE WEIGHTING MATRIX

h
is an average of centered (mean-zero) quantities, it is
p d
ggRX
Now, given that
310
reasonable to expect a CLT to apply, after multiplication by

gs
this,
. Assuming
p RXH1
d
where
8 R p RX p RXH1 A f
d
d
Using this, and the last equation, we get
s p Ra
d
Using these results, the asymptotic normality theorem gives us

h
r I R
s R
p I R

s x n
h p ipd 1
d
the asymptotic distribution of the GMM estimator for arbitrary weighting mato be positive denite,
"
Note that for
trix
must have full row rank,
s

8 i
f
15.4. Choosing the weighting matrix

is a weighting matrix, which determines the relative importance of viola-
tions of the individual moment conditions. For example, if we are much more
sure of the rst moment condition, which is based upon the variance, than of
the second, which is based upon the fourth moment, we could set
much larger than
8
xU
with
311
In this case, errors in the second moment condition
have less weight in the objective function.

Since moments are not independent, in general, we should expect that
there be a correlation between the moment conditions, so it may not
data dependent matrix.
We have already seen that the choice of
be desirable to set the off-diagonal elements to 0.
may be a random,
will inuence the asymp-
totic distribution of the GMM estimator. Since the GMM estimator is
already inefcient w.r.t. MLE, we might like to choose the
matrix
to make the GMM estimator efcient within the class of GMM estimators
.
To provide a little intuition, consider the linear model
I
satises the classical assumptions of

homoscedasticity and nonautocorrelation, since
w I wI
8 f R I R I R I R R
for
I 8w
G

R 4G
G f
P
Then the model
e.g,
R
be the Cholesky factorization of
'
Let
That is, he have heteroscedasticity and autocorre-
8I
lation.
8
g j G

where
G P v
7 R f
d f
eS
dened by
(Note: we use
both nonsingular). This means that the transformed
model is efcient.
jective function
minimizes the ob-
Interpreting

G q
8
I R
f
f
G i f
P
The OLS estimator of the model
as moment conditions (note that they do have zero expectation when
evaluated at
), the optimal weighting matrix is seen to be the in-
verse of the covariance matrix of the moment conditions. This result

carries over to GMM estimation. (Note: this presentation of GLS is not
312
a GMM estimator, because the number of moment conditions here is
8
q1
equal to the sample size,
Later well see that GLS can be put into the
GMM framework dened above).

is a GMM estimator that minimizes
the asymptotic variance
con-
versus
R eI r eI
p
p
R

ps

I R ps
s I R p

s I R
I
'
8 I R I
is some arbitrary positive denite matrix:
I
Now, for any choice such that
sider the difference of the inverses of the variances when
s
R eI

p

s R ps

n eI
p
R I
when
so that
I R
I
simplies to
will be minimized by choosing
84 R pgRXgpgRXH1 % f A
d d
Proof: For
where
asymptotic variance of
the
f
f
"
"

d f f d f
geigi R eS
T HEOREM 25. If
as can be veried by multiplication. The term in brackets is idempotent, which

is also easy to check by multiplication, and is therefore positive semidenite.
A quadratic form in a positive semidenite matrix is also positive semidenite. The difference of the inverses of the variances is positive semidenite,
which implies that the difference of the variances is negative semidenite,
which proves the theorem.
The result
h
r R I
n
m
h p ipd 1
d
(15.4.1)
15.5. ESTIMATION OF THE VARIANCE-COVARIANCE MATRIX
p
d
means approximately distributed as. To operationalize this we
is continuous in
not continuous. We now turn to estimation of

7d
chastic equicontinuity results can give us this result even if
assuming that
which is consistent
Sto-
Rf
is simply
8
7d
by the consistency of
The obvious estimator of
Rf
h d Rf
and
need estimators of
where the

1
I R I
allows us to treat
313
is
15.5. Estimation of the variance-covariance matrix

(See Hamilton Ch. 10, pp. 261-2 and 280-84) .
In the case that we wish to use the optimal weighting matrix, we need an
the limiting variance-covariance matrix of
one could estimate
p d f
ggRCD1
estimate of
. While
parametrically, we in general have little information
upon which to base a parametric specication. In general, we expect that:
2

s R U @
will be autocorrelated (
). Note that this autoco-
variance will not depend on if the moment conditions are covariance
stationary.
contemporaneously correlated, since the individual moment condi-
).
B B
and have different variances (
).

B U
tions will not in general be independent of one another (
314
Since we need to estimate so many components if we are to take the parametric approach, it is unlikely that we would arrive at a correct parametric specication. For this reason, research has focused on consistent nonparametric
8
4d E

7d
8
g R U

are functions of
Now
P
I f ) 1 xxxH R
Pbbb
I @
8 R
f 1 )
would be
P
DI f 9xxP h R
Pbbb
1
P 01 P h IR
5
8 h R

in the denominator instead). So, a natural, but inconsis-
h I Rf
f
tent, estimator of
I Rf
1
P 5011 IR I ) 1 P p
P
P
I
@
I
@
1)v

w R
f
f
1 v R p eXc p eX1
d
d
(you might use
is
I
@
1)

f
A natural, consistent estimator of
so that
Recall that
so for now assume that we have some consistent
I@
1)
R
f
p
estimator of

R
and
Note that
autocovariance of
the moment conditions
Dene the
8 R
does not depend on
and
is covariance stationary (the covariance be-
8
gE2
tween
Henceforth we assume that
2
X
estimators of
P 11 P p

I f
1
P
DI ) 1 P p
315
This estimator is inconsistent in general, since the number of parameters to

estimate is more than the number of observations, and increases more rapidly
tends to zero sufciently rapidly as
a modied estimator
hR
I
p
P
xf (
must be
1

can be dropped because
grows sufciently
8 1
gDT
will be consistent, provided
1
S
f f
1

slowly. The term
as
where
tends to
On the other hand, supposing that
than , so information does not build up as
This allows
information to accumulate at a rate that satises a LLN. A disadvantage of

this estimator is that it may not be positive denite. This could cause one to
statistic, for example!
which in turn
which is based upon an estimate of

d
solution to this circularity is to set the weighting matrix
requires an estimate of
requires an estimate of
Note: the formula for
p d
geX
calculate a negative
The
arbitrarily
(for example to an identity matrix), obtain a rst consistent but inef-
then re-estimate
nor change appreciably

d
p
The process can be iterated until neither
then use this estimate to form
cient estimate of
8p
xgd
between iterations.
15.5.1. Newey-West covariance estimator. The Newey-West estimator (Econometrica, 1987) solves the problem of possible nonpositive deniteness of the
above estimator. Their estimator is
8 hR
xf (

I
)P
y) P p
15.6. ESTIMATION USING CONDITIONAL MOMENTS
316
This estimator is p.d. by construction. The condition for consistency is that

This estimator is
I 1
3 p
e
an example of a kernel estimator.
nonparametric - weve placed no parametric restrictions on the form of
It is
8
7
Note that this is a very slow rate of growth for
In a more recent paper, Newey and West (Review of Economic Studies, 1994)
use pre-whitening before applying the kernel estimator. The idea is to t a VAR
model to the moment conditions. It is expected that the residuals of the VAR
model will be more nearly white noise, so that the Newey-West covariance
estimator might perform better with short lag lengths..
The VAR model is
8
i
P T x Pbbb P I
CUT UhxxxsI x
Then the Newey-West covariance
estimator is applied to these pre-whitened residuals, and the covariance
This is estimated, giving the residuals
is
estimated combining the tted VAR
T Ux xxxsI x
T Pbbb P I
tails.
with the kernel estimate of the covariance of the
See Newey-West for de-
I have a program that does this if youre interested.
15.6. Estimation using conditional moments

If the above VAR model does succeed in removing unmodeled heteroscedasticity and autocorrelation, might this imply that this information is not being
used efciently in estimation? In other words, since the performance of GMM
depends on which moment conditions are used, if the set of selected moments
317
exhibits heteroscedasticity and autocorrelation, cant we use this information,

a la GLS, to guide us in selecting a better set of moment conditions to improve
efciency? The answer to this may not be so clear when moments are dened
unconditionally, but it can be analyzed more carefully when the moments used
in estimation are derived from conditional moments.
So far, the moment conditions have been presented as unconditional expectations. One common way of dening unconditional moment conditions is
based upon conditional moment conditions.
has zero expectation conditional on the
is also zero. The unconditional expectation is
and a function
Y
Then the unconditional expectation of the product of

XH
t
iX
random variable
Suppose that a random variable
of
8 t
i4t X XH
e
I

QHq
'
This can be factored into a conditional expectation and an expectation w.r.t.
r
o
t
iX QH
it can be pulled out of the integral
8 t
%4X XH iX
t
e
I

XD
doesnt depend on

QH
Since
8 t
%$X
the marginal density of

XD
Y
f'
But the term in parentheses on the rhs is zero by assumption, so

XD q
318
as claimed.
This is important econometrically, since models often imply restrictions on
conditional moments. Suppose a model tells us that the function
equal to
d
gc D
f
4

c
expectation, conditional on the information set
has
8d
g4c D g 4
f
so that
R D
d
f
gf
we can set
G P
tw R 4f
For example, in the context of the classical linear model
With this, the function
d f d
4c Duq eE
has conditional expectation equal to zero
8 4RE 4
d
This is a scalar moment condition,which wouldnt be sufcient to identify a

However, the above result allows us to
8
7d
dimensional parameter
t)
form various unconditional expectations
8
c
F
U
for identication holds.
is a set of variables
are instrumental variables. We
F
U
) '
j
moment conditions, so as long as
The
and
UF
d
eE
-vector valued function of
drawn from the information set

now have
is a
d F
eE U
where
the necessary condition
This ts the previous treatment. An interesting
I
@ 1
d
eE
f )
I
@ 1
4RE
d
W f )
d Rf
eSf W ) 1
eif
d
4e
d
d
4eI
8f
W
d f
4RC
Rf
bb f F
xxb 4U
W
V

W
f F
UI
I
F
I F
I
W
V
f F
$U

F
bb I F
xxb E
W
2

Rf
W ) 1

RI
.
.
.
row of
.
.
.
moment conditions
W
f
W
.
.
.

F
I F
E
is the
F
yW
to achieve maximum efciency.

question that arises is how one should choose the instrumental variables
where
With this we can form the
'
X1
One can form the
matrix
319
f d
W 4 fR d ) 1
1 d
d
R eSf R f W
y )
'

d
eSf

8 Df ) 1 4RCf
f W
d
1
%'
is a
which we can dene to be
where
matrix) is
(a
Note that with this choice of moment conditions, we have that
320
d
e R
matrix that has the derivatives of the individual moment
conditions as its columns. Likewise, dene the var-cov. of the moment conditions
8p d
ggeSf
f f
W 1 R f W
f R d
W c p eSf p eSf ) 1 yW
d
Rf
fW R p eSf p RCf R f W ) 1
d
d
c p eic p Ri1
R d f d f
f

$
f
where we have dened
Note that matrix is growing with the
sample size and is not consistently estimable without additional assumptions.

The asymptotic normality theorem above says that the GMM estimator using the optimal weighting matrix is distributed as
h

sr j
m
h p d 1
d

d
d
f

ing matrix, we can show that putting
Using an argument similar to that used to prove that
I
1
1
1
R f R f W I C R f W Df
f Wf
f W
(15.6.1)
where
321
is the efcient weight-
R f I f
f
W

causes the above var-cov matrix to simplify to

d
8 f f 1 t f
I R I gf
(15.6.2)
and furthermore, this matrix is smaller that the limiting var-cov for any other
choice of instrumental variables. (To prove this, examine the difference of the
inverses of the var-cov matrices with the optimal intruments and with nonoptimal instruments. As above, you can show that the difference is positive
where
must be consistently estimated to apply
is straightforward - one just uses
h d Rf d

p
Usually, estimation of
this.
and

f
since it depends on
which we should write more properly as
d
g p R f
Note that both
semi-denite).
is some initial consistent estimator based on non-optimal in-
struments.
may not be possible. It is an

q1
f

more unique elements than
1 '
XV1
Estimation of
matrix, so it has
the sample size, so without restrictions
15.8. A SPECIFICATION TEST
322
on the parameters it cant be estimated consistently. Basically, you

need to provide a parametric specication of the covariances of the
in order to be able to use optimal instruments. A solution is to ap-
d
4RE
proximate this matrix parametrically to dene the instruments. Note

that the simplied var-cov matrix in equation 15.6.2 will not apply if
approximately optimal instruments are used - it will be necessary to
b
h
b b
f y
use an estimator based upon equation 15.6.1, where the term
must be estimated consistently apart, for example by the Newey-West

procedure.
15.7. Estimation using dynamic moment conditions

Note that dynamic moment conditions simplify the var-cov matrix, but are
often harder to formulate. The will be added in future editions. For now, the
Hansen application below is enough.
15.8. A specication test

The rst order conditions for minimization, using the an estimate of the
h d h d yf
f
I
d
d
5 4d

or
optimal weighting matrix, are
f
d i I 4d
p RC eI h eI
d f
p
p

I R I
p eS eI 1
d f
p
m
i
p
R eI
x 1
Now
p
d X eI 1
p eC eI h eI
d f
p
p

I R I
p
R eI 1
Or

d X1
h
8 d f
p eS I

I R I
This last can be written as
R 1
d f
q p eSH1
h

$ d XH1
get
h
With this, and taking into account the original expansion (equation ??), we
p RC I
d f

I R I
h p pd 1
d
or
h
d
d f
p ipd R I p eC I
d
pd
xT P h p dipd p R I d p RC I d 4d X I 4d
)
R d
P d f
I d
8)
gtcT P h p d p e Rf p RC 4d X
d
d
P d f

The lhs is zero, and since tends to

Multiplying by
and
tends to
, we can write
we obtain
(15.8.1)

4d X
Consider a Taylor expansion of
323
324
and one can easily verify that

I R I
p
R eI
xl

(recall that the rank of an idempotent matrix is
equal to its trace) so
h eI
p
is idempotent of rank

4 X I c$d XH1 h $d X eI 1 R h $d X eI 1
R
p
p
m
i d
we also have
converges to
Since

$d X I c4d XH1
R

or
fb
4d C1
supposing the model is correctly specied. This is a convenient test since we
with a

q1
just multiply the optimized value of the objective function by
and compare
critical value. The test is a general test of whether or not the
moments used to estimate are correctly specied.
This wont work when the estimator is just identied. The f.o.c. are
$

$d X I eS
d f
and
But with exact identication, both
are square and invertible
(at least asymptotically, assuming that asymptotic normality hold), so
8
$

d X
15.9. OTHER ESTIMATORS INTERPRETED AS GMM ESTIMATORS
325
So the moment conditions are zero regardless of the weighting matrix

used. As such, we might as well use an identity matrix and save trou-
f
4 d 9
ble. Also
, so the test breaks down.
A note: this sort of test often over-rejects in nite samples. If the sam-
ple size is small, it might be better to use bootstrap critical values. That
by sampling from the data with re-
bootstrap samples, optimize and calculate the test
must be a very large number if
exceed the value. Of course,

d
)
4 b &
8 6A@8@8974) Cg d "1
85
%
b
percent of the
such that
Dene the bootstrap critical value
i
p
statistic
placement. For
is, draw articial samples of size
is large, in order to determine
the critical value with precision. This sort of test has been found to
have quite good small sample properties.
15.9. Other estimators interpreted as GMM estimators

15.9.1. OLS with heteroscedasticity of unknown form.
E XAMPLE 26. Whites heteroscedastic consistent varcov estimator for OLS.
a diagonal matrix.
where
and

S
If were not condent about parameterizing
jointly (feasible
is correct.
we can still estimate
consistently by OLS. However, the typical covariance estimator
I R
ferences.

GLS). This will work well if the parameterization of
dimensional parameter vector, and to estimate
is a nite

gsS
G P p
V
The typical approach is to parameterize
S
gT jsG
where
S
@
Suppose
will be biased and inconsistent, and will lead to invalid in-
(a
column vector) we have
)
0'
which suggests the moment condition
G
t
By exogeneity of the regressors
326
8
R v gS qE
f
moment con-
8
R v v
1 H v
) f

1 )
1 ) X

will be identically zero at the minimum, due to exact

qX
For any choice of
ditions). We have
parameters and
In this case, we have exact identication (
identication. That is, since the number of moment conditions is identical to

regardless of
There
8
X

X
the number of parameters, the foc imply that
is no need to use the optimal weighting matrix in this case, an identity matrix
works just as well for the purpose of estimation. Therefore

R I 6 R g v
8
q1 R R v v
is
8 hR
8 hdR
1
)

Recall that a possible estimator of
In this case
is simply
R v v
Recall that
The GMM estimator of the asymptotic varcov matrix is
8 hR
I
which is the usual OLS estimator.
I
p
P
I f
327
This is in general inconsistent, but in the present case of nonautocorrelation, it

simplies to
which has a constant number of elements to estimate, so information will accumulate, and consistency obtains. In the present case

1 '
%X1
is an
diagonal matrix with
in the position
2
2
G
1
qR
I
@
w
G R v v f v 1 )
I
@
v
w
h R v f R v v l1 )
f
I
@
p
R
f 1 )
where
Therefore, the GMM varcov. estimator, which is consistent, is

h
I R
st
1
R
I
1
qR I
1
1
R R
q
1
R
h
Dh j% 1

This is the varcov estimator that White (1980) arrived at in an inuential article.
This estimator is consistent under heteroscedasticity of an unknown form. If
the rest is the same.
there is autocorrelation, the Newey-West estimator can be used to estimate
328
15.9.2. Weighted Least Squares. Consider the previous example of a linear model with heteroscedasticity of unknown form:
S
T j
G
0P p

G
is a diagonal matrix.
is known, so that
q I S R I I S R
This estimator can be interpreted as the solution to the
In this case, the GLS
moment conditions
$
estimator is
is a correct para-
metric specication (which may also depend upon
Now, suppose that the form of
8
g
p RS
d
where
p d
p
R eES
1 ) Rdf !S

v 1 ) X
v v
That is, the GLS estimator in this case has an obvious representation as a GMM
estimator. With autocorrelation, the representation exists but it is a little more
complicated. Nevertheless, the idea is the same. There are a few points:
The (feasible) GLS estimator is known to be asymptotically efcient in
This means that it is more efcient than the above example of OLS with
the class of linear asymptotically unbiased estimators (Gauss-Markov).
Whites heteroscedastic consistent covariance, which is an alternative

GMM estimator.
This means that the choice of the moment conditions is important to
achieve efciency.
329
15.9.3. 2SLS. Consider the linear model
G P R
t0 g# gf
or
is i.i.d. Suppose that
this equation is one of a system of simultaneous equations, so that
as the vector of predictions of
when regressed upon
, e.g.,
must
Dene
8)
gtj'
is
4 v
exogenous and predetermined variables that are uncorrelated with
R I R

that
(suppose
is the vector of all
both endogenous and exogenous variables. Suppose that
contains
and
x#
is
gG
)'
G P
V
using the usual construction, where
R I R

is a linear combination of the exogenous variables
and so
identically equal to zero, regardless of
have
so we
R h u d 4
R f
I
estimator will set
moment conditions, the GMM
1 ) X

parameters and
f
R vg 4 E
Since we have
-dimensional moment con-
8 R f
4wg 4
dition
This suggests the
8
G
be uncorrelated with
Since
R
4x4
This is the standard formula for 2SLS. We use the exogenous variables and
the reduced form predictions of the endogenous variables as instruments, and
330
apply IV estimation. See Hamilton pp. 420-21 for the varcov formula (which
is the standard formula for 2SLS), and for how to deal with
heterogeneous
and dependent (basically, just use the Newey-West or some other consistent
and apply the usual formula). Note that
gG
estimator of
dependent causes
lagged endogenous variables to loose their status as legitimate instruments.

15.9.4. Nonlinear simultaneous equations. GMM provides a convenient
way to estimate nonlinear systems of simultaneous equations. We have a system of equations of the form
I
f
I G P d
0 pI c4cvI
0 p c4c
G P d
G P d
V p 4c
f
.
.
.
f
or in compact notation
8
B G
) B
j' Yw
vector of instruments
for each equation, that
Typical instruments would be low order monomials

with their lagged values. Then we can dene
orthogonality conditions
v E 4c
d f
I d I f
I v Eg4cI
d
eE
.
.
.
v E 4c
d
d
b
in the exogenous variables in

c$
&
are uncorrelated with
B
c Dv
R8 R p Dxx9 R p R pI e p
d b b b
d
d
gf
-vector valued function, and
We need to nd an
) B I
' h yw B
the
is a
G P d
t0 p 4
where
331
A note on identication: selection of instruments that ensure identi-
cation is a non-trivial problem.

A note on efciency: the selected set of instruments has important effects on the efciency of estimation. Unfortunately there is little theory
offering guidance on what is the optimal set. More on this later.

15.9.5. Maximum likelihood. In the introduction we argued that ML will
in general be more efcient than GMM since ML implicitly uses all of the moments of the distribution while GMM uses a limited number of moments. Ac-
parameters can be uniquely characterized by
moment conditions. However, some sets of
tually, a distribution with
moment conditions may contain
more information than others, since the moment conditions could be highly
correlated. A GMM estimator that chose an optimal set of
moment condi-
tions would be fully efcient. Here well see that the optimal moment conditions are simply the scores of the ML estimator.
-vector of variables, and let
&
be a
8 R R !@A8@89 R ! RI
f 8 f f
gf
Let
Then at time
has been observed (refer to it as the information set, since we assume
I CC2

the conditioning variables have been selected to take advantage of all useful
information). The likelihood function is the joint density of the sample:
d ff 8 fI f
4gA@8A8 !6
d
4R
which can be factored as
d f b d f ff
6I i q4!I C g
d
e
and we can repeat this to get
8I f b 8 b d f f f b d f f f
gc $A8@84q C I q6I i
d
4e
332
The log-likelihood function is therefore
I
@
8 d
4I C gf
d
f 4R
d
4iE
observation. It can be shown that, under the regularity
conditions, that the scores have conditional mean zero when evaluated at
(see notes to Introduction to Econometrics):
p
$d
2
as the score of the
d f
46I i g
Dene
d
!I i p cit
so one could interpret these as moment conditions to use to dene a justparameters there are
tions). The GMM estimator sets
identied GMM estimator ( if there are
score equa-
I
@
I
@

f
4 d I C g 1 ) d CE 1 )
f
f
which are precisely the rst order conditions of MLE. Therefore, MLE can be
Consistent estimates of variance components are as follows
I R I
interpreted as a GMM estimator. The GMM varcov formula is
I
@
R
4d 6I i g
f
d
f 1 ) 4 d ciX

and

g
It is important to note that
are both condi-
tionally and unconditionally uncorrelated. Conditional uncorrelation

is a function of

i
follows from the fact that
which is in the in-
formation set at time . Unconditional uncorrelation follows from the
333
fact that conditional uncorrelation hold regardless of the realization

so marginalizing with respect to
I i

I C
of
preserves uncorrelation
(see the section on ML estimation, above). The fact that the scores are
serially uncorrelated implies that
can be estimated by the estimator
of the 0 autocovariance of the moment conditions:
I
@
I
@
f
f
R
R r 4 d 6I i g n r d I C A n 1 ) c4 d ci d CE 1 )
f
f
Recall from study of ML estimation that the information matrix equality (equation ??) states that
B
8 p I C g C R p I C A p gI i g
d
f
d
f
d
f
This result implies the well known (and already seeen) result that we can estiin any of three ways:
The sandwich version:
C
I
f
d I C A @f
B
I
' R r d I C A n r 4d I C g n @f

f
f
I
I
' C 4d I i g @
f
f B

1
term cancel, except for a minus sign):
or the inverse of the negative of the Hessian (since the middle and last
I
@
w d I C g
f
)
f 1 v

I
mate
or the inverse of the outer product of the gradient (since the middle
and last cancel except for a minus sign, and the rst term converges to
minus the inverse of the middle term, which is still inside the overall
inverse)
15.10. EXAMPLE: THE HAUSMAN TEST
334
I
@
8 s f
f
R r d I C A n r 4 d 6I i g n 1 )
f
I
r
y
This simplication is a special result for the MLE estimator - it doesnt apply
to GMM estimators in general.
Asymptotically, if the model is correctly specied, all of these forms converge to the same limit. In small samples they will differ. In particular, there
is evidence that the outer product of the gradient formula does not perform
very well in small samples (see Davidson and MacKinnon, pg. 477). Whites
Information matrix test (Econometrica, 1982) is based upon comparing the two
ways to estimate the information matrix: outer product of gradient or negative
of the Hessian. If they differ by too much, this is evidence of misspecication
of the model.
15.10. Example: The Hausman Test

This section discusses the Hausman test, which was originally presented
in Hausman, J.A. (1978), Specication tests in econometrics, Econometrica, 46,
1251-71.
8e P
cE R 4f
Consider the simple linear regression model
We assume that
the functional form and the choice of regressors is correct, but that the some of
the regressors may be correlated with the error term, which as you know will
produce inconsistency of
For example, this will be a problem if
if some regressors are endogeneous
lagged values of the dependent variable are used as regressors and
some regressors are measured with error
6e
is autocorrelated.
335
F IGURE 15.10.1. OLS and IV estimators when regressors and errors are correlated
x
x
qo mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4
{
{
zz
x
x
q o mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4

5

8 5

8 5

% 8
% yy yy yy 8yy yy

@5

@5

8
8
To illustrate, the Octave program biased.m performs a Monte Carlo experiment where errors are correlated with regressors, and estimation is by OLS
and IV.
Figure 15.10.1 shows that the OLS estimator is quite biased, while the IV
estimator is on average much closer to the true value. If you play with the program, increasing the sample size, you can see evidence that the OLS estimator
is asymptotically biased, while the IV estimator is consistent.
We have seen that inconsistent and the consistent estimators converge to
different probability limits. This is the idea behind the Hausman test - a pair
of consistent estimators converge to the same probability limit, while if one is
consistent and the other is not they converge to different limits. If we accept
that one is consistent (e.g., the IV estimator), but we are doubting if the other
is consistent (e.g., the OLS estimator), we might try to check if the difference
between the estimators is signicantly different from zero.
336
If were doubting about the consistency of OLS (or QML, etc.), why
should we be interested in testing - why not just use the IV estima-
tor? Because the OLS estimator is more efcient when the regressors
are exogenous and the other classical assumptions (including normality of the errors) hold. When we have a more efcient estimator that
relies on stronger assumptions (such as exogeneity) than the IV estimator, we might prefer to use it, unless we have evidence that the
assumptions are false.
So, lets consider the covariance between the MLE estimator
(or any other
fully efcient estimator) and some other CAN estimator, say . Now, lets
recall some results from MLE. Equation 4.4.1 is:

h
h p d 1
d
T
8 d
g p eHb1 I 6 p ea
d
Equation 4.6.2 is
8 d
4Ra o e
d
Combining these two equations, we get
h
8 d
p eHn1 I ! p ea h p ipd 1
d
d

Also, equation 4.7.1 tells us that the asymptotic covariance between any
CAN estimator and the MLE score vector is
9
d
ea
9
d
49an

4eHn1
d
h p d 1
d
i

r

h
h ipd
d
h ip d
d
h
h
Now, consider
337
d
4RDb1
h ip d 1
d
h
4
d
I e
The asymptotic covariance of this is

h
4ea
I d
4
9
d
4Ra
d
49an

4
d
d
I 4Ra I 4R
d
d
I 4Ra $x n
4ea
I d
h d 1
d
h d 1
d
h

which, for clarity in what follows, we might write as

h

4d ar I 4R
d
d
d
I 4Ra $x n

h ipd
d
h ip d
d
So, the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE asymptotic variance (the inverse of the information
matrix).
Now, suppose we with to test whether the the two estimators are in fact
pd
both converging to
, versus the alternative hypothesis that the MLE esti-
mator is not in fact consistent (the consistency of is a maintained hypothesis).
Under the null hypothesis that they are, we have

h
h d d 1
h p pd
d
h p p d
d
1
1
9 n
will be asymptotically normally distributed as
338
8 h 4d ruq4 d n

h d p d 1
p d h d arVq4 d an R h d p d 1

where

g
So,
is the rank of the difference of the asymptotic variances. A statistic
that has the same asymptotic distribution is
8
g
h d d h d q d R h d p d
This is the Hausman test statistic, in its original form. The reason that this
test has power under the alternative hypothesis is that in that case the MLE
pd
p

h d 1
h
d
Then the mean of the asymptotic distribution of vector
, say, where
estimator will not be consistent, and will converge to
will be
, a non-zero vector, so the test statistic will eventually reject, regardless of

d
how small a signicance level is used.

Note: if the test is based on a sub-vector of the entire parameter vector
of the MLE, it is possible that the inconsistency of the MLE will not
show up in the portion of the vector that has been used. If this is the
case, the test may not have power to detect the inconsistency. This
may occur, for example, when the consistent but inefcient estimator
is not identied for all the parameters of the model.
Some things to note:

The rank, , of the difference of the asymptotic variances is often less
than the dimension of the matrices, and it may be difcult to determine what the true rank is. If the true rank is lower than what is taken
339
to be true, the test will be biased against rejection of the null hypothesis. The contrary holds if we underestimate the rank.
A solution to this problem is to use a rank 1 test, by comparing only
a single coefcient. For example, if a variable is suspected of possibly

being endogenous, that variables coefcients may be compared.
This simple formula only holds when the estimator that is being tested
for consistency is fully efcient under the null hypothesis. This means
that it must be a ML estimator or a fully efcient estimator that has

the same asymptotic distribution as the ML estimator. This is quite
restrictive since modern estimators such as GMM and QML are not in
general fully efcient.
Following up on this last point, lets think of two not necessarily efcient es, where one is assumed to be consistent, but the other may
and
belong to the
Igd
I
gd
not be. We assume for expositional simplicity that both
and
timators,
same parameter space, and that they can be expressed as generalized method
of moments (GMM) estimators. The estimators are dened (suppressing the
dependence upon data) by
is a
B 'B
positive
Consider the omnibus GMM estimator
R
d
I d I
gR%

B d
vector of moment conditions, and

z
|
z
|
r R R
d
85
7) 3
)
' B
B e B
d
denite weighting matrix,

(15.10.1)
is a
B e B B R B e X
d
d
where
R ge% n h d ggd
I d I
I

340
Suppose that the asymptotic covariance of the omnibus moment vector is

h
e
d
I d I
gev%

tf

S
S b
yS yS
I I
The standard Hausman test is equivalent to a Wald test of the equality of
(or subvectors of the two) applied to the omnibus GMM estimator, but
and
I
$d
1

(15.10.2)
with the covariance of the moment conditions estimated as
8
S
I

S
I
yS
z
While this is clearly an inconsistent estimator in general, the omitted
term
cancels out of the test statistic when one of the estimators is asymptotically
efcient, as we have seen above, and thus it need not be estimated.
The general solution when neither of the estimators is efcient is clear: the
matrix must be estimated consistently, since the
S
I
entire
term will not can-
cel out. Methods for consistently estimating the asymptotic covariance of a

vector of moment conditions are well-known, e.g., the Newey-West estimator
discussed previously. The Hausman test using a proper estimator of the over-
all covariance matrix will now have an asymptotic
distribution when nei-
ther estimator is efcient. However, the test suffers from a loss of power due to
the fact that the omnibus GMM estimator of equation 15.10.1 is dened using
an inefcient weight matrix. A new test can be dened by using an alternative
15.11. APPLICATION: NONLINEAR RATIONAL EXPECTATIONS
341
omnibus GMM estimator
d
h }S r R e
I ~
R ge% n h d ggd
I d I
I

is a consistent estimator of the overall covariance matrix
R
d
I d I
gRv%
}
S
where
(15.10.3)
of equation
15.10.2. By standard arguments, this is a more efcient estimator than that

dened by equation 15.10.1, so the Wald test using this alternative is more
powerful. See my article in Applied Economics, 2004, for more details, including
simulation results.
15.11. Application: Nonlinear rational expectations

Readings: Hansen and Singleton, 1982 Tauchen, 1986
!
8
Though GMM estimation has many applications, application to rational

expectations models is elegant, since theory directly suggests the moment conditions. Hansen and Singletons 1982 paper is also a classic worth studying in
itself. Though I strongly recommend reading the paper, Ill use a simplied
model with similar notation to Hamiltons.
We assume a representative consumer maximizes expected discounted utility over an innite horizon. Utility is temporally additive, and the expected
utility hypothesis holds. The future consumption stream is the stochastic se-
The parameter
(15.11.1)
8 E

8 p @
ity
The objective function at time is the discounted expected util-
quence
is between 0 and 1, and reects discounting.

2
is the information set at time
342
and includes the all realizations of
i
random variables indexed and earlier.
- current consumption, which is constained to
be less than or equal to current wealth
8
F
The choice variable is
Suppose the consumer can invest in a risky asset. A dollar invested in
the asset yields a gross return
, where
I 3
I c Pp) UF
3
8
)
S
Current wealth
$t
is normalized to
is the dividend in period
8
2
is the price and
The price of
S
P )
I 4I S I
t P
where
is investment in period
. So the problem is to allocate current wealth between current
) Y2
is risky.
are not known in period : the asset
Future net rates of return
P
6 F
consumption and investment to nance future consumption:
A partial set of necessary conditions for utility maximization have the form:
8 I CI 7 s i
R
P )
R
(15.11.2)
To see that the condition is necessary, suppose that the lhs < rhs. Then by
reducing current consumption marginally would cause equation 15.11.1 to

g R
drop by
since there is no discounting of the current period. At the
same time, the marginal reduction in consumption nances investment, which

which could nance consumption in period
8) P
2

I
P )
has gross return
This increase in consumption would cause the objective function to increase by
343
Therefore, unless the condition holds, the expected
8
I R I 7

P )
discounted utility function is not maximized.

To use this we need to choose the functional form of utility. A constant
relative risk aversion form is

is the coefcient of relative risk aversion (
t)

s
where
. With this form,
R
I C
so the foc are
P )
I I I I
P )
I I I I
so that we could use this to dene moment conditions, it is unlikely that
While it is true that
is
stationary, even though it is in real terms, and our theory requires stationarity.

P )
I I
r
Suppose that
8
gE2
based only upon information available in time
is chosen
is a vector of variables drawn from the information set
We can use the necessary conditions to form the expressions
8
c
can be passed though the conditional expectation since
(note that
1-
To solve this, divide though by
d
eE
v h n I c0)
P )
and
8

represents
344
Therefore, the above expression may be interpreted as a moment con-
H2

Note that at time
8p
x4d
dition which can be used for GMM estimation of the parameters
has been observed, and is therefore an element of
the information set. By rational expectations, the autocovariances of the mo
ment conditions other than
should be zero. The optimal weighting matrix
is therefore the inverse of the variance of the moment conditions:
c p eXc p eX1 A wp
R d d

which can be consistently estimated by
I
@
R
cd Ec4d E
1)
As before, this estimate depends on an initial consistent estimate of

d
which
This process can be iterated, e.g., use the new estimate to re-estimate
use

7d
matrix, for example). After obtaining
can be obtained by setting the weighting matrix
arbitrarily (to an identity
we then minimize
8 d
geX I c4RX 4R
R d
d
and repeat until the estimates dont change.
p
9gd
this to estimate
This whole approach relies on the very strong assumption that equation 15.11.2 holds without error. Supposing agents were heterogeneous, this wouldnt be reasonable. If there were an error term here, it
could potentially be autocorrelated, which would no longer allow any
variable in the information set to be used as an instrument..
15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL
345
In principle, we could use a very large number of moment conditions
8
c v
in estimation, since any current or lagged variable could be used in
Since use of more moment conditions will lead to a more (asymptotically) efcient estimator, one might be tempted to use many instrumental variables. We will do a compter lab that will show that this
may not be a good idea with nite samples. This issue has been studied using Monte Carlos (Tauchen, JBES, 1986). The reason for poor
performance when using many instruments is that the estimate of

becomes very imprecise.
Empirical papers that use this approach often have serious problems
in obtaining precise estimates of the parameters. Note that we are bas-
ing everything on a single parial rst order condition. Probably this

f.o.c. is simply not informative enough. Simulation-based estimation
methods (discussed below) are one means of trying to use more informative moment conditions to estimate this sort of model.
15.12. Empirical example: a portfolio model
The Octave program portfolio.m performs GMM estimation of a portfolio
As instruments we use 2 lags of and
in that order. There are 95 observations (source: Tauchen,
The estimation results are
and

9
model, using the data le tauchen.data. The columns of this data le are
1986).
***********************************************
Example of GMM estimation of rational expectations model
15.12. EMPIRICAL EXAMPLE: A PORTFOLIO MODEL
346
GMM Estimation Results

Objective function value: 0.071872

Observations: 93
Value
X^2 test
df
p-value
6.6841
5.0000
0.2452
estimate
st. err
t-stat
p-value
beta
0.8723
0.0220
39.6079
0.0000
gamma
3.1555
0.2854
11.0580
0.0000
***********************************************
experiment with the program using lags of 1, 3 and 4 periods to dene
Iterate the estimation of
instruments
and
to convergence.

Comment on the results. Are the results sensitive to the set of instruas well as
8
7d
ments used? (Look at
Are these good instruments? Are
the instruments highly correlated with one another?
EXERCISES
347
Exercises
(1) Show how to cast the generalized IV estimator presented in section 11.4 as

7f
is the form of the the matrix
d
eE
a GMM estimator. Identify what are the moment conditions,
, what
what is the efcient weight matrix, and
show that the covariance matrix formula given previously corresponds to

the GMM covariance matrix formula.
f
v 4
(2) Using Octave, generate data from the logit dgp . Recall that
. Consider the moment condtions (exactly iden-

~ P
d
I s4d & v c 7g} d) 4c v
tied):
8 d
v c v igf 4RE
d
(a) Estimate by GMM, using these moments. Estimate by MLE.
(b) The two estimators should coincide. Prove analytically that the estimators coicide.

$d X I R 4d X"1
b

(1) Verify the missing steps needed to show that
has a
distribution. That is, show that the monster matrix is idem-
8
potent and has trace equal to
CHAPTER 16
Quasi-ML
Quasi-ML is the estimator one obtains when a misspecied probability
model is used to calculate an ML estimator.
conditional on
is a member of the parametric family
8
h f v 8998qI v
v
The true joint density is associated with the vector
suppose the joint density of
and a vector of conditioning
r p
Y
4!

f 88 I
h e D998 q
variables
of a random vector
Given a sample of size
8
g p !
doesnt depend on
p
x
As long as the marginal density of
this conditional
density fully characterizes the random characteristics of samples: e.g., it fully

describes the probabilistically important features of the d.g.p. The likelihood

!g! !
and let
888
h v 9qI v
p
888 I
h I H99q I
Let
function is just this density evaluated at other values
The likelihood function, taking into account possible dependence of
348
16. QUASI-ML
349
observations, can be written as
I
@

E
fq
I
@

I DE q
f
!
f
The average log-likelihood function is:
I
@ 1
1 f
ES

f ) ! f ) C

Suppose that we do not have knowledge of the family of densities
2cg$! I HES g I D
p p d
g4 I D
x
g d d

8
ES
is a member of the family
such that
Mistakenly, we may assume that the conditional density of
where there is no
(this is what we
p
gd
mean by misspecied).
This setup allows for heterogeneous time series data, with dynamic
misspecication.
The QML estimator is the argument that maximizes the misspecied average
log likelihood, which we refer to as the quasi-log likelihood function. This
objective function is
I
@
d
4RE
f

I
@
p
d
I DE
f
)
d f
4RC
d f f
4eS ~ g d
and the QML is
16. QUASI-ML
350
A SLLN for dependent sequences applies (we assume), so that
d
4e
I
@ 1 f
4RE
d
d f
A f ) A 4eS
We assume that this can be strengthened to uniform convergence, a.s., followis the value that
d
e ~ E p
d
4R
maximizes
ing the previous arguments. The pseudo-true value of
Given assumptions so that theorem 19 is applicable, we obtain

a.s.
p
d
f
f d A
An example of sufcient conditions for consistency are
is compact
is continuous and converges pointwise almost surely to
means
is uniformly continuous).
is a unique global maximizer. A stronger version of this as-
p
gd
sumption that allows for asymptotic normality is that
I 6 p e
d
p ea o I 6 p R
d
d

P
%
h p ipd 1
d
where
Applying the asymptotic normality theorem,
8p
ists and is negative denite in a neighborhood of
4R
d
d
4R
compactness of
will be continuous, and this combined with
d
e
d f
eS9
(this means that
d
4e

x
ex-
p eS f p RP
d f
d
16. QUASI-ML
351
and
8 d f
p eS 1 A f p ea

d
Note that asymptotic normality only requires that the additional as-
8
x
p
xgd
local property.
not throughout
for
hold in a neighborhood of
at
and
for
P
sumptions regarding
and
In this sense, asymptotic normality is a
16.0.1. Consistent Estimation of Variance Components. Consistent estiis straightforward. Assumption (b) of Theorem 22 implies
p ea o
d
f
d
I
@ 1 tf
I
@ 1

f
d

p e ) T 4f d E ) 4f d SP
f
f
Consistent estimation of
g
I@
1j

)1
r
f
I@
g
1
f )
1
) 1
1 h
I
@
p R
d
f
A
p eS96
d f
s
I@
5j

f
We need to estimate
is more difcult, and may be impossible.
p d
eE
Notation: Let
in place of
8p
9gd
That is, just calculate the Hessian using the estimate
8 d
p RP
that
p e
d
mation of
d
p Ra

f

f
16. QUASI-ML
352
This is going to contain a term
I
@
1 f

R 51
f )
which will not tend to zero, in general. This term is not consistently estimable
in general, since it requires calculating an expectation using the true density
under the d.g.p., which is unknown.
p d
6gea
There are important cases where
is consistently estimable. For
example, suppose that the data come from a random sample (i.e., they
are iid). This would be the case with cross sectional data, for example.
does not imply that the conditional density
f

f

(Note: we have that the joint distribution of
is identical. This
is identical).
With random sampling, the limiting objective function is simply
8
f
p p p e
d
f
d
where
means expectation of
and
means expectation respect
to the marginal density of
By the requirement that the limiting objective function be maximized
The dominated convergence theorem allows switching the order of
p
d
at
we have
p
d
d
f
p eg p
expectation and differentiation, so
p
p
d
f
d
f
p A p
16. QUASI-ML
353
The CLT implies that
8 d
gE p Ra o j
I
@ 1
p
d
f A )
f
That is, its not necessary to subtract the individual means, since they
are zero. Given this, and due to independent observations, a consistent estimator is
I
@ 1
$d c y 4d E

A f )
This is an important case where consistent estimation of the covariance matrix

is possible. Other cases exist, even for dynamically misspecied time series
models.
CHAPTER 17
Nonlinear least squares (NLS)

Readings: Davidson and MacKinnon, Ch. 2 and 5 ; Gallant, Ch. 1
17.1. Introduction and denition

Nonlinear least squares (NLS) is a means of estimating the parameter of
the model
8G P d
ctV p v
gf
In general,
will be heteroscedastic and autocorrelated, and possibly
nonnormally distributed. However, dealing with this is exactly as in
the case of linear models, so well just treat the iid case here,
! DtG
t33
If we stack the observations vertically, dening
R f f 8 fI f
cg!@@8A89 !
R d
cI @A8@89gI ggI
8d
d
and
fG 8 GI G
R EA@8A8 $ G
observations as
G Pd
04R
we can write the
354
17.1. INTRODUCTION AND DEFINITION
355
Using this notation, the NLS estimator can be dened as
d
d
d
d f
4R i ) 1 e % R e % ) 1 4eS
The estimator minimizes the weighted sum of squared errors, which
and
8d
4R
is the same as minimizing the Euclidean distance between
The objective function can be written as
d f
d R d P d R 5 R
44R 4e 4R sus ) 1 4RC
which gives the rst order conditions
8
$d y
Using this, the rst order conditions can
8
g$d s

$d s
'
Q1
be written as
in place of
P R $d
In shorthand, use
(17.1.1)
matrix
$d R 4d
Dene the
$d R R
P
or
8
$
4d i n R
(17.1.2)
This bears a good deal of similarity to the f.o.c. for the linear model - the
so the f.o.c. (with spherical errors) simplify to
R R
is simply
then
d
d e
derivative of the prediction is orthogonal to the prediction error. If
17.2. IDENTIFICATION
356
the usual 0LS f.o.c.

We can interpret this geometrically: INSERT drawings of geometrical depiction
of OLS and NLS (see Davidson and MacKinnon, pgs. 8,13 and 46).
Note that the nonlinearity of the manifold leads to potential multiple
d f
eSt
local maxima, minima and saddlepoints: the objective function
is not necessarily well-behaved and may be difcult to minimize.
17.2. Identication
As before, identication can be considered conditional on the sample, and
d
4e G
p d
ggRG
case if
such that
is strictly convex at
tend
This will be the
d
gpe
p
xgd
8 p d d
xd c9gea eaG
p d
to a limiting function
d f
eSt
asymptotically. The condition for asymptotic identication is that
which requires that
be positive
denite. Consider the objective function:
I
@ 1
ceE q p eE
G d
d
f 5
I
@ 1
P d
d
eE q p eE )
f
I
@ 1
tVP p c
G
d
v )
f
I@ 1
gf
f )
d
v
I
@ 1
tG
f )
d
4c v E
d f
eS9

As in example 14.3, which illustrated the consistency of extremum estimators using OLS, we conclude that the second term will converge
8
d
to a constant which does not depend upon
A LLN can be applied to the third term to conclude that it converges

and are uncorrelated.
d
4R
pointwise to 0, as long as
17.2. IDENTIFICATION
357
Next, pointwise convergence needs to be stregnthened to uniform al-
most sure convergence. There are a number of possible assumptions

one could use. Here, well just assume it holds.
Turning to the rst term, well assume a pointwise law of large num-
bers applies, so
d
4
Q
I
@ 1
d d
ceE q p RE f )
g
ld
form almost sure convergence is immediate. For example if
)
t9
8
7d
r I 4d 7} )

~ P
y
)
is continuous in
will
so strengthening to uni-
8
be bounded and continuous, for all
In many cases,
d
4
q p
d #
is the distribution function of
where
# Q d #
g4Dt 47
(17.2.1)
a bounded range, and the function
Given these results, it is clear that a minimizer is
When considering identi-
8p
x4d
cation (asymptotic), the question is whether or not there may be some other
minimizer. A local condition for identication is that
Dt 4
Q d
d
q p
8pd
Rd d
R
d d d
4Ra
Evaluating this derivative, we obtain (after a little
# Q
$Dt R p 7 y c p
d #
R d #
5
Dt
Q d
work)
be positive denite at
d
q p
the expectation of the outer product of the gradient of the regression function
8p
9gd
evaluated at
(Note: the uniform boundedness we have already assumed
358
allows passing the derivative through the integral, by the dominated convergence theorem.) This matrix will be positive denite (wp1) as long as the gradient vector is of full rank (wp1). The tangent space to the regression manifold
-dimensional space if we are to consistently estimate a
must span a
dimensional parameter vector. This is analogous to the requirement that there

be no perfect colinearity in a linear model. This is a necessary condition for
identication. Note that the LLN implies that the above expectation is equal
to
1
d
R 5 p e
17.3. Consistency
We simply assume that the conditions of Theorem 19 hold, so the estimator
is consistent. Given that the strong stochastic equicontinuity conditions hold,
as discussed above, and given the above identication conditions an a com-
proofs assumptions are satised.

gsx
pact estimation space (the closure of the parameter space
the consistency

As in the case of GMM, we also simply assume that the conditions for asymptotic normality as in Theorem 22 hold. The only remaining problem is to
determine the form of the asymptotic variance-covariance matrix. Recall that
the result of the asymptotic normality theorem is
h
I p e p Ra o I p R%
d
P
d
d P
d f
4RC y z
h p d 1
d
is the almost sure limit of
evaluated at
p
xgd
d
p Ra o R c p e g c p eS96 1
d f d f
T
p d
ge
P
where
and
R 1 5 p RaP
d
1
d
R A p ea

o
This converges almost surely to its expectation, following a LLN
RG
tGR 1 R c p R 9g p RC 1
d f d f
we can write the above as
RG
G R c p R d
d
d
w p c
v
I
@
d
tG f v w p v
8 d
p v
8 d
4 v

d
p v
I
@
tG f
Noting that
I
@ 1
d f d f
G f v R p RC p RC 1
I
@ 1
d f
G f 5 p RC
d
4 v
d
v
With this we obtain
I
@ 1
f
4RC
d f
f 5
p
9gd
Evaluating at
So
I
@
1 4eS
gf
d f
f )
The objective function is

359
17.5. EXAMPLE: THE POISSON MODEL FOR COUNT DATA
p RaP
d
normality theorem, we get
d
g p ea
and
ing these expressions for
and
8
G
where the expectation is with respect to the joint density of
360
Combin-
and the result of the asymptotic
8 o1
I R
h p d 1
d
We can consistently estimate the variance covariance matrix using
where
1

R
I
(17.4.1)
is dened as in equation 17.1.1 and
1

r d % n R r d % n

the obvious estimator. Note the close correspondence to the results for the
linear model.
17.5. Example: The Poisson model for count data
conditional on
Suppose that
is independently distributed Poisson. A
Poisson random variable is a count data variable, which means it can take the
values {0,1,2,...}. This sort of model has been used to study visits to doctors per
year, number of patents registered by businesses per year, etc.
The Poisson density is
gf
that the true mean is
as is the variance. Note that
is
(
gf
8A88@975649 g f
8 )
f
5! u e e 7g} g
~
The mean of
must be positive. Suppose

p R v 7} p
~
e
17.6. THE GAUSS-NEWTON ALGORITHM
by nonlinear least
R v } g
~ f
I@
f
S
f )
We can write
8
c
squares:
Suppose we estimate
p

which enforces the positivity of
361
8
G
v
v t
G
ps R v } v 4fU

~
I@
I
@
I
@
q R
5P G
P R ~

~
~
v 7~g} p R v } tG
v 7g} p R v 7g}
f )
f )
f
I
@
~ G
~
R v 7} t0P p R v 7g}
f
f
C
)

The last term has expectation zero since the assumption that
implies that
related with
which in turn implies that functions of
are uncor-
Applying a strong LLN, and noting that the objective function
is continuous on a compact parameter space, we get
p R v 7g} 0P qR v 7g} p dR v 7g}

~
~
~
where the last term comes from the fact that the conditional variance of is the
x
x x
p f y
sQP qC z
b

This
and
means nding the the specic forms of
8ggps o

x
8 h sj%
p
E XERCISE 27. Determine the limiting distribution of
so the
8
f
NLS estimator is consistent as long as identication holds.
p
This function is clearly minimized at
same as the variance of
Again, use a CLT as needed, no need to verify that it can be applied.
17.6. The Gauss-Newton algorithm

Readings: Davidson and MacKinnon, Chapter 6, pgs. 201-207 .
The Gauss-Newton optimization technique is specically designed for nonlinear least squares. The idea is to linearize the nonlinear model, rather than
362
the objective function. The model is
8G P d
V p e
d
we have
P d
s4R
a
Q
rather than the true value
r I
d
rst order Taylors series approximation around a point
to evaluating the regression function at
and the error due
8p
is a combination of the fundamental error term
where
p
At some in the parameter space, not equal to
Take a
P a
QP I ips I y I R
d d
d
P d
approximationerror.
This can be written as

a
Id
I R y
d
' 1
P d
I Rs
I es
d
where, as above,
is the
and
regression function, evaluated at
is
matrix of derivatives of the

plus approximation error from
the truncated Taylors series.

The other new element here is
Given
which is also known.
8
g I
d
e
d
h
$
U
d
g I R i
Similarly,
is known, given
8I
Note that
Note that one could esti-
mate simply by performing OLS on the above equation.
U
process. Stop when
U
this, take a new Taylors series expansion around
as
8 I dlPU
we calculate a new round estimate of
With
and repeat the
(to within a specied tolerance).
363
To see why this might work, consider the above approximation, but evaluated
at the NLS estimator:
P h d pd $d s$d
P
is
8 r 4d i n R h R d U
pd
d
The OLS estimate of
This must be zero, since
r 4d % n R

by denition of the NLS estimator (these are the normal equations as in equa-
when we evaluate at
updating would stop.
tion 17.1.2, Since
The Gauss-Newton method doesnt require second derivatives, as does
The varcov estimator, as in equation 17.4.1 is simple to calculate, since
the Newton-Raphson method, so its faster.
we have
as a by-product of the estimation process (i.e., its just the
last round regressor matrix). In fact, a normal OLS program will

give the NLS varcov estimator directly, since its just the OLS varcov
estimator from the last iteration.
d d
4Rs R es
The method can suffer from convergence problems since
may be very nearly singular, even with an asymptotically identied
model, especially if is very far from . Consider the example

d
|
G P I
t0P VsH f
|
tive function, so
When evaluated at
has virtually no effect on the NLS objec-
will have rank that is essentially 2, rather than 3.
17.7. APPLICATION: LIMITED DEPENDENT VARIABLES AND SAMPLE SELECTION
will be nearly singular, so

I R
In this case,
large roundoff errors.
364
will be subject to
17.7. Application: Limited dependent variables and sample selection

Readings: Davidson and MacKinnon, Ch. 15 (a quick reading is suf
cient), J. Heckman, Sample Selection Bias as a Specication Error, Econometrica, 1979 (This is a classic article, not required for reading, and which is a
bit out-dated. Nevertheless its a good place to start if you encounter sample
selection problems in your research).
Sample selection is a common problem in applied research. The problem
occurs when observations used in estimation are sampled non-randomly, according to some selection scheme.
17.7.1. Example: Labor Supply. Labor supply of a person is a positive

number of hours per unit time supposing the offer wage is higher than the
reservation wage, which is the wage at which the person prefers not to work.
The model (very simple, with subscripts suppressed):
Latent labor supply:
P R T F
a P
Q R
P v
R G
RF
Reservation wage:
Offer wage:
Characteristics of individual:
Write the wage differential as
P R SaH4c
a P R
G P
VydR
365
We have the set of equations
P
dR v
8G P
0YBdR
8
) $

Assume that
We assume that the offer wage and the reservation wage, as well as the latent
are unobservable. What is observed is
F )
variable
F

8 pF
In other words, we observe whether or not a person is working. If the person
8 (
Otherwise,
8
I
is working, we observe labor supply, which is equal to latent labor supply,
Note that we are using a simplifying assumption that
individuals can freely choose their weekly hours of work.

Suppose we estimated the model
F
or equivalently,
R
G
`d R
G

tions are those for which
The problem is that these observad
P v
R
using only observations for which
residual
and
can enter in
since elements of
depend on
are dependent. Furthermore, this expectation will in general
and
since
366
Because of these two facts,
least squares estimation is biased and inconsistent.

Given the joint normality of
8
$Gd R j
G
Consider more carefully
and
we can write (see for example Spanos Statistical Foundations of Econometric

G
Modelling, pg. 122)
P G
H$
has mean zero and is independent of . With this we can write
P G P v
$uR
we get
R
G
If we condition this equation on
where
P
dR j $uR
G G P v
A useful result is that for
)
t #
dc
b
and
Y
#
C # #
#
# qt
dt
b
where
are the standard normal density and distribution
function, respectively. The quantity on the RHS above is known as the

inverse Mills ratio:
Y
#
qt ca
#
With this we can write
y s h R v
r c y
n
R
P 4d d R t $VR v

P
The error term has conditional mean zero, and is uncorrelated
Rv
NLS.
y y s
8 c s
8
$
with the regressors
where
(17.7.2)
(17.7.1)
367
At this point, we can estimate the equation by
Heckman showed how one can estimate this in a two step procedure
is estimated, then equation 17.7.2 is estimated by least
squares using the estimated value of

d
where rst
to form the regressors. This
is inefcient and estimation of the covariance is a tricky issue. It is

probably easier (and more efcient) just to do MLE.
The model presented above depends strongly on joint normality. There
exist many alternative models which weaken the maintained assumptions. It is possible to estimate consistently without distributional assumptions. See Ahn and Powell, Journal of Econometrics, 1994.
CHAPTER 18
Nonparametric inference
18.1. Possible pitfalls of parametric inference: estimation
Readings: H. White (1980) Using Least Squares to Approximate Unknown
Regression Functions, International Economic Review, pp. 149-70.
In this section we consider a simple example, which illustrates both why
nonparametric methods may in some cases be preferred to parametric methods.
and
5
5 P
h 7 )
Taylors series approximation to
In general, the functional form of
is a classical error.
with respect to
is unknown. One idea is to take a
about some point
8p
f
throughout the range of .
The problem of interest is to estimate the elasticity of
, where

G P

Suppose that
5
g
is uniformly distributed on

f
We suppose that data is generated by random sampling of
Flexible functional
forms such as the transcendental logarithmic (usually know as the translog)

can be interpreted as second order Taylors series approximations. Well work
p p p
P
368
with a rst order approximation, for simplicity. Approximating about

18.1. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: ESTIMATION
is the value of the function at
8

value of the derivative at

The coefcient
we can write
UP
p
If the approximation point is
369
and the slope is the
These are of course not known. One might try
estimation by ordinary least squares. The objective function is
I
@
8
f
U
g 1 ) gC
f
The limiting objective function, following the argument we used to get equations 14.3.1 and 17.2.1 is
8 t E q
U
gCa
The theorem regarding the consistency of extremum estimators (Theorem 19)

U
and
tells us that
will converge almost surely to the values that minimize
the limiting objective function. Solving the rst order conditions1 reveals that
p
$
The estimated approximat-
therefore tends almost surely to
P
9
A
U
gCa
ing function
8 I 6
pU
obtains its minimum at
a
We may plot the true function and the limit of the approximation to see the
asymptotic bias as a function of :
(The approximating model is the straight line, the true model has curvature.) Note that the approximating model is in general inconsistent, even at
the approximation point. This shows that exible functional forms based
1
All calculations were done using Scientic Workplace.
370
upon Taylors series approximations do not in general allow consistent estimation. The mathematical properties of the Taylors series do not carry over
when coefcients are estimated.
The approximating model seems to t the true model fairly well, asymptotically. However, we are interested in the elasticity of the function. Recall
that an elasticity is the marginal function divided by the average function:
t it G
R
R
over the range of
will require a good

The approximating
R

elasticity is
and
8
approximation of both
Good approximation of the elasticity over the range of
Plotting the true elasticity and the elasticity obtained from the limiting approximating model
8
The true elasticity is the line that has negative slope for large
Visually we
see that the elasticity is not approximated so well. Root mean squared error in
the approximation of the elasticity is

)7
g 78 eI t
p
G
q
p
Now suppose we use the leading terms of a trigonometric series as the

approximating model. The reason for using a trigonometric series as an approximating model is motivated by the asymptotic properties of the Fourier
exible functional form (Gallant, 1981, 1982), which we will study in more detail below. Normally with this type of model the number of basis functions is
371
an increasing function of the sample size. Here we hold the set of basis function xed. We will consider the asymptotic behavior of a xed model, which
we interpret as an approximation to the estimators behavior in nite samples.
Consider the set of basis functions:
5
5
8 r E 4v ) n
The approximating model is
8 &
W

Maintaining these basis functions as the sample size increases, we nd that the
limiting objective function is minimized at

8 6 ) D6 C6 ) | 6 ) 6
6
A
9
I
Substituting these values into

proximation
we obtain the almost sure limit of the ap-
(18.1.1)

5 P ) Dy 5 4 P E P ) Dy S4 P y9

P

a@
Plotting the approximation and the true function:
Clearly the truncated trigonometric series model offers a better approximation, asymptotically, than does the linear model. Plotting elasticities: On
average, the t is better, though there is some implausible wavyness in the
estimate.
18.2. POSSIBLE PITFALLS OF PARAMETRIC INFERENCE: HYPOTHESIS TESTING
372
Root mean squared error in the approximation of the elasticity is
p

7 )75$)98 t a@ q SG
9

eI R
p
about half that of the RMSE when the rst order approximation is used. If
the trigonometric series contained innite terms, this error measure would be
driven to zero, as we shall see.
18.2. Possible pitfalls of parametric inference: hypothesis testing

What do we mean by the term nonparametric inference? Simply, this
means inferences that are possible without restricting the functions of interest
to belong to a parametric family.
Consider means of testing for the hypothesis that consumers maximize utility. A consequence of utility maximization is that the Slutsky
, where
0 d

0 T

matrix
are the a set of compensated demand
functions, must be negative semi-denite. One approach to testing for

utility maximization would estimate a set of normal demand functions
.

!
Estimation of these functions by normal parametric methods requires
specication of the functional form of demand, for example
p
d G P
d

p V p "!d !d
is a function of known form and

d "!d j
After estimation, we could use
is a nite dimen-
to calculate (by solving
the integrability problem, which is non-trivial)
8
0 T
pd
g"!d
sional parameter.
p
x
x
pg
where
If we can
18.3. THE FOURIER FUNCTIONAL FORM
373
statistically reject that the matrix is negative semi-denite, we might

conclude that consumers dont maximize utility.
The problem with this is that the reason for rejection of the theoretical
proposition may be that our choice of functional form is incorrect. In

the introductory section we saw that functional form misspecication
leads to inconsistent estimation of the function and its derivatives.
Testing using parametric models always means we are testing a com-
pound hypothesis. The hypothesis that is tested is 1) the economic

proposition we wish to test, and 2) the model is correctly specied.
Failure of either 1) or 2) can lead to rejection. This is known as the
model-induced augmenting hypothesis.
Varians WARP allows one to test for utility maximization without

specifying the form of the demand functions. The only assumptions
used in the test are those directly implied by theory, so rejection of the
hypothesis calls into question the theory.
Nonparametric inference allows direct testing of economic proposi-
tions, without the model-induced augmenting hypothesis.
18.3. The Fourier functional form

Readings: Gallant, 1987, Identication and consistency in semi-nonparametric
regression, in Advances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
Suppose we have a multivariate model
G P
70 v
f
is of unknown form and
is a
where
374
dimensional vector. For
simplicity, assume that is a classical error. Let us take the estimation
of the vector of elasticities with typical element
B
C v
v Dv
B
8 v
B
at an arbitrary point
The Fourier form, following Gallant (1982), but with a somewhat different parameterization, may be written as
(18.3.1)
8
E v R ~% q v R % 4v R
have each been trans-
formed to lie in an interval that is shorter than
This is required
8 5
888II II R
g999g$Dc x As & u
R

d v
&
'
We assume that the conditioning variables
IWI
P 7R 5 R P
v v ) P v
to avoid periodic behavior of the approximation, which is desirable

since economic functions arent periodic. For example, subtract sample means, divide by the maxima of the conditioning variables, and
is some positive number less than
8 5
w A@88A74) Q&

$i 5
are elementary multi-indices which are simply
formed of integers (negative, positive and zero). The
The
in value.
where
multiply by
(18.3.2)
8x
R
-dimensional parameter vector
where the
vectors
are required to be linearly independent, and we follow the convention
375
that the rst non-zero element be positive. For example
)
R r ) ) n
is a potential multi-index to be used, but
r
) )
R ) n
is not since its rst nonzero element is negative. Nor is
r
5
R %5 5 n
a multi-index we would use, since it is a scalar multiple of the original
multi-index.
We parameterize the matrix
differently than does Gallant because it
simplies things in practice. The cost of this is that we are no longer
able to test a quadratic specication using nested testing.

The vector of rst partial derivatives is
(18.3.3)
R R
E v H~% 4 s v % E t@
%
~C
I I
W
P v
P
d v
and the matrix of second partial derivatives is
H CE v H~% i v H~% 4 t
R
%
R P R
I I

d v
To dene a compact notation for partial derivatives, let be an

e
multi-index with no negative elements. Dene
(18.3.4)
-dimensional
as the sum of the elements
of the (arbitrary) function
indicate a certain partial derivative:
v $ v
bb
xxb z I
. Taking this denition and the last
few equations into account, we see that it is possible to dene
c
' )
so that
vector
8 Bdc v
R
d v
(18.3.5)
to
is the zero vector,
, use
When
arguments
of . If we have
376
Both the approximating model and the derivatives of the approximat-
ing model are linear in the parameters.

For the approximating model to the function (not derivatives), write
for simplicity.
dR
d v
The following theorem can be used to prove the consistency of the Fourier
form.
is obtained by
. Consider the
Sx
f
8A@89 754)
8 7

relative topology dened by
with respect to
with respect to
and
(a) Compactness: The closure of
is a subset
is compact in the
is a dense subset of the closure of
on which is dened a norm
following conditions:
(b) Denseness:
where
b
of some function space
over
maximizing a sample objective function
T HEOREM 28. [Gallant and Nychka, 1987] Suppose that
with respect to
that is continuous in
in
and there is a function
(c) Uniform convergence: There is a point
377
such that
6 a
f
f
uq S9 A
9
almost surely.

v f s
f A

v
Under these conditions
almost surely.
6 a`

must have
with

6 9
(d) Identication: Any point in the closure of
almost surely, provided that
f
k
g
f A
The modication of the original statement of the theorem that has been
x
made is to set the parameter space
in Gallant and Nychkas (1987) Theorem
0 to a single point and to state the theorem in terms of maximization rather

than minimization.
This theorem is very similar in form to Theorem 19. The main differences
are:
(1) A generic norm
is used in place of the Euclidean norm. This
norm may be stronger than the Euclidean norm, so that convergence
with respect to
implies convergence w.r.t the Euclidean norm.
Typically we will want to make sure that the norm is strong enough to
imply convergence of all functions of interest.
parameter space
(2) The estimation space
is a function space. It plays the role of the
in our discussion of parametric estimators. There
is no restriction to a parametric family, only a restriction to a space of

functions that satisfy certain conditions. This formulation is much less
restrictive than the restriction to a parametric family.
378
(3) There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [19], see Gallant, 1987) but we will discuss its assumptions, in relation to
the Fourier form as the approximating model.
18.3.1. Sobolev norm. Since all of the assumptions involve the norm
, we need to make explicit what norm we wish to use. We need a norm that
guarantees that the errors in approximation of the functions we are interested
in are accounted for. Since we are interested in rst-order elasticities in the
Let
and its
be an open set that con`
8

g R
tains all values of
throughout the range of
rst derivative
present case, we need close approximation of both the function
that were interested in. The Sobolev norm is appropriate
in this case. It is dened, making use of our notation for partial derivatives, as:
is well approximated by an approxi-
, we would evaluate
Y
r
d

mating model
Y ~ Y r W
9
To see whether or not the function
W d v sq v

We see that this norm takes into account errors in approximating the function
If we want to estimate rst order elas-
ticities, as is the case in this example, the relevant
Further-
convergence w.r.t. the Sobolev means

9
uniform convergence, so that we obtain consistent estimates for all values of
8
over

i`
more, since we examine the
would be
8
)
8
"
and partial derivatives up to order
379
18.3.2. Compactness. Verifying compactness with respect to this norm is

quite technical and unenlightening. It is proven by Elbadawi, Gallant and
Souza, Econometrica, 1983. The basic requirement is that if we need consistency
then the functions of interest must belong to a Sobolev space
W v r v
W
wE
r
where
of functions
. A Sobolev space is the set
which takes into account derivatives of order
) P
uu
w.r.t.
is a nite constant. In plain words, the functions must have bounded
partial derivatives of one order higher than the derivatives we seek to estimate.
18.3.3. The estimation space and the estimation subspace. Since in our
case were interested in consistent estimation of rst-order elasticities, well
dene the estimation space as follows:

E
8
Qg
The estimation space is an open set, and we presume that
8
g
Yr
D EFINITION 29. [Estimation space] The estimation space
So we are assuming that the function to be estimated has bounded second

`
derivatives throughout
With seminonparametric estimators, we dont actually optimize over the

dened as:
ned as
d

g r
E
g
d v I d v u
r

v
d

18.3.1.
is de-
where
D EFINITION 30. [Estimation subspace] The estimation subspace
b
estimation space. Rather, we optimize over a subspace,
is the Fourier form approximation as dened in Equation
380
18.3.4. Denseness. The important point here is that
is a space of func-
is not necessarily an element of
so optimiza-
may not lead to a consistent estimator. In order for optimization
will have to grow more slowly
, dened above, is a subset of the closure of the

is dense if the closure of
the countable union of the subsets is equal to the closure of
of a set
. A set of subsets
be dense subsets of
8
Q

q1
estimation space,
than . The second requirement is:
The estimation subspace
This
in equation 18.3.1 increasing functions
the sample size. It is clear that
(2) We need that the
and
is achieved by making
as
(1) The dimension of the parameter vector,
at least asymptotically, we
need that:

Q
to be equivalent to optimization over
of
elements, as
this parameter is estimable.
1
over
tion over
Note that the true function
has
observations,
in equation 18.3.2). With
tions that is indexed by a nite dimensional parameter (
Use a picture here. The rest of the discussion of denseness is provided just for comY
sI
r
with respect to
of
pleteness: theres no need to study it in detail. To show that
is a dense subset
it is useful to apply Theorem 1 of Gallant (1982),
who in turn cites Edmunds and Moscatelli (1977). We reproduce the theorem
as presented by Gallant, with minor notational changes, for convenience of
reference:
T HEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function
be continuously differentiable up to order
on an open set containing
. Then it is possible to choose a triangular array of coefcients

, and every
, and
5
. By denition of the estimation
is the countable union of the
such that
The implication of Theorem 31 is that there is a sequence of {
} from
lowing Gallant and Nychka (1987),
r
Y sI
. Therefore,
for all
, so the theorem is applicable. Closely fol-
)
q v
dI
d 8 8 8
99 ggd
open and contains the closure of
, which is
are once continuously differentiable on
r( d v
888
v999
Y
space, the elements of
as
In the present application,
with
G
such that for every
the closure of
381
However,

R

Q
so
8
Q
Therefore
r
s
, with respect to the norm
is a dense subset of
so
18.3.5. Uniform convergence. We now turn to the limiting objective function. We estimate by OLS. The sample objective function stated in terms of
f
E d v sg
I@
1 RC
d f
f )
maximization is
382
With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limiting objective function is

E v Hsq v
Y
uG

and
are elements of
H
presentation of the theorem. Both
in the
.
takes the place of the generic function
Q
Dt
where the true function
8 j
(18.3.6)
The pointwise convergence of the objective function needs to be strengthened to uniform convergence. We will simply assume that this holds, since
the way to verify this depends upon the specic application. We also have
r
s
since
8
Q
"t
with respect to the norm
r v
q v p
p

continuity of the objective function in
v q v I n G
Y
p
V I

By the dominated convergence theorem (which applies since the nite bound
used to dene
is dominated by an integrable function), the limit
and the integral can be interchanged, so by inspection, the limit is zero.
18.3.6. Identication. The identication condition requires that for any point
y
X
'
and

satised given that
r
Y sI
in
. This condition is clearly
are once continuously differentiable (by the as-
sumption that denes the estimation space).
18.3.7. Review of concepts. For the example of estimation of rst-order

elasticities, the relevant concepts are:

Consistency norm
Estimation subspace
Yr
Estimation space
383
: the function space in the closure of
which the true function must lie.

is compact with respect
r
s
The estimation subspace is the subset of
d f
RC9
8
Q
Sample objective function
dense subsets of
that is representable by a Fourier form with parameter
to this norm.
The closure of
These are
the negative of the sum of squares.
By standard arguments this converges uniformly to the

which is continuous in and has

aG
Limiting objective function
a global maximum in its rst argument, over the closure of the innite
As a result of this, rst order elasticities

union of the estimation subpaces, at
8ig v
`
C v
B

v Dv
B
are consistently estimated for all
18.3.8. Discussion. Consistency requires that the number of parameters

used in the expansion increase with the sample size, tending to innity. If parameters are added at a high rate, the bias tends relatively rapidly to zero. A
basic problem is that a high rate of inclusion of additional parameters causes
the variance to tend more slowly to zero. The issue of how to chose the rate at
which parameters are added and which to add rst is fairly complex. A problem is that the allowable rates for asymptotic normality to obtain (Andrews
1991; Gallant and Souza, 1991) are very strict. Supposing we stick to these
384
rates, our approximating model is:
matrix of regressors obtained by stacking ob-
'
1
as the
8 Bd4 d v
R
Dene
servations. The LS estimator is

f R
R
q d
is the Moore-Penrose generalized inverse.

may be singular, as would be the case for
large enough when some dummy variables are included.
dR
of the unknown function
normally distributed:
1

. The prediction,
This is used since
where
is asymptotically

w j
8 w
h
d 4 1
R
where
1 DR
f
A w
Formally, this is exactly the same as if we were dealing with a para-
grows very slowly as
metric linear model. I emphasize, though, that this is only valid if
grows. If we cant stick to acceptable rates, we
should probably use some other method of approximating the small

sample distribution. Bootstrapping is a possibility. Well discuss this
in the section on simulation.
18.4. KERNEL REGRESSION ESTIMATORS
385
18.4. Kernel regression estimators

Readings: Bierens, 1987, Kernel estimators of regression functions, in
Advances in Econometrics, Fifth World Congress, V. 1, Truman Bewley, ed., Cambridge.
An alternative method to the semi-nonparametric method is a fully nonparametric method of estimation. Kernel regression estimation is an example (others are splines, nearest neighbor, etc.). Well consider the NadarayaWatson kernel regression estimator in a simple case.
-dimensional. The model is
f
S
is
where
Suppose we have an iid sample from the joint density
G P
tV D f
where
given
is
conditional expectation, we have
8
H
f

8 t
G
The conditional expectation of
fSf f
t
f f
$t !

8 f t f
S

by estimating
and
y

)
H

This suggests that we could estimate
8ft f
$ ! f
H
is the marginal density of
r
where
By denition of the
386
form
18.4.1. Estimation of the denominator. A kernel estimator for
has the
8
I
@ 1
f
f
h @ f )
G
is the sample size and
is the dimension of
(the kernel) is absolutely integrable:
% t
r
)
integrates to
dc
b
and
The function
where
8 t
4) k
is like a density function, but we do not necessarily
to be nonnegative.
satises
is a sequence of positive numbers that
The window width parameter,
f
7
restrict
d
b
In this respect,
f

f b1 A
h
f f

So, the window width must tend to zero, but not too quickly.
for

g
To show pointwise consistency of
rst consider the ex-
pectation of the estimator (since the estimator is an average of iid

terms we only need to consider the expectation of a representative
term):
8 # t # f #
7$ G $0 @
f
h
r n
By the representative term argument, this is
I
@ 1
f
G
A w )
f
I
f
@ 1
` h @ )
f
f
we have, due to the iid assump-

g
f

r
f 1

b

n h f b1
tion
Next, considering the variance of
by an absolutely integrable function.

convergence theorem.. For this to hold we need that
be dominated
can pass the limit through the integral is a result of the dominated
by assumption. (Note: that we
) 4
#t #
#t #
4

and
f

since

# t k
#

#t #f
4 f
#
#t #f
4 t f
#

f
r n

Now, asymptotically,
y m
m
8 gs k
#t #f
#
# #f
t f s
#
f
h
h
#f
4 #
and
r n
f
7 4# D#
so
f
h
we obtain
Change variables as
387
probability).
consistency (convergence in quadratic mean implies convergence in
Since the bias and the variance both go to zero, we have pointwise
are bounded, this is bounded, and
r
n
by assumption, we have that

#
@t C
#
y
r n f b1 A f

h
and
8 t
# #
f b1

since
Since both
to be
Using exactly the same change of variables as before, this can be shown
8 $ G $V @
# t # f #
f
h
f

f
A r
f

n h f b1 A
Therefore,
by the previous result regarding the expectation and the fact that
r n hf
The second term converges to zero:
# t # f #
f
r n h f ss$ G $0 A h s
#$ ` $V @ f ss$ G $0 A s
t #
f #
f
# t #
f
#
f

h
h
h

f
f
f #
f
6G $#0 @ s G $V A
h
h
q
f #
` $j @ f r n f b1

h
h

r

n h f b1
Also, since
we have
388
I
@
f
8 ` @ I f
`f
@ f @f
b
I
@ I f
%b p
f

b f @
I f
%b p
f I

ft f
$ f )

H
I
@ 1
f
ft f
f
h @ gf f ) $ f
I
8
$ k
ft f
r
U
ft f
$ f
is required to have mean zero:
I
@ 1
I f
f
f
th j A f ) S
f f
f
dc
b
G
8 f
g !
The estimator has the same form as the estimator for

g
18.4.3. Discussion.
This is the Nadaraya-Watson kernel regression estimator.
by marginalization of the kernel, so we obtain
With this kernel, we have
and to marginalize to the previous kernel for
The kernel
only with one dimension more:
ft f
$ ! f
estimator of
18.4.2. Estimation of the numerator. To estimate
we need an
389
8c
1 8 5
!@A8@89764) " f
%
are closer to

D
The kernel regression estimator for
390
is a weighted average of the
, where higher weights are associated with points that

The weights sum to 1.
The window width parameter
A large window width reduces the variance (strong imposition of at-
f

f
7
is increasingly at as
imposes smoothness. The estimator
since in this case each weight tends to
8
q1 )
ness), but increases the bias.

A small window width reduces the bias, but makes very little use of
8
c
information except points that are in a small neighborhood of
Since
relatively little information is used, the variance is large when the window width is small.
though there are possibly better alternatives.
and
f
g
The standard normal density is a popular choice for
18.4.4. Choice of the window width: Cross-validation. The selection of

an appropriate window width is important. One popular method is cross validation. This consists of splitting the sample into two parts (e.g., 50%-50%).
The rst part is the in sample data, which is used for estimation, and the
second part is the out of sample data, used for evaluation of the t though
RMSE or some other criterion. The steps are:
(2) Choose a window width .
and
8
(1) Split the data. The out of sample data is

corresponding to each
8
(3) With the in sample data, t
This tted
value is a function of the in sample data, as well as the evaluation

, but it does not involve
8 If
point
18.6. SEMI-NONPARAMETRIC MAXIMUM LIKELIHOOD
391
(4) Repeat for all out of sample points.
that minimizes RMSE(
(Verify that a minimum has been
found, for example by plotting RMSE as a function of

and all of the data.
This same principle can be used to choose
and
(8) Re-estimate using the best
8
H
(7) Select the
H
tried.
or to the next step if enough window widths have been

75
(6) Go to step
D
(5) Calculate RMSE
in a Fourier form model.
18.5. Kernel density estimation

The previous discussion suggests that a kernel density estimator may easily
be constructed. We have already seen how joint densities may be estimated.
conditional on ,
then the kernel estimate of the conditional density is simply
If were interested in a conditional density, for example of
f
I
f

A I@
G
f t j f I@f )
f f f

b
I@ f
%b p
f I
b
I@ f
b p
r b p u u f I

!
f
u

where we obtain the expressions for the joint and marginal densities from the
section on kernel regression.
18.6. Semi-nonparametric maximum likelihood
Readings: Gallant and Nychka, Econometrica, 1987. For a Fortran program
to do this and a useful discussion in the users guide, see
392
this link . See also Cameron and Johansson, Journal of Applied Econometrics,
V. 12, 1997.
MLE is the estimation method of choice when we are condent about specifying the density. Is is possible to obtain the benets of MLE when were not
so condent about the specication? In part, yes.
(both may be
is a reasonable starting approxi-
t f
H
vectors). Suppose that the density
conditional on
Suppose were interested in the density of
mation to the true density. This density can be reshaped by multiplying it by

a squared polynomial. The new density is
p
h H T
f
T
is a normalizing factor to make the density integrate (sum) to

it is necessary
is set to 1. The normalization factor
t
HT
t f
D T I H T
p
to impose a normalization:
is a homogenous function of
d
t
H T
one. Because
f
h h
and
t
Hs
T
t f T
H
t f f
H H T
where
is
393
calculated (following Cameron and Johansson) using
p u
p u
p
8 t
gHT

` h
h
T
p
u p
HstT I s t

f

f
Y` S
r h
h
p p T
h

HtT f f sHt f

f
I
Y
G
h h
T T
t
Ht HT
f
f f
Y` sH T
p T
h
p T u

we get that the normalizing factor is

h
p p
h HT
t
T T
(18.6.1)
is set to 1 to achieve identication. The
in equation 18.6.1
Recall that
f
Dt
18.6.1
Y
G
By setting
are the raw moments of the baseline density. Gallant and Nychka (1987) give
conditions under which such a density may be treated as correctly specied,
asymptotically. Basically, the order of the polynomial must increase as the
sample size increases. However, there are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial polynomial (NBP) density for count data. The negative binomial baseline density may be written (see equation as
e
u
PQX
PQX X
P f
R t)f
P

X
f
Ht
Y
G
. The usual means of incorporating condi-
. For both forms,
. In the case of
e
g
e
'
e &
e
'
the NB-II model, we have
binomial-II (NP-II) model. For the NB-I density,
we have
we have the negative

e &
) X
&@
x y (e

the negative binomial-I model (NB-I). When
. When
is the parameterization
&
@ e
v
X
ue y e t
tioning variables
and
where
394
The reshaped density, with normalization to sum to one, is
PQX
PQX X
P f t
U t)f HsfT f
P
s sH T Hst
(18.6.2)
Y
`
To get the normailization factor, we need the moment generating function:
P e e
E2
X
(18.6.3)
To illustrate, here are the rst through fourth raw moments of the NB density,
calculated using
MuPAD, which is a Computer Algebra System that is free for personal use,
and then programmed in Ox. These are the moments you would need to use a
if(k_gam >= 1)
5
second order polynomial
{
m[][0] = lambda;
m[][1] = (lambda .* (lambda + psi + lambda .* psi))
Econometrics/ psi;
}
if(k_gam >= 2)
{
395
m[][2] = (lambda .* (psi .^ 2 + 3 .* lambda .* psi

.* (1 + psi) + lambda .^
2 .* (2 + 3 .* psi +
psi .^ 2))) Econometrics/ psi .^ 2;

m[][3] = (lambda .* (psi .^ 3 + 7 .* lambda .* psi
.^ 2 .* (1 + psi) +
6 .* lambda .^ 2 .* psi .* (2 + 3 .* psi + psi
.^ 2) +
lambda .^ 3 .* (6 + 11 .* psi + 6 .* psi .^ 2
+ psi .^ 3))) Econometrics/ psi .^ 3;
}
After calculating the raw moments, the normalization factor is calculated
using equation 18.6.1, again with the help of MuPAD.
if(k_gam == 1)
{
norm_factor = 1 + gam[0][] .* (2 .* m[][0] + gam[0][]
.* m[][1]);
}
else
if(k_gam == 2)
{
norm_factor = 1 + gam[0][] .^ 2 .* m[][1] + 2 .*
gam[0][] .* (m[][0] +
gam[1][] .* m[][2]) +
gam[1][] .* (2 .* m[][1] + gam[1][] .* m[][3]);

}

79
For
396
the analogous formulae are impressively (i.e. several pages) long.
This is an example of a model that would be difcult ot formulate without the

help of a program like MuPAD.
It is possible that there is conditional heterogeneity such that the appropriate reshaping should be more local. This can be accomodated by allowing the
parameters to depend upon the conditioning variables, for example using
polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can
approximate a wide variety of densities arbitrarily well as the degree of the
polynomial increases with the sample size. This approach is not without its
drawbacks: the sample objective function can have an extremely large number
of local maxima that can lead to numeric difculties. If someone could gure
out how to do in a way such that the sample objective function was nice and
smooth, they would probably get the paper published in a good journal. Any
ideas?
Heres a plot of true and the limiting SNP approximations (with the order
of the polynomial xed) to four different count data densities, which variously
exhibit over and underdispersion, as well as excess zeros. The baseline model
is a negative binomial density.
Figures/SNP.eps not found!
18.7. EXAMPLES
397
18.7. Examples
18.7.1. Fourier form estimation. You need to get the le
FFF.ox, which sets up the data matrix for Fourier form estimation.
The rst DGP rst DGP generates data with a nonlinear mean and
er-
rors (with the mean subtracted out). Then the program fourierform.ox allows
you to experiment with different sample sizes and values of
. There is no
need to specify multi-indices with a univariate regressor (as is the case here to
with different .
4 1
keep the graphics simple). For a sample size of
, here are several plots
5
This rst plot shows an underparameterized t (
).
Nonparametric-I/fff_2.eps not found!
This next one looks pretty good.
Heres an example of an overtted model - we are starting to chase the
)
error term too much (
).
18.7. EXAMPLES
398
18.7.2. Kernel regression estimation. You need to get the le

KernelLib.ox, which contains the routines for kernel regression and density
estimation.
18.7.3. Kernel regression. We will use the same data generating process as
for the above examples of Fourier Form models. The program kernelreg1.ox
allows you to experiment with different sample sizes, window widths. For a
1
sample size of
, here are several plots with different window widths.
Note that too small a window-width (ww = 0.1) leads to a very irregular t,
while setting the window width too high leads to too at a t.
Nonparametric-I/undersmoothed.eps not found!
Nonparametric-I/oversmoothed.eps not found!
18.7. EXAMPLES
399
Nonparametric-I/justright.eps not found!
Cross Validation
The leave-one-out method of cross validation consists of doing an out-of-sample

t to each data point in turn, and calculating the MSE. This is repeated for various window widths. The minimum MSE window width may then be chosen.
The program kernelreg2.ox does this. The results are:
Nonparametric-I/cvscores.eps not found!
Nonparametric-I/crossvalidated.eps not found!
18.7.4. Kernel density estimation. The second DGP second DGP gener-
d
ates
random variables, then estimates their density using kernel density
estimation. The program kerneldens.ox allows you to experiment using different sample sizes, kernels, and window widths. The following gure shows
18.7. EXAMPLES
400
an Epanechnikov kernel t using different window widths. To change kernels

you need to selectively (un)comment lines in the KernelLib.ox le.
Nonparametric-I/kerneldensfit.eps not found!
18.7.5. Seminonparametric density estimation and MuPAD. Following

the lecture notes, an SNP density for count data may be obtained by reshaping
a negative binomial density using a squared polynomial:
PQX X
P f t
U t)f HsfT f
P
s sH T Hst
PQX
8 fh
h
p
h
f
H UT
T
8 h

h

Y
`
p p
h HsT
t
T T
To implement this using a polynomial of order

g5
of the negative binomial density up to order
(18.7.3)
The normalization factor is
(18.7.2)
(18.7.1)
we need the raw moments
. I couldnt nd the NB moment
generating function anywhere, so a solution is to calculate it using a Computer

Algebra System (CAS). Rather than using one of the expensive alternatives, we
can try out MuPAD, which can be downloaded and is free (in the sense of free
18.7. EXAMPLES
401
beer) for personal use. It is installed on the Linux machines in the computer
room, and if you like you can install the Windows version, too.
The le negbinSNP.mpd, if run using the the command mupad negbinSNP.mpd,
will give you the output that follows:
*----*
/|
MuPAD 2.5.1 -- The Open Computer Algebra System
/|
*----* |
Copyright (c)
| *--|-*
|/
1997 - 2002
by SciFace Software
All rights reserved.
|/
*----*
Licensed to:
Dr. Michael Creel
Negative Binomial SNP Density
First define the NB density
\a /
\y
gamma(a + y) | ----- |
| ----- |
\ a + b /
\ a + b /
---------------------------------gamma(a) gamma(y + 1)
Verify that it sums to 1
18.7. EXAMPLES
402
Define the MGF
\a
| ----- |
\ a + b /
--------------------/ a + b - b exp(t) \a
| ---------------- |
\
a + b
Print the MGF in TeX format
"\\frac{\\frac{a}{\\left(a + b\\right)}â}{\\frac{\\left(a + b - b\\, \

ox{exp}\\left(t\\right)\\right)}{\\left(a + b\\right)}â}"
Find the first moment (which we know is b (lambda))
18.7. EXAMPLES
403
Find the fifth moment (which we probably dont know)
5
(24 b
4
+ 60 a b
3
75 a
+ a
3
b
4
10 a
4
+ 15 a
4
b
4
+ a
5
b + 50 a b
2
b
2
+ 35 a
2
+ 50 a
5
b
3
+ 60 a
3
+ 15 a
4
b
2
b
4
+ 25 a
+ 110 a
3
b
+ 10 a
b ) / a
Print the fifth moment in fortran form, to program ln L
"
t3 = a**-4*(b**5*24.0D0+60.0D0*a*b**4+a**4*b+50.0D0*a*b**5+50.0D
~(a*a)*b**3+15.0D0*a**3*(b*b)+110.0D0*(a*a)*b**4+75.0D0*a**3*b**3
~5.0D0*a**4*(b*b)+35.0D0*(a*a)*b**5+60.0D0*a**3*b**4+25.0D0*a**4*
~*3+10.0D0*a**3*b**5+10.0D0*a**4*b**4+a**4*b**5)"
Print the fifth moment in TeX form
"\\frac{24\\, b^5 + 60\\, a\\, b^4 + a^4\\, b + 50\\, a\\, b^5 + 50\\,
\\, b^3 + 15\\, a^3\\, b^2 + 110\\, a^2\\, b^4 + 75\\, a^3\\, b^3 + 15\
a^4\\, b^2 + 35\\, a^2\\, b^5 + 60\\, a^3\\, b^4 + 25\\, a^4\\, b^3 + 1
18.7. EXAMPLES
404
, a^3\\, b^5 + 10\\, a^4\\, b^4 + a^4\\, b^5}{a^4}"
To get the normalizing factor, we need expressions of the
form of the following
a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) +
b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) +
a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) +
b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6)
>> quit
Once you get expressions for the moments and the double sums, you can
use these to program a loglikelihood function in Ox, without too much trouble.
The le NegBinSNP.ox implements this. The le EstimateNBSNP.ox will let
you estimate NegBinSNP models for the MEPS data. The estimation results
5
for OBDV using
and a NB-I baseline model are
Ox version 3.20 (Linux) (C) J.A. Doornik, 1994-2002
***********************************************************************
MEPS data, OBDV
18.7. EXAMPLES
405
negbin_snp_obj results
Strong convergence
Observations = 500
Avg. Log Likelihood

-2.2426
Standard Errors
params
se(OPG)
se(Sand.)
se(Hess)
1.5340
0.13289
0.12645
0.12593
0.16113
0.053100
0.056824
0.054144
0.090624
0.062689
0.065619
0.063835
sex
0.16863
0.047614
0.050720
0.048707
age
0.17950
0.048407
0.045060
0.046301
educ
0.039692
0.047968
0.058794
0.052521
inc
0.032581
0.064384
0.043708
0.051033
1.8138
0.18466
0.17398
0.17378
-0.052710
0.0089429
0.0078799
0.0083419
0.013382
0.0042349
0.0039745
0.0040547
params
t(OPG)
t(Sand.)
t(Hess)
1.5340
11.543
12.132
12.181
constant
pub_ins
priv_ins
ln_alpha
t-Stats
constant
18.7. EXAMPLES
pub_ins
406
0.16113
3.0344
2.8356
2.9759
0.090624
1.4456
1.3811
1.4197
sex
0.16863
3.5416
3.3248
3.4621
age
0.17950
3.7082
3.9837
3.8769
educ
0.039692
0.82746
0.67509
0.75573
inc
0.032581
0.50603
0.74541
0.63842
1.8138
9.8226
10.425
10.438
-0.052710
-5.8941
-6.6892
-6.3188
0.013382
3.1599
3.3669
3.3003
priv_ins
ln_alpha
CAIC
BIC
AIC
2314.7
2304.7
2262.6
***********************************************************************
Note that the CAIC and BIC are lower for this model than for the ordinary
NB-I model. NOTE: density functions formed in this way may have MANY
local maxima, so you need to be careful before accepting the results of a casual
run. To guard against having converged to a local maximum, one can try using
multiple starting values, or one could try simulated annealing as an optimization method. To do this, copy maxsa.ox and maxsa.h into your working directory, and then use the program EstimateNBSNP2.ox to see how to implement
SA estimation of the reshaped negative binomial model. For more details on
18.7. EXAMPLES
407
the Ox implementation of SA, see Charles Bos page. Note - in my own experience, using a gradient-based method such as BFGS with many starting values
is as successful as SA, and is usually faster. Perhaps Im not using SA as well
as is possible... YMMV.
CHAPTER 19
Simulation-based estimation
Readings: In addition to the book mentioned previously, articles include
Gallant and Tauchen (1996), Which Moments to Match?, ECONOMETRIC
THEORY, Vol. 12, 1996, pages 657-681; Gourieroux, Monfort and Renault
a
(1993), Indirect Inference, J. Apl. Econometrics; Pakes and Pollard (1989)
Econometrica; McFadden (1989) Econometrica.
19.1. Motivation
Simulation methods are of interest when the DGP is fully characterized by
a parameter vector, but the likelihood function is not calculable. If it were
available, we would simply estimate by MLE, which is asymptotically fully
efcient.
19.1.1. Example: Multinomial and/or dynamic discrete response models.

be a latent random vector of dimension
8
"
Bf
Let
Suppose that
G P
B V B B f
is
8 e
'
where
Suppose that
j B G

(19.1.1)
Henceforth drop the subscript when it is not needed for clarity.

408
19.1. MOTIVATION
409
is not observed. Rather, we observe a many-to-one mapping
f
f
This mapping is such that each element of
is either zero or one (in
some cases only one element will be one).

Dene
f
6
. In this case the elements of
be the vector of parameters of the model. The
observation to the likelihood function is
f t f
B $ B j B 1
X

x t R
contribution of the
3
R R
Bf
Let
is independent of
is not
8
4% y3 f
diagonal). However,
may not be independent of one another (and clearly are not if
Bf
B B
f
f w B
f
B f t B Hyw
Suppose random sampling of
d
4e B
where
5
5

G I R y 7} eI p I6 G1
G ~ p
is the multivariate normal density of an
-dimensional random vec-
tor. The log-likelihood function is
I
B 1
d
4dR
B A ) 4R
f
and the MLE solves the score equations
8
$
I
I
B 1
4d B B 1

4d B f ) d B f )
d
e B
The problem is that evaluation of
410
and its derivative w.r.t.

d
19.1. MOTIVATION
by
standard methods of numeric integration such as quadrature is com(the dimension of
is higher than 3
has not been made specic so far. This setup is
quite general: for different choices of
it nests the case of dynamic
The mapping
or 4 (as long as there are no restrictions on
binary discrete choice models as well as the case of multinomial discrete choice (the choice of one out of a nite set of alternatives).
Multinomial discrete choice is illustrated by a (very simple) job
search model. We have cross sectional data on individuals matching to a set of
jobs that are available (one of which is unemploy-
ment). The utility of alternative is

%
G P
V

Utilities of jobs, stacked in the vector
are not observed. Rather,
we observe the vector formed of elements
C%
putationally infeasible when

X6"ig Dc h

Y ) f
Only one of these elements is different than zero.

Dynamic discrete choice is illustrated by repeated choices over
time between two alternatives. Let alternative have utility
%
G
j%
5)
4
g
85)
i!A@8@8974
19.1. MOTIVATION
411
Then
G P
V"
I$Gj G0PcEIju
I

$
Now the mapping is (element-by-element)

f ) f

2
) B f
period
if individual
zero otherwise.
chooses the second alternative in
that is
19.1.2. Example: Marginalization of latent variables. Economic data often presents substantial heterogeneity that may be difcult to model. A possibility is to introduce latent random variables. This can cause the problem that
there may be no known closed form for the distribution of observable variables after marginalizing out the unobservable latent variables. For example,
is often modeled using the Poisson
(
@3
e 7g} 3
~ f
f
f
S U
e
g
Often, one parameterizes the conditional mean as
The mean and variance of the Poisson distribution are both equal to
distribution
@8A89774
8 7 5)
count data (that takes values
8 ~ B
B 7g} HCe
19.1. MOTIVATION
412
This ensures that the mean is positive (as it must be). Estimation by ML is
straightforward.
Often, count data exhibits overdispersion which simply means that
8 f
S
f
If this is the case, a solution is to use the negative binomial distribution rather
than the Poisson. An alternative is to introduce a latent variable that reects
heterogeneity into the specication:
B 7} HCe
B P ~ B
be the density of
In some cases, the
( f
B Q B P ~ B P ~
Dt u g B 7} us B B 7g} d }
~
marginal density of
Dt
B Q
on additional parameters). Let
(this density may depend
8 g
B
has some specied density with support
where
f
B f
will have a closed-form solution (one can derive the negative binomial distrihas an exponential distribution), but often this will not
be possible. In this case, simulation is a means of calculating
f
3
bution in the way if
which
is then used to do ML estimation. This would be an example of the Simulated

Maximum Likelihood (SML) estimation.
In this case, since there is only one latent variable, quadrature is probably a better choice. However, a more exible model with heterogeneity
19.1. MOTIVATION
413
would allow all parameters (not just the constant) to vary. For example
X
B Dt u s D( B f
Q
~
~
~
B B } B B 7g} d 7g}
-dimensional integral, which will not be evaluable
B A

e
by quadrature when
f
B f
'
entails a
gets large.
19.1.3. Estimation of models specied in terms of stochastic differential

equations. It is often convenient to formulate models in terms of continuous
time using differential equations. A realistic model should account for exogenous shocks to the system, which can be done by assuming a random component. This leads to a model that is expressed as a system of stochastic differential equations. Consider the process
t f d P 2 t f d f
"!7R D$!7RD $t
is a standard Brownian motion (Weiner
jD"t

process), such that

Qg
which is assumed to be stationary.
Brownian motion is a continuous-time stochastic process such that
That is, non-overlapping segments are independent.
are independent for

Su%

2
EVqt

Ej jE2Vqt
2

and
One can think of Brownian motion the accumulation of independent normally

distributed shocks with innitesimal variance.
f d
g7eH
The function
is the deterministic part.
19.1. MOTIVATION
414
determines the variance of the shocks.
f d
!7R
To estimate a model of this sort, we typically have data that are assumed to be
8 f@8A89 f
8 fI
direct ML or GMM estimation is not usually fea-
sible, because one cannot, in general, deduce the transition density

7d
To perform inference on
is a continu-
8d f
46I gf $
gf
ous process it is observed in discrete time.
That is, though
in discrete points
gf
observations of
This density is necessary to evaluate the likelihood function or to evaluate moment conditions (which are based upon expectations with respect to this density).
A typical solution is to discretize the model, by which we mean to
nd a discrete time approximation to the model. The discretized version of the model is
)
x9
G f t P f t
tcI g!s cI g!sH
(that is, the
p
Ht

tG
f
I jHf
The discretization induces a new parameter,
which
denes the best approximation of the discretization to the actual (un-
p
d
known) discrete time version of the model is not equal to
which is
the true parameter value). This is an approximation, and as such ML
estimation of
(which is actually quasi-maximum likelihood, QML)
based upon this equation is in general biased and inconsistent for the
original parameter, . Nevertheless, the approximation shouldnt be
d
too bad, which will be useful, as we will see.

The important point about these three examples is that computational
difculties prevent direct application of ML, GMM, etc. Nevertheless
19.2. SIMULATED MAXIMUM LIKELIHOOD (SML)
415
the model is fully specied in probabilistic terms up to a parameter

vector. This means that the model is simulable, conditional on the
parameter vector.
19.2. Simulated maximum likelihood (SML)

For simplicity, consider cross-sectional data. An ML estimator solves
0
is the density function of the
is an infeasible estimator. However,
d f
4 g
does not have a known closed form,
observation. When
d f
a
2
I
@
1 d f
dca f

A f ) 4RC ~ E
where
it may be possible to dene a random function such that
d f d f a
a gs 4cag!p
is known. If this is the case, the simulator
d f
4ag
)
8 d f
g4a gs
d f
4a
The SML simply substitutes
in place of
log-likelihood function, that is
d f
ca g
is unbiased for
d f a
4 Eg!@t
where the density of
in the
B 1 d f
daf

f ) eS ~
19.2.1. Example: multinomial probit. Recall that the utility of alternative

is
G P
V

is formed of elements
g
""i h

Y ) f
cant be calculated when
d ) U
f
The problem is that
C%
and the vector
416
is larger than 4 or
5. However, it is easy to simulate this probability.
) U B f

times and dene

I
B f g U B
B
-vector formed of the
. Each element of
tween 0 and 1, and the elements sum to one.
is be-
d f
B } RB f B B
Now
as the
Dene
Repeat this
is the matrix formed by stacking the
g
Q"iD hB
Dene
(where
P
B G B B

Calculate
C%
B G
from the distribution
B
j
Draw

The SML multinomial probit log-likelihood function is
4d f RB B 1

B B A f )
f
This is to be maximized w.r.t.
are draw only once and are used repeatedly during
The draws are different for each
are re-drawn at every iteration the estimator will not converge.
B G
If the
and
the iterations used to nd
draws of
8
E3
The
B G
Notes:
and
The log-likelihood function with this simulator is a discontinuous func
and
tion of
This does not cause problems from a theoretical point
A

of view since it can be shown that
417
is stochastically equicon-
tinuous. However, it does cause problems if one attempts to use a

d
gradient-based optimization method such as Newton-Raphson.
are zero. If the corresponding element of
equal to 1, there will be a
some elements of
, are used, that
problem.
Bf
It may be the case, particularly if few simulations,
is
Solutions to discontinuity:
1) use an estimation method that doesnt require a continuous and

differentiable objective function, for example, simulated annealing. This is computationally costly.
2) Smooth the simulated probabilities so that they are continuous
functions of the parameters. For example, apply a kernel transformation such as
I
I
r dhB ~ h B n j' P h r dhB ~ h B n ' w
) 8
W
W
w
is a large positive number. This approximates a step
if it is the maximum. This makes
and
so that
B
) B

tinuous function of
is not the max-
and therefore
a con-

B f
B f
imum, and
is very close to zero if

U B f
function such that
where
will be continuous and differentiable. Consistency requires that

so that the approximation to a step function becomes
1
T D w
arbitrarily close as the sample size increases. There are alternative

methods (e.g., Gibbs sampling) that may work better, but this is
too technical to discuss here.
d
To solve to log(0) problem, one possibility is to search the web for the
slog function. Also, increase
if this is a serious problem.
19.3. METHOD OF SIMULATED MOMENTS (MSM)
418
19.2.2. Properties. The properties of the SML estimator depend on how

is set. The following is taken from Lee (1995) Asymptotic Bias in Simulated
Maximum Likelihood Estimation of Discrete Choice Models, Econometric Theory, 11, pp. 437-83.
d
1
eI "tf
p
h p i
d
d
1
a nite constant, then
d
E p e I o j
h p i
d
1
h eI " f
p
e
g
d
1

is a nite vector of constants.

d
This means that the SML estimator is asymptotically biased if

grow faster than
doesnt
8 eI 1
p
where
then
d
p e I o j
2) if

T HEOREM 32. [Lee] 1) if
The varcov is the typical inverse of the information matrix, so that

as long as
grows fast enough the estimator is consistent and fully

asymptotically efcient.
19.3. Method of simulated moments (MSM)
d f
4
the density of
which is simulable given , but is such that

d
Suppose we have a DGP
is not calculable.
Once could, in principle, base a GMM estimator upon the moment conditions
# d f
xc Huq g 4eE
d
where
f t d f f d
$ s c k' c D
The problem is that this density is not available.

is readily simulated using
d
k
as
d
4c s}
)
4c s c s}
d
d
I
g
By the law of large numbers,
}f
dc D

8
c
However
is the density
conditional on
d f

of
is a vector of instruments in the information set and
419
which
provides a clear intuitive basis for the estimator, though in fact we obtain consistency even for
operating across the
nite, since a law of large numbers is also
observations of real data, so errors introduced
by simulation cancel themselves out.

This allows us to form the moment conditions
d
4RE
is drawn from the information set. As before, form
d
eX}
1
1
(19.3.2)
Ig
I
B

# w
f
}f D ) q v
f
d
I
d
4eE B
f
where
d f
# r 4c s} q n

(19.3.1)
with which we form the GMM criterion and estimate as usual. Note
}f D
that the unbiased simulator
appears linearly within the sums.
19.3.1. Properties. Suppose that the optimal weighting matrix is used. McFadden (ref. above) and Pakes and Pollard (refs. above) show that the asymptotic distribution of the MSM estimator is very similar to that of the infeasible
GMM estimator. In particular, assuming that the optimal weighting matrix is
)
) P D
m
h p i
d
d
1

is the asymptotic variance of the infeasible GMM estid
That is, the asymptotic variance is inated by a factor
8 D)
) P
I R I
mator.

I R I
where
nite,
h
(19.3.3)
420
used, and for
For this
reason the MSM estimator is not fully asymptotically efcient relative
small and controllable, by setting
nite, but the efciency loss is

d
to the infeasible GMM estimator, for
reasonably large.
The estimator is asymptotically unbiased even for
8
)
advantage relative to SML.
This is an
If one doesnt use the optimal weighting matrix, the asymptotic varcov
8 )
) P
is just the ordinary GMM varcov, inated by
The above presentation is in terms of a specic moment condition

based upon the conditional mean. Simulated GMM can be applied

to moment conditions of any form.

d
19.3.2. Comments. Why is SML inconsistent if
is nite, while MSM is?
The reason is that SML is based upon an average of logarithms of an unbiased

simulator (the densities of the observations). To use the multinomial probit
model as an example, the log-likelihood function is
I
B 1

BR
B f ) A
f
I
B 1

BR
B f ) A
f
The SML version is
421
The problem is that

E B B
in spite of the fact that
B B

d
d
b
is a nonlinear transformation. The only way for the

tends to innite so that

b
two to be equal (in the limit) is if
tends to
d
b
due to the fact that
The reason that MSM does not suffer from this problem is that in this case
the unbiased simulator appears linearly within every sum of terms, and it ap-
pears within a sum over
(see equation [19.3.2]). Therefore the SLLN applies
to cancel out simulation errors, from which we get consistency. That is, using
simple notation for the random sampling case, the moment conditions
d
I
g
I

# w 6 G P4c
d
G P d B 1
D ) t0 p c Dcv )
f
I
d g
I
B 1

q f
cg v )
)
f

# w }f D
d
eX

(19.3.5)
(19.3.4)
converge almost surely to
8 Qt d d
g D C# 4 Duq p D
d
4ea
is assume to be made up of functions of
converges to
8
g
(note:
The objective function
p
d
d
4ea I cea ea
R d
d
which obviously has a minimum at
henceforth consistency.
If you look at equation 19.3.5 a bit, you will see why the variance in-
I c
P )
ation factor is
19.4. EFFICIENT METHOD OF MOMENTS (EMM)
422
19.4. Efcient method of moments (EMM)

The choice of which moments upon which to base a GMM estimator can
have very pronounced effects upon the efciency of the estimator.
A poor choice of moment conditions may lead to very inefcient es-
timators, and can even cause identication problems (as weve seen
with the GMM problem set).
The drawback of the above approach MSM is that the moment condi-
tions used in estimation are selected arbitrarily. The asymptotic efciency of the estimator may be low.
The asymptotically optimal choice of moments would be the score vector of the likelihood function,
d
eE 4RE
d
As before, this choice is unavailable.
The efcient method of moments (EMM) (see Gallant and Tauchen (1996),
Which Moments to Match?, ECONOMETRIC THEORY, Vol. 12, 1996, pages
657-681) seeks to provide moment conditions that closely mimic the score vector. If the approximation is very good, the resulting estimator will be very
nearly fully efcient.
The DGP is characterized by random sampling from the density
p eE
d
p g
d
f
423
We can dene an auxiliary model, called the score generator, which simply provides a (misspecied) parametric density
e
f
$
This density is known up to a parameter
We assume that this den-
sity function is calculable. Therefore quasi-ML estimation is possible.

Specically,
e
I
@
1
8
E
f
A f ) e C ~ E e
e
e A
f
After determining we can calculate the score functions
The important point is that even if the density is misspecied, there is

for which the true expectation, taken with respect to
is zero:
p e w Y 4r p
We have seen in the section on QML that
f$4eE
t d
the moment conditions
; this suggests using
I@ 1
d f
e ei
f )
These moment conditions are not calculable, since

able, but they are simulable using
e }f

Ig @ 1
I
eS
d f
) f ) e

(19.4.1)
pe
T e
Q t f t d f
f
D p p e A
over
and then marginalized
d
eEi
the true but unknown density of
d f
g p 9f

a pseudo-true
is not avail-
converges to
holding
xed. By the LLN and
8 p e p ea
d
e eS}
d f
then
e c
f
d f
g
imates
The advantage of this procedure is that if
This is not the case for other values of , assuming that
the fact that
pe
d
ge &
is a draw from
where
424
is identied.
closely approx-
will closely approximate the optimal
moment conditions which characterize maximum likelihood estimation, which is fully efcient.
If one has prior information that a certain density approximates the
8 b
d
data well, it would be a good choice for
If one has no density in mind, there exist good ways of approximating

unknown distributions parametrically: Philips ERAs (Econometrica,
1983) and Gallant and Nychkas (Econometrica, 1987) SNP density estimator which we saw before. Since the SNP density is consistent, the
efciency of the indirect estimator is the same as the infeasible ML
estimator.
19.4.1. Optimal weighting matrix. I will present the theory for
nite,
d
and possibly small. This is done because it is sometimes impractical to estimate with
very large. Gallant and Tauchen give the theory for the case of
d
so large that it may be treated as innite (the difference being irrelevant given
the numerical precision of a computer). The theory for the case of
follows directly from the results presented here.
innite
e 7RX}
d
can apply Theorem 22 to conclude that
depends on the pseudo-ML estimate
The moment condition
425
We
hp
were in fact the true density
gp e o I 6p e QP

d f
g4
e c g
f
e 1
be the maximum likelihood estimator, and
then
If the density
I ! p e Q p e o I ! p e Q%
P
P
(19.4.2)
would
would be an identity
matrix, due to the information matrix equality. However, in the present case
so there is no
Comparing the denition of
e S9
f
8 h gp e S y z A
f

p e QP
Recall that
e c g
f
cancellation.
is only an approximation to
d f
4
we assume that
with the denition of the moment condition in Equation 19.4.1, we see that
8
g p e p eX y p e QP
d
As in Theorem 22,
Re
S9f
if
1 f p e
o
In this case, this is simply the asymptotic variance covariance matrix of the
)
tcT P h p
e p e p RX y w 1
d
e 9gRi1
p d f
P
p e p RC 1
d f
. It is straightforward but somewhat tedious to
h
d f
h e p eS 1
p d f
p e xgRC 1
show that the asymptotic variance of this term is

p e a I
First consider
about
Now take a rst order Taylors series approximation to
moment conditions,
p d f y
p e 9ei
. Note that
hp
e p e 9gR y w 1
p d
so we have
Next consider the second term
426
8 8
4C6 h p
e p e Ql1
P
hp

g p e QP
e p e p RX y w 1
d
But noting equation 19.4.2

h
p e o h p
e p e Ql1
P
Now, combining the results for the rst and second terms,
h
p e
d f
) ) u e p ei 1
x
p e
Suppose that
is a consistent estimator of the asymptotic variance-covariance
matrix of the moment conditions. This may be complicated if the score generator is a poor approximator, since the individual score contributions may not
have mean zero in this case (see the section on QML) . Even if this is the case,
the individuals means can be calculated by simulation, so it is always possible
gp e
to consistently estimate
when the model is simulable. On the other hand,
if the score generator is taken to be correctly specied, the ordinary estimator

of the information matrix is consistent. Combining this with the result on the
efcient GMM weighting matrix in Theorem 25, we see that dening as

d
e 7Ri
d f
p e
) R d f
) P e eS d
is the GMM estimator with the efcient choice of weighting matrix.

If one has used the Gallant-Nychka ML estimator as the auxiliary model,
the appropriate weighting matrix is simply the information matrix of
the auxiliary model, since the scores are uncorrelated. (e.g., it really is
427
ML estimation asymptotically, since the score generator can approximate the unknown density arbitrarily well).
19.4.2. Asymptotic distribution. Since we use the optimal weighting matrix, the asymptotic distribution is as in Equation 15.4.1, so we have (using the
result in Equation 19.4.2):
R
I p e
P
) )
o

m
h p d 1
d
where
8
qc p e p e ! f
d Rf
This can be consistently estimated using
e d !
Rf

19.4.3. Diagnotic testing. The fact that
h
d
o
d f
) ) u e p ei1
R ' 7d C e
e
I
d
ge e A
P ) R f
) D e d CH1
since without
d
4e A
where is
p e
implies that
moment conditions the model
is not identied, so testing is impossible. One test of the model is simply based
on this statistic: if it exceeds the
critical point, something may be wrong
(the small sample performance of this sort of test would be a topic worth investigating).
19.5. EXAMPLE: ESTIMATION OF STOCHASTIC DIFFERENTIAL EQUATIONS
428
Information about what is wrong can be gotten from the pseudo-tstatistics:

h
d if1

p e
e
I eI
) P )
diag
can be used to test which moments are not well modeled. Since these
moments are related to parameters of the score generator, which are
usually related to certain features of the model, this information can be
h
and
e 7d C1
f

e 7d SH1
f

e p eSH1
d f
since
)
gx9
used to revise the model. These arent actually distributed as
have different distributions (that of
is somewhat more complicated). It can be shown that the
pseudo-t statistics are biased toward nonrejection. See Gourieroux et.

al. or Gallant and Long, 1995, for more details.
19.5. Example: estimation of stochastic differential equations
It is often convenient to formulate theoretical models in terms of differential equations, and when the observation frequency is high (e.g., weekly, daily,
hourly or real-time) it may be more natural to adopt this framework for econometric models of time series.
The most common approach to estimation of stochastic differential equations is to discretize the model, as above, and estimate using the discretized
version. However, since the discretization is only an approximation to the true
discrete-time version of the model (which is not calculable), the resulting estimator is in general biased and inconsistent.
An alternative is to use indirect inference: The discretized model is used as
the score generator. That is, one estimates by QML to obtain the scores of the
discretized approximation:
)
x9
G f t P f t
tcI g!s cI g!sH

tG
f
I jHf
8 d f
gt ei
Indicate these scores by

equations
429
Then the system of stochastic differential
t f d P 2 t f d f
"!7R D$!7RD $t
is simulated over , and the scores are calculated and averaged over the simud
lations
t 7Rif B
d
I
B
d f
t 7RC
)
is chosen to set the simulated scores to zero
f
t 7d i

$
(since and
are of the same dimension).
This method requires simulating the stochastic differential equation. There

are many ways of doing this. Basically, they involve doing very ne discretizations:
gf
very small, the sequence of
fairly well.
j
g!e g!eHiP gf
f d P f d
By setting
approximates a Brownian motion
430
This is only one method of using indirect inference for estimation of differential equations. There are others (see Gallant and Long, 1995 and Gourieroux
et. al.). Use of a series approximation to the transitional density as in Gallant and Long is an interesting possibility since the score generator may have
a higher dimensional parameter than the model, which allows for diagnostic

7d
the same dimension as is
so diagnostic testing is not possible.
testing. In the method described above the score generators parameter
is of
CHAPTER 20
Parallel programming for econometrics

In this chapter well see how commonly used computations in econometrics can be done in parallel on a cluster of computers.
431
CHAPTER 21
Introduction to Octave
Why is Octave being used here, since its not that well-known by econometricians? Well, because it is a high quality environment that is easily extensible,
uses well-tested and high performance numerical libraries, it is licensed under
the GNU GPL, so you can get it for free and modify it if you like, and it runs
on both GNU/Linux, Mac OSX and Windows systems. Its also quite easy to
learn.
21.1. Getting started
Get the bootable CD, as was described in Section 1.3. Then burn the image,
and boot your computer with it. This will give you this same PDF le, but with
all of the example programs ready to run. The editor is congure with a macro
to execute the programs using Octave, which is of course installed. From this
point, I assume you are running the CD (or sitting in the computer room across
the hall from my ofce), or that you have congured your computer to be able
to run the *.m les mentioned below.
21.2. A short introduction
The objective of this introduction is to learn just the basics of Octave. There
are other ways to use Octave, which I encourage you to explore. These are just
some rudiments. After this, you can look at the example programs scattered
throughout the document (and edit them, and run them) to learn more about
how Octave can be used to do econometrics. Students of mine: your problem
432
21.2. A SHORT INTRODUCTION
433
F IGURE 21.2.1. Running an Octave program
sets will include exercises that can be done by modifying the example programs in relatively minor ways. So study the examples!
Octave can be used interactively, or it can be used to run programs that are
written using a text editor. Well use this second method, preparing programs
with NEdit, and calling Octave from within the editor. The program rst.m
gets us started. To run this, open it up with NEdit (by nding the correct
le inside the /home/knoppix/Desktop/Econometrics folder and clicking on the icon) and then type CTRL-ALT-o, or use the Octave item in the Shell
menu (see Figure 21.2.1).
21.2. A SHORT INTRODUCTION
434
Note that the output is not formatted in a pleasing way. Thats because
printf() doesnt automatically start a new line. Edit first.m so that the
8th line reads printf(hello world\n); and re-run the program.
We need to know how to load and save data. The program second.m
shows how. Once you have run this, you will nd the le x in the directory
Econometrics/Include/OctaveIntro/ You might have a look at it with
NEdit to see Octaves default format for saving data. Basically, if you have
data in an ASCII text le, named for example myfile.data, formed of
numbers separated by spaces, just use the command load myfile.data.
After having done so, the matrix myfile (without extension) will contain
the data.
Please have a look at CommonOperations.m for examples of how to do
some basic things in Octave. Now that were done with the basics, have a look
at the Octave programs that are included as examples. If you are looking at
the browsable PDF version of this document, then you should be able to click
on links to open them. If not, the example programs are available here and the
support les needed to run these are available here. Those pages will allow
you to examine individual les, out of context. To actually use these les (edit
and run them), you should go to the home page of this document, since you
will probably want to download the pdf version together with all the support
les and examples. Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You
might like to check the article Econometrics with Octave and the Econometrics Toolbox ,
which is for Matlab, but much of which could be easily used with Octave.
21.3. IF YOURE RUNNING A LINUX INSTALLATION...
435
21.3. If youre running a Linux installation...

Then to get the same behavior as found on the CD, you need to:
Get the collection of support programs and the examples, from the
Put them somewhere, and tell Octave how to nd them, e.g., by putting
Make sure nedit is installed and congured to run Octave and use
document home page.
a link to the MyOctaveFiles directory in /usr/local/share/octave/site-m
syntax highlighting. Copy the le /home/econometrics/.nedit

from the CD to do this. Or, get the le NeditConguration and save
it in your $HOME directory with the name .nedit. Not to put too
ne a point on it, please note that there is a period in that name.
Associate *.m les with NEdit so that they open up in the editor when
you click on them. That should do it.
CHAPTER 22
Notation and Review

All vectors will be column vectors, unless they have a transpose symbol (or I forget to apply this rule - your help catching typos and er0rors
vector,
) '
V"
vector. When I refer to a -vector, I mean a column vector.
R
is a
is a

' )
is much appreciated). For example, if
22.1. Notation for differentiation of vectors and matrices

[3, Chapter 1]
4R
d
y

R
Rd
R
8 edd d e d 4Rdd d

d
y

'
) c
c z
matrix. Also,
both -vectors, show that
be a -vector valued function of the -vector . Let

436
8d
ge y R R e
d
. Then
f
y
1 '
%)
d
T y e
valued transpose of
be the
is a
'
Y
Let
and
vector and
.
.
.
E XERCISE 33. For
is a
is
z
c
Following this convention,
Then
R 4R
d
y r
b
T
organized as a -vector,
8
d
be a real valued function of the -vector
Let
22.2. CONVERGENGE MODES
d
T y 4e
d
T y e
v
y
8) '
4j
R d P R d e 4R d
d R d
8 '
)
R
R
P d R d R d d
4R 4e
Rd
R
Applying the transposition rule we get
which has dimension
be a -vector valued function of an
H
b
T y
-vector valued argument . Then
R c
R
R

H e d
d

both
)
Y'
E XERCISE 35. For and
vectors, show that
~
R 7g} x y
8 Q1
'
R w P w
has dimension
vector, show that
a -vector valued function of a -vector
argument, and let
'
i
:
Chain rule: Let
matrix and
) '
V"
E XERCISE 34. For
6
y 6
has dimension
be -vector valued
functions of the -vector . Then
and
Product rule: Let
437
22.2. Convergenge modes

Readings: [1, Chapter 4];[4, Chapter 4].
We will consider several modes of convergence. The rst three modes discussed are simply for background. The stochastic modes are those which will
be used later in the course.
438
D EFINITION 36. A sequence is a mapping from the natural numbers
85)
A8@897
to some other set, so that the set is ordered according to the nat-
1 I 1
t 9f t
ural numbers associated with its elements.

Real-valued sequences:
con-
such that for all
4
f
G t 0s4 1
f
Deterministic real-valued functions. Consider a sequence of functions

where

Cf
f
$
written
is the limit of
8
C
there exists an integer
if for any
verges to the vector
f
7
D EFINITION 37. [Convergence] A real-valued sequence of vectors
8
sy
r
af
may be an arbitrary set.
and
so that converge may be
q f

than for others. Uniform convergence requires
D EFINITION 39. [Uniform convergence] A sequence of functions
8
G
1
such that
if for any
to the function (

6 Sf
a similar rate of convergence throughout
verges uniformly on
there
8 G
1
much more rapid for certain
depends upon

y
Its important to note that
such that
if for all
g
q
exists an integer
to the function (
converges pointwise on

6 Sf
D EFINITION 38. [Pointwise convergence] A sequence of functions
con-
there exists an integer
9
Sf A
439
lie)
in which
must
Sf

(insert a diagram here showing the envelope around
Stochastic sequences. In econometrics, we typically deal with stochastic
rQ u

recall that a random variable
is a collection of such mappings, i.e., each
R I X R S

f
8
i %
the OLS estimator
is
For example,
where
G P p
"
is the sample size, can be used to form a sequence of random vectors
f
S

6 Sf x
a random variable with respect to the probability space

given the model
A sequence of
S
f
random variables
maps the sample space to the real line, i.e.,
8
sy
sequences. Given a probability space
A number of modes of convergence are in use when dealing with sequences

of random variables. Several such modes of convergence should already be
familiar:
D EFINITION 40. [Convergence in probability] Let
f a
r
f
be a random variable. Let
converges in probability to
if
G u

% T f
8 f c A f
G
converges almost surely to
f
Sax
. Then
be a random variable. Let
dom variables, and let
be a sequence of ran-
f
r
Ca% f A w
D EFINITION 41. [Almost sure convergence] Let
or plim
C
f
Convergence in probability is written as
8
% a
f
f
Sax
u
. Then
S
f
dom variables, and let
be a sequence of ran-
if
8
4) c
(ordinary convergence of the two functions)
such that
8 d

Sf
except on a set
In other words,
Almost sure convergence is
f
a
or
8 8
4C6%

i f

written as
440
One can show that
8
% T a a
f
f
T
8
i
Convergence in distribution is written as
8
i
converges in distribution to
then
If
every continuity point of
have distribution function
and the r.v.
have distribuP
tion function
D EFINITION 42. [Convergence in distribution] Let the r.v.
at
It can be shown that con-
vergence in probability implies convergence in distribution.
Stochastic functions. Simple laws of large numbers (LLNs) allow us to
T y f
and
1
f
1
G R I R kP p S
f
p S
T
directly conclude that
in the OLS example, since
by a SLLN. Note that this term is not a function of the parameter
This easy proof is a result of the linearity of the model, which allows us to
express the estimator in a way that separates parameters from random functions. In general, this is not possible. We often deal with the more complicated
situation where the stochastic sequence depends on parameters in a manner
that is not reducible to a simple sequence of random variables. In this case,
d
is a random variable with respect to a probability space

x g
pwd
and the parameter belongs to a parameter space
d f
4yCa
each
where

d f
ySx
we have a sequence of random functions that depend on :
22.3. RATES OF CONVERGENCE AND ASYMPTOTIC EQUALITY
if
yuj4yCa f
d d f
9
@
are random vari-
Well indicate uniform almost sure conver-
8 x g
d

gence by
for all
d
4yCf
ables w.r.t.
and
and uniform convergence in probability by
8 T

Implicit is the assumption that all
converges uni-
(a.s.)
d
yu
to
d
yu
formly almost surely in
d f
64ySx
D EFINITION 43. [Uniform almost sure convergence]
441
An equivalent denition, based on the fact that almost sure means
with probability one is

A
) yVj4yCa t f
d d f
9
This has a form similar to that of the denition of a.s. convergence -

9
the essential difference is the addition of the
22.3. Rates of convergence and asymptotic equality

Its often useful to have notation for the relative magnitudes of quantities.
Quantities that are small relative to others can often be ignored, which simplies analysis.
be two real-valued functions.
1
DH
is a nite constant.
means there exists some
xf
9f
1

1

x
9ff

This denition doesnt require that
edly).
and
such that for
1

1
ED
where
be two real-valued functions.
1
H
means
D EFINITION 45. [Big-O] Let

The notation
The notation
and
8 9f

xf f
1
DH
1
D
D EFINITION 44. [Little-o] Let
have a limit (it may uctuate bound-
are sequences of random variables analogous denitions
1

D EFINITION 46. The notation
)
tcT
)
xT G R I X R

(

GVPpgdp R I X R R I X R d
x
8 T 9ff
1
HT
are
and
f
g
If
442
means
E XAMPLE 47. The least squares estimator
8 ) P p
gtcT gd d
8
G R I Q R gd
P p
and
I
y y
Since plim
we can write
Asymptotically, the term
is negligible. This is just a
way of indicating that the LS estimator is consistent.
1
D
1
H
1
G)
D
1
8Gj) f
f
)
t f
a
S
such that
)
gtcT
then
since, given

G
E XAMPLE 49. If
is a nite constant.
always some
S
d
where
and all
1
EHT
such that for
means there exists some
D EFINITION 48. The notation
there is
Useful rules:
( T T ( T T T
1
1
1
( T $T T ( T S T T
1 S
1
1 S
E XAMPLE 50. Consider a random sample of iid r.v.s with mean 0 and vari
)
gtcT S d eI 1 D! jk d eI 1
8
p
p
B I
C fB 1 ) d
. The estimator of the mean
distributed, e.g.,
)
gxT d
we had
is asymptotically normally
So
so
8 p
g eI 1$T S d
ance
Before
now we have have the stronger result that relates the rate of
convergence to the sample size.
443
E XAMPLE 51. Now consider a random sample of iid r.v.s with mean

gt)cT S h Qpd p 1 g h
8
eI
I
BC fB 1 ) d
. The estimator of the mean
8) S
gxT T d
p 1
Qspd
eI
and variance
normally distributed, e.g.,

so
is asymptotically
So
so
p 1
gg eI T
S
T
pd
These two examples show that averages of centered (mean zero) quantities typically have plim 0, while averages of uncentered quantities have nite
1
D
does not mean that
and
are of the same order. Asymptotic equality ensures that this is the case.
1
) DH
D 3
1
vious way.
and
S
f
4 f
Finally, analogous almost sure versions of
if
7f
asymptotically equal (written
and
f
g}
D EFINITION 52. Two sequences of random variables
1
D
nonzero plims. Note that the denition of
are
are dened in the ob-
EXERCISES
444
Exercises
vectors, show that
)j
'
) '
j
vectors, show that
'
i
both
and
both
and
) '

matrix and
vector, show that
x
R
7~}
x
~
~
R 7} R 7g}
R w P y 6
(4) For
both
) '

(3) For
and
y 6
(2) For
(1) For
vectors, nd the analytic expression for
(5) Write an Octave program that veries each of the previous results by taking numeric derivatives. For a hint, type help numgradient and help
numhessian inside octave.
CHAPTER 23
The GPL
This document and the associated examples and materials are copyright
Michael Creel, under the terms of the GNU General Public License. This license follows:
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place,
Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and
distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your freedom to
share and change it. By contrast, the GNU General Public License is intended
to guarantee your freedom to share and change free softwareto make sure the
software is free for all its users. This General Public License applies to most
of the Free Software Foundations software and to any other program whose
authors commit to using it. (Some other Free Software Foundation software is
covered by the GNU Library General Public License instead.) You can apply it
to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our
General Public Licenses are designed to make sure that you have the freedom
to distribute copies of free software (and charge for this service if you wish),
that you receive source code or can get it if you want it, that you can change
445
23. THE GPL
446
the software or use pieces of it in new free programs; and that you know you
can do these things.
To protect your rights, we need to make restrictions that forbid anyone to
deny you these rights or to ask you to surrender the rights. These restrictions
translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or
for a fee, you must give the recipients all the rights that you have. You must
make sure that they, too, receive or can get the source code. And you must
show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy, distribute
and/or modify the software.
Also, for each authors protection and ours, we want to make certain that
everyone understands that there is no warranty for this free software. If the
software is modied by someone else and passed on, we want its recipients to
know that what they have is not the original, so that any problems introduced
by others will not reect on the original authors reputations.
Finally, any free program is threatened constantly by software patents. We
wish to avoid the danger that redistributors of a free program will individually
obtain patent licenses, in effect making the program proprietary. To prevent
this, we have made it clear that any patent must be licensed for everyones
free use or not licensed at all.
The precise terms and conditions for copying, distribution and modication follow.
23. THE GPL
447
GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains a
notice placed by the copyright holder saying it may be distributed under the
terms of this General Public License. The "Program", below, refers to any such
program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modications
and/or translated into another language. (Hereinafter, translation is included
without limitation in the term "modication".) Each licensee is addressed as
"you".
Activities other than copying, distribution and modication are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its
contents constitute a work based on the Program (independent of having been
made by running the Program). Whether that is true depends on what the
Program does.
1. You may copy and distribute verbatim copies of the Programs source
code as you receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to
the absence of any warranty; and give any other recipients of the Program a
copy of this License along with the Program.
You may charge a fee for the physical act of transferring a copy, and you
may at your option offer warranty protection in exchange for a fee.
23. THE GPL
448
2. You may modify your copy or copies of the Program or any portion of
it, thus forming a work based on the Program, and copy and distribute such
modications or work under the terms of Section 1 above, provided that you
also meet all of these conditions:
a) You must cause the modied les to carry prominent notices stating that
you changed the les and the date of any change.
b) You must cause any work that you distribute or publish, that in whole
or in part contains or is derived from the Program or any part thereof, to be
licensed as a whole at no charge to all third parties under the terms of this
License.
c) If the modied program normally reads commands interactively when
run, you must cause it, when started running for such interactive use in the
most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying
that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License.
(Exception: if the Program itself is interactive but does not normally print such
an announcement, your work based on the Program is not required to print an
announcement.)
These requirements apply to the modied work as a whole. If identiable
sections of that work are not derived from the Program, and can be reasonably
considered independent and separate works in themselves, then this License,
and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole
which is a work based on the Program, the distribution of the whole must be
23. THE GPL
449
on the terms of this License, whose permissions for other licensees extend to
the entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights
to work written entirely by you; rather, the intent is to exercise the right to
control the distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of a
storage or distribution medium does not bring the other work under the scope
of this License.
3. You may copy and distribute the Program (or a work based on it, under
Section 2) in object code or executable form under the terms of Sections 1 and
2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable source
code, which must be distributed under the terms of Sections 1 and 2 above on
a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three years, to give
any third party, for a charge no more than your cost of physically performing
source distribution, a complete machine-readable copy of the corresponding
source code, to be distributed under the terms of Sections 1 and 2 above on a
medium customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code
or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modications to it. For an executable work, complete source code means
23. THE GPL
450
all the source code for all modules it contains, plus any associated interface
denition les, plus the scripts used to control compilation and installation of
the executable. However, as a special exception, the source code distributed
need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component itself
accompanies the executable.
If distribution of executable or object code is made by offering access to
copy from a designated place, then offering equivalent access to copy the
source code from the same place counts as distribution of the source code,
even though third parties are not compelled to copy the source along with the
object code.
4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy,
modify, sublicense or distribute the Program is void, and will automatically
terminate your rights under this License. However, parties who have received
copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
5. You are not required to accept this License, since you have not signed
it. However, nothing else grants you permission to modify or distribute the
Program or its derivative works. These actions are prohibited by law if you do
not accept this License. Therefore, by modifying or distributing the Program
(or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or
modifying the Program or works based on it.
23. THE GPL
451
6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor
to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients exercise
of the rights granted herein. You are not responsible for enforcing compliance
by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict
the conditions of this License, they do not excuse you from the conditions of
this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by all those
who receive copies directly or indirectly through you, then the only way you
could satisfy both it and this License would be to refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any
particular circumstance, the balance of the section is intended to apply and the
section as a whole is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or
other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people
have made generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that system; it is
23. THE GPL
452
up to the author/donor to decide if he or she is willing to distribute software

through any other system and a licensee cannot impose that choice.
This section is intended to make thoroughly clear what is believed to be a
consequence of the rest of this License. 8. If the distribution and/or use of the
Program is restricted in certain countries either by patents or by copyrighted
interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those
countries, so that distribution is permitted only in or among countries not thus
excluded. In such case, this License incorporates the limitation as if written in
the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will be
similar in spirit to the present version, but may differ in detail to address new
problems or concerns.
Each version is given a distinguishing version number. If the Program
species a version number of this License which applies to it and "any later
version", you have the option of following the terms and conditions either of
that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you
may choose any version ever published by the Free Software Foundation.
10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask
for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions
for this. Our decision will be guided by the two goals of preserving the free
23. THE GPL
453
status of all derivatives of our free software and of promoting the sharing and
reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED
BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING
THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE
PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED
INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR
A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
23. THE GPL
454
How to Apply These Terms to Your New Programs

If you develop a new program, and you want it to be of the greatest possible
use to the public, the best way to achieve this is to make it free software which
everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach
them to the start of each source le to most effectively convey the exclusion of
warranty; and each le should have at least the "copyright" line and a pointer
to where the full notice is found.
<one line to give the programs name and a brief idea of what it does.>
Copyright (C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option) any
later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it
starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision
comes with ABSOLUTELY NO WARRANTY; for details type show w. This is
23. THE GPL
455
free software, and you are welcome to redistribute it under certain conditions;
type show c for details.
The hypothetical commands show w and show c should show the appropriate parts of the General Public License. Of course, the commands you
use may be called something other than show w and show c; they could
even be mouse-clicks or menu itemswhatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if necessary.
Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program Gnomovision (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public
License instead of this License.
CHAPTER 24
The attic
The GMM estimator, briey
The OLS estimator can be thought of as a method of moments estimator.
v
e
. So, likewise,
y f h f
With weak exogeneity,
The idea of the MM estimator is to choose the estimator to make the sample
counterpart hold:
R I 6 R
1
h i R

1
eR
This means of deriving the formula requires no calculus. It provides another

interpretation of how the OLS estimator is dened.
We can perhaps think of other variables that are not correlated with , say
is greater than
then we have more un
Ea
e
a
of
that satisy
assume that we have instruments
!e
. This may be needed if the weak exogeneity assumption fails for
. Let us
. If the dimension
This holds material that is not really ready to be incorporated into the main
body, but that I dont want to lose. Basically, ignore it, unless youd like to help
get it ready for inclusion.
456
24.1. MEPS DATA: MORE ON COUNT MODELS
457
24.1. MEPS data: more on count models

Note to self: this chapter is yet to be converted to use Octave. To check the
plausibility of the Poisson model, we can compare the sample unconditional
variance with the estimated unconditional variance according to the Poisson
P
f
nb
f
S
model:
. For OBDV and ERV, we get We see that even after
TABLE 1. Marginal Variances, Sample and Estimated (Poisson)

OBDV
ERV
Sample
37.446 0.30614
Estimated 3.4540 0.19060
conditioning, the overdispersion is not captured in either case. There is huge

problem with OBDV, and a signicant problem with ERV. In both cases the
Poisson model does not appear to be plausible.
24.1.1. Innite mixture models. Reference: Cameron and Trivedi (1998)

Regression analysis of count data, chapter 4.
The two measures seem to exhibit extra-Poisson variation. To capture unobserved heterogeneity, a possibility is the random parameters approach. Consider the possibility that the constant term in a Poisson model were random:
G 7g} v 7g}
~ & ~
4G0P v 7g}
~
&
(
f
~
4 7g}
u
d d
d
G f
v

Y
G
. Now
G ~
7}
) and
captures the randomness in the

a
v 7} ue
~
&
where
458
constant. The problem is that we dont observe , so we will need to marginala
ize it to get a usable density

()f
d 7g}
~
# t #
$
d d
u G8
f
v
Y
`
This density can be used directly, perhaps using numerical integration to evaluate the likelihood function. In some cases, though, the integral will have an
follows a certain one parameter gamma
a
analytic solution. For example, if

density, then
PXX
t f
v
appears since it is the parameter of the gamma density.
f
v
e
q

&

&
, then
, then
is parameterized.
. Note that
e &
f
v
e
'
, where
X
U e t
The variance depends upon how
, which we have parameterized
R v 7g} e
~
Y
`
For this density,
&
I q
e
X
If
PQX
) P f
RX X xf
Ui
P
X
where
(24.1.1)
is a
function of , so that the variance is too. This is referred to as the
NB-I model.
f
v
)
&
@
to as the NB-II model.
e &
, where
If
. This is referred
So both forms of the NB model allow for overdispersion, with the NB-II model
allowing for a more radical form.
Testing reduction of a NB model to a Poisson model cannot be done
using standard Wald or LR procedures. The critical
values need to be adjusted to account for the fact that
&

&
by testing
is on
the boundary of the parameter space. Without getting into details,
459
suppose that the data were in fact Poisson, so there is equidispersion

. Then about half the time the sample data will
&

and the true
de underdispersed, and about half the time overdispersed. When the

will be
( &
&
data is underdispersed, the MLE of
. Thus, under the
null, there will be a probability spike in the asymptotic distribution of
& 1
at 0, so standard testing methods will not be valid.
&
& 1
Here are NB-I estimation results for OBDV, obtained using this estimation program
.
MEPS data, OBDV

negbin results
Strong convergence
Observations = 500
Function value
-2.2656
t-Stats
params
t(OPG)
t(Sand.)
-0.055766
-0.16793
-0.17418
-0.17215
pub_ins
0.47936
2.9406
2.8296
2.9122
priv_ins
0.20673
1.3847
1.4201
1.4086
sex
0.34916
3.2466
3.4148
3.3434
age
0.015116
3.3569
3.8055
3.5974
educ
0.014637
0.78661
0.67910
0.73757
inc
0.012581
0.60022
0.93782
0.76330
1.7389
23.669
11.295
16.660
constant
ln_alpha
Consistent Akaike
2323.3
t(Hess)
Schwartz
2315.3
Hannan-Quinn
2294.8
Akaike
2281.6
460
461
Here are NB-II results for OBDV
*********************************************************************
MEPS data, OBDV
negbin results
Strong convergence
Observations = 500
Function value
-2.2616
t-Stats
params
t(OPG)
t(Sand.)
-0.65981
-1.8913
-1.4717
-1.6977
pub_ins
0.68928
2.9991
3.1825
3.1436
priv_ins
0.22171
1.1515
1.2057
1.1917
sex
0.44610
3.8752
2.9768
3.5164
age
0.024221
3.8193
4.5236
4.3239
educ
0.020608
0.94844
0.74627
0.86004
inc
0.020040
0.87374
0.72569
0.86579
0.47421
5.6622
4.6278
5.6281
constant
ln_alpha
Consistent Akaike
2319.3
Schwartz
2311.3
Hannan-Quinn
2290.8
Akaike
2277.6
t(Hess)
24.2. HURDLE MODELS
462
*********************************************************************
For the OBDV model, the NB-II model does a better job, in terms of
the average log-likelihood and the information criteria.
Note that both versions of the NB model t much better than does the
The t-statistics are now similar for all three ways of calculating them,
Poisson model.
which might indicate that the serious specication problems of the

Poisson model for the OBDV data are partially solved by moving to
the NB model.
is highly signicant.
&
The estimated
To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance according to
P
f
z nb
f

the NB-II model:
. For OBDV and ERV (estimation results
not reported), we get The overdispersion problem is signicantly better than

TABLE 2. Marginal Variances, Sample and Estimated (NB-II)
OBDV
ERV
Sample
37.446 0.30614
Estimated 26.962 0.27620
in the Poisson case, but there is still some overdispersion that is not captured,
for both OBDV and ERV.
24.2. Hurdle models

Returning to the Poisson model, lets look at actual and tted count prob-
Y
`
I
B %
f
f
ted frequencies are
1 4d B %
C
1 % pB $) B %
f
abilities. Actual relative frequencies are
and t-
We see that for the OBDV
24.2. HURDLE MODELS
463
TABLE 3. Actual and Poisson tted frequencies

Count
OBDV
ERV
Count Actual Fitted Actual Fitted
0
0.32
0.06
0.86
0.83
1
0.18
0.15
0.10
0.14
2
0.11
0.19
0.02
0.02
3
0.10
0.18
0.004 0.002
4
0.052
0.15
0.002 0.0002
5
0.032
0.10
0
2.4e-5
measure, there are many more actual zeros than predicted. For ERV, there are
somewhat more actual zeros than tted, but the difference is not too important.
Why might OBDV not t the zeros well? What if people made the decision to contact the doctor for a rst visit, they are sick, then the doctor decides
on whether or not follow-up visits are needed. This is a principal/agent type
situation, where the total number of visits depends upon the decision of both
the patient and the doctor. Since different parameters may govern the two
decision-makers choices, we might expect that different parameters govern
e
be the parameters of
be the paramter of the doctors dee
the patients demand for visits, and let
the probability of zeros versus the other counts. Let

m
mand for visits. The patient will initiate visits according to a discrete choice
model, for example, a logit model:
~
$T e 7g} P ) )
~ P
T e 7g} d) ) )
tT e
Y
`

24.2. HURDLE MODELS
464
The above probabilities are used to estimate the binary 0/1 hurdle process.
Then, for the observations where visits are positive, a truncated Poisson density is estimated. This density is
m c } )
~
e
m e
f
Y
`

f
m e `
f Y
m
f
f e
Y
`
since according to the Poisson model with the doctors paramaters,
8 pm m ( c
f
}
~
e
e
Since the hurdle and truncated components of the overall density for
share
no parameters, they may be estimated separately, which is computationally

more efcient than estimating the overall model. (Recall that the BFGS algorithm, for example, will have to invert the approximated Hessian. The com-
is
is the number of parameters to be
estimated) . The expectation of
where
putational overhead is of order
c 7g} ) ~ P
~
m
T e )7g} )
e

24.2. HURDLE MODELS
465
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
*********************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value
-0.58939
t-Stats
params
t(OPG)
t(Sand.)
-1.5502
-2.5709
-2.5269
-2.5560
1.0519
3.0520
3.0027
3.0384
priv_ins
0.45867
1.7289
1.6924
1.7166
sex
0.63570
3.0873
3.1677
3.1366
age
0.018614
2.1547
2.1969
2.1807
educ
0.039606
1.0467
0.98710
1.0222
inc
0.077446
1.7655
2.1672
1.9601
constant
pub_ins
t(Hess)
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
*********************************************************************
24.2. HURDLE MODELS
466
The results for the truncated part:
*********************************************************************
MEPS data, OBDV
tpoisson results
Strong convergence
Observations = 500
Function value
-2.7042
t-Stats
params
t(OPG)
t(Sand.)
constant
0.54254
7.4291
1.1747
3.2323
pub_ins
0.31001
6.5708
1.7573
3.7183
0.014382
0.29433
0.10438
0.18112
sex
0.19075
10.293
1.1890
3.6942
age
0.016683
16.148
3.5262
7.9814
educ
0.016286
4.2144
0.56547
1.6353
-0.0079016
-2.3186
-0.35309
-0.96078
priv_ins
inc
t(Hess)
Consistent Akaike
2754.7
Schwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
*********************************************************************
24.2. HURDLE MODELS
467
Fitted and actual probabilites (NB-II ts are provided as well) are:

TABLE 4. Actual and Hurdle Poisson tted frequencies
Count
OBDV
ERV
Count Actual Fitted HP Fitted NB-II Actual Fitted HP Fitted NB-II
0
0.32
0.32
0.34
0.86
0.86
0.86
1
0.18
0.035
0.16
0.10
0.10
0.10
2
0.11
0.071
0.11
0.02
0.02
0.02
3
0.10
0.10
0.08
0.004
0.006
0.006
4
0.052
0.11
0.06
0.002
0.002
0.002
5
0.032
0.10
0.05
0
0.0005
0.001
For the Hurdle Poisson models, the ERV t is very accurate. The OBDV t
is not so good. Zeros are exact, but 1s and 2s are underestimated, and higher
counts are overestimated. For the NB-II ts, performance is at least as good as
the hurdle Poisson model, and one should recall that many fewer parameters
are used. Hurdle version of the negative binomial model are also widely used.
24.2.1. Finite mixture models. The nite mixture approach to tting health
care demand was introduced by Deb and Trivedi (1997). The mixture approach
has the intuitive appeal of allowing for subgroups of the population with different health status. If individuals are classied as healthy or unhealthy then
two subgroups are dened. A ner classication scheme would lead to more
subgroups. Many studies have incorporated objective and/or subjective indicators of health status in an effort to capture this heterogeneity. The available
objective measures, such as limitations on activity, are not necessarily very
informative about a persons overall health status. Subjective, self-reported
measures may suffer from the same problem, and may also not be exogenous
24.2. HURDLE MODELS
468
Finite mixture models are conceptually simple. The density is
. Identication re
bb
xxb
are ordered in some way, for example,
, and
) DB BT
I
% p! t
3 B t
B
I
85 3
B I BT q) T @A8@897) ! B
I
B
8 Tt 8I t f Y
EI T A@8@89I 4@A8@896D `
I T
quires that the
f
xTt T Y T B 6 Y B
P t f
B
where
and
. This is simple to accomplish post-estimation by rearrangement
and possible elimination of redundant component densities.
The properties of the mixture density follow in a straightforward way

from those of the components. In particular, the moment generating function is the same mixture of the moment generating functions
is the mean of the
3
B
where
I
B Q B BT
of the component densities, so, for example,
component density.
Mixture densities may suffer from overparameterization, since the to-
tal number of parameters grows rapidly with the number of component densities. It is possible to constrained parameters across the mixtures.
Testing for the number of component densities is a tricky issue. For
example, testing for
Not that when
(a mixture of two components) involves the
, which is on the boundary of the parameter space.
) I
) I
restriction
5
)
mixture) versus
(a single component, which is to say, no
, the parameters of the second component can
take on any value without affecting the density. Usual methods such
as the likelihood ratio test are not applicable when parameters are on
the boundary under the null hypothesis. Information criteria means
of choosing the model (see below) are valid.
24.2. HURDLE MODELS
469
The following are results for a mixture of 2 negative binomial (NB-I) models,
for the OBDV data, which you can replicate using this estimation program
24.2. HURDLE MODELS
470
*********************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value
-2.2312
t-Stats
params
t(OPG)
t(Sand.)
0.64852
1.3851
1.3226
1.4358
-0.062139
-0.23188
-0.13802
-0.18729
0.093396
0.46948
0.33046
0.40854
sex
0.39785
2.6121
2.2148
2.4882
age
0.015969
2.5173
2.5475
2.7151
-0.049175
-1.8013
-1.7061
-1.8036
0.015880
0.58386
0.76782
0.73281
ln_alpha
0.69961
2.3456
2.0396
2.4029
constant
-3.6130
-1.6126
-1.7365
-1.8411
2.3456
1.7527
3.7677
2.6519
priv_ins
0.77431
0.73854
1.1366
0.97338
sex
0.34886
0.80035
0.74016
0.81892
age
0.021425
1.1354
1.3032
1.3387
0.22461
2.0922
1.7826
2.1470
0.019227
0.20453
0.40854
0.36313
2.8419
6.2497
6.8702
7.6182
0.85186
1.7096
1.4827
1.7883
constant
pub_ins
priv_ins
educ
inc
pub_ins
educ
inc
ln_alpha
logit_inv_mix
t(Hess)
24.2. HURDLE MODELS
471
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
*********************************************************************
Delta method for mix parameter st.
mix
se_mix
0.70096
err.
0.12043
The 95% condence interval for the mix parameter is perilously close
to 1, which suggests that there may really be only one component density, rather than a mixture. Again, this is not the way to test this - it is
merely suggestive.
Education is interesting. For the subpopulation that is healthy, i.e.,
that makes relatively few visits, education seems to have a positive
effect on visits. For the unhealthy group, education has a negative

effect on visits. The other results are more mixed. A larger sample
could help clarify things.
The following are results for a 2 component constrained mixture negative biare the same across
the two components. The constants and the overdispersion parameters

allowed to differ for the two components.
D&
x 7e

nomial model where all the slope parameters in
are
24.2. HURDLE MODELS
472
*********************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value
-2.2441
t-Stats
params
t(OPG)
t(Sand.)
-0.34153
-0.94203
-0.91456
-0.97943
pub_ins
0.45320
2.6206
2.5088
2.7067
priv_ins
0.20663
1.4258
1.3105
1.3895
sex
0.37714
3.1948
3.4929
3.5319
age
0.015822
3.1212
3.7806
3.7042
educ
0.011784
0.65887
0.50362
0.58331
inc
0.014088
0.69088
0.96831
0.83408
ln_alpha
1.1798
4.6140
7.2462
6.4293
const_2
1.2621
0.47525
2.5219
1.5060
lnalpha_2
2.7769
1.5539
6.4918
4.2243
logit_inv_mix
2.4888
0.60073
3.7224
1.9693
constant
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
t(Hess)
24.2. HURDLE MODELS
473
2284.3
Akaike
2266.1
*********************************************************************
Delta method for mix parameter st.
mix
se_mix
0.92335
err.
0.047318
Now the mixture parameter is even closer to 1.

The slope parameter estimates are pretty close to what we got with the
NB-I model.

24.2.2. Comparing models using information criteria. A Poisson model

cant be tested (using standard methods) as a restriction of a negative binomial model. Testing for collapse of a nite mixture to a mixture of fewer components has the same problem. How can we determine which of competing
models is the best?
The information criteria approach is one possibility. Information criteria
are functions of the log-likelihood, with a penalty for the number of parameters used. Three popular information criteria are the Akaike (AIC), Bayes (BIC)
and consistent Akaike (CAIC). The formulae are
5 P
$d f p
5
1 A $d f p
P
5
)
tP 1 H$d f p
P
5
w
7Y
7Yw
It can be shown that the CAIC and BIC will select the correctly specied model
from a group of models, asymptotically. This doesnt mean, of course, that the
24.3. MODELS FOR TIME SERIES DATA
474
correct model is necesarily in the group. The AIC is not consistent, and will
asymptotically favor an over-parameterized model over the correctly specied
model. Here are information criteria values for the models weve seen, for
OBDV. According to the AIC, the best is the MNB-I, which has relatively many
TABLE 5. Information Criteria, OBDV
Model
Poisson
NB-I
Hurdle Poisson
MNB-I
CMNB-I
AIC
3822
2282
3333
2265
2266
BIC CAIC
3911 3918
2315 2323
3381 3395
2337 2354
2312 2323
parameters. The best according to the BIC is CMNB-I, and according to CAIC,
the best is NB-I. The Poisson-based models do not do well.
24.3. Models for time series data
This section can be ignored in its present form. Just left in to form a basis
for completion (by someone else ?!) at some point.
Hamilton, Time Series Analysis is a good reference for this section. This is
very incomplete and contributions would be very welcome.
dependent variables, e.g.,
gf
consider the behavior of
as a
These variables can of course contain lagged
8 f 8
!@A88@9I f!UF
8
c
function of other variables
4f
Up to now weve considered the behavior of the dependent variable
Pure time series methods
as a function only of its own lagged values, un-
conditional on other observable variables. One can think of this as modeling
gf
the behavior of
after marginalizing out all other variables. While its not
immediately clear why a model that has other explanatory variables should
marginalize to a linear in the parameters time series model, most time series
475
work is done with linear models, though nonlinear time series is also a large
and growing eld. Well stick with linear time series models.
24.3.1. Basic concepts.
D EFINITION 53 (Stochastic process). A stochastic process is a sequence of

random variables, indexed by time:
(24.3.1)
@ i

D EFINITION 54 (Time series). A time series is one observation of a stochastic process, over a specic interval:
I f
@f gt
So a time series is a sample of size
(24.3.2)
from a stochastic process. Its impor-
tant to keep in mind that conceptually, one could draw another sample, and
that the values would be different.

%
D EFINITION 55 (Autocovariance). The
autocovariance of a stochastic
process is
8 f
g HQ
where
H Hsg
Q
f Q f
(24.3.3)
D EFINITION 56 (Covariance (weak) stationarity). A stochastic process is

covariance stationary if it has time constant mean and autocovariances of all
476
orders:
2

2
Q HQ
As weve seen, this implies that
the autocovariances depend only
one the interval between observations, but not the time of the observations.
D EFINITION 57 (Strong stationarity). A stochastic process is strongly stadoesnt
Since moments are determined by the distribution, strong stationarity

stationarity.
weak
The time series is one sample from the stochastic
process. One could think of
What is the mean of
8
2
depend on
tionary if the joint distribution of an arbitrary collection of the
repeated samples from the stoch. proc., e.g.,
By a LLN, we would expect that
dWgf
W t
f
IqW

) A
The problem is, we have only one sample to work with, since we cant go back
ergodicity is the needed property.
in time and collect another. How can
be estimated then? It turns out that
D EFINITION 58 (Ergodicity). A stationary stochastic process is ergodic (for

the mean) if the time average converges to the mean
I
@ 1
gf
f )
(24.3.4)
477
A sufcient condition for ergodicity is that the autocovariances be absolutely summable:
k
l

%
autocovariance divided by the variance:
is just the

(24.3.5)
autocorrelation,
D EFINITION 59 (Autocorrelation). The
dependent that they dont satisfy a LLN.
are not so strongly
4f
This implies that the autocovariances die off, so that the
D EFINITION 60 (White noise). White noise is just the time series literature
are independent,
!e
Ee
2
normality assumption.
8
4 2
and
2
EU
e
and iii)
is white noise if i)
ii)

xD E
e
term for a classical error.
Gaussian white noise just adds a
24.3.2. ARMA models. With these concepts, we can discuss ARMA models. These are closely related to the AR and MA error processes that weve
already discussed. The main difference is that the lhs variable is observed directly now.
order moving average (MA) process is
24.3.2.1. MA(q) processes. A
txxxP tG DI tcgStVP
G(d Pbbb
d P GI d P G
Q
gf
.
.
.
T gf
DI i
P
bxxb
b
)
.
.
.
.

..
P C

bb
xxb
.
..
T
t
..
gf
I gf
.
.
.
..
tG
)
)
bb
xxb t t
I
I T gf
I gf
gf
.
.
.

or
.
.
.
order difference equation as a vector rst order difference equation:

The dynamic behavior of an AR(p) process can be studied by writing this
G
tVP T g9xxxP f wsI cwP gf
fTt Pbbb
t P fI t
24.3.2.2. AR(p) processes. An AR(p) process can be represented as
and all of the
as long as
are nite.
Therefore an MA(q) process is necessarily covariance stationary and ergodic,
%
s
(d(d Pbbb d P Id d
%
9xxP d sgI P d

Similarly, the autocovariances are
P I !
d
d P )
tG I tcgP t
d P GI d G
( xxxP
d P b b b
G(d Pbbb
6( t9xxP
Q f

tG
where
is white noise. The variance is

478
8
$f
Save this result, well need it in a minute.
r
I sI
Therefore, stationarity requires that

must die off. Otherwise a shock causes a permanent change in the mean of
If the system is to be stationary, then as we move forward in time this impact
r
I sI
4I
P $
This is simply
8 $ f

Pbbb P
xxI I
I sI R
r

$ C
Consider the impact of a shock in period on
P
t
P
I i I
Pbbb
cxxP
P
P $ C

or in general
PsI P C P I i P P P P P P
P
P
I P P sI C6 P P P P P P
P
P
C
DI C P P
P
and
I S P I C P P P P
P
P
I P DI i P P P P
P
I P C P P
I C
With this, we can recursively work forward in time:

479
These are the for

e
Consider the eigenvalues of the matrix
480
such that
Ce P
is simply
I
t
the matrix
so

e
I
t
can be written as
I
e t
is
)
t t
I
Vqt
t I
)
I
Dt

Ce
Ce P
e
g
So the eigenvalues are the roots of the polynomial
and
e
t

75
so

the matrix
P
When

)
The determinant here can be expressed as a polynomial. for example, for
Vt
t I
481
which can be found using the quadratic equation. This generalizes. For a
order AR process, the eigenvalues are the roots of

Tt T
VI t
bb
gxxb t T e Dt I T e T
I
e
Supposing that all of the roots of this polynomial are distinct, then the matrix
can be factored as
I

is the matrix which has as its columns the eigenvectors of
and

where
is a diagonal matrix with the eigenvalues on the main diagonal. Using this
decomposition, we can write
bb
I xxb I I
P
is repeated times. This gives

%
I
e

are all real valued, it is clear that
r
I sI
85
A@8A874) 3 Ce
B

8 5 3
@@8A894) !) Se
B
requires that
..
Supposing that the
and
I
where
482
e.g., the eigenvalues must be less than one in absolute value.

It may be the case that some eigenvalues are complex-valued. The
previous result generalizes to the requirement that the eigenvalues be

less than one in modulus, where the modulus of a complex number
h
is
3U P
pt
hP
U
3U P
6p
This leads to the famous statement that stationarity requires the roots
of the determinantal polynomial to lie inside the complex unit circle.
draw picture here.
When there are roots on the unit circle (unit roots) or outside the unit
Dynamic multipliers:
circle, we leave the world of stationary processes.

is a dynamic multiplier or an
r
I sI
tG v$ f

impulse-response function. Real eigenvalues lead to steady movements,

whereas comlpex eigenvalue lead to ocillatory behavior. Of course,
when there are multiple eigenvalues the overall effect can be a mixture. pictures
Invertibility of AR process
f
To begin with, dene the lag operator
I f gf

f
The lag operator is dened to behave just as an algebraic quantity, e.g.,
gf f
gf

I gf f
gf
f

T e e xxH
bbb
e
e vEI
e u e I V4xx T e u I T e u T
Tt
Tt bbb
t
It
I
I V I I 6xx
# Tt
# Ttbbb
so we get
T e I xxxH
# b b b
I I
#
e
# e

t
#I t
T # u T I cV T #
and now dene
T#
)
c T 9uxxb # ucV)
#Tt bb
t #I t
Multiply both sides by
# )bbb
4tT e ctxxH$#
) #
4I
e
r
#
such that the following two expressions are
B
e
B
ie
the same for all
same as determination of the
is dened to operate as an algebraic quantitiy, determination of the
is the
)
v f I
e
B
ie
)
c T f V4xxb f u f V)
Tt bb
t
It
are coefcients to be determined. Since
f T e 9xxH f
)bbb
For the moment, just assume that the

Factor this polynomial as
tG T f uxxb f V f uf
Tt bb
t
I t )
or
tG T g9uxxb gf uI gDVHf

fTt bb
t fI t
A mean-zero AR(p) process can be written as
g0)
f
gf
f
gf
f
P gf
)
f
f
gc
f
P )
or
483
tG f t@8@8P f P f
t P 8
t

4) t
t P 8
wtA8@8P f wP f w)
t
t P
t
wP ) gf
gf I f I 4
P
Sgf I f I t f
since
so
tG
t P 8
wA8@8P f wP f ) gf I f I u)
t
t P
Now as
so
tG
and with cancellations we have
G
t7 f t@8@8P f P f c s
t P 8
t
t P )

gf I f I u f u4@8A88 f 0 f V f wt@8A8P f P f
t
t
t
t
t P 8
t
t P )

or, multiplying the polynomials on th LHS, we get
tG f tPx8A@88P f twP f twP ) f Vc f t@8A8P f P f )

f t )
t P 8
t
t P

t P 8 P t
@8A8 f wP f w)
t P
to get
8
4) t
Multiply both sides by
Stationarity, as above, implies that
f t )
G f Vc
Now consider a different stationary process
P
B
Ce
Therefore, the
the eigenvalues of the matrix
of
that are the coefcients of the factorization are simply
The LHS is precisely the determinantal polynomial that gives the eigenvalues
484
485
and the approximation becomes better and better as increases. However, we

%
started with
f t )
G f Vc
Substituting this into the above equation we have
f t )
f uc
t P 8
t@8A8P f P f gf
t
t P )
so
) f u f wA8@8P f wP f w
t )
t P 8
t
t P )

and the approximation becomes arbitrarily good as increases arbitrarily. There%
dene
f
t
t )
6 uc
I f

4) t
fore, for
Recall that our mean zero AR(p) process
tG T f uxxb f V f uf
Tt bb
t
I t )
can be written using the factorization
tG f T e xxxH f
)bbb
and given stationarity, all the
)
cgf
are the eigenvalues of
8
4) e
B
)
v f I
e
where the
Therefore, we can invert each rst order polynomial on the LHS to get
f
p
p
tGj T bxxb
b
f e
e
gf
Gbbb
tc!xxP f
P f iP c gf
I X )
sented as
The RHS is a product of innite-order polynomials in
which can be repre-
are real-valued and absolutely summable.
are real-valued because any complex-valued
In multiplication
83U
6i
is
is an eigenvalue of
3U P
6U
conjugate pairs. This means that if
always occur in
P
The
8B t
functions of the
, which are in turn
are formed of products of powers of the
B
ie
The
B
ie
where the
486
then so
P
U
3U 3U P 3U
66"uH6"0$ 6i6p
3U 3U P
which is real-valued.
This shows that an AR(p) process is representable as an innite-order

MA(q) process.
Recall before that by recursive substitution, an AR(p) process can be
written as
4I
P $
Pbbb P
xxI I
P
t
P
I i I
P
Pbbb
cxxP
drops out. Take
on the RHS drops out. The
Pbbb P
txxI I
P
I i I
the lagged
i
P
As
sI
P
are vectors
of zeros except for their rst element, so we see that the rst equation
here, in the limit, is just
tG I sI P gf
P $ C

P
this and lag it by periods to get
If the process is mean zero, then everything with a
T t@8A8P wHI sDt

Tt P 8
t P I
f
f T t P 8 P Q f t P Q f I t
sQs
gHG0PsQsyT gwA8@8D ssqI gvA
Q
s gHsg@
f Q f
The autocovariances of orders
follow the rule
P TTt P 8
t P I I
u}wA8@8P wDiDt p

With this, the second moments are easy to nd: The variance is
G P Q f T t P 8 P Q
tVHsYT t@8@8Ds
G P fTt Pbbb
0UT 9wtxxP f wsI
t P
g wDsI gt
f t P Q f I
cw}u4@8A88 DuQ
fI t P QT t QI t
Hf
Q
s
so
Q
T t QI t
u4@8A88 DVQ
Tt
uA8@88 uu)
t It
and
so
T t P 8 P t P QI t
@8@8Q wwP
Q
}
2
Q U
f
G
tVP T g9xxxP f wsI cwP gf
fTt Pbbb
t P fI t
Assuming stationarity,
so
Moments of AR(p) process. The AR(p) process is
8
g
as well, recalling the previous factorization of
and the
Bt
B
Se
the
which makes explicit the relationship between the

(and
487
G P 8
tV@8A8P gf sI cI P gf
P f

tG @8A8ss g qssqI gI q
8 P Q
f Q f Q f
p
tG Q f
s g f
I ! (
f

f
)
p
P 8
@8@8P f g
Id P )
( d

Q f
tG s I ! (
If this is the case,
f
or
so we get
P 8
@8@8P f g
Id P )
( d

8
4) g
B
) 8
Yc@8@8g f c f I c (
)
)
can be inverted as long as
or
with
will be an innite-order polynomial in

where
f
then we can write
f g c
B )
and each of the
P 8
A8@8P f g
Id P )
( d

As before, the polynomial on the RHS can be factored as
G
tc (
f
P 8
@8A8P f gP c
Id )
( d

gf
24.3.2.3. Invertibility of MA(q) process. An MA(q) can be written as

can be solved for recursively.
and solve for the unknowns. With
)
P
equations for
T 8I
t@A8@89C p TxD

y

for
) P
s
unknowns (
one can take the
8)
@@8A899 u
these, the
which have
Using the fact that
488
489
where
4
B
@8A8Q P I Q
8 P
Q
P
So we see that an MA(q) has an innite AR representation, as long as the
8 85 3
7@@8A894) 4)
It turns out that one can always manipulate the parameters of an MA(q)
process to nd an invertible representation. For example, the two
MA(1) processes
)

d
i
gf
Q
s
and
I $
G f d )
have exactly the same moments if
d
For example, weve seen that
8 d P )
g p

Given the above relationships amongst the parameters,
hc $ d p
d P )
d P )

so the variances are the same. It turns out that all the autocovariances
will be the same, as is easily checked. This means that the two MA
processes are observationally equivalent. As before, its impossible to
distinguish between observationally equivalent processes on the basis
of data.
490
For a given MA(q) process, its always possible to manipulate the pa-
Its important to nd an invertible representation, since its the only
rameters to nd an invertible representation (which is unique).
gG
The other representations express
as a function of past
8
$ R f
representation that allows one to represent
Why is invertibility important? The most important reason is that it

provides a justication for the use of parsimonious models. Since an
representation, one can reverse the ark
gument and note that at least some MA(
AR(1) process has an MA(
processes have an AR(1)
representation. At the time of estimation, its a lot easier to estimate

the single AR(1) coefcient rather than the innite number of coefcients associated with the MA representation.
This is the reason that ARMA models are popular. Combining low-
order AR and MA models can usually offer a satisfactory representation of univariate time series data with a reasonable number of parameters.
Stationarity and invertibility of ARMA models is similar to what weve
seen - we wont go into the details. Likewise, calculating moments is
similar.
P )
E XERCISE 61. Calculate the autocovariances of an ARMA(1,1) model:
e
E
P ) f
cP gc f t
Bibliography
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford
Univ. Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ.
Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supplementary use only).
491
Index
asymptotic equality, 442
observations, inuential, 27
outliers, 27
own inuence, 29
Chain rule, 436

Cobb-Douglas model, 21
parameter space, 49
convergence, almost sure, 438
Product rule, 436
convergence, in distribution, 439

convergence, in probability, 438
R- squared, uncentered, 31
Convergence, ordinary, 437
R-squared, centered, 32
convergence, pointwise, 437

convergence, uniform, 437
convergence, uniform almost sure, 440
cross section, 17
estimator, linear, 28, 38

estimator, OLS, 23
extremum estimator, 247
leverage, 28
likelihood function, 49
matrix, idempotent, 27
matrix, projection, 26
matrix, symmetric, 27
492

Econometrics Creel

Uploaded by

Copyright:

Available Formats

You might also like

Econometrics Creel

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econometrics Creel

Uploaded by

Copyright:

Available Formats

Econometrics

Version 0.70, September, 2005

E CONOMIC H ISTORY, U NIVERSITAT A UTNOMA

MICHAEL . CREEL @ UAB . ES , H T T P :// P A R E T O . U A B . E S / M C R E E L

Chapter 1. About this document

1.2. Obtaining the materials

1.3. An easy way to use LYX and Octave today

1.4. Known Bugs

Chapter 2. Introduction: Economic and econometric models

Chapter 3. Ordinary Least Squares

3.1. The Linear Model

3.2. Estimation by least squares

3.3. Geometric interpretation of least squares estimation

3.4. Inuential observations and outliers

3.6. The classical linear regression model

3.7. Small sample statistical properties of the least squares estimator

3.8. Example: The Nerlove model

Chapter 4. Maximum likelihood estimation

4.1. The likelihood function

4.2. Consistency of MLE

4.3. The score function

4.4. Asymptotic normality of MLE

4.6. The information matrix equality

4.7. The Cramr-Rao lower bound

Chapter 5. Asymptotic properties of the least squares estimator

5.2. Asymptotic normality

5.3. Asymptotic efciency

Chapter 6. Restrictions and hypothesis tests

6.1. Exact linear restrictions

6.4. Interpretation of test statistics

6.5. Condence intervals

6.7. Testing nonlinear restrictions, and the Delta Method

6.8. Example: the Nerlove data

Chapter 7. Generalized least squares

7.1. Effects of nonspherical disturbances on the OLS estimator

7.2. The GLS estimator

7.3. Feasible GLS

Chapter 8. Stochastic regressors

8.4. When are the assumptions reasonable?

Chapter 9. Data problems

9.2. Measurement error

9.3. Missing observations

Chapter 10. Functional form and nonnested tests

10.1. Flexible functional forms

10.2. Testing nonnested hypotheses

Chapter 11. Exogeneity and simultaneity

11.1. Simultaneous equations

11.3. Reduced form

11.5. Identication by exclusion restrictions

11.7. Testing the overidentifying restrictions

11.8. System methods of estimation

11.9. Example: 2SLS and Kleins Model 1

Chapter 12. Introduction to the second half

Chapter 13. Numeric optimization methods

13.2. Derivative-based methods

13.3. Simulated Annealing

13.5. Duration data and the Weibull model

13.6. Numeric optimization: pitfalls

Chapter 14. Asymptotic properties of extremum estimators