Professional Documents
Culture Documents
Econometrics Creel
Econometrics Creel
Econometrics Creel
c Michael Creel
D EPT.
OF
B ARCELONA ,
E CONOMICS
AND
DE
Contents
List of Figures
10
List of Tables
12
13
1.1. License
14
14
15
17
18
21
21
22
25
28
3.5. Goodness of t
31
34
36
43
Exercises
49
50
CONTENTS
50
54
56
58
63
65
Exercises
68
70
5.1. Consistency
70
71
72
75
75
6.2. Testing
81
6.3. The asymptotic equivalence of the LR, Wald and score tests
90
94
94
6.6. Bootstrapping
95
98
102
110
111
112
115
7.4. Heteroscedasticity
117
CONTENTS
7.5. Autocorrelation
130
Exercises
151
Exercises
153
154
8.1. Case 1
155
8.2. Case 2
156
8.3. Case 3
158
158
Exercises
161
162
9.1. Collinearity
162
171
175
Exercises
181
Exercises
181
Exercises
181
182
183
195
199
199
11.2. Exogeneity
202
205
11.4. IV estimation
208
CONTENTS
214
11.6. 2SLS
227
231
236
245
248
257
13.1. Search
258
258
267
13.4. Examples
268
272
276
Exercises
282
283
283
14.2. Consistency
284
289
291
14.5. Examples
294
298
Exercises
303
304
304
CONTENTS
15.2. Consistency
307
308
310
313
316
322
322
325
334
341
345
Exercises
347
348
354
354
17.2. Identication
356
17.3. Consistency
358
358
360
361
368
368
372
373
CONTENTS
385
391
391
18.7. Examples
397
408
19.1. Motivation
408
415
418
422
428
431
432
432
432
435
436
436
437
441
Exercises
444
445
456
457
462
CONTENTS
474
Bibliography
491
Index
492
List of Figures
1.2.1
LYX
15
1.2.2
Octave
16
3.2.1
23
3.3.1
26
3.3.2
26
3.4.1
30
3.5.1
Uncentered
32
3.7.1
37
3.7.2
38
3.7.3
41
3.7.4
42
6.5.1
96
6.8.1
107
7.4.1
125
7.5.1
132
7.5.2
144
7.6.1
147
7.6.2
149
10
LIST OF FIGURES
9.1.2
9.1.1
11
164
165
9.3.1
179
13.1.1
259
13.2.1
261
13.2.2
Newton-Raphson method
263
13.2.3
266
13.5.1
275
13.5.2
277
13.6.1
A foggy mountain
278
15.10.1
21.2.1
335
433
List of Tables
1
457
462
463
467
474
12
CHAPTER 1
It is possible to have the program links open up in an editor, ready to run using keyboard
macros. To do this with the PDF version you need to do some setup work. See the bootable
CD described below.
13
14
1.1. License
All materials are copyrighted by Michael Creel with the date that appears
above. They are provided under the terms of the GNU General Public License,
which forms Section 23 of the notes. The main thing you need to know is that
you are free to modify and distribute these materials in any way you like, as
long as you do so under the terms of the GPL. In particular, you must make
available the source les, in editable form, for your modied version of the
materials.
1.2. Obtaining the materials
The materials are available on my web page, in a variety of forms including
PDF and the editable sources, at pareto.uab.es/mcreel/Econometrics/. In addition to the nal product, which youre looking at in some form now, you can
obtain the editable sources, which will allow you to create your own version,
if you like, or send error corrections and contributions. The main document
was prepared using LYX (www.lyx.org) and Octave (www.octave.org). LYX is
a free2 what you see is what you mean word processor, basically working as
A
a graphical frontend to LTEX. It (with help from other applications) can export
A
your work in LTEX, HTML, PDF and several other forms. It will run on Linux,
Windows, and MacOS systems. Figure 1.2.1 shows LYX editing this document.
GNU Octave has been used for the example programs, which are scattered
though the document. This choice is motivated by two factors. The rst is the
high quality of the Octave environment for doing applied econometrics. The
fundamental tools exist and are implemented in a way that make extending
2
Free is used in the sense of freedom, but LYX is also free of charge.
15
them fairly easy. The example programs included here may convince you of
this point. Secondly, Octaves licensing philosophy ts in with the goals of this
project. Thirdly, it runs on Linux, Windows and MacOS. Figure 1.2.2 shows an
Octave program being edited by NEdit, and the result of running the program
in a shell window.
16
between les - they are only illustrative when browsing. To see how to use
these les (edit and run them), you should go to the home page of this document, since you will probably want to download the pdf version together
with all the support les and examples. Then set the base URL of the PDF le
to point to wherever the Octave les are installed. All of this may sound a bit
complicated, because it is. An easier solution is available:
The le pareto.uab.es/mcreel/Econometrics/econometrics.iso is an ISO image le that may be burnt to CDROM. It contains a bootable-from-CD Gnu/Linux
17
system that has all of the tools needed to edit this document, run the Octave example programs, etcetera. In particular, it will allow you to cut out small portions of the notes and edit them, and send them to me as LYX (or TEX) les for
inclusion in future versions. Think error corrections, additions, etc.! The CD
automatically detects the hardware of your computer, and will not touch your
hard disk unless you explicitly tell it to do so. It is based upon the Knoppix
GNU/Linux distribution, with some material removed and other added. Additionally, you can use it to install Debian GNU/Linux on your computer (run
knoppix-installer as the root user). The versions of programs on the CD
may be quite out of date, possibly with security problems that have not been
xed. So if you do a hard disk installation you should do apt-get update,
apt-get upgrade toot sweet. See the Knoppix web page for more information.
1.4. Known Bugs
This section is a reminder to myself to try to x a few things.
The PDF version has hyperlinks to gures that jump to the wrong gure. The numbers are correct, but the links are not. ps2pdf bugs?
CHAPTER 2
#
$!"!
is the quantity demanded
) '
0(&
is
%
ments
is income
is a vector of other variables such as individual characteristics that
#
affect preferences
1 8 5
!@A8@89764) 3
1
individuals
indexes the
B ! B ! B CDC
#
B B
The model is not estimable as it stands, since:
B#
Some components of
8
E3
For example, people dont eat the same lunch every day, and you cant
tell what they will order just by looking at them. Suppose we can
18
BG
B#
component
B F
break
19
G P ` RB F P W
P T R PI B
B 0aS YVXS B VU$ BSQDH HC
We have imposed a number of restrictions on the theoretical model:
which in principle may differ for all
dc C
b B
The functions
have been
If we assume nothing about the error term , we can always write the last
and in order to be able to estimate them from sample data, we need to make
additional assumptions. These additional assumptions have no theoretical
basis, they are assumptions on top of those needed to prove the existence of
a demand function. The validity of any results we obtain using this model
will be contingent on these additional restrictions being at least approximately
correct. For this reason, specication testing will be needed, to check that the
model seems to be reasonable. Only when we are convinced that the model is
at least approximately correct should we use it for economic analysis.
20
CHAPTER 3
. We
r p
Yq
h 8
i @@8A89 gI
e h P 8
uP i p h ut@8A8P p VsI pI
P
f
or, using vector notation:
e
wP p dR f
v
is a scalar random variable,
perscript 0 in
y bb
8 c p h xxb p pI
p
y h bb
i xxb I Xv
The su-
It will be dened more precisely later, and usually suppressed when its not
necessary for clarity.
Suppose that we want to use data to try to determine the best linear ap-
8 v
The data
1 85
A@8A874) v
2 f
proximation to
are
G P v
t0 R f
1
For example, cross-sectional data may be obtained by random sampling. Time series data
accumulate historically.
21
(3.1.1)
G P
V
h ff bb f I
) '
jQ1 R igxxb ef d
where
is
and
h
bb
R "f v xxb v I v lk
The
22
Linear models are more general than they might rst appear, since one can
employ nonlinear transformations of the variables:
F
gpvI mu I $ p u f
# m
F n #
G P r F b b b F
V spT m xxqp m I m o $ p m
B t
where the
etc. leads
to a model in the form of equation 3.6.1. For example, the Cobb-Douglas model
G ~ } x| z x w
7gwyF gyF #
{
can be transformed logarithmically to obtain
w D$7# f
I
8G
70P | F | uP F uP w A #
If we dene
The approximation is linear in the parameters, but not necessarily linear in the
variables.
f
g!
H
PI
where
Ee
line
e
EaP aH gf
PI
is a ran-
line is dened will become clear later. In practice, we only have the data, and
23
data
true regression line
-5
-10
-15
10
X
12
14
16
18
20
we dont know where the green line lies. We need to gain information about
the straight line that best ts the data points.
The ordinary least squares (OLS) estimator is dened as the value that minimizes the sum of the squared errors:
E
where
%
R V tuis
R P R 5 R
% R q %
I
@
R f
v g
f
24
This last expression makes it clear how the OLS estimator is dened: it minand
8
"
best means minimum Euclidean distance. One could think of other estimators based upon other metrics. For example, the minimum absolute distance
R v gf q@
I f
(MAD) minimizes
best in terms of their statistical properties, rather than in terms of the metrics
that dene them, depends upon the properties of , about which we have as
and it to zero:
g
5 P
5
x
R w R p q
so
8
R I ! R
x
R 5
i G
8
1
minimizer.
, so
Since
is in fact a
25
Note that
P
G
G P
0
G R
h % R
R R
. Lets look at
!i
3.3.1. In
true regression line. Note that the true line and the estimated line are different.
This gure was created by running the Octave program OlsFit.m . You can
experiment with changing the parameter values to see how this affects the t,
and to see how the tted line will sometimes be close to the true line, and
sometimes rather far away.
8
4)
26
data points
fitted line
true line
10
-5
-10
-15
10
X
12
14
16
18
20
e = M_xY
S(x)
x
x*beta=P_xY
Observation 1
Since
is chosen to make
as short as possible,
Since
8
i
G
8
G i
is in this space,
will be orthogonal
8 G R
i1
onto the
We can decompose
27
Note that
the f.o.c. that dene the least squares estimator imply that this is so.
i
or
fi I Qi
R
R
is
R R
i I !Xiu s
since
8
f
. We have that
G
jf
8fHRi I !X%upf d
R
fi I !X%ujf
R R
to the span of
is
8 pf
R R
i I 6Xiujpf
28
X
We have
8
f Q G
Therefore
8 P
G
f X f s
P
dimensional vector
dimensional
A symmetric matrix
An idempotent matrix
and
8 ww
w
8 R w
w
i
space.
u1
ned by
into two
f
3
B
f BR
f AB i I 6Qi
R R
is simply
29
This is how we dene a linear estimator - its a linear function of the dependent variable. Since its a linear combination of the observations on the
dependent variable, where the weights are detemined by the observations on
the regressors, some observations may have more inuence than others. Dene
)
and
)
a in the t position). So
is a
8
q1 e sy
. If the weight is much higher, then
the observation has the potential to affect the t importantly. The weight,
1
8
@
s.
2
observa-
t G i I !X% g ) ) @
R R
30
Data points
fitted
Leverage
Influence
12
10
0.5
1.5
X
2.5
G g
2
-2
)
@ aj a
While an observation may be inuential if it doesnt affect its own tted value,
it certainly is inuential if it does. A fast means of identifying inuential ob-
t G h I
servations is to plot
31
data entry error, which can easily be corrected once detected. Data
P
G f
Take the inner product:
P R R 5 P R R
R
G tR G G i i ff
R P R R
R
G t G i ff
is dened as
t
H 4v
f
f
fR f
R R
f R f )
G R G
(3.5.1)
The uncentered
G R
, so
f
The uncentered
(see Figure 3.5.1, the yellow vector is a constant, since its on the
degree line in observation space). Another, more common deni-
tion measures the contribution of the variables, other than the constant
term, to explaining the variation in
f
8
f
where
32
q
8
4)
g
R R
X
where
So
R P R R
G G i f f
R
In this case
8
G 9G
R
G i
7u @f f R f
f f I
) f f
G y)
G R
and
t G
so
Supposing that
G R G
where
The centered
is dened as
9G t G X is f f
R P R R
R
from the mean, equation 3.5.1 becomes
Let
1 R t9@A8@8944
) 8))
just returns the vector of deviations from the mean. In terms of deviations
1 wsf
R
R I 6REwsf
-vector. So
3.5. GOODNESS OF FIT
33
34
economic content to the model, and the regression parameters have no eco-
7
respect to
with
e h P 8
wP ih u@8A8P uI H f
P I
P
e
f
to have an economic
e h P 8
uP i p h ut@8A8P p VsI pI
P
f
(3.6.1)
r p
Yq
e
wP p dR f
v
(3.6.2)
where
1
R ) A
35
$f ! e
(3.6.3)
2
c p
G
(3.6.4)
Nonautocorrelated errors:
p E!G
2 e
(3.6.5)
Optionally, we will sometimes assume that the errors are normally distributed.
Normally distributed errors:
4f ! je
(3.6.6)
36
Rf I X R
. By linearity,
Gi I 6Qi
R R P
G P
V" R I Q R
By 3.6.2 and 3.6.3
G i I 6Xi
R R
Gi I !Xi
R R
R R
Gi I !X%
so the OLS estimator is unbiased under the assumptions of the classical model.
Figure 3.7.1 shows the results of a small Monte Carlo experiment where the
OLS estimator was calculated for 10000 samples from the classical model with
,
, and
5 1
G 5 P
P 0) f
that the
, where
and
)
5 1
where
G P
tvI f 8 P 4f
was calculated for 1000 samples from the AR(1) model with
37
0.1
0.08
0.06
0.04
0.02
-3
-2
-1
regressors are stochastic. We can see that the bias in the estimation of
about -0.2.
is
The program that generates the plot is Biased.m , if you would like to experiment with this.
8
RG I X R a
P
R
p I 6Xixui
38
0.12
0.1
0.08
0.06
0.04
0.02
0
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
assumption of normality is often questionable or simply untenable. For example, if the dependent variable is the number of automobile trips per week, it
is a count variable with a discrete distribution, and is thus not normally distributed. Many variables in economics can take on only nonnegative values,
which, strictly speaking, rules out normality.2
3.7.3. The variance of the OLS estimator and the Gauss-Markov theorem. Now lets make all the classical assumptions except the assumption of
Normality may be a good model nonetheless, as long as the probability of a negative value
occuring is negligable under the model. This depends upon the mean being large enough in
relation to the variance.
p I X R
RG
R
9I !XRiutGRi I 6Qi
R h 0% h 0
RG I X R
P
normality. We have
39
. So
The OLS estimator is a linear estimator, which means that it is a linear func-
8
f
f R I X R
f
where
9
%
Q
is
f
u
p "
G
P p "U
must have
matrix function of
1 '
jX
a function of
is some
8
i
where
f
is
8 p $
R
The variance of
40
Dene
R I X R
so
R I X R P
9
so
p h I Q R P R
R R
R R
p R % I 6XiP % I 6XiP
Since
q$
So
7
Theorem. The OLS estimator is the best linear unbiased estimator (BLUE).
It is worth emphasizing again that we have not used the normality
assumption in any way to prove the Gauss-Markov theorem, so it is
valid if the errors are not normally distributed, as long as the other
assumptions hold.
To illustrate the Gauss-Markov result, consider the estimator that results from
equally-sized parts, estimating using each part of
should be able to show that this estimator is unbiased, but inefcient with
respect to the OLS estimator. The program Efciency.m illustrates this using
41
0.5
1.5
2.5
3.5
a small Monte Carlo experiment, which compares the OLS estimator and a 3way split sample estimator. The data generating process follows the classical
8
75
p i
y
I
)
75 1
model, with
3.7.4 we can see that the OLS estimator is more efcient, since the tails of its
histogram are more narrow.
and
G R G
1 p
p e
We have that
is
0.1
0.08
0.06
0.04
0.02
2.5
1
Su p
1.5
p
Y
G
R tG y
R GG 4y
G R G y
G tG
R
p
1
)
1
)
1
)
1
)
1
)
1
)
1
p
0.5
R
G t G
3.5
42
y
w
w y
43
R
yF
F
4
8
quantities of inputs
8 F R
4 YF 4F
Monotonicity Increasing factor prices cannot decrease cost, so
F
4
F
4 2 c
F2
prices:
44
the factor demands are homogeneous of degree zero in factor prices they only depend upon relative prices.
t4
im-
plies that cost increases in the proportion 1:1. If this is the case, then
.
)
form
F
x x F@8A88 Ix F u
w
with respect to
x
x
x F@88A8 IYF w x x F8@8 x F8 Ix F 6
w
I
F
h # !
F "
`
This is one of the reasons the Cobb-Douglas form is popular - the coefcients
are easy to interpret, since they are the elasticities of the dependent variable
45
F
F
F
h # !
F "
`
$
%
F
4
where
j
w A V&
e P (
u )VP F A Vt@8A8F A HuP
P 8 PI I
u A
&
'
the data.
) B
3
!
8 8 8
A@@94)
8
4) )
(
) ))(
so
3.8.3. The Nerlove data and OLS. The le nerlove.data contains data on
145 electric utility companies cost of production, output and input prices. The
data are for the U.S., and were collected by M. Nerlove. The observations are
1
2
, PRICE OF FUEL
1
0
OF LABOR
, OUTPUT
8
g
g
46
PRICE
that the data are sorted by output level (the third column).
Note
e
uP )VP 5 4uP 1 | VP uD
6
3
PI
2
0
(3.8.1)
using OLS. To do this yourself, you need the data le mentioned above, as
well as Nerlove.m (the estimation program) , and the library of Octave functions mentioned in the introduction to Octave that forms section 21 of this
document.3
The results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
constant
output
labor
fuel
capital
estimate
-3.527
0.720
0.436
0.427
-0.220
st.err.
1.774
0.017
0.291
0.100
0.339
t-stat.
-1.987
41.244
1.499
4.249
-0.648
p-value
0.049
0.000
0.136
0.000
0.518
*********************************************************
3
If you are running the bootable CD, you have all of this installed and ready to run.
47
While we will use Octave programs as examples in this document, since following the programming statements is a useful way of learning how theory
is put into practice, you may be interested in a more user-friendly environment for doing econometrics. I heartily recommend Gretl, the Gnu Regression,
Econometrics, and Time-Series Library. This is an easy to use program, available in English, French, and Spanish, and it comes with a lot of data ready to
A
use. It even has an option to save output as LTEX fragments, so that I can just
include the results into this document, no muss, no fuss. Here the results of
the Nerlove model from GRETL:
5 E ) 8
444 8
) C7 @)8
9
7
5 @F78
9C7 A)8
) 58
E
9C9 A ) 8
A 8
CBA)
)
758
E E
CECC
AG)
9
@5 8
) @C7 8
7 9
5
7 4FA8
9
@5 8
7
8
l_capita
A
D7
l_fuel
p-value
A F8
9
E
58
5 )
8
8
$4 5)
8)
E
A C 4
l_labor
-statistic
444 8
8
l_output
Std. Error
E
CE
const
Coefcient
Variable
9
C9
Adjusted
Unadjusted
9
)
A
D9 8
)
E
C 8 )
9 FD7
98 A
E
7
E I5 8
4 5 8
9 @5 H8
7 7
5 4 g75
8)
54AG)75 4)
8
5 8
BA4)
48
Fortunately, Gretl and my OLS program agree upon the results. Gretl is included in the bootable CD mentioned in the introduction. I recommend using
GRETL to repeat the examples that are done using Octave.
The previous properties hold for nite sample sizes. Before considering
the asymptotic properties of the OLS estimator it is useful to review the MLE
estimator, since under the assumption of normal errors the two estimators coincide.
EXERCISES
49
Exercises
(1) Prove that the split sample estimator used to generate gure 3.7.4 is unbiased.
(2) Calculate the OLS estimates of the Nerlove model using Octave and GRETL,
and provide printouts of the results. Interpret the results.
(3) Do an analysis of whether or not there are inuential observations for OLS
estimation of the Nerlove model. Discuss.
(4) Using GRETL, examine the residuals after OLS estimation and tell me whether
or not you believe that the assumption of independent identically distributed normal errors is warranted. No need to do formal tests, just look
at the plots. Print out any that you think are relevant, and interpret them.
and
w y
for
Y
w
S Q
gT R
P
o w
where
U
V
function trace.
e P
E H 4f
P I
which satises the classical assumptions, prove that the variance of the
OLS estimator declines to zero as the sample size increases.
CHAPTER 4
sented without examples. In the second half of the course, nonlinear models
with nonnormal errors are introduced, and examples may be found there.
4.1. The likelihood function
Suppose we have a sample of size of the random vectors and . Suppose
# 88 I
h f98 g#
f
r p X
ff 888 I
h g9qf
and
is character-
8 X W
p yT
a Y
b`
This is the joint density of the sample. This density can be factored as
p !Ye ` p YW
W a d
p X W
YT
a Y
cG
X W
g YT
i g X
ph
X W
UyT
f
is a parameter space.
X
likelihood function.
50
is the value of
X
where
Y
a
d
YW
W d
!Y 4YW
X W
UYT
a Y
c`
pd
a
`
and
Note that if
51
are said
~ p
d
pd
fq
d
YW
I
@
where the
d f
# g
written as
If the
Y
G
d
4YW
d
4yW
Y
a
If this is not possible, we can always factor the likelihood into contributions of observations, by using the fact that a joint density can be factored
d
f
# f 8 fI f
6I !@A8@89 !t
6t gg# I
#If
I
so
c
d
f
fq
I
@
d
4
f
52
I
@ 1
1 d f
c f
d
d
) 4 f A ) 4RC
f
d f
geS ~ d
8
7d
8
7d
has no effect on
and
is a monotonic
Dividing by
increasing function,
d
b
4.1.1. Example: Bernoulli trial. Suppose that we are ipping a coin that
may be biased, so that the probability of a heads may not be 0.5. Maybe were
t
t" ) f
be a binary
) f
9 j g !
) g f
9 v! u I p ic p4
) u
p f
Y
`
)
f
I i% u
Y
G
and
) f ) P
iiy S0 f
f
Y
G
53
)
ic
if
)
ii
f )
S0 f
Y
`
i
f
gives
i B 1
) I
i B f f ) dif
f
p
in this case.
Now imagine that we had a bag full of bent coins, each bent around a
sphere of a different radius (with the head pointing to the outside of the sphere).
We might suspect that the probability of a heads could depend upon the ra-
~ ) B
I BR 7g} P c ! C
B
is a 2 1 vector. Now
B f
BC ! C i B
ic
)
BC B i) B B B if B
Y
`
!
f
'
so
B )
C B % B q B
that
where
B
R r B ) n qi
, so
54
B B f I
C ! C i B fB if
is no explicit solution for the two elements that set the equations to zero. This
is common with ML estimators, they are often nonlinear, and nding their
values often require use of numeric methods to nd solutions to the rst order
conditions.
x g
pwd
x
which is compact.
imixation is over
Max-
Uniform convergence:
8
x
d d
g p 7ea
g d
wc
d f
4RC!4 A f 4eS
d f
We have suppressed
vergence holds for all possible parameter values. For a given parameter value,
an ordinary Law of Large Numbers will usually imply almost sure convergence to the limit of the expectation. Convergence for a single element of
the parameter space, combined with the assumption of a compact parameter
space, ensures uniform convergence.
7
is
f
d
First,
8 p d d
f
T
p 7RaG
d d
87d
d f
eS9
Identication:
g d
d
continuous in
8
x
is continuous in
p eaG
d d
Continuity:
55
compact set.
b
p e
d
p e
d
d
4R f f
d
4R f f A
p d d
by Jensens inequality (
is a concave function).
p
4) p e f 4eRdd
ft d
f
f
'
p e
d
d
e f f
t
)
p e
d
since
d
p p e
d
4R f f A
Taking limits, this is
8 E p Ry4R
d f d f
or
pd pd pd d
e Guq 7e G
except on a set of zero probability (by the uniform convergence assumption).
56
is a limit point of
is a maximizer, independent of
we must
have
a.s.
q1
Cd
Suppose that
f
d
f
d
p d cc p p Ra uq p 7Ra
d
d
d
d d
p d d
ity is strict if
p p Ra uq p Ra
d
d
d d
a.s.
pd
Thus there is only one limit point, and it is equal to the true parameter value
with probability one. In other words,
8 8 p d d f
as
This completes the proof of strong consistency of the MLE. One can use weaker
) of
the MLE. This is omitted here. Note that almost sure convergence implies
convergence in probability.
pd
of
, at least when
p e
d
in a neighborhood
d f
4RC9
is large enough.
57
8)
gxp'
I
@
1
8 d
ge
$
f )
I
@
1
d f
g
f )
d f
eS Sg
d f
as an
variables) will often be suppressed for clarity, but one should not forget that
they are still there.
sets the derivatives to zero:
8
$
8 d
p e
82 d
s4ecg 1)
I
@
1 4d S
4d
f
f )
not necessarily
8 f t d f
g$c g
d f
gf$c g c g c ) g
t d
f d f
f t d f d f
g$ c g
d
4R
d
4R14
to the density
The ML estimator
58
order of integration and differentiation, by the dominated convergence theorem. This gives
f t d f
$4c
d
eE5
)
9
so it implies that
2
8 4S)
d f
r 4Rx
d
So
r p
d f
4RC9
d H
h p d E eH y p RD
d
d P d
4 d H
Assume
p RHb1 I 6 R
d
d
h p ipd 1
d
) P
Ycyd
e
f
this in a minute). So
e
d
8 e
4) "g p
d
d
g p eHd h p d R
d
d
where
This is
d
d
e
8 d
e
R
8 d d
d f
eS $
I
@
eE
d
f
eS
d f
RD y
d
Now consider
59
d f
4R 9
Given that this is an average of terms, it should usually be the case that this
satises a strong law of large numbers (SLLN). Regularity conditions are a set
of assumptions that guarantee that this will happen. There are different sets
of assumptions that can be used to justify appeal to different SLLNs. For
d
6IeE A
example, the
their variances must not become innite. We dont assume any particular set
here, since the appropriate assumptions will depend upon the particularities
of a given model. However, we assume that a SLLN applies.
d
e
j
p d
R
d f
f
d
eS Q A R
k
l
e
d
we have that
d
4R
p e cP
d
)
is
60
p p ea
d
d
d f
E p eS A f
p d
R
p p RaG p 7R
d
d
d d
maximizes the limiting objective function. Since there is a unique maxd
i.e.,
p ea
d
d f
4RC9
h p d 1
d
T
p d f
RCb1
1
1
1h
)
8 eE5)
d
8
1
p RS
d f
T
Note that
a CLT applies.
I
@
p e
d
f
I
@
p c f
d
gE
f
d f
eS
8 d
g p RDb1 h
This is
8 d
g p eHb1 I 6 p ea
d
Now consider
(4.4.1)
and we have
61
f
E$a
4f a
f
1
I
a @f 1
f
a
h
8
c
p RDb1
d
depend on
ally,
f
a
not too strongly dependent, then a CLT for dependent processes will apply.
9 q
p eSgn1
d f
p e gb1 eI ! p ea
d f
d
p
where
p e fb1 er f
p d
ea
e4
p R o q
d
(4.4.2)
d f
d f
R p RCD p RCh 1
p RCb1
d f
p ea o
d
8
I p ea p Ra o I p R d u h p ssd 1
d
d
d
we get
h p d 1
d
where
-consistent
n6
(4.4.3)
is
of a parameter
d
62
There do exist, in special cases, estimators that are consistent such that
1
8 h p ssd 1
d
T
h
mally,
of a parameter
is asymptotically unbiased if
8
7d d t tA f
pd
(4.4.4)
Estimators that are CAN are asymptotically unbiased, though not all consistent
estimators are asymptotically unbiased. Such cases are unusual, though. An
example is
with density
r
1 d If
I
d
p
d I f )
d
Show that this estimator is consistent but asymptotically biased. Also ask
yourself how you could dene an estimator that would have this density.
14
allows us to write
d
appear to be correlated one may question the specication of the model). This
(This forms the basis for a specication test proposed by White: if the scores
I f!A8@@889If g $ 2
d
f
4 2
I
@ 1
v
w4
d
w
HR 4dRus4dREg
s4RE
f )
4
I@ 1
f )
is xed in .
since for
If
and
and multiply by
d d P d
R s4eE ue5Vs4RE
y d
d
eE
d
ueE 5VP 4eE
A
f4RE s4RE y us4RE A 4eE
t d
d
d
P d
f t d
$4RE y e $eE eE A
d
P f t d d
g
54
d 4
The scores
f$4RE E4ec
t d
d
f t d
$eE
so
be short for
d
4RE
f t d
$eE
Let
d f
c g
)
8 d
g4Ra 4R
d
63
p Rac
d
p Ra p ea o p e
d
d
d
I
I
p ea x d o
d
x d
x
I
p ea x
d
I
x d
p d
Ran
p d
Ranx
p d
Ranx
x
that are
p d
f f
R r d Sg n r d n 1 Ra
8
4d
I
@
R
$d d
d1
f
p Ra
d
o
and
p R
d
d
p d
ea
p d
ea x o d
x
We can use
I 6 p Ra o d h p ipd 1
d
d
(4.6.3)
h
p ea o I 6 p e
d
d
Using this,
I 6 p e
d
8p
dd
h p d 1
d
in particular, for
simplies to
7d
8 d
4Ra o e
d
(4.6.2)
d
limits, we get
since all cross products between different periods expect to zero. Finally take
4.6. THE INFORMATION MATRIX EQUALITY
64
65
These are known as the inverse Hessian, outer product of the gradient (OPG) and
sandwich estimators, respectively. The sandwich form is the most robust, since
it coincides with the covariance estimator of the quasi-ML estimator.
pd
estimator of
semidenite matrix.
Proof: Since the estimator is CAN, it is asymptotically unbiased, so
f
d
4ip d t4 A
rR
matrix of zeros
8
g
this is a
d f y
ip d t
f
8 f$t 9 v4d y
h ip d y
d
8 t h ip d y t f p4R y 4R h d
f d
d
P f t d
d d
d
ge A y e 4 y
d
d
we can write
and
8 $4e y 4e h ip d
d d
f t d
h
we get
|
1z
f t d ~ d }{ y
$e 4R ) 1
t
With this we
Noting that
'
f d d
$t r h d 4 n y
Differentiate wrt
h d 1
d
66
h d 1 n A f
d
R d
9 r 4eHc1
h ip d 1
d
for
h
h
we can write
so
d
geH
Note that the bracketed part is just the transpose of the score vector,
any CAN estimator, is an identity matrix. Using this, suppose the variance of
Therefore,
d
ea
9
d
49an
h ss d 1
d
d
4eHn1
h ip d 1
d
(4.7.1)
8
g$ d h n
tends to
n
8
&
4Ra o
d
& I
d
ea
9
$ d r
-vector
r e I o R & R & n
d
This simplies to
h 4e I q9r R
d o d
d
ea
&
d
9an
d
e I
o
&
CAN estimator.
&
proof.
is arbitrary,
Since
and ,
, say
parameter
if
4d an0q4 d ar
p
67
Consistent
Asymptotically efcient
Asymptotically unbiased
EXERCISES
68
Exercises
(1) Consider coin tossing with a single possibly biased coin. The density func-
t
t" ) f
is
) f
9 j g !
) g f
9 v! u I p ic p4
) u
p f
Y
`
Suppose that we have a sample of size . We know from above that the ML
f p h
estimator is
f
and
p a
p da
b) Write an Octave program that does a Monte Carlo study that shows that
p %g 1
f
several values of .
Ee
&
P
j R f
is large. Please
p iv 1
f
for
e P) )
9 w
e
E
The Cauchy density has a shape similar to a normal density, but with much
thicker tails. Thus, extremely small and large errors occur much more frequently with this density than would happen if the errors were normally
d
where
Rh R d
EePi R 4f
R h & R l
d f
4RCg
d f
4RCg
where
where
! j e
EXERCISES
69
(4) Compare the rst order conditional that dene the ML estimators of problems 2 and 3 and interpret the differences. Why are the rst order conditions that dene an efcient estimator different in the two cases?
CHAPTER 5
5.1. Consistency
GVP"tRi I 6Qi
R
Rf I Q R
I yf
I
@ 1
G
)
f
f
y
f y f tf A
1
1
GR I R P
G% I 6XiP
R R
p
I
Each
tG
e1
RG
70
RG
71
8 R
e
E
As long as these are nite, and given a technical condition1, the Kolmogorov
I@ 1
f )
8
G
SLLN applies, so
8p
assumptions hold:
For application of LLNs and CLTs, of which there are very many to choose from, Im going
to avoid the technicalities. Basically, as long as terms of an average have nite variances and
are not too strongly dependent, one will be able to nd a LLN or CLT to apply.
72
1
1
GR I R
h i 6Xi
GR I R
Gi I 6XiP p
R R
fy
f
8
I I y
p
h j 1
p
0%
h
Now as before,
Considering
p
1
f
A
R ! Re
e
1 f
RG
p
!
p
I !
1
GR
h
Therefore,
h p 0 1
applied.
tributed,
is normally distributed. If
large samples if
73
f
R g
I@
f
q
G
70P p i f
so
I
5 @
g5
G D 7~} )
fq
g4f p ! h
4G
The joint density for can be constructed using a change of variables. We have
yu
and
so
p
I
@
8 g5 f
A j 5
1
R g f
"jf G
Taking logs,
I
5 @
g5
8
f
~
S
R g s 7g} )
f
fq
hu
4) y
f
so
1
y ! f
to multiplication by a constant), so the estimators are the same, under the present
assumptions. Therefore, their properties are the same. In particular, under the
is asymptotically efcient.
as long as
7f
G
74
CHAPTER 6
G
VP
P I I p
| uP A Vs HuP A
G
VP 7 | VP uH A HuP p p
PI I
so
P I I
| uP A uH H
is to set
| uP uDH
PI
75
76
f
G P
0"
is of rank
8 p
sible.
vector of constants.
'
We assume
and is a
matrix,
is a
)
'
p
where
The most
x
8 5 P f f
g %p R e " R qi ) 1
The Lagrange multipliers are scaled by 2, which makes things less messy. The
e
$
$ r
5 P
5 P 5
R ar R w Rf
e w
x
e
which can be written as
fR
Rf
R
I
R
We get
fonc are
R
I
I X R w I
I R I X R I X R u I R I X R I Q R
}
I
I X R Y
I R I X R
I X R
I
w
w
so
R I X R
I R I X R
9
and
R I X R
R I X R w
R I X R
$
8
$
I X R w
I X R
Note that
For the masochists: Stepwise Inversion
6.1. EXACT LINEAR RESTRICTIONS
77
78
and
I
I
P
s I R I X R 9
I R I X R
h % I
h % I R I X R
I
I X R w I
I R I X R I Q R w I R I X R I X R
and
Rf
8 R w gP Yw
w U
Though this is the obvious way to go about nding the restricted estima-
I
H
r n
I
'
8 I 0 I H
I
I
I
Then
f
is
G
VP VDH
P I I
where
8
i
79
G
0P H I j
I I
G
VP VP I E I E
I I
I I
I I
I jf
f
8G
VP @f
This model satises the classical assumptions, supposing the restriction is true.
is as before
p I D i
p
1
R h Cf
p R
I I R I R
R I I R E
I
H
I I
I I
I
H
is a linear function of
I
gH
h @f
p
To recover
I D R
and
p0
If the restriction is true, the second term is 0, so we are better off. True
term is NSD.
So, the rst term is the OLS covariance. The second term is PSD, and the third
R
R R
I 6Xia I I 6Xi
R R
I 6QRia I R p0 up0 I I 6Xi
I X R
r
of the third, we obtain
zero, and that the cross of the rst and third has a cancellation with the square
Noting that the crosses between the second term and the other terms expect to
R
c0r 0r Dr
Gi I !Xia I
R R
pV I
R R
I !Xi
R R
I !Xi
Gi I !Xi
R R
Gi I 6XR%a I I !X%CpV I I 6QiUGi I !Xi
R
R R
R R P R R P
fRi I 6QRia I I 6Qi I I !Xi
R R
R R P
h % I I 6Xi
R R
0r
80
6.2. TESTING
81
6.2. Testing
In many cases, one wishes to test economic theories. If theory suggests parameter restrictions, as in the above homogeneity example, one can test theory
by testing parameter restrictions. A number of tests are available.
6.2.1. t-test. Suppose one has the model
p
pr
vs.
pr
GdP
V" f
d
. Under
p R I X R a6
D
p
P ROPOSITION 4.
(
z
Rv2 )(
t9 j
(6.2.1)
and the
distribution.
Q
$f j
is a vector of
e
1
R
'
(6.2.2)
)
t9 j
are independent.
as long as the
in place of
p
R I Q R a p p R I X R a
8t9
)
p
so
6.2. TESTING
R
Q
B Q B e
noncentrality parameter.
P ROPOSITION 6. If the
suppressing the
g
to as a central
1
gD
When a
where
82
then
8 1
gD
I R
'
Well prove this one as an indication of how the following unproven propositions could be proved.
as
We have
Proof: Factor
R jaf
8 R f
but
af R f
'
. Thus
1
D
R R
R
4f juf
f R
and thus
R
f
so
but
I a R R ff
R
and we get the result we wanted.
A more general proposition which implies this result is
R
'
is idempotent.
if and only if
(6.2.3)
ju
P ROPOSITION 7. If the
then
v
1 2
x
R p
I X R a
so
X i I !Xi
R R
and
uv2
1
G R G
z y f
R I X R a p
y y
" x
%
RG I Q R
distribution if
and
Now consider (remember that we have only one restriction in this case)
8
are independent if
l 1
R
and
Yw
then
1
p
p
G X R G
p
G X R G
p
GR G
Consider the random variable
R
'
then
and
(6.2.4)
g ju 1
83
6.2. TESTING
84
v B
1 2
B
p
p
B $r
HB $r
Note: the
vs.
distributed. If one has nonnormal errors, one could use the above as-
)
t
v2
1
then
t
P
f
'
and
af
provided that
and
'
P ROPOSITION 10. If
(6.2.5)
gt
test. The
P
6.2.2.
distribution if nonnor-
are independent.
8
distribution:
8
!R
1
h % R X R w R h
I
I
then
distribution, it is simple
yw R
R
are independent if
g j 1
and
distri-
In practice, a conservative
as
bution, since
)
t9 j
6.2. TESTING
8
!R
1
P
1
Note: The
85
6.2.3. Wald-type tests. The Wald principle is based on the idea that if a
restriction is true, the unrestricted model should approximately satisfy the
restriction. Given that the least squares estimator is asymptotically normally
h
p !
R I
h p 0 1
we have
h p pr p
then under
p
I !
distributed:
h 1
so by Proposition [6]
h I p R h % 1
I R
4 R 1
I D1 R
With
8 I
or
Note that
h
hR R p h
% I I 6Xia R %
the
test.
6.2. TESTING
86
6.2.4. Score-type tests (Rao tests, Lagrange multiplier tests). In some cases,
an unrestricted model may be nonlinear in the parameters, but the model is
linear in the parameters under the null hypothesis. For example, the model
r p
under
t
but is linear in
8
4)
G P
VUs" f
and
is nonlinear in
Estimation of
h I
h % I 6Xia
I R R
e
h
h 1
R p !
I
Given that
p
I R I I !
m
R
p
1
I I I 1 A !
e
1
e
1
or
6.2. TESTING
87
since the s cancel and inserting the limit of a matrix of constants changes
nothing.
However,
R
I
R 1 (
I R
R R
I !XiaD1
1 A
e
1
R
e R I X R a e
8 p
estimator of
p
I 1 !
In this case,
p
R X R R h e R
e I
R P R P R
war i0dfi
6.2. TESTING
88
to get that
R
IG i
fR
Hr i
R
e
8
gR
G R I X R u R G
@
G R G
To see why the test is also known as a score test, note that the fonc for restricted
least squares
R P R P R
war i0dfi
give us
ijfi e
R R R
and the rhs is simply the gradient (score) of the unrestricted model, evaluated
at the restricted estimator. The scores evaluated at the unrestricted estimate are
identically zero. The logic behind the score test is that the scores evaluated at
the restricted estimate should be approximately zero, if the restriction is true.
The test is also known as a Rao test, since P. Rao rst proposed it in 1948.
6.2. TESTING
89
6.2.5. Likelihood ratio-type tests. The Wald test can be calculated using
the unrestricted model. The score test can be calculated using only the restricted model. The likelihood ratio test, on the other hand, uses both the restricted and the unrestricted estimators. The test statistic is
q4 d f 5
x
r
d
d
4R
1
$ d d f
h d d gd R h d p d 5 $d
1 P f
about
f
d
that it is asymptotically
h 4 d
where
d
49 f
49 f
since
is dened in terms of
h
s d d R h d p d y
1
) so
d
4R f A I f
f
d
h d d p R R h d p d 1
o
f
d
p R o p R
d
gd
As
8 d p d
g p eH eI 1 I ! p ea
o
pd
h ip d 1
p
and manipulate
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS
90
8 d p
g p RD eI 1 h I 6 p Ra o I I 6 p Ra o 70sf I ! p ea
d
R d R d
o
pd
h ip d 1
p eH I 6 p Ra o I I ! p ea o I ! p ea o eI Y h d p d 1
d
d
R d
R d p 1
d p
c p RH eI 1 I p Ra o I R I p Ra o R I p ea o R p RD eI 1
d
d
d
d
p
But since
d
E p e o
8
g R I p Ra o 6 j
d
m
p eH eI 1
d
p
p eH eI 1 I p ea o
d
d
p
We can see that LR is a quadratic form of this rv, with the inverse of its variance
in the middle, so
8
4
6.3. The asymptotic equivalence of the LR, Wald and score tests
random variables. In
that the Wald and LR tests are asymptotically equivalent. We have seen that
the Wald test is asymptotically equivalent to
h I p R h % 1
I R
8 "j R "j )5 0 5
1
f f
1
y s! f A
h
p eH eI 1 I ! p e o I I 6 p R o I 6 p e o p eH eI 1
d
d
R d
R d R d p
p
R I X R u
RG I X R a I R I X R a p 7R
Gi I I I p
R
R
RG 1 1
p
eI
I R
Gi I 6XiaH1
R R
p
G
i R G
p
GR w
I wR w wR G
I X R u R G
R
I tG I 1
R
where
p
0 aD1
we get
p 0 a
and
RG I X R p 0
Using
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS
91
f
alent. Similarly, one can show that, under the null hypothesis,
This completes the proof that the Wald and LR tests are asymptotically equiv-
p
G
i R G
RG I X R a I R I X R a p R I X R R R G
pd
I I e
R
1
yx
p ijf R
p H y x
p Ra
d
so
A
pd
e
H1
GR
1
p "j R
f
s! f ) 1 x
$
p H
Using this,
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS
92
6.3. THE ASYMPTOTIC EQUIVALENCE OF THE LR, WALD AND SCORE TESTS
93
The
and
is estimated, and weve already seen than there are several ways to do
under
f
y
f
h y
f
y
h y f
how
such that
8
4) 1 A
94
f
y
8 f
f
y
that
and if
For this reason, the Wald test will always reject if the LR test rejects, and in
turn the LR test rejects if the LM test rejects. This is a bit problematic: there is
the possibility that by careful choice of the statistic used, one can manipulate
reported results to favor or disfavor a hypothesis. A conservative/honest approach would be to report all three test statistics when they are available. In
P
x
% v2
6.6. BOOTSTRAPPING
d
&
)
c 4 )
using a
is the interval
signicance level:
v2
such that
x
r
p
0% p T St &
p $r p
&
p
95
A condence ellipse for two coefcients jointly would be, analogously, the
such that the
gD
I
set of {
specied critical value. This generates an ellipse, if the estimators are correlated.
The region is an ellipse, since the CI for an individual coefcient de-
&
since the
other coefcient is marginalized (e.g., can take on any value). Since the
&
Y)
it
tant errors. If the sample size is small and errors are highly nonnormal, the
h p 0 1
6.6. BOOTSTRAPPING
96
6.6. BOOTSTRAPPING
97
sample distribution. Also, the distributions of test statistics may not resemble
their limiting distributions at all. A means of trying to gain information on the
small sample distribution of test statistics and estimators is the bootstrap. Well
consider a simple example, just to get the main idea.
Suppose that
p "
G
G
p !
G
0P
is nonstochastic
will be un-
(its
G f
P
8 ) '
gtjQ1
1
(1) Draw
8 f i I !X%
R R
With this, we can use the replications to calculate the empirical distribution of
&
5 q&
of
(4) Save
of the replications,
and use the remaining endpoints as the limits of the CI. Note that this will not
give the shortest CI if the empirical distribution is skewed.
98
tion.
4%
for each
How to choose
f
with replacement.
The bootstrap is based fundamentally on the idea that the empirical distribution of the sample data converges to the actual sampling
distribution as
99
8 p
as
Ea
8p
p j aw p q
P
is a convex combination of
8 xp p a I p a6 m 1
R
8 p j% 1
p 0 p aD1 1
h
p
h
h
Due to consistency of
and
p 0 v a
have
8p
where
we can replace
by
about
in a neighborhood of
, so
evaluated at
x
qa y x
dc
b
where
, asymptotically, so
I R p a I p a R 1
p
r
100
and
p
h R q a I Q R a R
I
imation.
Since this is a Wald test, it will tend to over-reject in nite samples. The
score and LR tests are also possibilities, but they require estimation
methods for nonlinear models, which arent in the scope of this course.
Note that this also gives a convenient way to estimate nonlinear functions and
is
x R
p p a I p a6
m
h p q 1
p c p a I 6Xiv p a6 p u
R R
where
is
linear function
8G P
VdR f
are
R
w.r.t.
The elasticities of
101
(note that this is the entire vector of elasticities). The estimated elasticities are
R
h vh
..
I H
I
bb
xxb
% R
.
.
.
.
.
.
R
bb
xxb
hi
bb
xxb
..
I
R
bb
xxb
.
.
.
.
.
.
a
To get a consistent estimator just substitute in . Note that the elasticity and
8
In many cases, nonlinear restrictions can also involve the data, not just the
8 & 8 5 3
a6@A8@89764) ! ! B
B
CB
goods is
&
system for
!d
102
Now demand must be positive, and we assume that expenditures sum to income, so we have the restrictions
)
! B B
3
4) !d B
G
0P W uP T R QP I !d B
B
B
B
It is fairly easy to write restrictions such that the shares sum to one, but the
interval depends on both parameters
8
"
8
"
and
)
) !d B
*********************************************************
OLS estimation results
Observations 145
R-squared 0.925955
Sigma-squared 0.153943
Results (Ordinary var-cov estimator)
constant
output
estimate
-3.527
0.720
st.err.
1.774
0.017
t-stat.
-1.987
41.244
p-value
0.049
0.000
labor
fuel
capital
0.436
0.427
-0.220
0.291
0.100
0.339
103
1.499
4.249
-0.648
0.136
0.000
0.518
*********************************************************
.
u
) P 2 P 0
4)
, and that
) VP 2 VP 0
Note that
and if
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.925652
Sigma-squared 0.155686
estimate
st.err.
t-stat.
p-value
-4.691
0.891
-5.263
0.000
output
0.721
0.018
41.040
0.000
labor
0.593
0.206
2.878
0.005
fuel
0.414
0.100
4.159
0.000
-0.007
0.192
-0.038
0.969
constant
capital
*******************************************************
Value
p-value
0.574
0.450
Wald
0.594
0.441
LR
0.593
0.441
Score
0.592
104
0.442
*******************************************************
Restricted LS estimation results
Observations 145
R-squared 0.790420
Sigma-squared 0.438861
estimate
st.err.
t-stat.
p-value
-7.530
2.966
-2.539
0.012
output
1.000
0.000
Inf
0.000
labor
0.020
0.489
0.040
0.968
fuel
0.715
0.167
4.289
0.000
capital
0.076
0.572
0.132
0.895
constant
*******************************************************
Value
p-value
256.262
0.000
Wald
265.414
0.000
LR
150.863
0.000
Score
93.771
105
0.000
Notice that the input price coefcients in fact sum to 1 when HOD1 is im). Also,
@)8 u&
does not drop much when the restriction is imposed, compared to the un-
) 0
) j
, so the restriction is
imposing CRTS. If you look at the unrestricted estimation results, you can see
) V
not overlap 1.
does
From the point of view of neoclassical economic theory, these results are
not anomalous: HOD1 is an implication of the theory, but CRTS is not.
E XERCISE 12. Modify the NerloveRestrictions.m program to impose and
test the restrictions jointly.
The Chow test. Since CRTS is rejected, lets examine the possibilities more
carefully. Recall that the data is sorted by output (the third column). Dene
5 subsamples of rms, with the rst group being the 29 rms with the lowest
output levels, then the next 29 rms, etc. The ve subsamples can be indexed
for
2
5
%
85
5@8A874) 2
)
@A8@89764)
8 5
for
8 7
@8A84)
where
by
, etc.
e
EwP 6 VP 1 3 uS 1 A | VC VP I A
2
P 0
P
(6.8.1)
where is a superscript (not a power) that inicates that the coefcients may be
%
different according to the subsample in which the observation falls. That is,
8
2
106
rst column of nerlove.data indicates this way of breaking up the sample. The
new model may be written as
.
.
.
%
)
'
)
j' 5
i '
is the
I
f
6
@
I )
iy'
is the
is 29
bb
xxb
I
f
subsample, and
is 29
where
.
.
.
Ie
.
.
.
(6.8.2)
subsample.
3 I
|
This type of test, that parameters are constant across different sets of data, is
sometimes referred to as a Chow test.
There are 20 restrictions. If thats not clear to you, look at the Octave
program.
Since the restrictions are rejected, we should probably use the unrestricted
model for analysis. What is the pattern of RTS as a function of the output
107
RTS
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
1.5
2.5
3
Output group
3.5
4.5
group (small to large)? Figure 6.8.1 plots RTS. We can see that there is increasing RTS for small rms, but that RTS is approximately constant for large rms.
108
(1) Using the Chow test on the Nerlove model, we reject that there is coefcient stability across the 5 groups. But perhaps we could restrict the
input price coefcients to be the same but let the constant and output
coefcients vary by group size. This new model is
e B
6
3
B
B wP q )uP B 5 4uP B 1 | VP s A uP I DB
2
0
(a) estimate this model by OLS, giving
(6.8.3)
for coefcients, t-statistics for tests of signicance, and the associated p-values. Interpret the results in detail.
(b) Test the restrictions implied by this model using the F, Wald, score
and likelihood ratio tests. Comment on the results.
(c) Plot the estimated RTS parameters as a function of rm size. Compare the plot to that given in the notes for the unrestricted model.
Comment on the results.
x
I
c.
Apply the delta method to calculate the estimated standard error for
) $r
) p
rd
) $r p
than testing
versus
versus
) $r
rather
(3) Perform a Monte Carlo study that generates data from the model
)
t e
)
}9
|
e
wP | P Y f
)
) P 5
and
and
(a) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that
8
5 | VP
109
(b) Compare the means and standard errors of the estimated coefcients using OLS and restricted OLS, imposing the restriction that
8
) | VP
CHAPTER 7
! tG
or occasionally
8
g ! DtG
Now well investigate the consequences of nonidentically and/or dependently
distributed errors. Well assume xed regressors for now, relaxing this admittedly unrealistic assumption later. The model is
f
G P
Vi
G
U
G
in place
of
where
111
persphere.
would be an
q1
dimensional hy-
Rf I X R
G% I 6Xi
R R P
or the probability
P 2
limit
R
r c0 j n
(7.1.1)
R R R
I 6XiVSi I !X%
I 6XiuttGi I !Xi
R R G R R
The variance of
based
R R R
I 6XiVSi I !X%tu"
S
we still have
h j% 1
G R eI 1
p
RG 1 1
p
eI
I R
Gi I 6Qi1
R R
I I m h 0% 1
1
tf
R tG R u
G
'
so we obtain
112
has a different variance than before, so the previous test statistics arent
is consistent
valid
covariance matrix. Previous test statistics arent valid in this case for
this reason.
is inefcient, as is shown below.
Suppose
I yS R
We have
f
f I S R I X I S R
Rf R I Q R R
fR I 6 R
C0
G
U
G P
V
G
f
Therefore, the model
f
G
R R tG
R S
G @G
This variance of
is
8 V f
G P
or, making the obvious denitions,
GR "R fR
P
Consider the model
f R S
R R S R
so
7.2. THE GLS ESTIMATOR
113
114
The GLS estimator is unbiased in the same circumstances under which the
is nonstochastic
R R
G0Pi I yS% I 6X I ySi
R R
f I yS% I 6X I ySi
can be calculated using
@0
G I P
R 6 R
GVP gR I 6 R
fR I 6 R
C0
so
R
I 6X I ySi
I 6 R
I 6 R R I 6 R
I 6 R R G GR I 6 R
R h 0
C0
h j
@0
can set it to
8
%
8
4)
in place of
115
The GLS estimator is more efcient than the OLS estimator. This is a
y w S w
R R R R
I 6Q I ySi I 6X%uSi I 6Qi
C0
Vq
8 I S R I X I S R R I Q R w
where
vious, but it is true, as you can verify for yourself. Then noting that
is a quadratic form in a positive denite matrix, we conclude
y w S w
y w S w
that
OLS.
As one can verify by calculating fonc, the GLS estimator is the solution
C0
f R f
qij I yBSc"j E
so the metric
unique elements.
: its an
1 '
Y1
matrix with
1
1 1
P 5 0
5 DuP
1 1
116
8
q1
faster than
and increases
4d iS
T
pS
If we replace
is a continuous function of
(by the
d
iS
S
d
4iS
as long as
we can cond
sistently estimate
may include
d
4%S
where
and , where
7d
as a function of
we obtain the
FGLS estimator. The FGLS estimator shares the same asymptotic properties
as GLS. These are
(1) Consistency
(2) Asymptotic normality
(3) Asymptotic efciency if the errors are normally distributed. (CramerRao).
(4) Test procedures are asymptotically valid.
In practice, the usual way to proceed is
This is a case-by-case proposition,
8 d
g4RS
8
7d
7.4. HETEROSCEDASTICITY
I S I)
4d %S
(2) Form
117
P
RG i R Rf
7.4. Heteroscedasticity
Heteroscedasticity is the case where
R G G
ttU
is a diagonal matrix, so that the errors are uncorrelated, but have different
variances. Heteroscedasticity is usually thought of as associated with cross
sectional data, though there is absolutely no reason why time series data cannot also be heteroscedastic. Actually, the popular ARCH (autoregressive conditionally heteroscedastic) models explicitly assume that a time series is heteroscedastic.
Consider a supply function
3
is price and
G
T PI
B VP B iuP B $uH HB
where
suppose that unobservable factors (e.g., talent of managers, degree of coordiIf there
is more variability in these factors for large rms than for small rms, then
BG
8B G
nation between production units, etc.) account for the error term
7.4. HETEROSCEDASTICITY
118
G
W
T P I
B 0P B SuP B $VsH DB
is price and
BG
where
BG
high.
is
I I
bution
h
0 1
'
1
tf
R tG R u
G
cant estimate
'
I
@ 1
G R f )
One can then modify the previous test statistics to obtain tests that are valid
when there is heteroscedasticity of unknown form. For example, the Wald test
7.4. HETEROSCEDASTICITY
119
1
1
h
R h
% R I R I R 5R 1
I
p
pr
for
would be
7.4.2. Detection. There exist many tests for the presence of heteroscedasticity. Well discuss three methods.
will be independent.
I 1
|
1
I G I y I G I G R I G
m
m
8
| !
1 I 1
G y G G R G
| | |
| |
P
m
1
|
qs
I 1
so
and
and
|1
and
1 | iP iP 1
1 1 I
observations, where
!gq1
1I
| G R | G
I G R I G
The distributional result is exact if the errors are normally distributed. This test
is a two-tailed test. Alternatively, and probably more conventionally, if one has
prior ideas about the possible magnitudes of the variances of the observations,
one could order the observations accordingly, from largest to smallest. In this
case, one would use a conventional one-tailed F-test. Draw picture.
any power.
difference between the average variance in the subsamples, supposing that there exists heteroscedasticity. This can increase the power of
7.4. HETEROSCEDASTICITY
120
the test. On the other hand, dropping too many observations will suband
8 G R G
| |
I G R I G
A rule of
probably have low power since a sensible data ordering isnt available.
Whites test. When one has little idea if there exists heteroscedasticity, and
no idea of its potential form, the White test is a possibility. The idea is that if
there is homoscedasticity, then
2
U
G
isnt available, use the consistent estimator
(1) Since
(2) Regress
-vector.
is a
instead.
P R #
ta wP G
where
g G
follows:
or functions of
8 G
g
so that
8c
c
, plus the set of all unique squares and cross products of variables in
The
8
t)
1
7.4. HETEROSCEDASTICITY
Note that
121
V)
1
t)
(so that
'
D1
This doesnt require normality of the errors, though it does assume that the
sary?
fourth moment of
The White test has the disadvantage that it may not be very power-
It also has the problem that specication errors other than heteroscedas-
dc
b
where
#
gd R P &
G
for
is an arbitrary func-
tion of unknown form. The test is more general than is may appear
from the regression that is used.
Plotting the residuals. A very simple method is to simply plot the residuals
(or their squares). Draw pictures here. Like the Goldfeld-Quandt test, this will
7.4. HETEROSCEDASTICITY
122
7.4.3. Correction. Correcting for heteroscedasticity requires that a parabe supplied, and that a means for estimating
d
consis-
d
4eS
tently be determined. The estimation method will be specic to the for sup-
8 d
geS
plied for
H R
#
G
G P
0 R
gf
R #
iP H g G
has mean zero. Nonlinear least squares could be used to estimate
we can estimate
consistently using
and
in place of
G
Once we have
G
OLS residuals
and
consistently, were
and
R #
H g
In the second step, we transform the model by dividing by the standard deviation:
S S S
t G P R g f
7.4. HETEROSCEDASTICITY
123
or
8 0 R f
G P
G P
0 R
G
9#
where
mated, and the model of the variance is still nonlinear in the parameters. However, the search method can be used in this case to reduce the
estimation problem to repeated applications of OLS.
iP # G
W
gUW
W
by OLS.
as the estimate.
8
W
The regression
8 #
e.g.,
87 8756A@8A8754A)
7
8 8 8
8 7
I g
7.4. HETEROSCEDASTICITY
124
Groupwise heteroscedasticity
A common case is where we have repeated observations on each of a number of economic agents: e.g., 10 years of macroeconomic data on each of a set
of countries or regions, or daily observations of transactions of 200 banks. This
sort of data is a pooled cross-section time-series model. It may be reasonable to presume that the variance is constant over time within the cross-sectional units,
but that it differs across them (e.g., rms or countries of different sizes...). The
model is
2
c B
B V BR
G P
& 85
6A@8@8974) 3
each agent.
1 8 5
!@A8@89764) 2
B G
B f
where
the
8 E B G B U
G
B G
I@ 1
B
f )
tor:
B
s
regressors, so
1
1)
is unimportant.
7.4. HETEROSCEDASTICITY
125
Residuals
0.5
-0.5
-1
-1.5
20
40
60
80
100
120
140
160
B B
B
B G P BR B f
Do this for each cross-sectional group. This transformed model satises the classical assumptions, asymptotically.
7.4.4. Example: the Nerlove model (again!) Lets check the Nerlove data
for evidence of heteroscedasticity. In what follows, were going to use the
model with the constant and output coefcient varying across 5 groups, but
with the input price coefcients xed (see Equation 6.8.3 for the rationale behind this). Figure 7.4.1, which is generated by the Octave program GLS/NerloveResiduals.m
plots the residuals. We can see pretty clearly that the error variance is larger
for small rms than for larger rms.
7.4. HETEROSCEDASTICITY
126
Now lets try out some tests to formally check for heteroscedasticity. The
Octave program GLS/HetTests.m performs the White and Goldfeld-Quandt
tests, using the above model. The results are
Value
GQ test
61.903
0.000
Value
Whites test
p-value
p-value
10.886
0.000
All in all, it is very clear that the data are heteroscedastic. That means that OLS
estimation is not efcient, and tests of restrictions that ignore heteroscedasticity are not valid. The previous tests (CRTS, HOD1 and the Chow test) were calculated assuming homoscedasticity. The Octave program GLS/NerloveRestrictions-Het.m
uses the Wald test to check for CRTS and HOD1, but using a heteroscedasticconsistent covariance estimator.1 The results are
Testing HOD1
Value
6.161
0.013
Value
Wald test
p-value
p-value
20.169
0.001
Testing CRTS
Wald test
By the way, notice that GLS/NerloveResiduals.m and GLS/HetTests.m use the restricted LS estimator directly to restrict the fully general model with all coefcients
varying to the model with only the constant and the output coefcient varying. But
GLS/NerloveRestrictions-Het.m estimates the model by substituting the restrictions into the
model. The methods are equivalent, but the second is more convenient and easier to understand.
7.4. HETEROSCEDASTICITY
127
We see that the previous conclusions are altered - both CRTS is and HOD1 are
rejected at the 5% level. Maybe the rejection of HOD1 is due to to Wald tests
tendency to over-reject?
From the previous plot, it seems that the variance of is a decreasing func-
tion of output. Suppose that the 5 size groups have different error variances
(heteroscedasticity by groups):
B
e
if
5 85
6A@8A874) 3 )
where
estimates the model using GLS (through a transformation of the model so that
OLS can be applied). The estimation results are
*********************************************************
OLS estimation results
Observations 145
R-squared 0.958822
Sigma-squared 0.090800
estimate
st.err.
t-stat.
p-value
constant1
-1.046
1.276
-0.820
0.414
constant2
-1.977
1.364
-1.450
0.149
constant3
-3.616
1.656
-2.184
0.031
constant4
-4.052
1.462
-2.771
0.006
constant5
-5.308
1.586
-3.346
0.001
0.391
0.090
4.363
0.000
output1
7.4. HETEROSCEDASTICITY
128
output2
0.649
0.090
7.184
0.000
output3
0.897
0.134
6.688
0.000
output4
0.962
0.112
8.612
0.000
output5
1.101
0.090
12.237
0.000
labor
0.007
0.208
0.032
0.975
fuel
0.498
0.081
6.149
0.000
-0.460
0.253
-1.818
0.071
capital
*********************************************************
*********************************************************
OLS estimation results
Observations 145
R-squared 0.987429
Sigma-squared 1.092393
estimate
st.err.
t-stat.
p-value
constant1
-1.580
0.917
-1.723
0.087
constant2
-2.497
0.988
-2.528
0.013
constant3
-4.108
1.327
-3.097
0.002
constant4
-4.494
1.180
-3.808
0.000
constant5
-5.765
1.274
-4.525
0.000
output1
0.392
0.090
4.346
0.000
output2
0.648
0.094
6.917
0.000
7.4. HETEROSCEDASTICITY
129
output3
0.892
0.138
6.474
0.000
output4
0.951
0.109
8.755
0.000
output5
1.093
0.086
12.684
0.000
labor
0.103
0.141
0.733
0.465
fuel
0.492
0.044
11.294
0.000
-0.366
0.165
-2.217
0.028
capital
*********************************************************
Testing HOD1
Value
9.312
Wald test
p-value
0.002
The rst panel of output are the OLS estimation results, which are used to
The
not the same. The measure for the GLS results uses the transformed
measure,
7.5. AUTOCORRELATION
130
Note that the previously noted pattern in the output coefcients per-
level. That seems to indicate some kind of problem with the model or
the data, or economic theory.
Note that HOD1 is now rejected. Problem of Wald test over-rejecting?
Specication error in model?
7.5. Autocorrelation
Autocorrelation, which is the serial correlation of the error term, is a problem that is usually associated with time series data, but also can affect crosssectional data. For example, a shock to oil prices will simultaneously affect
all countries, so one could expect contemporaneous correlation of macroeconomic variables across countries.
7.5.1. Causes. Autocorrelation is the existence of correlation across the error term:
8 2
4 EgGtG
Why might this occur? Plausible explanations include
(1) Lags in adjustment to shocks. In a model such as
G P
t0 R gf
as the equilibrium value. Suppose
gG
R
is con-
as a shock
that moves the system away from equilibrium. If the time needed to
return to equilibrium is long with respect to the observation frequency,
7.5. AUTOCORRELATION
to be positive, conditional on
I G
induces a correlation.
131
positive, which
(2) Unobserved factors that are correlated over time. The error term is
often assumed to correspond to unobservable factors. If these factors
are correlated, there will be autocorrelation.
(3) Misspecication of the model. Suppose that the DGP is
G
tVP uS HuP p gf
P I
but we estimate
G P I
tVS HuP p gf
7.5.2. Effects on the OLS estimator. The variance of the OLS estimator is
the same as in the case of heteroscedasticity - the standard formula does not
apply. The correct formula is given in equation 7.1.1. Next we discuss two
GLS corrections for OLS. These will potentially induce inconsistency when the
regressors are nonstochastic (see Chapter8) and should either not be used in
that case (which is usually the relevant case) or used with caution. The more
recommended procedure is discussed in section 7.5.5.
7.5.3. AR(1). There are many types of autocorrelation. Well consider two
examples. The rst is the most commonly encountered case: autoregressive
7.5. AUTOCORRELATION
132
G P
tV R
3
t3
P G
CI t4
E
2
f
tG
kC
G
tU
8
4)
order autocovariance:
G
t p
0) G
t
gG
G
t
j)
p
qW
G
W U
gG
p
qW
W C
W G
as
ut
P G
RUVCI t4gw I U
P
G 5 P
G
is found as
obtain
In the limit the lagged
so we
G
P G
iisI t4
tG
133
7.5. AUTOCORRELATION
is
I
134
G
7 EI tG
2
0)
D$
G
t
G P G
I CI 4U
j)
D
stationary
cov
se se
f
!
f
f
S
corr
is covariance
and ) is dened as
G
gx
G
E tG
but in this case, the two standard errors are the same, so the -order autocor-
is
.
.
.
~ j{ ) z
|
}
|
}{
bb
xxb I f
..
..
.
.
.
f uxxb
bb
bb
I f uxxb $
S
relation
7.5. AUTOCORRELATION
135
and
8 s
If we
It turns out that its easy to estimate these consistently. The steps are
G P
0 R gf
by OLS.
P
I t G t G
G
T
ctG
Since
gression
P G
CI t4 tG
S S
P
I t G G
ng
i
@
1
T f )
S
, the
using the
8
g f I yS R% h I yS % @0 2
)
g 0c
One can iterate the process, by taking the rst FGLS estimator of
and
estimating
the factor
form
previous structure of
and
estimator
plying OLS to
re-
7.5. AUTOCORRELATION
136
p
pf
P R
c!I I g gf
)
(u1
using
observations (since
and
)
)
I I
I
f I f
This somewhat odd-looking result is related to the Cholesky factorSee Davidson and MacKinnon, pg. 348-49 for more
If
is
8I
ization of
asymptotically, so we
periods.
4 R f
in different time
7.5.4. MA(1). The linear regression model with moving average order 1
errors is
G P
t0 R
2
! 3
t3
I CnP C
t
f
tG
C
G
t)
P
t
tw
P )
t
p
z I z
I
C z I
t
.
..
.
.
.
bb
xxb
bb
xxb
.
.
.
t t)
P
t
wP )
t
t
..
so in this case
and
it
CrDI CREI CnwP iA
t P t
I
i
Similarly
t P )
t
wP
I irSCR
t P
p
G
t
In this case,
7.5. AUTOCORRELATION
137
7.5. AUTOCORRELATION
and a minimum at
)
4 t
) t
138
and the
Again the covariance matrix has a simple structure that depends on only two
using OLS on
I irP C t G
t
because the
w t
t P )
G
using the typical estimator:
x
I
@ 1
t P )
G f ) c
t
I
@ 1
P )
G f ) t c
However, this isnt sufcient to dene consistent estimators of the parameters, since its unidentied.
and
I G
gG
using
@ 1
I t
G
G t G f ) t EI tG
7.5. AUTOCORRELATION
139
@ 1
I
G G f ) t
Now solve these two equations to obtain identied (and therefore conand
t S S
following the form weve seen above, and transform the model using the Cholesky decomposition. The transformed model satises the
classical assumptions asymptotically.
h 0 1
f
R t RG A
G
is
where, as before,
I I
iting distribution
7.5. AUTOCORRELATION
f
tG
r
if xxb I n
bb
G
I@
f
I@
f
I
@
) 1 f
v
'
8
gE2
autocovariance of
as
8 R
g U
is potentially autocorrelated:
2
j
8 R
R
R
2
stationarity.
I@
R
f
R
Gi
that:
Note that
.
.
.
Dene the
and
We assume that
)
j'
so that
is dened
I
$G
G
t
as a
(recall that
140
due to covariance
7.5. AUTOCORRELATION
B B U
and heteroscedastic (
in
k B
contemporaneously correlated (
141
tion upon which to base a parametric specication. Recent research has fo-
I@
R
f
Now dene
I@
1
) f
We have (show that the following is true, by expanding sum and shifting rows to left)
P
sI f ) 1 xxxH R
Pbbb
1
P u1 IR
5 P
I @
8 R 1
f )
P
I f ) 1 9xxP h R
Pbbb
f
1
7 )
1
P V1 P h IR
5
8 h R
would be
1)
t G
instead of
h I Rf
1
P
DI ) 1 P p
tent, estimator of
where
is
I Rf
P s11 P p
I f
1
P
DI ) 1 P p
f
7.5. AUTOCORRELATION
142
a modied estimator
hR
I
p
P
9f (
f
1
S
as
1
slowly.
where
tends to
grows sufciently
8
q1
1
D
f f
given that
The term
has autocorrelations
S
1
8 hR
9f (
I
w
")P
) P p
f
1
D
I 1
3 p
e
for
is that
7.5. AUTOCORRELATION
f
T
as its limit,
has
Finally, since
143
and
R If
I
G @
I P f 5
G sI t G t G 0 G
I
G @
f
EI t G t G
@
f
@
f
d
The null hypothesis is that the rst order autocorrelation of the errors
8 QI r
8 I r p
is zero:
Note
that the alternative is not that the errors are AR(1), since many general patterns of autocorrelation will have the rst order autocorrelation different than zero. For this reason the test is useful for detecting
autocorrelation in general. For the same reason, one shouldnt just assume that an AR(1) model is appropriate when the DW test rejects the
null.
Under the null, the middle term tends to zero, and the other two tend
so
5
7p
75
so
In this case
8)
4
8
4)
8
75
to one, so
In this
7.5. AUTOCORRELATION
144
%
sors,
so tables cant give exact critical values. The give upper and
lower bounds, which correspond to the extremes that are possible. See
Figure 7.5.2. There are means of determining exact critical values con-
8
i
ditional on
is xed
in repeated samples. This is often unreasonable in the context of economic time series, which is precisely the context where the test would
have application. It is possible to relate the DW test to other test statistics which are valid without strict exogeneity.
7.5. AUTOCORRELATION
145
Breusch-Godfrey test
This test uses an auxiliary regression, as does the White test for heteroscedasticity. The regression is
Pbbb
txxP t G sI t G CiP R t G
P
I
ui
8
g
sH1
P
i G
into here.
%
g G
are not
This test is valid even if the regressors are stochastic and contain lagged
contains lagged
R f
4 RG U
y f 3
A simple example is the case of a single lag of the dependent variable with
7.5. AUTOCORRELATION
146
P G
CI 4
G f P
0P sI u R
f
tG
Now we can write
P G G P f P
G f
CI 4I tV gV I R tI gU
8 y f 3
I 4U
G
and therefore
Since
pG R U
1
P
G R 3 " 3
the OLS estimator is inconsistent in this case. One needs to estimate by instrumental variables (IV), which well get to later.
7.5.8. Examples.
Nerlove model, yet again. The Nerlove model uses cross-sectional data, so
one may not think of performing tests for autocorrelation. However, specication error can induce autocorrelated errors. Consider the simple Nerlove
model
e
uP )VP 5 4uP 1 | VP uH A
6
3
PI
2
0
8e
uP )uP 5 4uP 5 | uP VP I u A
6
3
2
0
7.5. AUTOCORRELATION
147
Residuals
Quadratic fit to Residuals
1.5
0.5
-0.5
-1
10
test statistic. The residual plot is in Figure 7.6.1 , and the test results are:
Value
Breusch-Godfrey test
p-value
34.930
0.000
7.5. AUTOCORRELATION
148
( ), both current and lagged, as well as the sum of wages in the private sector
le for this data set. This gives the variable names and other information.
Consider the model
I e P
ws P T |
&
P
I
&
P
S I
&
P p
&
'
The Octave program GLS/Klein.m estimates this model by OLS, plots the
residuals, and performs the Breusch-Godfrey test, using 1 lag of the residuals. The estimation and test results are:
*********************************************************
OLS estimation results
Observations 21
R-squared 0.981008
Sigma-squared 1.051732
estimate
st.err.
t-stat.
p-value
16.237
1.303
12.464
0.000
Profits
0.193
0.091
2.115
0.049
Lagged Profits
0.090
0.091
0.992
0.335
Wages
0.796
0.040
19.933
0.000
Constant
7.5. AUTOCORRELATION
149
Residuals
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
10
15
20
25
*********************************************************
Value
Breusch-Godfrey test
p-value
1.539
0.215
and the residual plot is in Figure 7.6.2. The test does not reject the null of
nonautocorrelatetd errors, but we should remember that we have only 21 observations, so power is likely to be fairly low. The residual plot leads me to
suspect that there may be autocorrelation - there are some signicant runs below and above the x-axis. Your opinion may differ.
Since it seems that there may be autocorrelation, letss try an AR(1) correction. The Octave program GLS/KleinAR1.m estimates the Klein consumption
equation assuming that the errors follow the AR(1) pattern. The results, with
the Breusch-Godfrey test for remaining autocorrelation are:
7.5. AUTOCORRELATION
150
*********************************************************
OLS estimation results
Observations 21
R-squared 0.967090
Sigma-squared 0.983171
estimate
st.err.
t-stat.
p-value
16.992
1.492
11.388
0.000
Profits
0.215
0.096
2.232
0.039
Lagged Profits
0.076
0.094
0.806
0.431
Wages
0.774
0.048
16.234
0.000
Constant
*********************************************************
Value
Breusch-Godfrey test
p-value
2.129
0.345
The test is farther away from the rejection region than before, and the
residual plot is a bit more favorable for the hypothesis of nonautocorrelated residuals, IMHO. For this reason, it seems that the AR(1)
correction might have improved the estimation.
Nevertheless, there has not been much of an effect on the estimated
coefcients nor on their estimated standard errors. This is probably
because the estimated AR(1) coefcient is not very large (around 0.2)
EXERCISES
151
Exercises
EXERCISES
152
(1) Comparing the variances of the OLS and GLS estimators, I claimed that the
following holds:
(2)
y w S w
@0
f R f
qij I yBSc"j E
C0
I I
unknown form is
h
j% 1
where
'
1
tf
R t RG u
G
Explain why
I
@ 1
G R f )
8 R
R
8
g RU
Show that
as
2
%
where
e
wP )uP 1 A 4uP 5 | uP A uP I
6
3
2
0
EXERCISES
1&
e
E
&
'
assume that
153
Exercises
(a) Calculate the FGLS estimator and interpret the estimation results.
(b) Test the transformed model to check whether it appears to satisfy homoscedasticity.
CHAPTER 8
Stochastic regressors
Up to now we have treated the regressors as xed, which is clearly unrealistic. Now we will assume they are random. There are several ways to
think of the problem. First, if we are interested in an analysis conditional on the
explanatory variables, then it is irrelevant if they are stochastic or not, since
conditional on the values of they regressors take on, they are nonstochastic,
which is the case already considered.
In cross-sectional analysis it is usually reasonable to make the analysis
gI gf
gf
a conditional anal-
ysis is not sufciently general, since we may want to predict into the
8
i
and
The model well deal will involve a combination of the following assumptions
r p
G
tVP p R gf
or in matrix form,
154
)
i'
conformable.
is
and
where
and are
bb
R h f xxb I 94i1
) '
G
70P p i f
where is
8.1. CASE 1
155
has rank
is stochastic
) R I f
: is normally distributed
e $f ! je G
p j
G R eI 1
p
m
V
Normality (Optional):
where
2
(8.0.1)
G
tU
given
R v v
v f
G
cv t
R v
In both cases,
2
(8.0.2)
8.1. Case 1
In this case,
7G
Normality of
Gi I !XiP p
R R
G R R
X i I !XiP p
p
p I !XRite
i
, unconditional on
8
%
X U
and since this holds for all
Likewise,
8.2. CASE 2
i
and
i
However, conditional on
in small samples.
2
Q
X"t
Q
XDt
is obtained by
8
i
is
If the density of
156
so
Summary: When
mally distributed:
(2)
is unbiased
(1)
is nonnormally distributed
(3) The usual test statistics have the same distribution as with non-
8
i
stochastic
(4) The Gauss-Markov theorem still holds, since it holds conditionand this is true for all
8
%
i
ally on
8.2. Case 2
nonnormally distributed, strongly exogenous regressors
carries through as before. However, the argument
8
G
The unbiasedness of
Still, we have
1
1
RG I R P
Gi I 6XiP
R R
p
p
8.2. CASE 2
Now
157
1
I T R
I
by assumption, and
gD
j
1
1
T GR I 1 GR
p
e
goes to innity. We have unbiasedness and the variance disappearing, so, the
estimator is consistent:
8p
RG I
p
h % 1
RG 1 1
p
eI
I R
1
1
R 1
p I j
so
hp
0 1
nonnormal,
normal or
(1) Unbiasedness
(2) Consistency
(3) Gauss-Markov theorem holds, since it holds in the previous case
and doesnt depend on normality.
(4) Asymptotic normality
158
(5) Tests are asymptotically valid, but are not valid in small samples.
8.3. Case 3
Weakly exogenous regressors
An important class of models are dynamic models, where lagged dependent
variables have an impact on the current value. A simple version of these models that captures the important points is
G P
0 R
I
G0 g7
P f
P & g#
R
T
e
v !
f
where now
and are not uncorrelated, so one cant show unbiasedness. For example,
9
G
I t
I gf
contains
(which is a function of
I G
since
as an element.
This fact implies that all of the small sample properties such as unbiasedness, Gauss-Markov theorem, and small sample validity of test
statistics do not hold in this case. Recall Figure 3.7.2. This is a case of
weakly exogenous regressors, and we see that the OLS estimator is
biased in this case.
Nevertheless, under the above assumptions, all asymptotic properties
continue to hold, using the same arguments as before.
(2)
p m G R eI 1
p
f
4) R I p f A
(1)
159
The most complicated case is that of dynamic models, since the other cases can
be treated as nested in this case. There exist a number of central limit theorems
for dependent processes, many of which are fairly technical. We wont enter
into details (see Hamilton, Chapter 7 if youre interested). A main requirement
for use of standard asymptotics for a dependent sequence
I
@ 1
9g
#
f )
x#
@8A89( 6g !c
8 # ##
8
2
not depend on
Covariance (weak) stationarity requires that the rst and second mo-
8
2
G P
t0sI
depends upon in this case.
G
Stationarity prevents the process from trending off to plus or minus innity,
and prevents cyclical behavior which would allow correlations between far
znd
x#
removed
160
ditioning variables have variances that are nite, and are not too strongly
dependent. The AR(1) model with unit root is an example of a case
where the dependence is too strong for standard asymptotics to apply.
The econometrics of nonstationary processes has been an active area
of research in the last two decades. The standard asymptotics dont
apply in this case. This isnt in the scope of this course.
EXERCISES
161
Exercises
if
w
(2) If it possible for an AR(1) model for time series data, e.g.,
satisfy weak exogeneity? Strong exogeneity? Discuss.
then
P
tG gI gf 8 P $f
and
w
CHAPTER 9
Data problems
In this section well consider problems associated with the regressor matrix:
collinearity, missing observation and measurement error.
9.1. Collinearity
Collinearity is the existence of linear relationships amongst the regressors.
We can always write
and
%
P
I v I
e
is an
vector.
) '
wj1
P se xxxP v
v Pbbb
3
is the
B
Dv
where
In the case that there exists collinearity, the variation in is relatively small, so
In the extreme, if there are exact linear relationships (every element of equal)
Q R
so
so
X
then
| & I &
P
G P
t0i | | Vi VsH
P
P I
162
gf
9.1. COLLINEARITY
163
G P
t0C | C
P I
| | uP & EI & uDH
P P I
P P PI
| uS | & VsI & uDH
P P PI
| V | & sI & uDH
gf
G P
tVS |
G P
t0C |
$ R
s dene two
the
equations in three
The
are unidentied in
Perfect collinearity is unusual, except in the case of an error in construction of the regressor matrix, such as including the same regressor
twice.
B
D
for
G P
B 0a BR P B
f
3
Eg4) B f P B P B P
& B
6
)
In this model,
and
f
if the
C
B
apartment is in Barcelona,
B B &
) p
B
3
B
f
lected in
these variables and the column of ones corresponding to the constant. One
must either drop the constant, or one of the qualitative variables.
9.1. COLLINEARITY
F IGURE 9.1.1.
164
6
4
60
55
50
45
40
35
30
25
20
15
2
0
-2
-4
-6
-4
-2
-6
9.1.2. Back to collinearity. The more common case, if one doesnt make
mistakes such as these, is the existence of inexact linear relationships, i.e., correlations between the regressors that are less than one in absolute value, but
not zero. The basic problem is that when two (or more) variables move together, it is difcult to determine their separate inuences. This is reected
in imprecise estimates, i.e., estimates with high variances. With economic data,
collinearity is commonly encountered, and is often a severe problem.
When there is collinearity, the minimizing point of the objective function
9.1. COLLINEARITY
F IGURE 9.1.2.
165
6
4
100
90
80
70
60
50
40
30
20
2
0
-2
-4
-6
-4
-2
-6
r v o
n
where
isf
the variance of
R
I X%
R v
vR v
R v R
R
9.1. COLLINEARITY
166
r
R
II sI Xi
v
v
where by
8
iP
sion
%v
Since
)
V
)
is
we have
jyc
)
We see three factors inuence the variance of this coefcient. It will be high if
is large
Draw a picture here.
4)
will be close to 1. As
sors, so that
8 v
(1)
9.1. COLLINEARITY
167
Intuitively, when there are strong linear relations between the regressors, it
is difcult to determine the separate inuence of the regressors on the dependent variable. This can be seen by comparing the OLS objective function in
the case of no correlation between regressors with the objective function with
correlation between the regressors. See the gures nocollin.ps (no correlation)
and collin.ps (correlation), available on the web site.
9.1.3. Detection of collinearity. The best way is simply to regress each explanatory variable in turn on the remaining regressors. If any of these auxiliary
g
but none
of the variables is signicantly different from zero (e.g., their separate inuences arent well determined).
In summary, the articial regressions are the best approach if one wants to
be careful.
9.1.4. Dealing with collinearity. More information
Collinearity is a problem of an uninformative sample. The rst question
is: is all the available information being used? Is more data available? Are
there coefcient restrictions that have been neglected? Picture illustrating how
a restriction can solve problem of perfect collinearity.
Stochastic restrictions and ridge regression
9.1. COLLINEARITY
168
Supposing that there is no more data or neglected restrictions, one possibility is to change perspectives, to Bayesian econometrics. One can express prior
beliefs regarding the coefcients using stochastic restrictions. A stochastic linear restriction would be something of the form
P p
and
where
is a random
(
f
gT(
f f
(
@D
G P
Vi
f
p
This sort of model isnt in line with the classical interpretation of parameters
P p
is
constant but the right is random. This model does t the Bayesian perspective:
we combine information coming from the model and the data, summarized in
4f ! j
G P
V"
f
G
!( ! jQp
R `U
G
which is
the last piece of information in the specication. How can you estimate using
9.1. COLLINEARITY
169
this model? The solution is to treat the restrictions as articial data. Write
f
8
This expresses the degree of belief in the restriction relative to the varithen the model
f
is homoscedastic and can be estimated by OLS. Note that this estimator is bi-
these
As
restrictions, where
function, so the estimator has the same limiting objective function as the OLS
estimator, and is therefore consistent.
To motivate the use of stochastic restrictions, consider the expectation of
P R
y 4e
I
B
P R
B
Se u
P
I X R y R
RG X R X R u R G V R
P
I I
P
P
h RG I X R R h RG I X R
R U
Q
R
is p.d.
9.1. COLLINEARITY
170
so
R
y
D
P
R R U
e
is nite.
r R n
f
becomes
8
iP x
tends to innite. On
8
I X R
maximum eigenvalue of
R
where
r R n
f% I 9 wi
R
P R
m
B
This is the ordinary ridge regression estimator. The ridge regression estimator
which is more and
e
restrictions tend to
which is nonsingular, to
% R
the
f R f% I 9
8
fi I wi
R
P R
m
B
B B R
m
so
model is at al sensible.
171
est in ridge regression centers on the fact that it can be shown that there exists
m
and
and chooses
increasing
as a function of
@0
the value of
B B R
a such that
is obviously
subjective. This is not a problem from the Bayesian perspective: the choice of
9.2.1. Error of measurement of the dependent variable. Measurement errors in the dependent variable and the regressors have important differences.
172
First consider error in measurement of the dependent variable. The data generating process is presumed to be
G P
V"
f
P f
! 3
t3
is what is ob-
G
iP " f
f
where
satises
G P
V"
P
f
so
f
G P
pV"
VP ! 3
t3
P
"
is uncorrelated with
%
As long as
173
G P
t0 R
gf
T 3
S
t3
iP
t
and that
G P
0 R
f
G P R
0 $s% R
P
R
G P
V f
and
the model
'
G
matrix. Now
and
is a
where
since
G P
tV R tHiP
where
8 R
H $
Because of this correlation, the OLS estimator is biased and inconsistent, just as
in the case of autocorrelated errors with lagged dependent variables. In matrix
notation, write the estimated model as
P
" f
We have that
1
1
Rf I R
S
I P
1
D R wP R 3
I
1
R 3
I
@ 1
R$t
f )
and
and
since
174
1
R 3
Likewise,
1
G P
0 R wP R 3
1
f R 3
so
P 3
S
h
So we see that the least squares estimator is inconsistent when the regressors
are measured with error.
175
G P
V" f
or
I
f
f
hold.
I
$G
where
tions
I G P I
0c f
I
Since these observations satisfy the classical assumptions, one could
estimate by OLS.
The question remains whether or not one could somehow replace the
by a predictor, and improve over OLS in some sense.
Now
R P I RI
f iVDf i
I
f
If
I
R I
be the predictor of
8f
R P I RI
i0 i
I
R
I
Let
unobserved
8
h
h
P w
w
8 w 9
I
RI I R 0D RI 9
PI
I RI R P I RI
i iV %@ I iV %
R P I RI
% iVD %
R I
R P I RI
I RI
i I i0D i
R P I RI
and we use
$
w
where
i I RiVD i gsH i
R
P I RI P I I RI
r iVDH i n
R P I I RI
9 H w
P I
%V i
R P I RI
%V i
R P I RI
Substituting these into the equation for the overall combined estimator gives
8 f R R
would give
Likewise, an OLS regression using only the second (lled in) observations
I RI
f i H %
I I RI
Rf R
176
177
G f
has mean zero. Clearly, it is difcult to satisfy this condition
where
without knowledge of
I f f
8
One possibility that has been suggested (see Greene, page 275) is to
estimate
servations
If RI I E RI
I
I
H f
f
I
H
I RI I RI I
f i I 6 i H
I
H
to predict
and
just as
IH i I
R
R
f i I
I
H
! %
R
R
! %
178
This shows that this suggestion is completely empty of content: the nal estimator is the same as the OLS estimator using only the complete
observations.
9.3.2. The sample selection problem. In the above discussion we assumed
that the missing observations are random. The sample selection problem is a
case where the missing observations are not random. Consider the model
G P
t0 R f
dened as
f f
if
is not
The difference in this case is that the missing values are not random: they
8
c
G
0P f
, but using only the observations for which
u 4f
5 4G
with
to estimate.
I
$G
I
f
f
Again, one could just estimate using the complete observations, but it may
179
Data
True Line
Fitted Line
20
15
10
-5
-10
10
Consistency
8
q1
increase with
180
In the case that there is only one regressor other than the constant,
for the missing
8
75
subtitution of
EXERCISES
181
Exercises
(1) Consider the Nerlove model
e
wP )uP 1 A 4uP 5 | uP A uP I
6
3
2
0
When this model is estimated by OLS, some coefcients are not signicant.
This may be due to collinearity.
Exercises
(a) Calculate the correlation matrix of the regressors.
(b) Perform articial regressions to see if collinearity is a problem.
(c) Apply the ridge regression estimator.
Exercises
(i) Plot the ridge trace diagram
large.
becomes very
CHAPTER 10
x gYF IyF U
z x x w
8 (
) n
4)
P I
8
8 | ! ! D! w
I
8 w A p
G P (
0 )VP F A uF HuP p A
PI I
where
This
when
while
A
uP F Vh F HuP p
PI I
(
)
and
G P
0
182
183
The basic point is that many functional forms are compatible with the linearin-parameters model, since this model can incorporate a wide variety of nonlinear transformations of the dependent variable and the regressors. For ex-
b
is a
H
b
vector-
valued function. The following model is linear in the parameters but nonlinear
in the variables:
x#
re-
For example,
8
ct#
f
G P
tV R
gressors, where
#
There may be
P
P
5 P ) d
free parameters: one for each independent effect that we wish to model.
184
G P
V D f
A second-order Taylors series expansion (with remainder term) of the funcabout the point
is
wP 5
P
D H R H H D
P
D R
H
tion
Use the approximation, which simply drops the remainder term, as an approximation to
r
U H
As
the approximation becomes more and more exact, in the sense that
5
P
D H R D D
P
H R
For
(
8
H
and
the ap-
H
H
'
H
proximation is exact, up to the second order. The idea behind many exible
and
H
H g D
treat them as parameters, the approximation will have exactly enough free pa-
8
The model is
) P
R 5 R P
g H
&
g
G
VP R 5 dR P
) P
f
&
185
H 3
Is
H w & 3
Is
H 3
The answer is no, in general. The reason is that if we treat the true
is forced to play
so that
and
case.
A simpler example would be to consider a rst-order T.S. approxima-
10.1.1. The translog form. In spite of the fact that FFFs arent really as
exible as they were originally claimed to be, they are useful, and they are
certainly subject to less bias due to misspecication of the functional form than
are many popular forms, such as the Cobb-Douglas of the simple linear in the
variables model. The translog model is probably the most widely used FFF.
This model is as above, except that the variables are subjected to a logarithmic
tranformation. Also, the expansion point is usually taken to be the sample
mean of the data, after the logarithmic transformation. The model is dened
186
by
f
f
2
f
8#
#
#
$
#
P
f
so the
#
so
is cost of production:
F
f
is a vector of input prices and
where
ables by extending in the obvious manner, but this is supressed for simplicity.
187
F
4
F
F 4
F
which is simply the vector of elasticities of cost with respect to input prices. If
the cost function is modeled using a translog function, we have
I
I
g R A #
#5 # I R P I R 5 P gwR P
) P I ) R # P
IR
r # R n 5 P R w R P
) # P
and
8 |
|
|
I
| C
C
I
iC
I II
&
&
I
I
F
Y F
I
and
where
#
r I
I
I n
P
188
Therefore, the share equations and the cost equation have parameters in common. By pooling the equations together and imposing the (true) restriction
that the parameters of the equations be the same, we can gain efciency.
To illustrate in more detail, consider the case of two inputs, so
I
I
5
# | EI | CP I CP # 5 P 5 P I i # P jHI HP
P #
I
I
P I
II P
|
|
&
I
CiP
I
g
II P I
CiH
P
# | CiP iI
I
# | CiP CiI
I
I P
and
with respect to
Note that the share equations and the cost equation have parameters in
common. One can do a pooled estimation of the three equations at once, imposing that the parameters are the same. In this way were using more observations and therefore more information, which will lead to imporved efciency. Note that this does assume that the cost equation is correctly specied
(i.e., not an approximation), since otherwise the derivatives would not be the
true derivatives of the log cost function, and would then be misspecied for
the shares. To pool the equations, write the model in matrix form (adding in
f
tG
P
d
.
.
.
f
a
f
f
I
f
.
.
.
I
$G
.
.
.
observations:
1
7
of
The overall model would stack observations on the three equations for a total
G P d
t0R f
|G
G
I
| C
C
I
||
I
$G
II
C
# EI
#
I )
I )
I z zz z # I )
I
g
I
H
&
error terms)
10.1. FLEXIBLE FUNCTIONAL FORMS
189
190
Next we need to consider the errors. For observation the errors can be placed
in a vector
I
$G
| G
G
tG
First consider the covariance matrix of this vector: the shares are certainly
correlated since they must sum to one. (In fact, with 2 shares the variances are
equal and the covariance is -1 times the variance. General notation is used to
allow easy extension to the case of more than 2 inputs). Also, its likely that
the shares and the cost equation have different variances. Supposing that the
won t depend upon :
gG
I
| H
|
| b b
| b
H H
I II
p
tG
Note that this matrix is singular, since the shares sum to 1. Assuming that there
is no autocorrelation, the overall covariance matrix has the seemingly unrelated
191
I
G
..
.
. .
.
bb
xxb
f
G
.
.
.
..
is
I bb
Sxxb
.
.
.
I
I
T
t
bb
xxb
(
T
and
T
x
.
.
.
(
R
..
and
two matrices
bb
xxb
.
.
.
10.1.2. FGLS estimation of a translog model. So, this model has heteroscedasticity and autocorrelation, so OLS wont be efcient. The next question is: how
do we estimate efciently using FGLS? FGLS is based upon inverting the estiSo we need to estimate
8
S
8
S
192
8p
using
I
@ 1 p
R
G t G f ) S
(2) Estimate
will be singular when the shares sum to one, so FGLS wont work.
The solution is to drop one of the share equations, for example the
second. The model becomes
&
I
H
I
$G
G
II
C
I
| C
C
I
||
# I )
# I I z zz z # I )
#
0yd f
G P
I
g
tions:
.
.
.
fG
P
d
If
.
.
.
observa-
I G
.
.
.
1
g5
193
ff
G P
0yd f
, and form
S
ht
block of
8 p
5 '
Q5
f
p
S
as the leading
I
G
Dene
h p S
I
`)
f
`)
194
(5) Finally the FGLS estimator can be calculated by applying OLS to the
transformed model
G d f
P
f h p S R h p S R C
I
I
I
C0
G Yd p u f p
P
(1) We have assumed no autocorrelation across time. This is clearly restrictive. It is relatively simple to relax this, but we wont go into it
here.
(2) Also, we have only imposed symmetry of the second derivatives. Another restriction that the model should satisfy is that the estimated
shares should sum to 1. This can be accomplished by imposing
)
8 7 5
774)
%
I
B
q B
|
VH
PI
These are linear parameter restrictions, so they are easy to impose and
will improve efciency if they are true.
195
(3) The estimation procedure outlined above can be iterated. That is, esti-
@0
C0
mate
jf G
@0
Then re-estimate
timator based on
be shown that if this is repeated until the estimates dont change (i.e.,
iterated to convergence) then the resulting estimator is the MLE. At
any rate, the asymptotic properties of the iterated and uniterated estimators are the same, since both are based upon a consistent estimator
of the error covariance.
LR, score or
G P
VS R 5 R P
) P
gf
&
g
G P
0 R P
8
f
&
196
The situation is more complicated when we want to test non-nested hypotheses. If the two functional forms are linear in the parameters, and use the same
transformation of the dependent variable, then they may be written as
! 3
t3
G P
Vi
P
W
! d 3
t3
ktG
f
$r
wd
p
B r
85
74) 3
is misspecied, for
fr
$UI
One could account for non-iid errors, but well suppress this for simplicity.
test, pro
P W
Ht
is zero.
P
"c
&
)
f
X&
&
&
C3 | iS iDC
P P
P I
G P
0C | | VC VDH
P
PI
fr
$I
f
$r
8
4)
B r
197
P
C3 |
P
C
&
&
PI
DC
P
S | | c
&
&
) P
C c
&
) P I
DH
)
c gf
&
P
i | | c
) P
i
&
&
8
in place of
8 Q&
&
gf
P
si3 3 C | | i I
P
P
P
P & EC & Hc & c
) P I P I )
test is to substitute
&
P
C3 |
The four
This is a consistent
estimator supposing that the second model is correctly specied. It will tend
to a nite probability limit even if the second model is misspecied. Then
estimate the model
P W
H e
&
P
f
P d
p
8
f a f R W I pW R e
W
f
is asymptotically normal:
T 2
)
x9
& 2
since
&
Q&
&
)
f
one can show that, under the hypothesis that the rst model is correct,
&
In this model,
&
&
P
"
where
tends in
198
We can reverse the roles of the models, testing the second against the
It may be the case that neither model is correctly specied. In this case,
rst.
the test will still reject the null hypothesis, asymptotically, if we use
distribution, since as long as
T 2
)
t9 j
&
tends to
the roles of the models the other will also be rejected asymptotically.
test is
simple to apply when both models are linear in the parameters. The
is nonlinear.
CHAPTER 11
G P
V" f
we condition on
%
8
i
when estimating
as xed
200
Simultaneous equations is a different prospect. An example of a simultaneous equation system is a simple supply-demand system:
v
v
r G I
$G n
S
4f
I
G
and
2
S
H H
I II
G0P S uDH
PI
P
S & I &
P
&
Supply:
I G
$VP f |
Demand:
is determined by some
unrelated process. Its easy to see that we have correlation between regressors
C
&
H0H
I II
&
&
I G& I
$G j$G P gf | & P HjqI &
I
r I
qG
S
&
&
&
I
GI$VP gf | & DH0qI & S & S
G
PI
P
CS
&
P
sI
&
is uncorrelated with
I
$G U
and
S
In this model,
201
gf
are a bit tricky, and well return to it in a minute. First, some notation. Suppose
endogs,
is
8
c
Group current and lagged exogs, as well as lagged endogs in the vector
equations into the error vector
&
8)
4'
R U
2
sT j
i
Rf
RI
Rf
8 & '
aiX1
is
.
.
.
RI
.
.
.
R
Rf
and
.
.
.
R
RI
where
Q1
'
is
2T
S
R P
8
c
a
8) '
4"&
, which is
& '
aiX1
is
&
If there are
11.2. EXOGENEITY
202
f
f Huxxb f H f H
I bb
I
II
.
.
.
.. .
. .
.
11.2. Exogeneity
The model denes a data generating process. The model involves two sets of
as well as a parameter vector
S
R r R x
R x
R x n
P &
45 V aP P &
&
&
and
variables,
&
ested in estimating.
8
t
and
ca
which
o t EiE
11.2. EXOGENEITY
conditional on
and lagged
8
2
where
203
t
o st aE o CE o st acC
This is a general factorization, but is may very well be the case that not
to indicate elements
of course. We have
I
st
and
of
I
st
all parameters in
for parameters
o t aE o 6c C o st aCE
I t
2
2T j
S
R P i
R
R
s
R U
Normality and lack of correlation over time imply that the observations are
independent of one another, so we can write the log-likelihood function as the
11.2. EXOGENEITY
204
I
@
P I t
6Dca CE
o
f
I
@
!ca iE
I t
o
f
I
@
st aCE
f
Supposing that
g
tI t
I
t
arbitrary combinations of
(the
that is invariant
is weakly exoge-
I
st
would change as
I
t
nous, since
and
8t
8EISd SIg
t t d tI t
t
to
to
d
I
@
o t a
f
o t aE
o 7d f
density is the same as the MLE using only the conditional density
8 t
I
@
6Ita iE
d
A f o i f
as xed in inference.
It Sd
8
d
I
6st
rameter of interest,
is irrelevant for
and knowledge of
I
t
inference on
8I
st
205
Of course, well need to gure out just what this mapping is to recover
With lack of weak exogeneity, the joint and conditional likelihood func-
8I
Dt
from
as
xed in inference. The joint MLE is valid, but the conditional MLE is
not.
to be weakly exogenous if
sat-
isfy the denition, since they are in the conditioning information set,
Lagged
8
c
I i
e.g.,
word, since their values are determined within the model, just earlier
on. Weakly exogenous variables include exogenous (in the normal sense)
variables as well as all predetermined variables.
R P i
R
S
R
s
Sgf P I
P
&
&
P I &
GjG P gf |
I
HjqI &
&
j$VSgf | & HjqI &
G I G P
PI
I G P
0Cf |
&
P
SS
&
P
I
&
S
S & S
G PI
VP S uDH
Similarly, the rf for price is
&
G &
I
G
$G
I
uP gf |
I G
$VP gf |
&
I f
HP I I
P I
P gf& P H & qI
| &
I & &
us 0sH qI
P G P I & &
&
P qH jv & I &
P
G I
g
g
g
&
demand:
tity is obtained by solving the supply equation for price and substituting into
An example is our supply/demand system. The reduced form for quancurrent period endog is included.
D EFINITION 17 (Reduced form). An equation is in reduced form if only one
reduced form.
Now only one current period endog appears in each equation. This is the
R R
P ei
R
I R P I i
R
206
207
The interesting thing about the rf is that the equations individually satisfy the
by assumption,
&
&
I
H
B
f
is
82
P H & g0H
I
5 II
&
&
&
I
H$G G & $G
I
I
D
The variance of
and
zz gz x
z gx
z
z
z z gx
i=1,2,
I
G
and therefore
is uncorrelated with
I
H
8
ci
I 5 II
VP & HVH
&
&
G I
jH$G jH$G
G I
I
VP H & P qH
II
&
&
&
jG G & $G
G I
I
I
HU
In summary the rf equations individually satisfy the classical assumptions, under the assumtions weve made, but they are contemporaneously correlated.
11.4. IV ESTIMATION
208
R R
P ei
R
I R P I i
R
so we have that
R I $ x u R I C
if the
2
h I
were autocorrelated).
11.4. IV estimation
The IV estimator may appear a bit unusual at rst, but it will grow on you
over time.
The simultaneous equations model is
Considering the rst equation (this is without loss of generality, since we can
matrix as
r D f n
I
are the other endogenous variables that enter the rst equation
are endogs that are excluded from this equation
as
r e n
I
I
f
Similarly, partition
11.4. IV ESTIMATION
209
I
r I G n
Assume that
restrictions that simply scale the remaining coefcients on each equation, and
which scale the variances of the error terms.
Given this scaling and our partitioning, the coefcient matrices can be written as
I
C
I
H
f
G P I I P I I
VDHVDCsD
7G
is correlated with
since
endogs.
G
VP
is formed of
11.4. IV ESTIMATION
210
Now, lets consider the general problem of a linear regression model with
correlation between regressors and the error term:
8 RG U
gf 3 G
t3
G P
0" f
The present case of a structural equation from a system of equations ts into
this notation, but so do other problems, such as measurement error or lagged
tion matrix
R I R
by the denition of
G
correlated with
G
P
" ! f !
or
G P
V f
G R U
G R R U
!
!
RG U
and
will be un-
11.4. IV ESTIMATION
211
and
R I R
!
This is a linear combination of
X
8
G
the columns of
on
G P
V f
will lead to a consistent estimator, given a few more assumptions. This is the
f R
R
i I !X ! i
P
R I X ! R
!
G P R
V" i I !X i
R
!
!
so
G I 6RRi I I !ddi
R
R R R
G R
R
% I 6X ! i
1
0)
to get
1
1
1
p 1
1
1
RG I R
R
R I R
R Y j)
I
so that
11.4. IV ESTIMATION
212
!
!
, a nite pd matrix
(= cols
T y
f!
T y
T !yf
then the plim of the rhs is zero. This last term has plim 0 since we assume that
and are uncorrelated, e.g.,
tG R
Given these assumtions the IV estimator is consistent
8
T
)
h
we have
h
1
1
1
GR I R R
I
q1
Furthermore, scaling by
h
1
1
1
R I R R
h 0) 1
m
is
8 h
!
!
!
h 0) 1
and
I ! ! R ! ! t
I
!
then we get
jf R h ) f ) 1
This estimator is consistent following the proof of consistency of the OLS esti-
mator of
11.4. IV ESTIMATION
213
is
h Q I dD% )
R R R
I
The IV estimator is
(1) Consistent
(2) Asymptotically normally distributed
!
I X ! R
R
G
!
R I X ! R U6 G ! R
8I
gj
I
0
and
!
!
such that
j
I
depends upon
and
and
to
IV estimation can clearly be used in the case of simultaneous equations. The only issue is which instruments to use.
214
I 6 ! R !! ! ) an
I
RG I f 3
. This matrix is
this matrix be positive denite, and that the instruments be (asymptotically) uncorrelated with .
).
!
!
of full rank (
must be
These identication conditions are not that intuitive nor is it very ob-
G
VP
f
where
W
Let
r dD n
I I
Notation:
I
`
&V&
s&
)PEID I &
Let
Let
215
I
if the variables in
ments.
Now the
instruments will identify the model either. Assuming this is true (well
prove it in a moment), then a necessary condition for identication is
since if not then at least one instrument must
I
ED I
) wP
&
!
a
&
) wP
be used twice, so
that
) &
r I
VP I n R 1 3 W R 1 3
I
)
)
I
D
I
s
r I I
eDwP VP I n R 1 W R 1
I
)
)
and
I
HP VP I D
I
I
r D
I
n P
|I
I
I
I
between
Because the
so
so we have
r n r D f n
I
I
r e n
I
r D f n
I
P
r dD n
I I
) wP
&
R ) 1 3 $
where
that
11.5. IDENTIFICATION BY EXCLUSION RESTRICTIONS
216
Since the far rhs term is formed only of linear combinations of columns of
columns we have
P
) &
or noting that
When
instruments. If
217
%
) &
In this case, the limiting matrix is not of full column rank, and the identication
condition fails.
11.5.2. Sufcient conditions. Identication essentially requires that the structural parameters be recoverable from the data. This wont be the case, in general, unless the structural model is subject to some restrictions. Weve already
identied necessary conditions. Turning to sufcient conditions (again, were
only considering identication through zero restricitions on the parameters,
for the moment).
The model is
P i
R
S
R
s
218
I SR I
R
SP ei
R
I P I i
R
S
The reduced form parameters are consistently estimable, but none of them are
known a priori, and there are no restrictions on their values. The problem is
that more than one structural form has the same reduced form, so knowledge
of the reduced form parameters alone isnt enough to determine the structural
parameters. To see this, consider the model
R
R
&
i' &
P
CwP
I P I
I I P P I I P
P
P
P P I P
P
is
where
R
i
R
i
R
219
I
Since the two structural forms lead to the same rf, and the rf is all that is directly estimable, the models are said to be observationally equivalent. What we
ble
and
is an identity matrix (if all of the equations are to be identied). Take the
I
C
I
H
The coefcients of the rst equation of the transformed model are simply these
. This gives
I
I
I
C
I
H
I
I
220
For identication of the rst equation we need that there be enough restrictions
so that the only admissible
I
I
P
)
I
Ct
I
H
I
I
I
i
I
H
) &
`
then the only way this can hold, without additional restrictions on the models
is a vector of zeros, then
) I
I
)
I
I
) &
r I
) n
Therefore, as long as
parameters, is if
221
then
I
I
P
The rst equation is identied in this case, so the condition is sufcient for
identication. It is also necessary, since the condition implies that this submarows. Since this matrix has
&
P V&
) &
&
rows, we obtain
) &
&
P 0&
or
) &
)
4(s&
)
4
&
When
drop a restriction and still retain consistency. Overidentifying restrictions are therefore testable. When an equation is overidentied we
222
have more instruments than are strictly necessary for consistent estimation. Since estimation by IV with more instruments is more efcient
asymptotically, one should employ overidentifying restrictions if one
is condent that theyre true.
We can repeat this partition for each equation in the system, to see
arent necessary for identication, though they are of course still sufcient.
To give an example of how other information can be used, consider the model
where
I I f
P
I
223
5 &
P 0DI t f
P I f
included endogs, so it
I
I f
G I G P R
0DI i7 GU
S
matrix is diago-
nal, then following the same logic, all of the equations are identied.
This is known as a fully recursive model.
11.5.3. Example: Kleins Model 1. To give an example of determining identication status, consider the following macro model (this is the widely known
8 I a I I w s& ) n R
and the predetermined variables are all others:
r
a T o
n R
The endogenous variables are the
government
c
} }
8
c w
taxes,
|
|
|
H H
I II
I
| H
s&
lhs variables,
nonwage spending,
I
P
|e
e
I
ge
T V Ha
& P P
wS S
| VS w | sI a CaCiP p
G P P P I
0sI | uDI VC HuP p
G P
P
P I
I G P
$V P T | & I & C I & P p &
P
P
a
T
Capital Stock:
Prots:
Output:
Private Wages:
Investment:
Consumption:
Kleins Model 1)
224
225
The model assumes that the errors of the equations are contemporaneously
gives
)
| &
|
&
|
| &
p p
&
)
)
HY I &
) Ct
I
and
)
)
in this equation. These are the rows that have zeros in the rst column, and
226
'
|
|
)
)
) qC
) I
)
)
matrix. For
) |
|
)
w
This matrix is of full rank, so the sufcient condition for identication is met.
) 7
) &
77 &
so
11.6. 2SLS
227
counting rules, which are correct when the only identifying information are the exclusion restrictions. However, there is additional infor-
and
and their coefcients are restricted to be the same. For this reason the
consumption equation is in fact overidentied by four restrictions.
11.6. 2SLS
When we have no information regarding cross-equation restrictions or the
structure of the error covariance matrix, one can estimate the parameters of a
single equation of the system without regard to the other equations.
This isnt always efcient, as well see, but it has the advantage that
gressed on all the weakly exogenous variables in the system, e.g., the entire
matrix. The tted values are
I
D
I R R
i I !Xiu
on the space spanned by
I
s
by assumption,
ID
I
i
The 2SLS estimator is very simple: in the rst stage, each column of
is
11.6. 2SLS
8
G
I
related with
Since
I
uncorrelated with
228
independent. This should be the case when the order condition is satised,
in this case.
in place of
I
original model is
I
than in
I
D
f
G P I I P I I
VDHVDCsD
G
VP
W
8 G P I I P I I
VDHVDCsD f
I
stage model as
I
%
I
Since
G
VP W s
G P I I P I I
VDH DCsD s
f
$
f s R W I pW s R e
W
which is exactly what we get if we estimate using IV, with the reduced form
predictions of the endogs used as instruments. Note that if we dene
W
r I I
dD n
11.6. 2SLS
YW
so that
229
fyW I !pWyW
R R
Important note: OLS on the transformed model can be used to calcusince we see that its equivalent to IV using
W
s
R
R
R
h WW h W yW h W yW d
I
I
I
RI
I
RI
I
c R I DE R I
I
I I
I I
r eD n R r d n W R W
11.6. 2SLS
we can write
I
RI
I
R I
r eD
I I
I I
n R r dD
I
D s RI
ID sUs R I
i
but since
230
r I I
r I I
eD n R D n
R
yW
Therefore, the second and last term in the variance formula cancel, so the 2SLS
varcov estimator simplies to
h W yW
I
which, following some algebra similar to the above, can also be written as
h W yW d
I
Finally, recall that though this is presented in terms of the rst equation, it is
general since any equation can be placed rst.
Properties of 2SLS:
(1) Consistent
(2) Asymptotically normal
(3) Biased when the mean esists (the existence of moments is a technical
issue we wont go into here).
(4) Asymptotically inefcient, except in special circumstances (more on
this later).
231
h ) jf
R
h
jf l )
but
)
!
GVP"
R I X !
!
f R I X
!
!
f i I !X
R
jf
R
iujf
R uj
R uj
G P
0" w
where
R
R
% I 6X ! iuj
$
w
so
GVPi
R R PR
w R w diugtG s)
!
. Substituting a consistent
w R w ' )
!
w R w ' )
!
)
t9 j
G
w R w R G )
!
xs
estimator,
variable
Supposing the
G
R
w R w tG s)
!
so
l
R I Q R u
!
!
w
is orthogonal to
8 i I 6X
R
! R
R
Ri I !X iu i I !X
!
!
!
!
R !
R !
R
% I 6X ! iuj ! i I 6X
!
R
iu !
R%u ! !
R
iu !
Furthermore,
w R w
!
w R w
!
Moreover,
232
Even if the
233
holds. The last thing we need to determine is the rank of the idempotent matrix. We have
!
R
R
i I !X ! iu !
w ! R w
so
R
I 6X iu
R !
i I 6X !
R I R y
R
! i y ! y
!
R
iu ! ! y
I R R y
and
w R w
!
8
%
columns of
where
is the number of
This test is an overall specication test: the joint null hypothesis is that
(e.g., that the variables classied as exogs really are uncorrelated with
W
8
G
f
and
8
7G
G
XP
is misspecied,
G
w
234
and
G w R w R G s)
!
we can write
)
D1
! e 1
1
GR G
G R I R R I R R G
test statistic.
is the uncentered
where
residuals
G P
V" f
is the matrix of instruments. If we have exact identication then
y
, so
X I
G
P
" ! f !
and the fonc are
f R
s) j %
!
The IV estimator is
`
and
f R
R
i I X ! i
h ) R Vf R
P
!
!
)
R R P
i s f i
R
!
!
h ) jf
R
i
!
h ) jf
!
) jf
!
) jf
!
) jf
R h
R
h
R
h
jf
R f
R f
R
f
R
f
)
f I 6Q
R R
Rf I R R I R R I Q R
f R I R R I Q R
!
we obtain
f R
!
R R R
I @i I !X
R R
R
I I !ddi I !X
R R R
I I 6di
R
I X i
!
Considering the inverse here
235
236
by the fonc for generalized IV. However, when were in the just indentied
case, this is
!
)
f R I X R u R I R 0 R I R R f
f I 6Qu f
R R
R
R R ! R
f I 6QVjf f
The value of the objective function of the IV estimator is zero in the just identied
case. This makes sense, since weve already shown that the objective function
with degrees of freedom equal to the
is asymptotically
after dividing by
e.g., its simply 0. This means were not able to test the identifying restrictions
in the case of exact identication.
an overidentied equation can use more instruments than are necessary for consistent estimation.
Secondly, the assumption is that
237
sT j
i
R
i
f
f Huxxb f H f H
I bb
I
II
.
.
.
.. .
. .
.
This means that the structural equations are heteroscedastic and correlated with one another
In general, ignoring this will lead to inefcient estimation, following
the section on GLS. When equations are correlated with one another
estimation should account for the correlation in order to obtain efciency.
Also, since the equations are correlated, information about one equa-
tion is implicitly information about all equations. Therefore, overidentication restrictions in any equation improve efciency for all equations, even the just identied equations.
Single equation methods cant use these types of information, and are
therefore inefcient (in general).
238
11.8.1. 3SLS. Note: It is easier and more practical to treat the 3SLS estimator as a generalized method of moments estimator (see Chapter 15). I no
longer teach the following section, but it is retained for its possible historical
interest. Another alternative is to use FIML (Subsection 11.8.2), if you are willing to make distributional assumptions on the errors. This is computationally
feasible with modern computers.
Following our above notation, each structural equation can be written as
G
B VP B B W
G PI PI
B 0H B 0Di B
B f
.
.
.
I
UW
bb
xxb
G
VP
I
f
f
W
.
.
.
..
bb
xxb
.
.
.
.
.
.
.
.
.
I
G
or
&
Grouping the
RG
tG
i
239
The 3SLS estimator is just 2SLS combined with a GLS correction that takes
as
R I X R u
bb
xxb
..
W
bb
xxb
.
.
.
bb
xxb
I I
d
.
.
.
..
.
.
.
8
i
W R X R u
I
I
UW R I X R u
bb
xxb
.
.
.
Dene
'
and
and
is calculated
fyW I !pWyW
R R
240
error covariance into the formula, which gives the 3SLS estimator
f f I yS yW h W f I yS yW
R
R
I
f I $f egyW h W I 4f yW
S R
S R
I
8
S
B f HB G
1
G BR G B
3
Substitute
is estimated by
B
B B
of
not
W
0
@`
0
CG
8
g B
8
S
$f
I I
S
R
f
0
@`
| 1
A formula for estimating the variance of the 3SLS estimator in nite samples
is
h W hf
I
1
R
I y S W h
241
In the case that all equations are just identied, 3SLS is numerically
equivalent to 2SLS. Proving this is easiest if we use a GMM interpre-
calculated
d% I 6Xi
R R
which is simply
r xxb f f n R I X R
f bb
I
that is, OLS equation by equation using all the exogs in the estimation of each
column of
It may seem odd that we use OLS on the reduced form, since the rf equations are correlated:
R P R
R
I R P I i
R
and
2
h I
R I u R I C
R I
242
is the
3
3
endog,
is the entire
column of
and
bb
xxb
column of
I
f
3
) '
w1
Bf
'
j1
is the
.
.
.
matrix of exogs,
.
.
.
bb
xxb
is the
..
where
.
.
.
I
$
.
.
.
.
.
.
f
to indicate the pooled model. Following this notation, the error covariance
matrix is
7R
Note that each equation of the system individually satises the classi-
cal assumptions.
However, pooled estimation using the GLS correction is more efcient,
since equation-by-equation estimation is equivalent to pooled estimais block diagonal, but ignoring the covariance informa-
tion.
tion, since
v
R
I
and
we get
I
f R I X R u
f iw I I 6Xiu
R
R
f iw I I Xwtf iw I t
R
R
.
.
.
w
(3)
w
R R w
I w
(2)
(1)
I
8
%wtf
OLS. To show
f I $f
243
11.8.2. FIML. Full information maximum likelihood is an alternative estimation method. FIML will be asymptotically efcient, since ML estimators based on a given information set are asymptotically efcient w.r.t. all
other estimators that use the same information set, and in the case of the
244
R
s
R U
R t
} t }
5 ~ p
R R j R I ) R R ) 7g} eI I
is
} p
to
I S R )5 } eI I
~ p
mal, which is
2
2T j
S
R P i
R
5
} p
Given the assumption of independence over time, the joint log-likelihood function is
R R j R I 4 R j R
I@ 5
I
f )
5
} A 1
5
S
1 P 5
} p A D1 T f
&
of this can be done using iterative numeric methods. Well see how to
245
It turns out that the asymptotic distribution of 3SLS and FIML are the
8 CGI0 | CG |
|
|
0
@`
as normal.
This is new, we didnt estimate
in
(2) Calculate
and
0
CG
(1) Calculate
this way before. This estimator may have some zeros in it. When
Greene says iterated 3SLS doesnt lead to FIML, he means this for
and
and
and calculate
using
and
If you update
8
S
(4) Apply 3SLS using these new instruments and the estimate of
FIML is fully efcient, since its an ML estimator that uses all informa-
tion. This implies that 3SLS is fully efcient when the errors are normally
distributed. Also, if each equation is just identied and the errors are
normal, then 2SLS will be fully efcient, since in this case 2SLS 3SLS.
$
246
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.976711
Sigma-squared 1.044059
estimate
st.err.
t-stat.
p-value
16.555
1.321
12.534
0.000
Profits
0.017
0.118
0.147
0.885
Lagged Profits
0.216
0.107
2.016
0.060
Wages
0.810
0.040
20.129
0.000
Constant
*******************************************************
INVESTMENT EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.884884
Sigma-squared 1.383184
estimate
st.err.
t-stat.
p-value
20.278
7.543
2.688
0.016
Profits
0.150
0.173
0.867
0.398
Lagged Profits
0.616
0.163
3.784
0.001
Constant
Lagged Capital
-0.158
0.036
247
-4.368
0.000
*******************************************************
WAGES EQUATION
*******************************************************
2SLS estimation results
Observations 21
R-squared 0.987414
Sigma-squared 0.476427
estimate
st.err.
t-stat.
p-value
Constant
1.500
1.148
1.307
0.209
Output
0.439
0.036
12.316
0.000
Lagged Output
0.147
0.039
3.777
0.002
Trend
0.130
0.029
4.475
0.000
*******************************************************
The above results are not valid (specically, they are inconsistent) if the errors are autocorrelated, since lagged endogenous variables will not be valid
instruments in that case. You might consider eliminating the lagged endogenous variables as instruments, and re-estimating by 2SLS, to obtain consistent
parameter estimates in this more complex case. Standard errors will still be
estimated inconsistently, unless use a Newey-West type covariance estimator.
Food for thought...
CHAPTER 12
over a set
x
d f f
47C9
be the
is the
8 R h f bxbxb I d f
t"HCf
fG Ppd f
d 1
8 5 2 G
p r!A@8@8974) SctjP p d R gf
v
x
pg
vertically,
where
8f
7
estimator is dened as
8
R I R d
f D R `@f sD 1 eS9 E
f d
f ) d f
d
`@
We readily nd that
5
7~g} I
p
e
fq
I
@
d
4eSf
~
$
The maxi-
, maximization
k
5
d f
4ig
8 ) p d
t99ej Yf
248
249
8
d
I
@
5
)
D1 5 5 4R f D1 4eS ~ g
)
d
) d f
d f
i f
that the distributional assumptions are incorrect. This gives a quasiML estimator, which well study later.
The strong distributional assumptions of MLE may be questionable
from the
pd
ge
distribution. Here,
I
Q
of a random
p
xgd eQ
p d I
8
q1 f
I@
I
Q
f
p RvQ
d I
p d I
ggRvQ Q
I
250
Dene
I dI
Q q4egQ 4R%
d I
The method of moments principle is to choose the estimator of the
I
d i
I@
pd 4d %
I
f
p
gd
1 f @
I
T f
Since
8 1 gf
In this case,
r.v. is
8 p @5 p ig g
d
d f
f
1
d d
p@5 4e
f f I
7ug @f
Dene
pd
ge
8
$
pd 5 4d
f f I
7g @f
Again, by the LLN, the sample variance is consistent for the true vari-
f f1 5 I
d
7g @f
So,
1
8 p d@5
f f I
T 7ug @f
251
i.e.,
I
@
81
gf pd
f
I
@
eEI%
d
1)
f
.
d I
e%
p d I
ge%
and
p
dI
4REi
both
f f d d
7gp@5 eE
8 f f1 I
d d
p@5 4R
7ug @f
and
is
dI
s p eE%
8 f
gwd 4REi
dI
d I
evi
to set either
I
k d i
chose
or
8 4d
8 p e
d
252
and choosing
8HE4eXpt 4deS d
d
f
R E4e g4R% eX
d d I
d
" wR
t
where
where
d
gE4eXt
is a positive denite
matrix. While its clear that the MM gives consistent estimates if there is a oneto-one relationship between parameters and moments, its not immediately
obvious that the GMM estimator is consistent. (Well see later that it is.)
These examples show that these widely used estimators may all be interpreted as the solution of an optimization problem. For this reason, the study
of extremum estimators is useful for its generality. We will see that the general
results extend smoothly to the more specialized results available for specic
estimators. After studying extremum estimators in general, we will study the
GMM estimator, then QML and NLS. The reason we study GMM rst is that
LS, IV, NLS, MLE, QML and other well-known parametric estimators may all
be interpreted as special cases of the GMM estimator, so the general results on
GMM can simplify and unify the treatment of these other estimators. Nevertheless, there are some special results on QML and NLS, and both are important in empirical research, which makes focus on them useful.
One of the focal points of the course will be nonlinear models. This is not to
suggest that linear models arent useful. Linear models are more general than
253
they might rst appear, since one can employ nonlinear transformations of the
variables:
r bb
G
t0P p d $T m xxb m vI m n g p m
f
f
P
SI uP
&
i
ts this form.
G P
tVC I P I
For example,
The important point is that the model is linear in the parameters but not
necessarily linear in the variables.
In spite of this generality, situations often arise which simply can not be convincingly represented by linear in the parameters models. Also, theory that
applies to nonlinear models also applies to linear models, so one may as well
start off with the general case.
Example: Expenditure shares
f
!d B
f
!d DC
goods is
B
f CAB
) B B
I
)
} 'g B
or
B
C
for
and
8 f
B
so necessarily
3
An expenditure share is
of
&
guarantee that either of these conditions holds. These constraints will often be
violated by estimated linear models, which calls into question their appropriateness in cases of this sort.
Example: Binary limited dependent variable
254
where
is income and
pG P p
9q"7
is a
vector of other variables such as prices, personal characteristics, etc. After proreect variations
C
) f
8 p
g"t
I G G
G p
s0
sy
&
w %
" I
" p
IG
and
p
G
and
8
g w iHw
f
t)
8
4s w %w
(12.0.1)
w
w %w
and let
and
f
w d I
G
I j p G
Dene
collect
let
if
" p w I
Dene
8 I q" I
G P
5
7) ! G
3
B
vision, utility is
are i.i.d. extreme value random variables. That is, utility de-
pends only on income, preferences in both states are homothetic, and a specic distributional assumption is made on the distribution of preferences in
the population. With these assumptions (the details are unimportant here, see
1
We assume here that responses are truthful, that is there is no strategic behavior and that
individuals are able to order their preferences in this hypothetical situation.
255
w VP &
d
w
8 I $Y 7g} $
# ~ P ) #
#
$
where
This is the simple logit model: the choice probability is the logit function of a
linear in parameters function.
is
. Thus, we
P
w VP &
can write
is either
w VP &
Now,
f
U
E w uP &
can be written as a linear
w uP &
1
h
) E &
in the parameters model, in the sense that, for arbitrary , there are no
R m
7Bdc w w VP &
such that
R m
we can always nd a
7d
w m
4)
where
such that
w m d
will be
binary random variable. Since this sort of problem occurs often in empirical
work, it is useful to study NLS and other nonlinear models.
256
After discussing these estimation methods for parametric models well briey
introduce nonparametric estimation methods. These methods allow one, for exconsistently when we are not willing to assume that a
G P
t0
ample, to estimate
f
t
g
gf
G
$# t
P
g
#
st
and perhaps
#
t
G P d
04
dc
b
where
CHAPTER 13
fort, Vol. 1, ch. 13, pp. 443-60 ; Goffe, et. al. (1994).
8 d
4R
(a
7d
d
4R
e.g.,
R P RU P
d d )5 yd! 4e
d
8
xU I d
P
d
U 4e
of problem we have with linear models estimated by OLS. Its also the case for
257
258
feasible GLS, since conditional on the estimate of the varcov matrix, we have
a quadratic objective function in the remaining parameters.
More general problems will not have linear f.o.c., and we will not be able
to solve for the maximizer analytically. This is when we need a numeric optimization method.
13.1. Search
The idea is to create a grid over the parameter space and evaluate the function at each point on the grid. Select the best point. Then rene the grid in
the neighborhood of the best point, and continue until the accuracy is good
enough. See Figure 13.1.1. One has to be careful that the grid is ne enough
in relationship to the irregularity of the function to ensure that sharp peaks are
not missed entirely.
and
4 )
)'p)7G8 7
A )
"#
p I 4 )
there would
take
is moderate or large.
be
we need to check
) o
To check
is small,
I h
tives)
given
259
The iteration method can be broken into two problems: choosing the stepsize
which is of the same
h h
t
wP
t
I h
d
so that
dimension of
"wy e S
t P d r
for positive but small. That is, if we go in direction , we will improve on the
t
260
RH
d
h h
is a symmetric pd
p
d
e ep
d
matrix and
where
expansion around
P P d P P d
pt R t eHu t e
)
t
t
8 t R 4RD
d
)
x
d
geH
Dening
is to be an inwhere
P R d P d
t4eHCwe
t P d
"we
the
)
tc
is
d R d
R d
eH ceH Bt4eH
Every increasing direction can be represented in this
way (p.d. matrices are those such that the angle between
and
d
4RD
8 ( eH
d
unless
eH uP
d
h h h
h
d
I h
and we keep going until the gradient becomes zero, so that there is no increas-
, choosing
Conditional on
and
261
sets
to and identity matrix, since the gradient provides the direction of max-
tives.
Disadvantages: This doesnt always work too well however (draw picture of banana function).
262
d
s
d
d
i
pd e
d
h
P d d
eH e
h
d
i
7d
in
8
7d
with respect to
d f
geSx
To attempt to maximize
8 d f
geS9
about
d f
eSt
to maximize
eH I R
d
d
h
h
I h
7d
d f
eSt
eH I ! e j
d
d
h
h h h
d
I h
may not be
263
eH I e
d
d
h
h
rection of search. This can happen when the objective function has at
regions, in which case the Hessian matrix is very ill-conditioned (e.g.,
d
e
d
h
$ P d
%U Ue
inite, e.g.,
d
4e
component to
264
Stopping criteria
The last thing we need is to decide when to stop. A digital computer is
subject to limited machine precision and round-off errors. For these reasons,
it is unreasonable to hope that a program can exactly nd the point that maximizes a function. We need to dene acceptable tolerances. Some stopping
criteria are:
I
$G
d
I h h d
% G
Id h
I h h d
d
d
| G I h Ruq h e
3
cxGG R
d
h
265
Also, if were maximizing, its good to check that the last round (real,
Starting values
The Newton-Raphson and related algorithms work well if the objective
function is concave (when maximizing), but not so well if there are convex
regions and local minima or multiple local maxima. The algorithm may converge to a local minimum or to a local maximum that is not optimal. The
algorithm may also have difculties converging at all.
The usual way to ensure that a global maximum has been found
is to use many different starting values, and choose the solution that
returns the highest objective function value. THIS IS IMPORTANT
in practice. More on this later.
Calculating derivatives
The Newton-Raphson algorithm requires rst and second derivatives. It
is often difcult to calculate derivatives (especially the Hessian) analytically if
dcC
b f
the function
MuPAD is not a freely distributable program, so its not on the CD. You can download it from
http://www.mupad.de/download.shtml
266
8
4) 4 8 S x
b f
b f
4 ) S
G P # P
cw & f
and esti-
and
dcC x
b f
S
b f
8 44 ) # # ! 44 ) g4 ) ! 44 ) q&
&
267
and
will both be 1.
There are algorithms (such as BFGS and DFP) that use the sequential gradient evaluations to build up an approximation to the Hessian.
The iterations are faster for this reason since the actual Hessian isnt
calculated, but more iterations usually are required for convergence.
Switching between algorithms during iterations is sometimes useful.
13.4. EXAMPLES
268
13.4. Examples
This section gives a few examples of how some nonlinear models may be
estimated using maximum likelihood.
13.4.1. Discrete Choice: The logit model. In this section we will consider
maximum likelihood estimation of the logit model for binary 0/1 dependent
variables. We will use the BFGS algotithm to nd the MLE.
We saw an example of a binary choice model in equation 12.0.1. A more
general representation is
H
P
)
f
G
jq H
d
4
f
f
t)
$
s4d
f ) P d
f B 1 d f
B
C id) B 04 C A B ) 4eS
B
f
For the logit model (see the contingent valuation example above), the probability has the specic form
4d ~ P
c } )
d
)
&
'
13.4. EXAMPLES
d
8 R x9
)
4 ) 1
269
***********************************************
Trial of MLE estimation of Logit model
MLE Estimation Results
BFGS convergence: Normal convergence
Average Log-L: 0.607063
Observations: 100
constant
slope
estimate
0.5400
0.7566
st. err
0.2229
0.2374
t-stat
2.4224
3.1863
p-value
0.0154
0.0014
Information Criteria
CAIC : 132.6230
BIC : 130.6230
AIC : 125.4127
***********************************************
13.4. EXAMPLES
270
@8@8974
85)
the values
specifying the density as a count data density. One of the simplest count data
densities is the Poisson density, which is
( f
8 u )
7g}
~
e e
f
S
Y
G
I
B 1
(
f
d f
B f Se B uP Se ) 4RC
B
B
f
13.4. EXAMPLES
271
8
@R ( 02 & w f 1 d)
BR v 7}
~
B
Ce
B
Dv
This ensures that the mean is positive, as is required for the Poisson model.
Note that for this parameterization
e
e U
so
variable.
6
conditioning
OBDV
******************************************************
Poisson model, MEPS 1996 full data set
272
estimate
st. err
t-stat
p-value
-0.791
0.149
-5.290
0.000
pub. ins.
0.848
0.076
11.093
0.000
priv. ins.
0.294
0.071
4.137
0.000
sex
0.487
0.055
8.797
0.000
age
0.024
0.002
11.471
0.000
edu
0.029
0.010
3.061
0.002
inc
-0.000
0.000
-0.978
0.328
constant
Information Criteria
CAIC : 33575.6881
Avg. CAIC:
7.3566
BIC : 33568.6881
Avg. BIC:
7.3551
AIC : 33523.7064
Avg. AIC:
7.3452
******************************************************
273
A spell is the period of time between the occurence of initial event and the
concluding event. For example, the initial event could be the loss of a job, and
the nal event is the nding of a new job. The spell is the period of unemployment.
be the time the initial event occurs, and
I
2
p2
Let
ing event occurs. For simplicity, assume that time is measured in years. The
gE2
3
4
function of
8E2 E
2
p2 I 3 P
j62
random variable
Several questions may be of interest. For example, one might wish to know
the expected time one has to wait to nd a job given that one has already
waited years. The probability that a spell lasts years is
8
gt
3
P
) t
) t
8 t
E32
3
4
P
)
2
t
The density of
3
4
The expectanced additional time required for the spell to end given that is has
3
#
u 7t t 43# P y) # Vt U
8
4
2
E
as a para-
8 e
8 I E2 e q e 76@ d 2 4
3
274
7 8 4 8 e
7 A
$@7 8 D9 E 8
life expectancy (expected additional years of life) as a function of age, with 95%
condence bands. The plot is accompanied by a nonparametric Kaplan-Meier
estimate of life-expectancy. This nonparametric estimator simply averages all
spell lengths greater than age, and then subtracts age. This is consistent by the
LLN.
In the gure one can see that the model doesnt t the data well, in that it
predicts life expectancy quite differently than does the nonparametric model.
For ages 4-6, the nonparametric estimate is outside the condence interval that
results from the parametric model, which casts doubt upon the parametric
model. Mongooses that are between 2-6 years old seem to have a lower life
expectancy than is predicted by the Weibull model, whereas young mongooses
that survive beyond infancy have a higher life expectancy, up to a bit beyond
2 years. Due to the dramatic change in the death rate as a function of , one
as a mixture of two Weibull densities,
8 h I z E2 e e z 6 @ z t P h I EI e vCI
)
2 I
e
2
E
@ t
3
4
might specify
4d 2
3
4
275
and
5 3 B
) ! Se
The parameters
With the same data, can be estimated using the mixed model. The results
d
are a log-likelihood = -623.17. Note that a standard likelihood ratio test can-
and
)
not be used to chose between the two models, since under the null that
ble to take this into account, but this topic is out of the scope of this course.
Nevertheless, the improvement in the likelihood function is considerable. The
parameter estimates are
276
0.016
1.722
0.166
1.731
0.101
1.522
0.096
0.428
0.035
I
i
Note that the mixture parameter is highly signicant. This model leads to
the t in Figure 13.5.2. Note that the parametric and nonparametric ts are
9
point is not too important, since less than 5% of mongooses live more than 6
years, which implies that the Kaplan-Meier nonparametric estimate has a high
variance (since its an average of a small number of observations).
Mixture models are often an effective way to model complex responses,
though they can suffer from overparameterization. Alternatives will be discussed later.
13.6. Numeric optimization: pitfalls
In this section well examine two common problems that can be encountered when doing numeric optimization of nonlinear models, and some solutions.
13.6.1. Poor scaling of the data. When the data is scaled so that the magnitudes of the rst and second derivatives are of different orders, problems can
easily result. If we uncomment the appropriate line in EstimatePoisson.m, the
data will not be scaled, and the estimation program will have difculty converging (it seems to take an innite amount of time). With unscaled data, the
elements of the score vector have very different magnitudes at the initial value
277
of (all zeros). To see this run CheckScore.m. With unscaled data, one element
d
of the gradient is very large, and the maximum and minimum elements are 5
orders of magnitude apart. This causes convergence problems due to serious
numerical inaccuracy when doing inversions to calculate the BFGS direction
of search. With scaled data, none of the elements of the gradient are very
large, and the maximum difference in orders of magnitude is 3. Convergence
is quick.
13.6.2. Multiple optima. Multiple optima (one global, others local) can
complicate life, since we have limited means of determining if there is a higher
278
maximum the the one were at. Think of climbing a mountain in an unknown
range, in a very foggy place (Figure 13.6.1). You can go up until theres nowhere
else to go up, but since youre in the fog you dont know if the true summit
is across the gap thats at your feet. Do you claim victory and go home, or do
you trudge down the gap and explore the other side?
The best way to avoid stopping at a local maximum is to use many starting
values, for example on a grid, or randomly generated. Or perhaps one might
have priors about possible values for the parameters (e.g., from previous studies of similar data).
279
Lets try to nd the true minimizer of minus 1 times the foggy mountain
function (since the algoritms are set up to minimize). From the picture, you
that. The program FoggyMountain.m shows that poor start values can lead to
problems. It uses SA, which nds the true global minimum, and it shows that
BFGS using a battery of random start values can also nd the global minimum
help. The output of one run is here:
MPITB extensions found
======================================================
BFGSMIN final results
-----------------------------------------------------STRONG CONVERGENCE
Function conv 1
Param conv 1
Gradient conv 1
param
gradient
change
15.9999
-0.0000
0.0000
-28.8119
0.0000
0.0000
16.000
-28.812
================================================
SAMIN final results
NORMAL CONVERGENCE
parameter
search width
0.037419
0.000018
-0.000000
0.000051
================================================
Now try a battery of random start values and
a short BFGS on each, then iterate to convergence
The result using 20 randoms start values
ans =
3.7417e-02
2.7628e-07
280
281
In that run, the single BFGS run with bad start values converged to a point far
from the true minimizer, which simulated annealing and BFGS using a battery
of random start values both found the true maximizaer. battery of random
start values managed to nd the global max. The moral of the story is be
cautious and dont publish your results too quickly.
EXERCISES
282
Exercises
(1) In octave, type help bfgsmin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call bfgsmin. Run it, and
examine the output.
(2) In octave, type help samin_example, to nd out the location of the
le. Edit the le to examine it and learn how to call samin. Run it, and
examine the output.
(3) Using logit.m and EstimateLogit.m as templates, write a function to calculate the probit loglikelihood, and a script to estimate a probit model. Run
it using data that actually follows a logit model (you can generate it in the
same way that is done in the logit example).
(4) Study mle_results.m to see what it does. Examine the functions that
mle_results.m calls, and in turn the functions that those functions call.
Write a complete description of how the whole chain works.
(5) Look at the Poisson estimation results for the OBDV measure of health care
use and give an economic interpretation. Estimate Poisson models for the
other 5 measures of health care usage.
CHAPTER 14
4 section 4.1 ; Davidson and MacKinnon, pp. 591-96; Gallant, Ch. 3; Newey
9#
d f f
47C
with
1)
1)
8
YW
d f W f
eC
and
d
pj
I
f B
d
BR B
f
8 R BR B HB #
f
where
B G d BR kB f
P
283
'
X1
as the optimizing
f# bb I
f
R r uxxb # g# n W
d f
4eS9
the
random matrix
depend upon a
over a set
where
observations, dene
14.2. CONSISTENCY
284
14.2. Consistency
The following theorem is patterned on a proof in Gallant (1987) (the article,
ref. later), which well see in its original form later in the course. It is interesting to compare the following proof with Amemiyas Theorem 4.1.1, which is
done in terms of convergence in probability.
is compact.
such that
A
4Ra uq4e @ f
d d f 9
d f
eSx
is such that
d f
64yCxg
i.e.,
8p
9gd d
f
g d p
wxd cx4RaG
d d
G
b
Proof: Select a
Then
x g
pt d
(3) Identication:
a.s.
is a xed sequence
converges uniformly to
x
lies
. Since every sequence from a compact set has at least one limit point (Davidd
fd
There is a subsequence
fd
87fd
W1
Yt
f
7d
8 d
g4RaG
continuous in on
that is
p d
ge
The closure of
space
is obtained by maximiz-
d
4RI
d f
eS9
Assume
over
ing
f
d
. By
8 d
p Ra
p ea p e t W
d
d
f
gd a d W
f f
8 d f
g p e t W
p e
d
f
d W
f f
d 9
f f
C
d
d d A
of
k
dcuG
b
implies that
8
gd d W
f
8
gd a d W
f f
8 CD fd
by uniform convergence, so
as seen above, and
However,
which holds in the limit, so
Next, by maximization
since the limit as
Continuity of
Then
285
14.2. CONSISTENCY
286
so
except on a set
p
d
d
g p ea 4d a
f
d
d
8
with
. Therefore
8p
and
d
4RaI
we must have
at
p
9d
d
eG
with
must have
p Ra
d
v p d
d
8p
xgd
maximum at
which matches the way we will write the assumption in the section on nonparametric inference.
is in fact a global maximum of
f
d
8 d f
Hex
We assume that
It is not re-
requires that the limiting objective function have a unique maximizing argument. The next section on numeric optimization methods will
trivial problem.
d f
ey9
may be a non-
breakdown of consistency.
x
is in the interior of
8
x
The
reason that we assume its in the interior here is that this is necessary
for subsequent proof of asymptotic normality, and Id like to maintain
a minimal set of simple assumptions, for clarity. Parameters on the
boundary of the parameter set cause theoretical difculties that we
14.2. CONSISTENCY
287
will not deal with in this course. Just note that conventional hypothesis testing methods do not apply in this case.
is not required to be continuous, though
d
4R`
d f
e9
Note that
is.
14.2. CONSISTENCY
288
We need a uniform strong law of large numbers in order to verify assumption (2) of Theorem 19. The following theorem is from Davidson, pg. 337.
be a sequence of stochastic
Then
8 x
!
d f &
e
A
d f
eSs&
9
if and only if
where
x
p
x g
pwd
d f &
eSs
d f
T 4R s&
is a dense subset of
and
(b)
for each
(a)
using
The pointwise almost sure convergence needed for assuption (a) comes
from one of the usual SLLNs.
289
d f
4eSx
is discontinuous.
for which
x
gf
&
x
lg
8ctud R gf
G P p
R 6s!p & gd
p
p
we can write
8
'
G F G P Fp P
tU tU!swqp
1
R U4
F )
`
F nDQ
Q `
F f
p!
is compact. Let
, where
( and
and
so
290
is
I@
I
@
I
@
1
d d
1 5 t d d
G
)P tG p s R
P i p R
f
f
f
I
B
I
@
f
d R G
tVP p d R 1 ) 4d R g
f
f
d f
eS9
1)
1)
8
Q
"t
Q
Dt
G
H F
IhG
I
@
G f 1)
and
Finally, for the rst term, for a given , we assume that a SLLN applies
so that
F
Q
Dt
F
Q
Dt
t
d
i p R
d
F
I
F 0 p Dp 0 p
P F
Q t
0 p P "8F F 0 p
F
p & wP &
5
&
p & wP &
5
&
I
x i p R @
d
d
f
1)
(14.3.1)
p &
p &
Finally, the objective function is clearly continuous, and the parameter space
is assumed to be compact, so the convergence is also uniform. Thus,
8p
9s !9p 'Q&
&
F
P F
5
uP j p 0 p & p & sP
&
p & 4RaG
d
E XERCISE 21. Show that in order for the above solution to be unique it is
8 p
F
necessary that
291
This example shows that Theorem 19 can be used to prove strong consistency of the OLS estimator. There are easier ways to show this, of course - this
is only an example of application of the theorem.
8p
d
p e
d f P
6$fRC4
8pd
d f
4RCP
hood of
(b)
d f
eS9
(a)
p R@ p ea
d
P
d
I d P d o I p e % m h p d 1
p d f
gRC 1 Qtf ggRa o
p d
p d
4ge o q ggRC 1
m p d f
(c)
where
fd
`
Then
h p ipd RC p eS96 4d S
d
d f
P d f f f
pd
xg
) P
cd
probability one as
d f
4RC
Note that
8
4)
where
exists with
292
f
d
f
d
Cd
gives
and
p
is between
and since
f
p d d
Also, since
8p
9d
, assumption (b)
p e P eS
d
d f
T
So
h p ipd tT p e P p RC
d
) P d
P d f
h p ssd 1 t$T p e
d
) P d
P d f
p eS96 1
term is asymptoti-
p d h
gRaP
pd
6ge
)
tT
Now
And
p eS 1 I 6 p eaR h p ipd 1
d f
d P
d
h p ipd 1 h p e Q p eS96 1 h
d
d
P P
d f
Because of assumption (c), and the formula for the variance of a linear combih
I p e
d
p ea o I p R
d
d
P
%
nation of r.v.s,
h p ipd 1
d
f
81
H
b
H f H
cant depend on
is a function of
fd f
e P
Ch. 4) is
our case
and
dH
b
if
is continuous at
8
8xd d
p
and
p
gd
d
4RaC
is continuous at
p
d f
4RCg
then
8p d g
ge"wd
ous on a neighborhood of
p
xgd
p d
gRaC
Stronger conditions that imply this are as above: continuous and bounded
d f
4RC
is also an average of
d f
4eS
that
8p
9d
matrices,
the elements of which are not centered (they do not have zero expectation). Supposing a SLLN applies, the almost sure limit of
q
hm
d f
p eS96 1
S
T
wed have
q1
U
tVT
p e
d
p eS96 1
d f
) S p d
t T ggRP
tion (c):
pd f
gggR 9
h z 1 T S
)
t$T S z 1
d f
p eS96
if
p ea@ d g
d
function
T HEOREM 23. If
293
14.5. EXAMPLES
8
g
1
T
1
1
1
( T S T
p eS96
d f
to zero.
294
The sequence
to avoid convergence
14.5. Examples
14.5.1. Binary response models. Binary response models arise in a variety
of contexts. Weve already seen a logit model. Another simple example is a
probit threshold-crossing model. Assume that
)
x9
)
f
G
j R
f
G
Gt ~ p
4 5 G 7} eI x
5
, where
is a binary vari-
f
s t) s
Here,
In general, a binary response model will require that the choice probability
be parameterized in some form. For a vector of explanatory variables , the
d
4 )
f
4d R
d
where
dc
b
gd R
d
s
W
If
is the
14.5. EXAMPLES
295
d ) d B
B f XY
I 4 ic X u 4 C C B %G
I
B 1
8d
4
B f
C B )
f
I
d
f ) P d
f B 1
B
C s%) A B 0yc C B )
B
f
d f
4RC
(14.5.1)
8 d f
g
d f
eSx
8 d f
4RCx
conditional on
Noting that
that
gpd BC s B
f
p
gd
d
estimator,
8 d
$s4 i) A p i) E4 p s yd) A Sjcs4 t u
d
P d
d
d
f ) P d f
d
4ea
Y
'g
is the
Q
where
Q id) c p i) s p
t
d
d P d
d
(14.5.2)
d
d
4
as long as
d
continuous in
is continous for the logit and probit models, for example. The maximizing
R
8 p d p d p ea
d
f
d
f
d
p R
d
p e
d
p de
1
)
p R 1 )
d
1
h
p e
d
f
So we get
f
d f
f
p eS 1
The terms in
i.i.d.
f.o.c. in the consistency proof above and the fact that observations are
Theres no need to subtract the mean, since its zero, following the
d f
pgRC 1 f gRa o
p d
8 d P d
q I 6 p he p Ra o I ! p R% @
d P
is simply
h p d 1
d
8p
xgd d
d
d
siy)
d
d
d
d
d
Q siy) q g Y
pd
pd
Ig4RaG
d d
is consis-
element of
296
8R Dv p iq p
d
d
v
d
v c p %f
Rd d
d
p e
d
f d
p
(14.5.3)
So
8 v 4 %4
d d
d ) d
v i4
d R 7g} )
~ P
~ P )
v 4dv R v 7} I d R v 7g}
~
~ ~ P )
v 4dR v 7} dR v 7g}
d d
4
8 I dR v 7g} P c 4
~ ) d
Now suppose that we are dealing with a correctly specied logit model:
8 p i) 0 p f p
d
f ) P d
d
f
8
then over
f
R
8g p d d
d
f
d
p e
on
Likewise,
conditional
14.5. EXAMPLES
297
gives
t
v
d
P d d 5
BQR Dv p Q p s p u f
8 QR Hv p iq p
t
v
d
d
d
p RP
f
S Y
(14.5.6)
. Likewise,
8 BQR vHv p iq p
t
d
d
pd
g v Y
f
d
p Ra
(14.5.5)
(14.5.4)
then
298
Note that we arrive at the expected result: the information matrix equality
. With this,
I 6 p e
d
p ea o I 6 p R
d
d
pd
e
P
%
I 6 p RR9 d
d P
h
d
p ipd 1
gRaP
p d h
o
simplies to
h p ipd 1
d
8 I p e
d
m
h p ipd 1
d
On a nal note, the logit and standard normal CDFs are very similar - the
logit distribution is a bit more fat-tailed. While coefcients will vary slightly
between the two models, functions of interest such as estimated probabilities
will be virtually identical for the two models.
4d
299
G P d B
B V p C B f
where
! B G
t33
The nonlinear least squares estimator solves
f B 1
d
B
C B ) g f d
f
Well study this more later, but for now it is clear that the foc for minimization
will require solving a set of nonlinear equations. A common approach to the
problem seeks to avoid this difculty by linearizing the model. A rst order
Taylors series expansion about the point
R p C p p B f
B aQP p p
B P d
d
p
encompasses both
BG
where
is no longer a classical error - its mean is not zero. We should expect problems.
Dene
p p
d
Rp q p p
d
d
pgp
&
by applying OLS to
a
B QP C uP
B
and
&
B f
&
g
be consistent for
and
&
&
Question, will
and
300
as extremum
&
8 R R ! &
C 0
B
B
f
&
estimators. Let
and
I
B 1 f
H
f )
f
Y HaG HS9
f
T
0
that minimizes
HG
p
to the
&
f
Y p
&
8 8
4C
converges
j
and
Noting that
0 & q p
VP
G P d
0 & p0 p Y
&
&
f
Y
G
and
pd
g4
drop out.
dR
and the
301
h(x,)
x
Tangent line
Fitted line
x
x
x_0
It is clear that the tangent line does not minimize MSE, since, for ex-
pd
gg
ample, if
p
d
302
of interest, so in this case, one should be cautious of unthinking application of models that impose stong restrictions on second derivatives.
This sort of linearization about a long run equilibrium is a common
practice in dynamic macroeconomic models. It is justied for the purposes of theoretical analysis of a model given the models parameters,
but it is not justiable for the estimation of the parameters of the model
using data. The section on simulation-based methods offers a means
of obtaining consistent estimators of the parameters of dynamic macro
models that are too complex for standard methods of analysis.
EXERCISES
303
Exercises
and
&
C
B
&
by OLS. Find
and
B B
P C P 'B f
&
B P B ) qB f
G
is iid(0,
where
BG
uniform(0,1), and
8
gD
(2) Verify your results using Octave by generating data that follows the above
model, and calculating the OLS estimator. When the sample size is very
large the estimator should be very close to the analytical results you obtained in question 1.
(3) Use the asymptotic normality theorem to nd the asymptotic distribution
and
where
x
yx x
x cb gsQP S z
p f
QG
G P p
U l f
8
and is independent of
)
t
of the ML estimator of
8
g p
CHAPTER 15
587 for refs. to applications); Newey and McFadden (1994), Large Sample
Estimation and Hypothesis Testing, in Handbook of Econometrics, Vol. 4, Ch.
36.
15.1. Denition
Weve already seen one example of GMM in the introduction, based upon
distribution. Consider the following example based upon the t-distribution.
is
p
4d
q1
d
45 p R p d
f P
d f
p d ) `5 tqegIpp RA
) P d p g! `
Y
one could estimate
the
likelihood function
d f
4g
Y
`
I
@
d
A f 4RSf ~ g
This approach is attractive since ML estimators are asymptotically efcient. This is because the ML estimator uses all of the available information (e.g., the distribution is fully specied up to a parameter). Recalling that a distribution is completely characterized by its moments,
the ML estimator is interpretable as a GMM estimator that uses all of
304
15.1. DEFINITION
305
ments to estimate a
mo-
(for
8 p
g45 gd
5 p d p f
4Vg gd
y)
X Xf
d
5
I
d
$ d %
8 deI%
p
p dI
6geE%
8 7f @f 1 )
I
f 4upR d 4eE%
5 d
dI
both
and
Choosing to set
dI I
d I
e% @f 1 ) 4Rv%
p
and
5 d
VR
p d f Y
ggg @
(15.1.1)
yields a MM estimator:
This estimator is based on only one moment of the distribution - it uses less
information than the ML estimator, so it is intuitively clear that the MM estimator will be inefcient relative to the ML estimator.
p RD 5 p d
d
gR4u e 3
f
p d y7
3
cQ
8 gd
p
I
@ 1
d 45 d d
f
s4YVR 4R
d
7
f )
provided
15.1. DEFINITION
306
4d
to set
If you
solve this youll see that the estimate is different from that in equation
15.1.1.
This estimator isnt efcient either, since it uses only one moment. A GMM estimator would use the two moment conditions together to estimate the single
parameter. The GMM estimator is overidentied, which leads to an estimator which is efcient relative to the just identied MM estimators (more on
efciency later).
xgd QdDt)$T d eX
p S d
p 1
g eI T
d
gpeX
8 R E4e !ge% eC
d d I
d f
The
As before, set
since it is an
where expectations are taken using the true distribution with param-
8p
xd
eter
We assume
converges to a
is a -vector
'
is a
and
matrix.
d
4RX
8 d f d
g4RXi R 4eX e 9
d f
and we minimize
f
"
"fi R E4eXpt
d
d
E4RXpt
.A
For the purposes of this course, the following denition of the GMM estimator
is sufciently general:
f
i
eX!4
d
d f f d f
g4Rig" R eS $ 4eS g
d f
p
-vector,
with
and
where
d I
4RE @f I f 4RC
d f
tor
is a
'
15.2. CONSISTENCY
307
15.2. Consistency
We simply assume that the assumptions of Theorem 19 hold, so the GMM
estimator is strongly consistent. The only assumption that warrants additional comments is that of identication. In Theorem 19, the third assumphas a unique global maximum at
p
d
daG
b
i.e.,
p
9gd "d
d
4ea
ea pp R gpgdea gge G
p d
pd
y
8 p dRa
d f
p RC 4
8 d
g4Ra eSf
d
T
8 d
4eSf
d f f d f
g4RCgi R eS 4RC
d f
8 p d d
xd cgea ea
p d
rst consider
by assumption,
Since
for
Q
f
p
d
'
t
denite
308
a nite positive
is asymptotically identied.
'
Note that asymptotic identication does not rule out the possibility
of lack of identication for a given data set - there may be multiple
minimizing solutions in nite samples.
y
8gpgeS 1 X f A ea o
d f
p d
d f
eS z
I 6 p e p ea o I 6 p R
d P d
d
P
%
h p ipd 1
d
p d
geP
where
have
and
We need to determine the form of these matrices given the objective function
8 d f f d f
g4RCgi R eS 4RC
d f
d f d
5 4RC
d Rf
HR d
matrix
$
d
eSf
'
so:
8d
HR e 5 4e d
d
d
and
d f
eS
f d
"74eSf eS9
d f
,
q1
(15.3.1)
(Note that
d f f d
eygi e yf
Dene the
but it
d
d
f
Rf f" R 4RXH1 p eXH1 if A f
p d
ggh RXq
by assumption), we have
and
ps
d
5 p ea
rows of
@848C6 R
d
p ea
p
gd
p Ra o
d
(since
A4C7
8 8
Rd d
P
d
p RCf
A
we get
R
BR p e d
d
where we dene
ggtcT p eX
)
d
since
d
R p RXg5
y
d
BR e
d d
BR 4e
p
xgd
d
R 4eX5
R
5 P
R B d dRggR B 5
R
de fie B d
d
5
assume that
be the
th row of
8d
ge
at
d R d
d
4R B
uct rule,
To take second derivatives, let
309
p d
ggRX
310
gs
this,
. Assuming
p RXH1
d
where
8 R p RX p RXH1 A f
d
d
s p Ra
d
r I R
s R
p I R
s x n
h p ipd 1
d
the asymptotic distribution of the GMM estimator for arbitrary weighting mato be positive denite,
"
trix
s
8 i
f
tions of the individual moment conditions. For example, if we are much more
sure of the rst moment condition, which is based upon the variance, than of
the second, which is based upon the fourth moment, we could set
8
xU
with
311
may be a random,
matrix
to make the GMM estimator efcient within the class of GMM estimators
.
I
w I wI
8 f R I R I R I R R
for
I 8w
G
R 4G
G f
P
e.g,
R
'
Let
8I
lation.
8
g j G
where
G P v
7 R f
d f
eS
dened by
(Note: we use
model is efcient.
jective function
Interpreting
G q
8
I R
f
f
G i f
P
evaluated at
312
8
q1
con-
versus
R eI r eI
p
p
R
ps
I R ps
s I R p
s I R
I
'
8 I R I
I
s
R eI
p
s R ps
n eI
p
R I
when
so that
I R
I
simplies to
84 R pgRXgpgRXH1 % f A
d d
Proof: For
where
asymptotic variance of
the
f
f
"
"
d f f d f
geigi R eS
T HEOREM 25. If
r R I
n
m
h p ipd 1
d
(15.4.1)
p
d
is continuous in
7d
assuming that
which is consistent
Sto-
Rf
is simply
8
7d
by the consistency of
Rf
h d Rf
and
need estimators of
where the
1
I R I
allows us to treat
313
is
In the case that we wish to use the optimal weighting matrix, we need an
the limiting variance-covariance matrix of
p d f
ggRCD1
estimate of
. While
2
s R U @
will be autocorrelated (
stationary.
).
B B
).
B U
314
Since we need to estimate so many components if we are to take the parametric approach, it is unlikely that we would arrive at a correct parametric specication. For this reason, research has focused on consistent nonparametric
8
4d E
7d
8
g R U
are functions of
Now
P
I f ) 1 xxxH R
Pbbb
I @
8 R
f 1 )
would be
P
DI f 9xxP h R
Pbbb
1
P 01 P h IR
5
8 h R
h I Rf
f
tent, estimator of
I Rf
1
P 5011 IR I ) 1 P p
P
P
I
@
I
@
1)v
w R
f
f
1 v R p eXc p eX1
d
d
is
I
@
1)
f
so that
Recall that
I@
1)
R
f
p
estimator of
R
and
Note that
autocovariance of
Dene the
8 R
and
8
gE2
tween
2
X
estimators of
P 11 P p
I f
1
P
DI ) 1 P p
315
a modied estimator
hR
I
p
P
xf (
must be
1
grows sufciently
8 1
gDT
1
S
f f
1
as
where
tends to
This allows
d
requires an estimate of
requires an estimate of
p d
geX
calculate a negative
The
arbitrarily
then re-estimate
p
cient estimate of
8p
xgd
between iterations.
15.5.1. Newey-West covariance estimator. The Newey-West estimator (Econometrica, 1987) solves the problem of possible nonpositive deniteness of the
above estimator. Their estimator is
8 hR
xf (
I
)P
y) P p
316
I 1
3 p
e
It is
8
7
In a more recent paper, Newey and West (Review of Economic Studies, 1994)
use pre-whitening before applying the kernel estimator. The idea is to t a VAR
model to the moment conditions. It is expected that the residuals of the VAR
model will be more nearly white noise, so that the Newey-West covariance
estimator might perform better with short lag lengths..
The VAR model is
8
i
P T x Pbbb P I
CUT UhxxxsI x
is
T Ux xxxsI x
T Pbbb P I
tails.
317
and a function
Y
XH
t
iX
random variable
of
8 t
i4t X XH
e
I
QHq
'
r
o
t
iX QH
8 t
%4X XH iX
t
e
I
XD
doesnt depend on
QH
Since
8 t
%$X
XD
Y
f'
XD q
318
as claimed.
This is important econometrically, since models often imply restrictions on
conditional moments. Suppose a model tells us that the function
equal to
d
gc D
f
4
c
has
8d
g4c D g 4
f
so that
R D
d
f
gf
we can set
G P
tw R 4f
d f d
4c Duq eE
has conditional expectation equal to zero
8 4RE 4
d
8
7d
dimensional parameter
t)
8
c
F
U
is a set of variables
F
U
) '
j
The
and
UF
d
eE
is a
d F
eE U
where
I
@ 1
d
eE
f )
I
@ 1
4RE
d
W f )
d Rf
eSf W ) 1
eif
d
4e
d
d
4eI
8f
W
d f
4RC
Rf
bb f F
xxb 4U
W
V
W
f F
UI
I
F
I F
I
W
V
f F
$U
F
bb I F
xxb E
W
2
Rf
W ) 1
RI
.
.
.
row of
.
.
.
moment conditions
W
f
W
.
.
.
F
I F
E
is the
F
yW
'
X1
matrix
319
f d
W 4 fR d ) 1
1 d
d
R eSf R f W
y )
'
d
eSf
8 Df ) 1 4RCf
f W
d
1
%'
is a
where
matrix) is
(a
320
d
e R
conditions as its columns. Likewise, dene the var-cov. of the moment conditions
8p d
ggeSf
f f
W 1 R f W
f R d
W c p eSf p eSf ) 1 yW
d
Rf
fW R p eSf p RCf R f W ) 1
d
d
c p eic p Ri1
R d f d f
f
$
f
sr j
m
h p d 1
d
f
I
1
1
1
R f R f W I C R f W Df
f Wf
f W
(15.6.1)
where
321
R f I f
f
W
8 f f 1 t f
I R I gf
(15.6.2)
and furthermore, this matrix is smaller that the limiting var-cov for any other
choice of instrumental variables. (To prove this, examine the difference of the
inverses of the var-cov matrices with the optimal intruments and with nonoptimal instruments. As above, you can show that the difference is positive
where
h d Rf d
p
Usually, estimation of
this.
and
f
since it depends on
d
g p R f
semi-denite).
struments.
may not be possible. It is an
q1
f
1 '
XV1
Estimation of
matrix, so it has
322
d
4RE
h
b b
f y
h d h d yf
f
I
d
d
5 4d
or
f
d i I 4d
p RC eI h eI
d f
p
p
I R I
p eS eI 1
d f
p
m
i
p
R eI
x 1
Now
p
d X eI 1
p eC eI h eI
d f
p
p
I R I
p
R eI 1
Or
d X1
h
8 d f
p eS I
I R I
R 1
d f
q p eSH1
h
$ d XH1
get
h
With this, and taking into account the original expansion (equation ??), we
p RC I
d f
I R I
h p pd 1
d
or
h
d
d f
p ipd R I p eC I
d
pd
xT P h p dipd p R I d p RC I d 4d X I 4d
)
R d
P d f
I d
8)
gtcT P h p d p e Rf p RC 4d X
d
d
P d f
and
tends to
, we can write
we obtain
(15.8.1)
4d X
323
324
I R I
p
R eI
xl
h eI
p
is idempotent of rank
4 X I c$d XH1 h $d X eI 1 R h $d X eI 1
R
p
p
m
i d
we also have
converges to
Since
$d X I c4d XH1
R
or
fb
4d C1
with a
q1
and compare
This wont work when the estimator is just identied. The f.o.c. are
$
$d X I eS
d f
and
8
$
d X
325
f
4 d 9
ble. Also
A note: this sort of test often over-rejects in nite samples. If the sam-
ple size is small, it might be better to use bootstrap critical values. That
by sampling from the data with re-
d
)
4 b &
8 6A@8@8974) Cg d "1
85
%
b
percent of the
such that
i
p
statistic
placement. For
the critical value with precision. This sort of test has been found to
have quite good small sample properties.
and
S
jointly (feasible
is correct.
I R
ferences.
is a nite
gsS
G P p
V
S
gT jsG
where
S
@
Suppose
(a
)
0'
G
t
326
8
R v gS qE
f
moment con-
8
R v v
1 H v
) f
1 )
1 ) X
qX
ditions). We have
parameters and
There
8
X
X
is no need to use the optimal weighting matrix in this case, an identity matrix
works just as well for the purpose of estimation. Therefore
R I 6 R g v
8
q1 R R v v
is
8 hR
8 hdR
1
)
In this case
is simply
R v v
Recall that
8 hR
I
I
p
P
I f
327
which has a constant number of elements to estimate, so information will accumulate, and consistency obtains. In the present case
1 '
%X1
is an
in the position
2
2
G
1
qR
I
@
w
G R v v f v 1 )
I
@
v
w
h R v f R v v l1 )
f
I
@
p
R
f 1 )
where
I R
st
1
R
I
1
qR I
1
1
R R
q
1
R
h
Dh j% 1
This is the varcov estimator that White (1980) arrived at in an inuential article.
This estimator is consistent under heteroscedasticity of an unknown form. If
328
15.9.2. Weighted Least Squares. Consider the previous example of a linear model with heteroscedasticity of unknown form:
S
T j
G
0P p
G
is a diagonal matrix.
is known, so that
q I S R I I S R
moment conditions
$
estimator is
is a correct para-
8
g
p RS
d
where
p d
p
R eES
1 ) Rdf !S
v 1 ) X
v v
That is, the GLS estimator in this case has an obvious representation as a GMM
estimator. With autocorrelation, the representation exists but it is a little more
complicated. Nevertheless, the idea is the same. There are a few points:
This means that it is more efcient than the above example of OLS with
329
G P R
t0 g# gf
or
, e.g.,
must
Dene
8)
gtj'
is
4 v
R I R
that
(suppose
contains
and
x#
is
gG
)'
G P
V
R I R
and so
have
so we
R h u d 4
R f
I
1 ) X
parameters and
f
R vg 4 E
Since we have
8 R f
4wg 4
dition
8
G
be uncorrelated with
Since
R
4x4
This is the standard formula for 2SLS. We use the exogenous variables and
the reduced form predictions of the endogenous variables as instruments, and
330
apply IV estimation. See Hamilton pp. 420-21 for the varcov formula (which
is the standard formula for 2SLS), and for how to deal with
heterogeneous
and dependent (basically, just use the Newey-West or some other consistent
and apply the usual formula). Note that
gG
estimator of
dependent causes
I
f
I G P d
0 pI c4cvI
0 p c4c
G P d
G P d
V p 4c
f
.
.
.
f
or in compact notation
8
B G
) B
j' Yw
vector of instruments
orthogonality conditions
v E 4c
d f
I d I f
I v Eg4cI
d
eE
.
.
.
v E 4c
d
d
b
c$
&
B
c Dv
R8 R p Dxx9 R p R pI e p
d b b b
d
d
gf
We need to nd an
) B I
' h yw B
the
is a
G P d
t0 p 4
where
331
more information than others, since the moment conditions could be highly
moment condi-
tions would be fully efcient. Here well see that the optimal moment conditions are simply the scores of the ML estimator.
-vector of variables, and let
&
be a
8 R R !@A8@89 R ! RI
f 8 f f
gf
Let
Then at time
I CC2
the conditioning variables have been selected to take advantage of all useful
information). The likelihood function is the joint density of the sample:
d ff 8 fI f
4gA@8A8 !6
d
4R
d f b d f ff
6I i q4!I C g
d
e
8I f b 8 b d f f f b d f f f
gc $A8@84q C I q6I i
d
4e
332
I
@
8 d
4I C gf
d
f 4R
d
4iE
conditions, that the scores have conditional mean zero when evaluated at
(see notes to Introduction to Econometrics):
p
$d
2
d f
46I i g
Dene
d
!I i p cit
so one could interpret these as moment conditions to use to dene a justparameters there are
score equa-
I
@
I
@
f
4 d I C g 1 ) d CE 1 )
f
f
which are precisely the rst order conditions of MLE. Therefore, MLE can be
I R I
I
@
R
4d 6I i g
f
d
f 1 ) 4 d ciX
and
g
i
333
I i
I C
of
preserves uncorrelation
(see the section on ML estimation, above). The fact that the scores are
I
@
I
@
f
f
R
R r 4 d 6I i g n r d I C A n 1 ) c4 d ci d CE 1 )
f
f
Recall from study of ML estimation that the information matrix equality (equation ??) states that
B
8 p I C g C R p I C A p gI i g
d
f
d
f
d
f
This result implies the well known (and already seeen) result that we can estiin any of three ways:
C
I
f
d I C A @f
B
I
' R r d I C A n r 4d I C g n @f
f
f
I
I
' C 4d I i g @
f
f B
1
or the inverse of the negative of the Hessian (since the middle and last
I
@
w d I C g
f
)
f 1 v
I
mate
or the inverse of the outer product of the gradient (since the middle
and last cancel except for a minus sign, and the rst term converges to
minus the inverse of the middle term, which is still inside the overall
inverse)
334
I
@
8 s f
f
R r d I C A n r 4 d 6I i g n 1 )
f
I
r
y
This simplication is a special result for the MLE estimator - it doesnt apply
to GMM estimators in general.
Asymptotically, if the model is correctly specied, all of these forms converge to the same limit. In small samples they will differ. In particular, there
is evidence that the outer product of the gradient formula does not perform
very well in small samples (see Davidson and MacKinnon, pg. 477). Whites
Information matrix test (Econometrica, 1982) is based upon comparing the two
ways to estimate the information matrix: outer product of gradient or negative
of the Hessian. If they differ by too much, this is evidence of misspecication
of the model.
8e P
cE R 4f
We assume that
the functional form and the choice of regressors is correct, but that the some of
the regressors may be correlated with the error term, which as you know will
produce inconsistency of
6e
is autocorrelated.
335
F IGURE 15.10.1. OLS and IV estimators when regressors and errors are correlated
x
x
qo mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4
{
{
zz
x
x
q o mk
v t r qopmnkl j f d g f e d
wushf pnl i8ih4
5
8 5
8 5
% 8
% yy yy yy 8yy yy
@5
@5
8
8
To illustrate, the Octave program biased.m performs a Monte Carlo experiment where errors are correlated with regressors, and estimation is by OLS
and IV.
Figure 15.10.1 shows that the OLS estimator is quite biased, while the IV
estimator is on average much closer to the true value. If you play with the program, increasing the sample size, you can see evidence that the OLS estimator
is asymptotically biased, while the IV estimator is consistent.
We have seen that inconsistent and the consistent estimators converge to
different probability limits. This is the idea behind the Hausman test - a pair
of consistent estimators converge to the same probability limit, while if one is
consistent and the other is not they converge to different limits. If we accept
that one is consistent (e.g., the IV estimator), but we are doubting if the other
is consistent (e.g., the OLS estimator), we might try to check if the difference
between the estimators is signicantly different from zero.
336
If were doubting about the consistency of OLS (or QML, etc.), why
should we be interested in testing - why not just use the IV estima-
tor? Because the OLS estimator is more efcient when the regressors
are exogenous and the other classical assumptions (including normality of the errors) hold. When we have a more efcient estimator that
relies on stronger assumptions (such as exogeneity) than the IV estimator, we might prefer to use it, unless we have evidence that the
assumptions are false.
fully efcient estimator) and some other CAN estimator, say . Now, lets
h p d 1
d
T
8 d
g p eHb1 I 6 p ea
d
Equation 4.6.2 is
8 d
4Ra o e
d
Combining these two equations, we get
h
8 d
p eHn1 I ! p ea h p ipd 1
d
d
Also, equation 4.7.1 tells us that the asymptotic covariance between any
CAN estimator and the MLE score vector is
9
d
ea
9
d
49an
4eHn1
d
h p d 1
d
i
r
h ipd
d
h ip d
d
h
h
Now, consider
337
d
4RDb1
h ip d 1
d
h
4
d
I e
4ea
I d
4
9
d
4Ra
d
49an
4
d
d
I 4Ra I 4R
d
d
I 4Ra $x n
4ea
I d
h d 1
d
h d 1
d
h
4d ar I 4R
d
d
d
I 4Ra $x n
h ipd
d
h ip d
d
So, the asymptotic covariance between the MLE and any other CAN estimator is equal to the MLE asymptotic variance (the inverse of the information
matrix).
Now, suppose we with to test whether the the two estimators are in fact
pd
both converging to
h d d 1
h p pd
d
h p p d
d
1
1
9 n
338
8 h 4d ruq4 d n
h d p d 1
p d h d arVq4 d an R h d p d 1
where
g
So,
8
g
h d d h d q d R h d p d
This is the Hausman test statistic, in its original form. The reason that this
test has power under the alternative hypothesis is that in that case the MLE
pd
p
h d 1
h
d
, say, where
will be
of the MLE, it is possible that the inconsistency of the MLE will not
show up in the portion of the vector that has been used. If this is the
case, the test may not have power to detect the inconsistency. This
may occur, for example, when the consistent but inefcient estimator
is not identied for all the parameters of the model.
than the dimension of the matrices, and it may be difcult to determine what the true rank is. If the true rank is lower than what is taken
339
to be true, the test will be biased against rejection of the null hypothesis. The contrary holds if we underestimate the rank.
A solution to this problem is to use a rank 1 test, by comparing only
Following up on this last point, lets think of two not necessarily efcient es, where one is assumed to be consistent, but the other may
and
belong to the
Igd
I
gd
and
timators,
same parameter space, and that they can be expressed as generalized method
of moments (GMM) estimators. The estimators are dened (suppressing the
dependence upon data) by
is a
B 'B
positive
R
d
I d I
gR%
B d
z
|
z
|
r R R
d
85
7) 3
)
' B
B e B
d
is a
B e B B R B e X
d
d
where
R ge% n h d ggd
I d I
I
340
e
d
I d I
gev%
tf
S
S b
yS yS
I I
(or subvectors of the two) applied to the omnibus GMM estimator, but
and
I
$d
1
(15.10.2)
8
S
I
S
I
yS
z
term
cancels out of the test statistic when one of the estimators is asymptotically
efcient, as we have seen above, and thus it need not be estimated.
The general solution when neither of the estimators is efcient is clear: the
matrix must be estimated consistently, since the
S
I
entire
ther estimator is efcient. However, the test suffers from a loss of power due to
the fact that the omnibus GMM estimator of equation 15.10.1 is dened using
an inefcient weight matrix. A new test can be dened by using an alternative
341
d
h }S r R e
I ~
R ge% n h d ggd
I d I
I
R
d
I d I
gRv%
}
S
where
(15.10.3)
of equation
The parameter
(15.11.1)
8 E
8 p @
ity
quence
2
342
i
8
F
, where
I 3
I c Pp) UF
3
8
)
S
Current wealth
$t
is normalized to
8
2
The price of
S
P )
I 4I S I
t P
where
is investment in period
) Y2
is risky.
P
6 F
A partial set of necessary conditions for utility maximization have the form:
8 I CI 7 s i
R
P )
R
(15.11.2)
To see that the condition is necessary, suppose that the lhs < rhs. Then by
reducing current consumption marginally would cause equation 15.11.1 to
g R
drop by
8) P
2
I
P )
343
8
I R I 7
P )
t)
s
where
R
I C
so the foc are
P )
I I I I
P )
I I I I
is
stationary, even though it is in real terms, and our theory requires stationarity.
P )
I I
r
Suppose that
8
gE2
is chosen
8
c
(note that
1-
d
eE
v h n I c0)
P )
and
8
represents
344
H2
8p
x4d
c p eXc p eX1 A wp
R d d
I
@
R
cd Ec4d E
1)
d
which
This process can be iterated, e.g., use the new estimate to re-estimate
use
7d
we then minimize
8 d
geX I c4RX 4R
R d
d
p
9gd
this to estimate
This whole approach relies on the very strong assumption that equation 15.11.2 holds without error. Supposing agents were heterogeneous, this wouldnt be reasonable. If there were an error term here, it
could potentially be autocorrelated, which would no longer allow any
variable in the information set to be used as an instrument..
345
8
c v
Since use of more moment conditions will lead to a more (asymptotically) efcient estimator, one might be tempted to use many instrumental variables. We will do a compter lab that will show that this
may not be a good idea with nite samples. This issue has been studied using Monte Carlos (Tauchen, JBES, 1986). The reason for poor
Empirical papers that use this approach often have serious problems
in obtaining precise estimates of the parameters. Note that we are bas-
and
9
model, using the data le tauchen.data. The columns of this data le are
1986).
***********************************************
Example of GMM estimation of rational expectations model
346
Value
X^2 test
df
p-value
6.6841
5.0000
0.2452
estimate
st. err
t-stat
p-value
beta
0.8723
0.0220
39.6079
0.0000
gamma
3.1555
0.2854
11.0580
0.0000
***********************************************
instruments
and
to convergence.
Comment on the results. Are the results sensitive to the set of instruas well as
8
7d
EXERCISES
347
Exercises
(1) Show how to cast the generalized IV estimator presented in section 11.4 as
7f
d
eE
, what
f
v 4
(2) Using Octave, generate data from the logit dgp . Recall that
~ P
d
I s4d & v c 7g} d) 4c v
tied):
8 d
v c v igf 4RE
d
(a) Estimate by GMM, using these moments. Estimate by MLE.
(b) The two estimators should coincide. Prove analytically that the estimators coicide.
$d X I R 4d X"1
b
has a
8
CHAPTER 16
Quasi-ML
Quasi-ML is the estimator one obtains when a misspecied probability
model is used to calculate an ML estimator.
conditional on
8
h f v 8998qI v
v
r p
Y
4!
f 88 I
h e D998 q
variables
of a random vector
8
g p !
doesnt depend on
p
x
this conditional
!g! !
and let
888
h v 9qI v
p
888 I
h I H99q I
Let
348
16. QUASI-ML
349
I
@
E
fq
I
@
I DE q
f
!
f
I
@ 1
1 f
ES
f ) ! f ) C
2cg$! I HES g I D
p p d
g4 I D
x
g d d
8
ES
such that
where there is no
(this is what we
p
gd
mean by misspecied).
This setup allows for heterogeneous time series data, with dynamic
misspecication.
The QML estimator is the argument that maximizes the misspecied average
log likelihood, which we refer to as the quasi-log likelihood function. This
objective function is
I
@
d
4RE
f
I
@
p
d
I DE
f
)
d f
4RC
d f f
4eS ~ g d
16. QUASI-ML
350
d
4e
I
@ 1 f
4RE
d
d f
A f ) A 4eS
We assume that this can be strengthened to uniform convergence, a.s., followis the value that
d
e ~ E p
d
4R
maximizes
p
d
f
f d A
is compact
is continuous and converges pointwise almost surely to
means
is uniformly continuous).
p
gd
I 6 p e
d
p ea o I 6 p R
d
d
P
%
h p ipd 1
d
where
8p
4R
d
d
4R
compactness of
d
e
d f
eS9
d
4e
x
ex-
p eS f p RP
d f
d
16. QUASI-ML
351
and
8 d f
p eS 1 A f p ea
d
Note that asymptotic normality only requires that the additional as-
8
x
p
xgd
local property.
not throughout
for
hold in a neighborhood of
at
and
for
P
sumptions regarding
and
16.0.1. Consistent Estimation of Variance Components. Consistent estiis straightforward. Assumption (b) of Theorem 22 implies
p ea o
d
f
d
I
@ 1 tf
I
@ 1
f
d
p e ) T 4f d E ) 4f d SP
f
f
Consistent estimation of
g
I@
1j
)1
r
f
I@
g
1
f )
1
) 1
1 h
I
@
p R
d
f
A
p eS96
d f
s
I@
5j
f
We need to estimate
p d
eE
Notation: Let
in place of
8p
9gd
8 d
p RP
that
p e
d
mation of
d
p Ra
f
f
16. QUASI-ML
352
I
@
1 f
R 51
f )
which will not tend to zero, in general. This term is not consistently estimable
in general, since it requires calculating an expectation using the true density
under the d.g.p., which is unknown.
p d
6gea
example, suppose that the data come from a random sample (i.e., they
are iid). This would be the case with cross sectional data, for example.
f
f
is identical. This
is identical).
8
f
p p p e
d
f
d
where
means expectation of
and
p
d
at
we have
p
d
d
f
p eg p
expectation and differentiation, so
p
p
d
f
d
f
p A p
16. QUASI-ML
353
8 d
gE p Ra o j
I
@ 1
p
d
f A )
f
That is, its not necessary to subtract the individual means, since they
are zero. Given this, and due to independent observations, a consistent estimator is
I
@ 1
$d c y 4d E
A f )
CHAPTER 17
8G P d
ctV p v
gf
In general,
the case of linear models, so well just treat the iid case here,
! DtG
t33
If we stack the observations vertically, dening
R f f 8 fI f
cg!@@8A89 !
R d
cI @A8@89gI ggI
8d
d
and
fG 8 GI G
R EA@8A8 $ G
observations as
G Pd
04R
354
355
d
d
d
d f
4R i ) 1 e % R e % ) 1 4eS
and
8d
4R
d f
d R d P d R 5 R
44R 4e 4R sus ) 1 4RC
8
$d y
8
g$d s
$d s
'
Q1
be written as
in place of
P R $d
In shorthand, use
(17.1.1)
matrix
$d R 4d
Dene the
$d R R
P
or
8
$
4d i n R
(17.1.2)
This bears a good deal of similarity to the f.o.c. for the linear model - the
R R
is simply
then
d
d e
17.2. IDENTIFICATION
356
d f
eSt
17.2. Identication
As before, identication can be considered conditional on the sample, and
d
4e G
p d
ggRG
case if
such that
is strictly convex at
tend
d
gpe
p
xgd
8 p d d
xd c9gea eaG
p d
to a limiting function
d f
eSt
be positive
I
@ 1
ceE q p eE
G d
d
f 5
I
@ 1
P d
d
eE q p eE )
f
I
@ 1
tVP p c
G
d
v )
f
I@ 1
gf
f )
d
v
I
@ 1
tG
f )
d
4c v E
d f
eS9
As in example 14.3, which illustrated the consistency of extremum estimators using OLS, we conclude that the second term will converge
8
d
d
4R
pointwise to 0, as long as
17.2. IDENTIFICATION
357
bers applies, so
d
4
Q
I
@ 1
d d
ceE q p RE f )
g
ld
)
t9
8
7d
r I 4d 7} )
~ P
y
)
is continuous in
will
so strengthening to uni-
8
In many cases,
d
4
q p
d #
where
# Q d #
g4Dt 47
(17.2.1)
8p
x4d
cation (asymptotic), the question is whether or not there may be some other
minimizer. A local condition for identication is that
Dt 4
Q d
d
q p
8pd
Rd d
R
d d d
4Ra
# Q
$Dt R p 7 y c p
d #
R d #
5
Dt
Q d
work)
be positive denite at
d
q p
the expectation of the outer product of the gradient of the regression function
8p
9gd
evaluated at
358
allows passing the derivative through the integral, by the dominated convergence theorem.) This matrix will be positive denite (wp1) as long as the gradient vector is of full rank (wp1). The tangent space to the regression manifold
-dimensional space if we are to consistently estimate a
must span a
1
d
R 5 p e
17.3. Consistency
We simply assume that the conditions of Theorem 19 hold, so the estimator
is consistent. Given that the strong stochastic equicontinuity conditions hold,
as discussed above, and given the above identication conditions an a com-
gsx
the consistency
I p e p Ra o I p R%
d
P
d
d P
d f
4RC y z
h p d 1
d
evaluated at
p
xgd
d
p Ra o R c p e g c p eS96 1
d f d f
T
p d
ge
P
where
and
R 1 5 p RaP
d
1
d
R A p ea
RG
tGR 1 R c p R 9g p RC 1
d f d f
RG
G R c p R d
d
d
w p c
v
I
@
d
tG f v w p v
8 d
p v
8 d
4 v
d
p v
I
@
tG f
Noting that
I
@ 1
d f d f
G f v R p RC p RC 1
I
@ 1
d f
G f 5 p RC
d
4 v
d
v
I
@ 1
f
4RC
d f
f 5
p
9gd
Evaluating at
So
I
@
1 4eS
gf
d f
f )
359
p RaP
d
d
g p ea
and
and
8
G
360
Combin-
8 o1
I R
h p d 1
d
where
1
R
I
(17.4.1)
1
r d % n R r d % n
the obvious estimator. Note the close correspondence to the results for the
linear model.
17.5. Example: The Poisson model for count data
conditional on
Suppose that
Poisson random variable is a count data variable, which means it can take the
values {0,1,2,...}. This sort of model has been used to study visits to doctors per
year, number of patents registered by businesses per year, etc.
The Poisson density is
gf
is
(
gf
8A88@975649 g f
8 )
f
5! u e e 7g} g
~
The mean of
p R v 7} p
~
e
by nonlinear least
R v } g
~ f
I@
f
S
f )
We can write
8
c
squares:
Suppose we estimate
p
361
8
G
v
v t
G
ps R v } v 4fU
~
I@
I
@
I
@
q R
5P G
P R ~
~
~
v 7~g} p R v } tG
v 7g} p R v 7g}
f )
f )
f
I
@
~ G
~
R v 7} t0P p R v 7g}
f
f
C
)
The last term has expectation zero since the assumption that
implies that
related with
are uncor-
~
~
~
where the last term comes from the fact that the conditional variance of is the
x
x x
p f y
sQP qC z
b
This
and
8ggps o
x
8 h sj%
p
so the
8
f
p
The Gauss-Newton optimization technique is specically designed for nonlinear least squares. The idea is to linearize the nonlinear model, rather than
362
8G P d
V p e
d
we have
P d
s4R
a
Q
r I
d
8p
where
p
Take a
P a
QP I ips I y I R
d d
d
P d
approximationerror.
Id
I R y
d
' 1
P d
I Rs
I es
d
where, as above,
is the
and
is
Given
8
g I
d
e
d
h
$
U
d
g I R i
Similarly,
is known, given
8I
Note that
U
U
as
8 I dlPU
With
363
To see why this might work, consider the above approximation, but evaluated
at the NLS estimator:
P h d pd $d s$d
P
is
8 r 4d i n R h R d U
pd
d
r 4d % n R
by denition of the NLS estimator (these are the normal equations as in equa-
when we evaluate at
we have
d d
4Rs R es
|
G P I
t0P VsH f
|
tive function, so
When evaluated at
I R
In this case,
364
will be subject to
cient), J. Heckman, Sample Selection Bias as a Specication Error, Econometrica, 1979 (This is a classic article, not required for reading, and which is a
bit out-dated. Nevertheless its a good place to start if you encounter sample
selection problems in your research).
Sample selection is a common problem in applied research. The problem
occurs when observations used in estimation are sampled non-randomly, according to some selection scheme.
P R T F
a P
Q R
P v
R G
RF
Reservation wage:
Offer wage:
Characteristics of individual:
P R SaH4c
a P R
G P
VydR
365
P
dR v
8G P
0YBdR
8
) $
Assume that
We assume that the offer wage and the reservation wage, as well as the latent
are unobservable. What is observed is
F )
variable
F
8 pF
8 (
Otherwise,
8
I
F
or equivalently,
R
G
`d R
G
P v
R
using only observations for which
residual
and
can enter in
since elements of
depend on
and
since
366
8
$Gd R j
G
and
G
P G
H$
P G P v
$uR
we get
R
G
where
P
dR j $uR
G G P v
A useful result is that for
)
t #
dc
b
and
Y
#
C # #
#
# qt
dt
b
where
Y
#
qt ca
#
y s h R v
r c y
n
R
P 4d d R t $VR v
P
Rv
NLS.
y y s
8 c s
8
$
where
(17.7.2)
(17.7.1)
367
Heckman showed how one can estimate this in a two step procedure
is estimated, then equation 17.7.2 is estimated by least
where rst
CHAPTER 18
Nonparametric inference
18.1. Possible pitfalls of parametric inference: estimation
Readings: H. White (1980) Using Least Squares to Approximate Unknown
Regression Functions, International Economic Review, pp. 149-70.
In this section we consider a simple example, which illustrates both why
nonparametric methods may in some cases be preferred to parametric methods.
and
5
5 P
h 7 )
is a classical error.
with respect to
8p
f
, where
G P
Suppose that
5
g
is uniformly distributed on
f
Flexible functional
p p p
P
368
8
The coefcient
we can write
UP
p
369
I
@
8
f
U
g 1 ) gC
f
The limiting objective function, following the argument we used to get equations 14.3.1 and 17.2.1 is
8 t E q
U
gCa
and
tells us that
the limiting objective function. Solving the rst order conditions1 reveals that
p
$
P
9
A
U
gCa
ing function
8 I 6
pU
a
We may plot the true function and the limit of the approximation to see the
asymptotic bias as a function of :
(The approximating model is the straight line, the true model has curvature.) Note that the approximating model is in general inconsistent, even at
the approximation point. This shows that exible functional forms based
1
370
upon Taylors series approximations do not in general allow consistent estimation. The mathematical properties of the Taylors series do not carry over
when coefcients are estimated.
The approximating model seems to t the true model fairly well, asymptotically. However, we are interested in the elasticity of the function. Recall
that an elasticity is the marginal function divided by the average function:
t it G
R
R
R
elasticity is
and
8
approximation of both
Plotting the true elasticity and the elasticity obtained from the limiting approximating model
8
The true elasticity is the line that has negative slope for large
Visually we
see that the elasticity is not approximated so well. Root mean squared error in
the approximation of the elasticity is
)7
g 78 eI t
p
G
q
p
371
an increasing function of the sample size. Here we hold the set of basis function xed. We will consider the asymptotic behavior of a xed model, which
we interpret as an approximation to the estimators behavior in nite samples.
Consider the set of basis functions:
5
5
8 r E 4v ) n
8 &
W
Maintaining these basis functions as the sample size increases, we nd that the
limiting objective function is minimized at
8 6 ) D6 C6 ) | 6 ) 6
6
A
9
I
(18.1.1)
5 P ) Dy 5 4 P E P ) Dy S4 P y9
P
a@
Clearly the truncated trigonometric series model offers a better approximation, asymptotically, than does the linear model. Plotting elasticities: On
average, the t is better, though there is some implausible wavyness in the
estimate.
372
p
7 )75$)98 t a@ q SG
9
eI R
p
about half that of the RMSE when the rst order approximation is used. If
the trigonometric series contained innite terms, this error measure would be
driven to zero, as we shall see.
Consider means of testing for the hypothesis that consumers maximize utility. A consequence of utility maximization is that the Slutsky
, where
0 d
0 T
matrix
!
p
d G P
d
p V p "!d !d
d "!d j
is a nite dimen-
8
0 T
pd
g"!d
sional parameter.
p
x
x
pg
where
If we can
373
G P
70 v
f
is a
where
374
B
C v
v Dv
B
8 v
B
at an arbitrary point
The Fourier form, following Gallant (1982), but with a somewhat different parameterization, may be written as
(18.3.1)
8
E v R ~% q v R % 4v R
This is required
8 5
888II II R
g999g$Dc x As & u
R
d v
&
'
IWI
P 7R 5 R P
v v ) P v
8 5
w A@88A74) Q&
$i 5
The
in value.
where
multiply by
(18.3.2)
8x
R
where the
vectors
375
)
R r ) ) n
is a potential multi-index to be used, but
r
) )
R ) n
is not since its rst nonzero element is negative. Nor is
r
5
R %5 5 n
a multi-index we would use, since it is a scalar multiple of the original
multi-index.
(18.3.3)
R R
E v H~% 4 s v % E t@
%
~C
I I
W
P v
P
d v
H CE v H~% i v H~% 4 t
R
%
R P R
I I
d v
(18.3.4)
-dimensional
v $ v
bb
xxb z I
c
' )
so that
vector
8 Bdc v
R
d v
(18.3.5)
to
, use
When
arguments
of . If we have
376
dR
d v
The following theorem can be used to prove the consistency of the Fourier
form.
is obtained by
. Consider the
Sx
f
8A@89 754)
8 7
with respect to
with respect to
and
is a subset
is compact in the
following conditions:
(b) Denseness:
where
b
over
with respect to
that is continuous in
in
377
such that
6 a
f
f
uq S9 A
9
almost surely.
v f s
f A
v
almost surely.
6 a`
must have
with
6 9
f
k
g
f A
The modication of the original statement of the theorem that has been
x
with respect to
Typically we will want to make sure that the norm is strong enough to
imply convergence of all functions of interest.
parameter space
378
(3) There is a denseness assumption that was not present in the other theorem.
We will not prove this theorem (the proof is quite similar to the proof of theorem [19], see Gallant, 1987) but we will discuss its assumptions, in relation to
the Fourier form as the approximating model.
18.3.1. Sobolev norm. Since all of the assumptions involve the norm
, we need to make explicit what norm we wish to use. We need a norm that
guarantees that the errors in approximation of the functions we are interested
in are accounted for. Since we are interested in rst-order elasticities in the
Let
and its
8
g R
rst derivative
in this case. It is dened, making use of our notation for partial derivatives, as:
, we would evaluate
Y
r
d
mating model
Y ~ Y r W
9
W d v sq v
We see that this norm takes into account errors in approximating the function
If we want to estimate rst order elas-
Further-
9
8
over
i`
would be
8
)
8
"
379
W v r v
W
wE
r
where
of functions
) P
uu
w.r.t.
partial derivatives of one order higher than the derivatives we seek to estimate.
18.3.3. The estimation space and the estimation subspace. Since in our
case were interested in consistent estimation of rst-order elasticities, well
dene the estimation space as follows:
E
8
Qg
8
g
Yr
derivatives throughout
ned as
d
g r
E
g
d v I d v u
r
v
d
18.3.1.
is de-
where
b
380
is a space of func-
so optimiza-
of a set
. A set of subsets
be dense subsets of
8
Q
q1
estimation space,
This
and
is achieved by making
as
at least asymptotically, we
need that:
Q
of
elements, as
1
over
tion over
has
observations,
Use a picture here. The rest of the discussion of denseness is provided just for comY
sI
r
with respect to
of
is a dense subset
who in turn cites Edmunds and Moscatelli (1977). We reproduce the theorem
as presented by Gallant, with minor notational changes, for convenience of
reference:
T HEOREM 31. [Edmunds and Moscatelli, 1977] Let the real-valued function
be continuously differentiable up to order
, and
5
such that
} from
r
Y sI
. Therefore,
for all
)
q v
dI
d 8 8 8
99 ggd
, which is
r( d v
888
v999
Y
as
with
G
the closure of
381
However,
R
Q
so
8
Q
Therefore
r
s
is a dense subset of
so
18.3.5. Uniform convergence. We now turn to the limiting objective function. We estimate by OLS. The sample objective function stated in terms of
f
E d v sg
I@
1 RC
d f
f )
maximization is
382
With random sampling, as in the case of Equations 14.3.1 and 17.2.1, the limiting objective function is
E v Hsq v
Y
uG
and
are elements of
H
in the
.
Q
Dt
8 j
(18.3.6)
The pointwise convergence of the objective function needs to be strengthened to uniform convergence. We will simply assume that this holds, since
the way to verify this depends upon the specic application. We also have
r
s
since
8
Q
"t
r v
q v p
p
v q v I n G
Y
p
V I
By the dominated convergence theorem (which applies since the nite bound
used to dene
18.3.6. Identication. The identication condition requires that for any point
y
X
'
and
r
Y sI
in
Consistency norm
Estimation subspace
Yr
Estimation space
383
r
s
d f
RC9
8
Q
dense subsets of
to this norm.
The closure of
These are
aG
a global maximum in its rst argument, over the closure of the innite
8ig v
`
C v
B
v Dv
B
384
'
1
as the
8 Bd4 d v
R
Dene
f R
R
q d
dR
normally distributed:
1
. The prediction,
where
is asymptotically
w j
8 w
h
d 4 1
R
where
1 DR
f
A w
385
f
S
is
where
G P
tV D f
where
given
is
8
H
f
8 t
G
fSf f
t
f f
$t !
8 f t f
S
by estimating
and
y
)
H
8ft f
$ ! f
H
r
where
By denition of the
386
form
has the
8
I
@ 1
f
f
h @ f )
G
is the dimension of
% t
r
)
integrates to
dc
b
and
The function
where
8 t
4) k
is like a density function, but we do not necessarily
to be nonnegative.
satises
f
7
restrict
d
b
In this respect,
f
f b1 A
h
f f
So, the window width must tend to zero, but not too quickly.
for
g
8 # t # f #
7$ G $0 @
f
h
r n
I
@ 1
f
G
A w )
f
I
f
@ 1
` h @ )
f
f
we have, due to the iid assump-
g
f
r
f 1
b
n h f b1
tion
Next, considering the variance of
be dominated
can pass the limit through the integral is a result of the dominated
by assumption. (Note: that we
) 4
#t #
#t #
4
and
f
since
# t k
#
#t #f
4 f
#
#t #f
4 t f
#
f
r n
Now, asymptotically,
y m
m
8 gs k
#t #f
#
# #f
t f s
#
f
h
h
#f
4 #
and
r n
f
7 4# D#
so
f
h
we obtain
Change variables as
387
probability).
consistency (convergence in quadratic mean implies convergence in
Since the bias and the variance both go to zero, we have pointwise
r
n
#
@t C
#
y
r n f b1 A f
h
and
8 t
# #
f b1
since
Since both
to be
Using exactly the same change of variables as before, this can be shown
8 $ G $V @
# t # f #
f
h
f
f
A r
f
n h f b1 A
Therefore,
by the previous result regarding the expectation and the fact that
r n hf
# t # f #
f
r n h f ss$ G $0 A h s
#$ ` $V @ f ss$ G $0 A s
t #
f #
f
# t #
f
#
f
h
h
h
f
f
f #
f
6G $#0 @ s G $V A
h
h
q
f #
` $j @ f r n f b1
h
h
r
n h f b1
Also, since
we have
388
I
@
f
8 ` @ I f
`f
@ f @f
b
I
@ I f
%b p
f
b f @
I f
%b p
f I
ft f
$ f )
H
I
@ 1
f
ft f
f
h @ gf f ) $ f
I
8
$ k
ft f
r
U
ft f
$ f
I
@ 1
I f
f
f
th j A f ) S
f f
f
dc
b
G
8 f
g !
g
18.4.3. Discussion.
This is the Nadaraya-Watson kernel regression estimator.
by marginalization of the kernel, so we obtain
With this kernel, we have
and to marginalize to the previous kernel for
The kernel
ft f
$ ! f
estimator of
we need an
389
8c
1 8 5
!@A8@89764) " f
%
are closer to
D
390
f
f
7
is increasingly at as
8
q1 )
8
c
Since
relatively little information is used, the variance is large when the window width is small.
and
f
g
and
8
corresponding to each
8
This tted
8 If
point
391
and
8
H
H
tried.
75
(6) Go to step
D
f
I
f
A I@
G
f t j f I@f )
f f f
b
I@ f
%b p
f I
b
I@ f
b p
r b p u u f I
!
f
u
where we obtain the expressions for the joint and marginal densities from the
section on kernel regression.
18.6. Semi-nonparametric maximum likelihood
Readings: Gallant and Nychka, Econometrica, 1987. For a Fortran program
to do this and a useful discussion in the users guide, see
392
this link . See also Cameron and Johansson, Journal of Applied Econometrics,
V. 12, 1997.
MLE is the estimation method of choice when we are condent about specifying the density. Is is possible to obtain the benets of MLE when were not
so condent about the specication? In part, yes.
(both may be
t f
H
conditional on
p
h H T
f
T
t
HT
t f
D T I H T
p
to impose a normalization:
is a homogenous function of
d
t
H T
one. Because
f
h h
and
t
Hs
T
t f T
H
t f f
H H T
where
is
393
p u
p u
p
8 t
gHT
` h
h
T
p
u p
HstT I s t
f
f
Y` S
r h
h
p p T
h
HtT f f sHt f
f
I
Y
G
h h
T T
t
Ht HT
f
f f
Y` sH T
p T
h
p T u
h
p p
h HT
t
T T
(18.6.1)
in equation 18.6.1
Recall that
f
Dt
18.6.1
Y
G
By setting
are the raw moments of the baseline density. Gallant and Nychka (1987) give
conditions under which such a density may be treated as correctly specied,
asymptotically. Basically, the order of the polynomial must increase as the
sample size increases. However, there are technicalities.
Similarly to Cameron and Johannson (1997), we may develop a negative binomial polynomial (NBP) density for count data. The negative binomial baseline density may be written (see equation as
e
u
PQX
PQX X
P f
R t)f
P
X
f
Ht
Y
G
. In the case of
e
g
e
'
e &
e
'
we have
) X
&@
x y (e
. When
is the parameterization
&
@ e
v
X
ue y e t
tioning variables
and
where
394
PQX
PQX X
P f t
U t)f HsfT f
P
s sH T Hst
(18.6.2)
Y
`
P e e
E2
X
(18.6.3)
To illustrate, here are the rst through fourth raw moments of the NB density,
calculated using
MuPAD, which is a Computer Algebra System that is free for personal use,
and then programmed in Ox. These are the moments you would need to use a
if(k_gam >= 1)
5
{
m[][0] = lambda;
m[][1] = (lambda .* (lambda + psi + lambda .* psi))
Econometrics/ psi;
}
if(k_gam >= 2)
{
395
2 .* (2 + 3 .* psi +
gam[1][] .* m[][2]) +
79
For
396
polynomials.
Gallant and Nychka, Econometrica, 1987 prove that this sort of density can
approximate a wide variety of densities arbitrarily well as the degree of the
polynomial increases with the sample size. This approach is not without its
drawbacks: the sample objective function can have an extremely large number
of local maxima that can lead to numeric difculties. If someone could gure
out how to do in a way such that the sample objective function was nice and
smooth, they would probably get the paper published in a good journal. Any
ideas?
Heres a plot of true and the limiting SNP approximations (with the order
of the polynomial xed) to four different count data densities, which variously
exhibit over and underdispersion, as well as excess zeros. The baseline model
is a negative binomial density.
18.7. EXAMPLES
397
18.7. Examples
18.7.1. Fourier form estimation. You need to get the le
FFF.ox, which sets up the data matrix for Fourier form estimation.
The rst DGP rst DGP generates data with a nonlinear mean and
er-
rors (with the mean subtracted out). Then the program fourierform.ox allows
. There is no
need to specify multi-indices with a univariate regressor (as is the case here to
with different .
4 1
5
).
)
).
18.7. EXAMPLES
398
1
sample size of
Note that too small a window-width (ww = 0.1) leads to a very irregular t,
while setting the window width too high leads to too at a t.
18.7. EXAMPLES
399
Cross Validation
18.7.4. Kernel density estimation. The second DGP second DGP gener-
d
ates
estimation. The program kerneldens.ox allows you to experiment using different sample sizes, kernels, and window widths. The following gure shows
18.7. EXAMPLES
400
PQX X
P f t
U t)f HsfT f
P
s sH T Hst
PQX
8 fh
h
p
h
f
H UT
T
8 h
h
Y
`
p p
h HsT
t
T T
g5
(18.7.3)
(18.7.2)
(18.7.1)
18.7. EXAMPLES
401
beer) for personal use. It is installed on the Linux machines in the computer
room, and if you like you can install the Windows version, too.
The le negbinSNP.mpd, if run using the the command mupad negbinSNP.mpd,
will give you the output that follows:
*----*
/|
/|
*----* |
Copyright (c)
| *--|-*
|/
1997 - 2002
by SciFace Software
|/
*----*
Licensed to:
\a /
\y
gamma(a + y) | ----- |
| ----- |
\ a + b /
\ a + b /
---------------------------------gamma(a) gamma(y + 1)
18.7. EXAMPLES
402
\a
| ----- |
\ a + b /
--------------------/ a + b - b exp(t) \a
| ---------------- |
\
a + b
18.7. EXAMPLES
403
5
(24 b
4
+ 60 a b
3
75 a
+ a
3
b
4
10 a
4
+ 15 a
4
b
4
+ a
5
b + 50 a b
2
b
2
+ 35 a
2
+ 50 a
5
b
3
+ 60 a
3
+ 15 a
4
b
2
b
4
+ 25 a
+ 110 a
3
b
+ 10 a
b ) / a
"
t3 = a**-4*(b**5*24.0D0+60.0D0*a*b**4+a**4*b+50.0D0*a*b**5+50.0D
~(a*a)*b**3+15.0D0*a**3*(b*b)+110.0D0*(a*a)*b**4+75.0D0*a**3*b**3
~5.0D0*a**4*(b*b)+35.0D0*(a*a)*b**5+60.0D0*a**3*b**4+25.0D0*a**4*
~*3+10.0D0*a**3*b**5+10.0D0*a**4*b**4+a**4*b**5)"
"\\frac{24\\, b^5 + 60\\, a\\, b^4 + a^4\\, b + 50\\, a\\, b^5 + 50\\,
\\, b^3 + 15\\, a^3\\, b^2 + 110\\, a^2\\, b^4 + 75\\, a^3\\, b^3 + 15\
a^4\\, b^2 + 35\\, a^2\\, b^5 + 60\\, a^3\\, b^4 + 25\\, a^4\\, b^3 + 1
18.7. EXAMPLES
404
a(0) b(0) m(0) + a(0) b(1) m(1) + b(0) a(1) m(1) + a(0) b(2) m(2) +
b(0) a(2) m(2) + a(1) b(1) m(2) + a(0) b(3) m(3) + b(0) a(3) m(3) +
a(1) b(2) m(3) + a(2) b(1) m(3) + a(1) b(3) m(4) + a(2) b(2) m(4) +
b(1) a(3) m(4) + a(2) b(3) m(5) + a(3) b(2) m(5) + a(3) b(3) m(6)
>> quit
Once you get expressions for the moments and the double sums, you can
use these to program a loglikelihood function in Ox, without too much trouble.
The le NegBinSNP.ox implements this. The le EstimateNBSNP.ox will let
you estimate NegBinSNP models for the MEPS data. The estimation results
5
***********************************************************************
MEPS data, OBDV
18.7. EXAMPLES
405
negbin_snp_obj results
Strong convergence
Observations = 500
Standard Errors
params
se(OPG)
se(Sand.)
se(Hess)
1.5340
0.13289
0.12645
0.12593
0.16113
0.053100
0.056824
0.054144
0.090624
0.062689
0.065619
0.063835
sex
0.16863
0.047614
0.050720
0.048707
age
0.17950
0.048407
0.045060
0.046301
educ
0.039692
0.047968
0.058794
0.052521
inc
0.032581
0.064384
0.043708
0.051033
1.8138
0.18466
0.17398
0.17378
-0.052710
0.0089429
0.0078799
0.0083419
0.013382
0.0042349
0.0039745
0.0040547
params
t(OPG)
t(Sand.)
t(Hess)
1.5340
11.543
12.132
12.181
constant
pub_ins
priv_ins
ln_alpha
t-Stats
constant
18.7. EXAMPLES
pub_ins
406
0.16113
3.0344
2.8356
2.9759
0.090624
1.4456
1.3811
1.4197
sex
0.16863
3.5416
3.3248
3.4621
age
0.17950
3.7082
3.9837
3.8769
educ
0.039692
0.82746
0.67509
0.75573
inc
0.032581
0.50603
0.74541
0.63842
1.8138
9.8226
10.425
10.438
-0.052710
-5.8941
-6.6892
-6.3188
0.013382
3.1599
3.3669
3.3003
priv_ins
ln_alpha
Information Criteria
CAIC
BIC
AIC
2314.7
2304.7
2262.6
***********************************************************************
Note that the CAIC and BIC are lower for this model than for the ordinary
NB-I model. NOTE: density functions formed in this way may have MANY
local maxima, so you need to be careful before accepting the results of a casual
run. To guard against having converged to a local maximum, one can try using
multiple starting values, or one could try simulated annealing as an optimization method. To do this, copy maxsa.ox and maxsa.h into your working directory, and then use the program EstimateNBSNP2.ox to see how to implement
SA estimation of the reshaped negative binomial model. For more details on
18.7. EXAMPLES
407
the Ox implementation of SA, see Charles Bos page. Note - in my own experience, using a gradient-based method such as BFGS with many starting values
is as successful as SA, and is usually faster. Perhaps Im not using SA as well
as is possible... YMMV.
CHAPTER 19
Simulation-based estimation
Readings: In addition to the book mentioned previously, articles include
Gallant and Tauchen (1996), Which Moments to Match?, ECONOMETRIC
THEORY, Vol. 12, 1996, pages 657-681; Gourieroux, Monfort and Renault
a
(1993), Indirect Inference, J. Apl. Econometrics; Pakes and Pollard (1989)
Econometrica; McFadden (1989) Econometrica.
19.1. Motivation
Simulation methods are of interest when the DGP is fully characterized by
a parameter vector, but the likelihood function is not calculable. If it were
available, we would simply estimate by MLE, which is asymptotically fully
efcient.
8
"
Bf
Let
Suppose that
G P
B V B B f
is
8 e
'
where
Suppose that
j B G
(19.1.1)
19.1. MOTIVATION
409
f
f
f
6
f t f
B $ B j B 1
X
x t R
contribution of the
3
R R
Bf
Let
is independent of
is not
8
4% y3 f
diagonal). However,
Bf
B B
f
f w B
f
B f t B Hyw
d
4e B
where
5
5
G I R y 7} eI p I6 G1
G ~ p
I
B 1
d
4dR
B A ) 4R
f
8
$
I
I
B 1
4d B B 1
4d B f ) d B f )
d
e B
410
19.1. MOTIVATION
by
is higher than 3
The mapping
binary discrete choice models as well as the case of multinomial discrete choice (the choice of one out of a nite set of alternatives).
Multinomial discrete choice is illustrated by a (very simple) job
search model. We have cross sectional data on individuals matching to a set of
G P
V
C%
X6"ig Dc h
Y ) f
G
j%
5)
4
g
85)
i!A@8@8974
19.1. MOTIVATION
411
Then
G P
V"
I$Gj G0PcEIju
I
$
f ) f
2
) B f
period
if individual
zero otherwise.
that is
19.1.2. Example: Marginalization of latent variables. Economic data often presents substantial heterogeneity that may be difcult to model. A possibility is to introduce latent random variables. This can cause the problem that
there may be no known closed form for the distribution of observable variables after marginalizing out the unobservable latent variables. For example,
is often modeled using the Poisson
(
@3
e 7g} 3
~ f
f
f
S U
e
g
The mean and variance of the Poisson distribution are both equal to
distribution
@8A89774
8 7 5)
8 ~ B
B 7g} HCe
19.1. MOTIVATION
412
This ensures that the mean is positive (as it must be). Estimation by ML is
straightforward.
Often, count data exhibits overdispersion which simply means that
8 f
S
f
If this is the case, a solution is to use the negative binomial distribution rather
than the Poisson. An alternative is to introduce a latent variable that reects
heterogeneity into the specication:
B 7} HCe
B P ~ B
be the density of
( f
B Q B P ~ B P ~
Dt u g B 7} us B B 7g} d }
~
marginal density of
Dt
B Q
8 g
B
where
f
B f
will have a closed-form solution (one can derive the negative binomial distrihas an exponential distribution), but often this will not
f
3
which
In this case, since there is only one latent variable, quadrature is probably a better choice. However, a more exible model with heterogeneity
19.1. MOTIVATION
413
would allow all parameters (not just the constant) to vary. For example
X
B Dt u s D( B f
Q
~
~
~
B B } B B 7g} d 7g}
B A
e
by quadrature when
f
B f
'
entails a
gets large.
t f d P 2 t f d f
"!7R D$!7RD $t
is a standard Brownian motion (Weiner
jD"t
Qg
Su%
2
EVqt
Ej jE2Vqt
2
and
f d
g7eH
The function
19.1. MOTIVATION
414
f d
!7R
To estimate a model of this sort, we typically have data that are assumed to be
8 f@8A89 f
8 fI
7d
To perform inference on
is a continu-
8d f
46I gf $
gf
in discrete points
gf
observations of
This density is necessary to evaluate the likelihood function or to evaluate moment conditions (which are based upon expectations with respect to this density).
A typical solution is to discretize the model, by which we mean to
nd a discrete time approximation to the model. The discretized version of the model is
)
x9
G f t P f t
tcI g!s cI g!sH
(that is, the
p
Ht
tG
f
I jHf
which
p
d
which is
estimation of
based upon this equation is in general biased and inconsistent for the
original parameter, . Nevertheless, the approximation shouldnt be
d
415
d f
4 g
observation. When
d f
a
2
I
@
1 d f
dca f
A f ) 4RC ~ E
where
d f d f a
a gs 4cag!p
d f
4ag
)
8 d f
g4a gs
d f
4a
in place of
d f
ca g
is unbiased for
d f a
4 Eg!@t
in the
B 1 d f
daf
f ) eS ~
G P
V
is formed of elements
g
""i h
Y ) f
d ) U
f
C%
416
is larger than 4 or
) U B f
I
B f g U B
B
. Each element of
is be-
d f
B } RB f B B
Now
as the
Dene
Repeat this
g
Q"iD hB
Dene
(where
P
B G B B
Calculate
C%
B G
B
j
Draw
4d f RB B 1
B B A f )
f
B G
If the
and
draws of
8
E3
The
B G
Notes:
and
and
tion of
A
417
is stochastically equicon-
some elements of
problem.
Bf
is
Solutions to discontinuity:
I
I
r dhB ~ h B n j' P h r dhB ~ h B n ' w
) 8
W
W
w
and
so that
B
) B
tinuous function of
and therefore
a con-
B f
B f
imum, and
U B f
where
1
T D w
To solve to log(0) problem, one possibility is to search the web for the
slog function. Also, increase
418
1
eI "tf
p
h p i
d
d
1
d
E p e I o j
h p i
d
1
h eI " f
p
e
g
d
1
doesnt
8 eI 1
p
where
then
d
p e I o j
2) if
asymptotically efcient.
19.3. Method of simulated moments (MSM)
d f
4
the density of
is not calculable.
Once could, in principle, base a GMM estimator upon the moment conditions
# d f
xc Huq g 4eE
d
where
f t d f f d
$ s c k' c D
d
k
as
d
4c s}
)
4c s c s}
d
d
I
g
}f
dc D
8
c
However
is the density
conditional on
d f
of
419
which
provides a clear intuitive basis for the estimator, though in fact we obtain consistency even for
d
4RE
d
eX}
1
1
(19.3.2)
Ig
I
B
# w
f
}f D ) q v
f
d
I
d
4eE B
f
where
d f
# r 4c s} q n
(19.3.1)
with which we form the GMM criterion and estimate as usual. Note
}f D
19.3.1. Properties. Suppose that the optimal weighting matrix is used. McFadden (ref. above) and Pakes and Pollard (refs. above) show that the asymptotic distribution of the MSM estimator is very similar to that of the infeasible
GMM estimator. In particular, assuming that the optimal weighting matrix is
)
) P D
m
h p i
d
d
1
8 D)
) P
I R I
mator.
I R I
where
nite,
h
(19.3.3)
420
For this
reasonably large.
8
)
This is an
If one doesnt use the optimal weighting matrix, the asymptotic varcov
8 )
) P
I
B 1
BR
B f ) A
f
I
B 1
BR
B f ) A
f
421
E B B
B B
d
d
b
b
tends to
d
b
The reason that MSM does not suffer from this problem is that in this case
the unbiased simulator appears linearly within every sum of terms, and it ap-
to cancel out simulation errors, from which we get consistency. That is, using
simple notation for the random sampling case, the moment conditions
d
I
g
I
# w 6 G P4c
d
G P d B 1
D ) t0 p c Dcv )
f
I
d g
I
B 1
q f
cg v )
)
f
# w }f D
d
eX
(19.3.5)
(19.3.4)
8 Qt d d
g D C# 4 Duq p D
d
4ea
converges to
8
g
(note:
p
d
d
4ea I cea ea
R d
d
henceforth consistency.
If you look at equation 19.3.5 a bit, you will see why the variance in-
I c
P )
ation factor is
422
timators, and can even cause identication problems (as weve seen
with the GMM problem set).
The drawback of the above approach MSM is that the moment condi-
tions used in estimation are selected arbitrarily. The asymptotic efciency of the estimator may be low.
The asymptotically optimal choice of moments would be the score vector of the likelihood function,
d
eE 4RE
d
As before, this choice is unavailable.
The efcient method of moments (EMM) (see Gallant and Tauchen (1996),
Which Moments to Match?, ECONOMETRIC THEORY, Vol. 12, 1996, pages
657-681) seeks to provide moment conditions that closely mimic the score vector. If the approximation is very good, the resulting estimator will be very
nearly fully efcient.
The DGP is characterized by random sampling from the density
p eE
d
p g
d
f
423
We can dene an auxiliary model, called the score generator, which simply provides a (misspecied) parametric density
e
f
$
e
I
@
1
8
E
f
A f ) e C ~ E e
e
e A
f
is zero:
p e w Y 4r p
f$4eE
t d
I@ 1
d f
e ei
f )
e }f
Ig @ 1
I
eS
d f
) f ) e
(19.4.1)
pe
T e
Q t f t d f
f
D p p e A
over
d
eEi
d f
g p 9f
a pseudo-true
is not avail-
converges to
holding
8 p e p ea
d
e eS}
d f
then
e c
f
d f
g
imates
pe
d
ge &
is a draw from
where
424
is identied.
closely approx-
moment conditions which characterize maximum likelihood estimation, which is fully efcient.
8 b
d
1983) and Gallant and Nychkas (Econometrica, 1987) SNP density estimator which we saw before. Since the SNP density is consistent, the
efciency of the indirect estimator is the same as the infeasible ML
estimator.
nite,
d
and possibly small. This is done because it is sometimes impractical to estimate with
very large. Gallant and Tauchen give the theory for the case of
d
so large that it may be treated as innite (the difference being irrelevant given
the numerical precision of a computer). The theory for the case of
follows directly from the results presented here.
innite
e 7RX}
d
425
We
hp
gp e o I 6p e QP
d f
g4
e c g
f
e 1
then
If the density
I ! p e Q p e o I ! p e Q%
P
P
(19.4.2)
would
would be an identity
matrix, due to the information matrix equality. However, in the present case
so there is no
e S9
f
8 h gp e S y z A
f
p e QP
Recall that
e c g
f
cancellation.
is only an approximation to
d f
4
we assume that
with the denition of the moment condition in Equation 19.4.1, we see that
8
g p e p eX y p e QP
d
As in Theorem 22,
Re
S9f
if
1 f p e
o
In this case, this is simply the asymptotic variance covariance matrix of the
)
tcT P h p
e p e p RX y w 1
d
e 9gRi1
p d f
P
p e p RC 1
d f
h
d f
h e p eS 1
p d f
p e xgRC 1
p e a I
First consider
about
moment conditions,
p d f y
p e 9ei
. Note that
hp
e p e 9gR y w 1
p d
so we have
426
8 8
4C6 h p
e p e Ql1
P
hp
g p e QP
e p e p RX y w 1
d
p e o h p
e p e Ql1
P
Now, combining the results for the rst and second terms,
h
p e
d f
) ) u e p ei 1
x
p e
Suppose that
matrix of the moment conditions. This may be complicated if the score generator is a poor approximator, since the individual score contributions may not
have mean zero in this case (see the section on QML) . Even if this is the case,
the individuals means can be calculated by simulation, so it is always possible
gp e
to consistently estimate
e 7Ri
d f
p e
) R d f
) P e eS d
427
ML estimation asymptotically, since the score generator can approximate the unknown density arbitrarily well).
19.4.2. Asymptotic distribution. Since we use the optimal weighting matrix, the asymptotic distribution is as in Equation 15.4.1, so we have (using the
result in Equation 19.4.2):
R
I p e
P
) )
o
m
h p d 1
d
where
8
qc p e p e ! f
d Rf
e d !
Rf
19.4.3. Diagnotic testing. The fact that
h
d
o
d f
) ) u e p ei1
R ' 7d C e
e
I
d
ge e A
P ) R f
) D e d CH1
since without
d
4e A
where is
p e
implies that
is not identied, so testing is impossible. One test of the model is simply based
(the small sample performance of this sort of test would be a topic worth investigating).
428
d if1
p e
e
I eI
) P )
diag
can be used to test which moments are not well modeled. Since these
moments are related to parameters of the score generator, which are
usually related to certain features of the model, this information can be
h
and
e 7d C1
f
e 7d SH1
f
e p eSH1
d f
since
)
gx9
)
x9
G f t P f t
tcI g!s cI g!sH
tG
f
I jHf
8 d f
gt ei
429
t f d P 2 t f d f
"!7R D$!7RD $t
is simulated over , and the scores are calculated and averaged over the simud
lations
t 7Rif B
d
I
B
d f
t 7RC
)
is chosen to set the simulated scores to zero
f
t 7d i
$
(since and
gf
fairly well.
j
g!e g!eHiP gf
f d P f d
By setting
430
This is only one method of using indirect inference for estimation of differential equations. There are others (see Gallant and Long, 1995 and Gourieroux
et. al.). Use of a series approximation to the transitional density as in Gallant and Long is an interesting possibility since the score generator may have
a higher dimensional parameter than the model, which allows for diagnostic
7d
is of
CHAPTER 20
431
CHAPTER 21
Introduction to Octave
Why is Octave being used here, since its not that well-known by econometricians? Well, because it is a high quality environment that is easily extensible,
uses well-tested and high performance numerical libraries, it is licensed under
the GNU GPL, so you can get it for free and modify it if you like, and it runs
on both GNU/Linux, Mac OSX and Windows systems. Its also quite easy to
learn.
21.1. Getting started
Get the bootable CD, as was described in Section 1.3. Then burn the image,
and boot your computer with it. This will give you this same PDF le, but with
all of the example programs ready to run. The editor is congure with a macro
to execute the programs using Octave, which is of course installed. From this
point, I assume you are running the CD (or sitting in the computer room across
the hall from my ofce), or that you have congured your computer to be able
to run the *.m les mentioned below.
21.2. A short introduction
The objective of this introduction is to learn just the basics of Octave. There
are other ways to use Octave, which I encourage you to explore. These are just
some rudiments. After this, you can look at the example programs scattered
throughout the document (and edit them, and run them) to learn more about
how Octave can be used to do econometrics. Students of mine: your problem
432
433
sets will include exercises that can be done by modifying the example programs in relatively minor ways. So study the examples!
Octave can be used interactively, or it can be used to run programs that are
written using a text editor. Well use this second method, preparing programs
with NEdit, and calling Octave from within the editor. The program rst.m
gets us started. To run this, open it up with NEdit (by nding the correct
le inside the /home/knoppix/Desktop/Econometrics folder and clicking on the icon) and then type CTRL-ALT-o, or use the Octave item in the Shell
menu (see Figure 21.2.1).
434
Note that the output is not formatted in a pleasing way. Thats because
printf() doesnt automatically start a new line. Edit first.m so that the
8th line reads printf(hello world\n); and re-run the program.
We need to know how to load and save data. The program second.m
shows how. Once you have run this, you will nd the le x in the directory
Econometrics/Include/OctaveIntro/ You might have a look at it with
NEdit to see Octaves default format for saving data. Basically, if you have
data in an ASCII text le, named for example myfile.data, formed of
numbers separated by spaces, just use the command load myfile.data.
After having done so, the matrix myfile (without extension) will contain
the data.
Please have a look at CommonOperations.m for examples of how to do
some basic things in Octave. Now that were done with the basics, have a look
at the Octave programs that are included as examples. If you are looking at
the browsable PDF version of this document, then you should be able to click
on links to open them. If not, the example programs are available here and the
support les needed to run these are available here. Those pages will allow
you to examine individual les, out of context. To actually use these les (edit
and run them), you should go to the home page of this document, since you
will probably want to download the pdf version together with all the support
les and examples. Or get the bootable CD.
There are some other resources for doing econometrics with Octave. You
might like to check the article Econometrics with Octave and the Econometrics Toolbox ,
which is for Matlab, but much of which could be easily used with Octave.
435
Get the collection of support programs and the examples, from the
Put them somewhere, and tell Octave how to nd them, e.g., by putting
Make sure nedit is installed and congured to run Octave and use
CHAPTER 22
) '
V"
R
is a
is a
' )
4R
d
y
R
Rd
R
8 edd d e d 4Rdd d
d
y
'
) c
c z
matrix. Also,
8d
ge y R R e
d
. Then
f
y
1 '
%)
d
T y e
valued transpose of
be the
is a
'
Y
Let
and
vector and
.
.
.
is a
is
z
c
Then
R 4R
d
y r
b
T
organized as a -vector,
8
d
Let
d
T y 4e
d
T y e
v
y
8) '
4j
R d P R d e 4R d
d R d
8 '
)
R
R
P d R d R d d
4R 4e
Rd
R
H
b
T y
R c
R
R
H e d
d
both
)
Y'
~
R 7g} x y
8 Q1
'
R w P w
has dimension
'
i
:
matrix and
) '
V"
6
y 6
has dimension
be -vector valued
and
437
438
85)
A8@897
to some other set, so that the set is ordered according to the nat-
1 I 1
t 9f t
4
f
G t 0s4 1
f
Cf
f
$
written
is the limit of
8
C
if for any
f
7
8
sy
r
af
and
q f
8
G
1
such that
if for any
to the function (
6 Sf
verges uniformly on
there
8 G
1
depends upon
y
such that
if for all
g
q
exists an integer
to the function (
converges pointwise on
6 Sf
con-
9
Sf A
439
lie)
in which
must
Sf
rQ u
R I X R S
f
8
i %
is
For example,
where
G P p
"
f
S
6 Sf x
A sequence of
S
f
random variables
8
sy
f a
r
f
converges in probability to
if
G u
% T f
8 f c A f
G
f
Sax
. Then
be a sequence of ran-
f
r
Ca% f A w
or plim
C
f
8
% a
f
f
Sax
u
. Then
S
f
be a sequence of ran-
if
8
4) c
such that
8 d
Sf
except on a set
In other words,
f
a
or
8 8
4C6%
i f
written as
440
8
% T a a
f
f
T
8
i
8
i
converges in distribution to
then
If
have distribuP
tion function
at
T y f
and
1
f
1
G R I R kP p S
f
p S
T
This easy proof is a result of the linearity of the model, which allows us to
express the estimator in a way that separates parameters from random functions. In general, this is not possible. We often deal with the more complicated
situation where the stochastic sequence depends on parameters in a manner
that is not reducible to a simple sequence of random variables. In this case,
d
d f
4yCa
each
where
d f
ySx
if
yuj4yCa f
d d f
9
@
8 x g
d
gence by
for all
d
4yCf
ables w.r.t.
and
8 T
converges uni-
(a.s.)
d
yu
to
d
yu
d f
64ySx
441
) yVj4yCa t f
d d f
9
9
1
DH
is a nite constant.
xf
9f
1
1
x
9ff
edly).
and
1
1
ED
where
1
H
means
The notation
and
8 9f
xf f
1
DH
1
D
1
)
tcT
)
xT G R I X R
(
GVPpgdp R I X R R I X R d
x
8 T 9ff
1
HT
are
and
f
g
If
442
means
8 ) P p
gtcT gd d
8
G R I Q R gd
P p
and
I
y y
Since plim
we can write
1
D
1
H
1
G)
D
1
8Gj) f
f
)
t f
a
S
such that
)
gtcT
then
since, given
G
E XAMPLE 49. If
is a nite constant.
always some
S
d
where
and all
1
EHT
there is
Useful rules:
( T T ( T T T
1
1
1
( T $T T ( T S T T
1 S
1
1 S
E XAMPLE 50. Consider a random sample of iid r.v.s with mean 0 and vari
)
gtcT S d eI 1 D! jk d eI 1
8
p
p
B I
C fB 1 ) d
distributed, e.g.,
)
gxT d
we had
is asymptotically normally
So
so
8 p
g eI 1$T S d
ance
Before
now we have have the stronger result that relates the rate of
443
E XAMPLE 51. Now consider a random sample of iid r.v.s with mean
gt)cT S h Qpd p 1 g h
8
eI
I
BC fB 1 ) d
8) S
gxT T d
p 1
Qspd
eI
and variance
is asymptotically
So
so
p 1
gg eI T
S
T
pd
These two examples show that averages of centered (mean zero) quantities typically have plim 0, while averages of uncentered quantities have nite
1
D
and
are of the same order. Asymptotic equality ensures that this is the case.
1
) DH
D 3
1
vious way.
and
S
f
4 f
if
7f
and
f
g}
1
D
are
EXERCISES
444
Exercises
vectors, show that
)j
'
) '
j
'
i
both
and
both
and
) '
matrix and
x
R
7~}
x
~
~
R 7} R 7g}
R w P y 6
(4) For
both
) '
(3) For
and
y 6
(2) For
(1) For
(5) Write an Octave program that veries each of the previous results by taking numeric derivatives. For a hint, type help numgradient and help
numhessian inside octave.
CHAPTER 23
The GPL
This document and the associated examples and materials are copyright
Michael Creel, under the terms of the GNU General Public License. This license follows:
GNU GENERAL PUBLIC LICENSE Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place,
Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and
distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your freedom to
share and change it. By contrast, the GNU General Public License is intended
to guarantee your freedom to share and change free softwareto make sure the
software is free for all its users. This General Public License applies to most
of the Free Software Foundations software and to any other program whose
authors commit to using it. (Some other Free Software Foundation software is
covered by the GNU Library General Public License instead.) You can apply it
to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our
General Public Licenses are designed to make sure that you have the freedom
to distribute copies of free software (and charge for this service if you wish),
that you receive source code or can get it if you want it, that you can change
445
446
the software or use pieces of it in new free programs; and that you know you
can do these things.
To protect your rights, we need to make restrictions that forbid anyone to
deny you these rights or to ask you to surrender the rights. These restrictions
translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or
for a fee, you must give the recipients all the rights that you have. You must
make sure that they, too, receive or can get the source code. And you must
show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy, distribute
and/or modify the software.
Also, for each authors protection and ours, we want to make certain that
everyone understands that there is no warranty for this free software. If the
software is modied by someone else and passed on, we want its recipients to
know that what they have is not the original, so that any problems introduced
by others will not reect on the original authors reputations.
Finally, any free program is threatened constantly by software patents. We
wish to avoid the danger that redistributors of a free program will individually
obtain patent licenses, in effect making the program proprietary. To prevent
this, we have made it clear that any patent must be licensed for everyones
free use or not licensed at all.
The precise terms and conditions for copying, distribution and modication follow.
447
GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains a
notice placed by the copyright holder saying it may be distributed under the
terms of this General Public License. The "Program", below, refers to any such
program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modications
and/or translated into another language. (Hereinafter, translation is included
without limitation in the term "modication".) Each licensee is addressed as
"you".
Activities other than copying, distribution and modication are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its
contents constitute a work based on the Program (independent of having been
made by running the Program). Whether that is true depends on what the
Program does.
1. You may copy and distribute verbatim copies of the Programs source
code as you receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to
the absence of any warranty; and give any other recipients of the Program a
copy of this License along with the Program.
You may charge a fee for the physical act of transferring a copy, and you
may at your option offer warranty protection in exchange for a fee.
448
2. You may modify your copy or copies of the Program or any portion of
it, thus forming a work based on the Program, and copy and distribute such
modications or work under the terms of Section 1 above, provided that you
also meet all of these conditions:
a) You must cause the modied les to carry prominent notices stating that
you changed the les and the date of any change.
b) You must cause any work that you distribute or publish, that in whole
or in part contains or is derived from the Program or any part thereof, to be
licensed as a whole at no charge to all third parties under the terms of this
License.
c) If the modied program normally reads commands interactively when
run, you must cause it, when started running for such interactive use in the
most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying
that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License.
(Exception: if the Program itself is interactive but does not normally print such
an announcement, your work based on the Program is not required to print an
announcement.)
These requirements apply to the modied work as a whole. If identiable
sections of that work are not derived from the Program, and can be reasonably
considered independent and separate works in themselves, then this License,
and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole
which is a work based on the Program, the distribution of the whole must be
449
on the terms of this License, whose permissions for other licensees extend to
the entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights
to work written entirely by you; rather, the intent is to exercise the right to
control the distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of a
storage or distribution medium does not bring the other work under the scope
of this License.
3. You may copy and distribute the Program (or a work based on it, under
Section 2) in object code or executable form under the terms of Sections 1 and
2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable source
code, which must be distributed under the terms of Sections 1 and 2 above on
a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three years, to give
any third party, for a charge no more than your cost of physically performing
source distribution, a complete machine-readable copy of the corresponding
source code, to be distributed under the terms of Sections 1 and 2 above on a
medium customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code
or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modications to it. For an executable work, complete source code means
450
all the source code for all modules it contains, plus any associated interface
denition les, plus the scripts used to control compilation and installation of
the executable. However, as a special exception, the source code distributed
need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component itself
accompanies the executable.
If distribution of executable or object code is made by offering access to
copy from a designated place, then offering equivalent access to copy the
source code from the same place counts as distribution of the source code,
even though third parties are not compelled to copy the source along with the
object code.
4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy,
modify, sublicense or distribute the Program is void, and will automatically
terminate your rights under this License. However, parties who have received
copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
5. You are not required to accept this License, since you have not signed
it. However, nothing else grants you permission to modify or distribute the
Program or its derivative works. These actions are prohibited by law if you do
not accept this License. Therefore, by modifying or distributing the Program
(or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or
modifying the Program or works based on it.
451
6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor
to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients exercise
of the rights granted herein. You are not responsible for enforcing compliance
by third parties to this License.
7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict
the conditions of this License, they do not excuse you from the conditions of
this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by all those
who receive copies directly or indirectly through you, then the only way you
could satisfy both it and this License would be to refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any
particular circumstance, the balance of the section is intended to apply and the
section as a whole is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or
other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people
have made generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that system; it is
452
453
status of all derivatives of our free software and of promoting the sharing and
reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE
IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED
BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING
THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE
PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED
TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY
WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED
INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR
A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
454
455
free software, and you are welcome to redistribute it under certain conditions;
type show c for details.
The hypothetical commands show w and show c should show the appropriate parts of the General Public License. Of course, the commands you
use may be called something other than show w and show c; they could
even be mouse-clicks or menu itemswhatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if necessary.
Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program Gnomovision (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public
License instead of this License.
CHAPTER 24
The attic
The GMM estimator, briey
The OLS estimator can be thought of as a method of moments estimator.
v
e
. So, likewise,
y f h f
The idea of the MM estimator is to choose the estimator to make the sample
counterpart hold:
R I 6 R
1
h i R
1
eR
is greater than
Ea
e
a
of
that satisy
!e
. Let us
. If the dimension
This holds material that is not really ready to be incorporated into the main
body, but that I dont want to lose. Basically, ignore it, unless youd like to help
get it ready for inclusion.
456
457
f
nb
f
S
model:
G 7g} v 7g}
~ & ~
4G0P v 7g}
~
&
(
f
~
4 7g}
u
d d
d
G f
v
Y
G
. Now
G ~
7}
) and
v 7} ue
~
&
where
458
# t #
$
d d
u G8
f
v
Y
`
This density can be used directly, perhaps using numerical integration to evaluate the likelihood function. In some cases, though, the integral will have an
follows a certain one parameter gamma
a
PXX
t f
v
f
v
e
q
&
&
, then
, then
is parameterized.
. Note that
e &
f
v
e
'
, where
X
U e t
R v 7g} e
~
Y
`
&
I q
e
X
If
PQX
) P f
RX X xf
Ui
P
X
where
(24.1.1)
is a
NB-I model.
f
v
)
&
@
e &
, where
If
. This is referred
So both forms of the NB model allow for overdispersion, with the NB-II model
allowing for a more radical form.
Testing reduction of a NB model to a Poisson model cannot be done
using standard Wald or LR procedures. The critical
&
&
by testing
is on
459
&
( &
&
& 1
&
& 1
Here are NB-I estimation results for OBDV, obtained using this estimation program
.
-2.2656
t-Stats
params
t(OPG)
t(Sand.)
-0.055766
-0.16793
-0.17418
-0.17215
pub_ins
0.47936
2.9406
2.8296
2.9122
priv_ins
0.20673
1.3847
1.4201
1.4086
sex
0.34916
3.2466
3.4148
3.3434
age
0.015116
3.3569
3.8055
3.5974
educ
0.014637
0.78661
0.67910
0.73757
inc
0.012581
0.60022
0.93782
0.76330
1.7389
23.669
11.295
16.660
constant
ln_alpha
Information Criteria
Consistent Akaike
2323.3
t(Hess)
Schwartz
2315.3
Hannan-Quinn
2294.8
Akaike
2281.6
460
461
*********************************************************************
MEPS data, OBDV
negbin results
Strong convergence
Observations = 500
Function value
-2.2616
t-Stats
params
t(OPG)
t(Sand.)
-0.65981
-1.8913
-1.4717
-1.6977
pub_ins
0.68928
2.9991
3.1825
3.1436
priv_ins
0.22171
1.1515
1.2057
1.1917
sex
0.44610
3.8752
2.9768
3.5164
age
0.024221
3.8193
4.5236
4.3239
educ
0.020608
0.94844
0.74627
0.86004
inc
0.020040
0.87374
0.72569
0.86579
0.47421
5.6622
4.6278
5.6281
constant
ln_alpha
Information Criteria
Consistent Akaike
2319.3
Schwartz
2311.3
Hannan-Quinn
2290.8
Akaike
2277.6
t(Hess)
462
*********************************************************************
For the OBDV model, the NB-II model does a better job, in terms of
the average log-likelihood and the information criteria.
Note that both versions of the NB model t much better than does the
The t-statistics are now similar for all three ways of calculating them,
Poisson model.
The estimated
To check the plausibility of the NB-II model, we can compare the sample unconditional variance with the estimated unconditional variance according to
P
f
z nb
f
Y
`
I
B %
f
f
1 4d B %
C
1 % pB $) B %
f
and t-
463
measure, there are many more actual zeros than predicted. For ERV, there are
somewhat more actual zeros than tted, but the difference is not too important.
Why might OBDV not t the zeros well? What if people made the decision to contact the doctor for a rst visit, they are sick, then the doctor decides
on whether or not follow-up visits are needed. This is a principal/agent type
situation, where the total number of visits depends upon the decision of both
the patient and the doctor. Since different parameters may govern the two
decision-makers choices, we might expect that different parameters govern
e
be the parameters of
mand for visits. The patient will initiate visits according to a discrete choice
model, for example, a logit model:
~
$T e 7g} P ) )
~ P
T e 7g} d) ) )
tT e
Y
`
464
The above probabilities are used to estimate the binary 0/1 hurdle process.
Then, for the observations where visits are positive, a truncated Poisson density is estimated. This density is
m c } )
~
e
m e
f
Y
`
f
m e `
f Y
m
f
f e
Y
`
8 pm m ( c
f
}
~
e
e
Since the hurdle and truncated components of the overall density for
share
is
where
c 7g} ) ~ P
~
m
T e )7g} )
e
465
Here are hurdle Poisson estimation results for OBDV, obtained from this estimation program
*********************************************************************
MEPS data, OBDV
logit results
Strong convergence
Observations = 500
Function value
-0.58939
t-Stats
params
t(OPG)
t(Sand.)
-1.5502
-2.5709
-2.5269
-2.5560
1.0519
3.0520
3.0027
3.0384
priv_ins
0.45867
1.7289
1.6924
1.7166
sex
0.63570
3.0873
3.1677
3.1366
age
0.018614
2.1547
2.1969
2.1807
educ
0.039606
1.0467
0.98710
1.0222
inc
0.077446
1.7655
2.1672
1.9601
constant
pub_ins
t(Hess)
Information Criteria
Consistent Akaike
639.89
Schwartz
632.89
Hannan-Quinn
614.96
Akaike
603.39
*********************************************************************
466
*********************************************************************
MEPS data, OBDV
tpoisson results
Strong convergence
Observations = 500
Function value
-2.7042
t-Stats
params
t(OPG)
t(Sand.)
constant
0.54254
7.4291
1.1747
3.2323
pub_ins
0.31001
6.5708
1.7573
3.7183
0.014382
0.29433
0.10438
0.18112
sex
0.19075
10.293
1.1890
3.6942
age
0.016683
16.148
3.5262
7.9814
educ
0.016286
4.2144
0.56547
1.6353
-0.0079016
-2.3186
-0.35309
-0.96078
priv_ins
inc
t(Hess)
Information Criteria
Consistent Akaike
2754.7
Schwartz
2747.7
Hannan-Quinn
2729.8
Akaike
2718.2
*********************************************************************
467
For the Hurdle Poisson models, the ERV t is very accurate. The OBDV t
is not so good. Zeros are exact, but 1s and 2s are underestimated, and higher
counts are overestimated. For the NB-II ts, performance is at least as good as
the hurdle Poisson model, and one should recall that many fewer parameters
are used. Hurdle version of the negative binomial model are also widely used.
24.2.1. Finite mixture models. The nite mixture approach to tting health
care demand was introduced by Deb and Trivedi (1997). The mixture approach
has the intuitive appeal of allowing for subgroups of the population with different health status. If individuals are classied as healthy or unhealthy then
two subgroups are dened. A ner classication scheme would lead to more
subgroups. Many studies have incorporated objective and/or subjective indicators of health status in an effort to capture this heterogeneity. The available
objective measures, such as limitations on activity, are not necessarily very
informative about a persons overall health status. Subjective, self-reported
measures may suffer from the same problem, and may also not be exogenous
468
. Identication re
bb
xxb
, and
) DB BT
I
% p! t
3 B t
B
I
85 3
B I BT q) T @A8@897) ! B
I
B
8 Tt 8I t f Y
EI T A@8@89I 4@A8@896D `
I T
f
xTt T Y T B 6 Y B
P t f
B
where
and
3
B
where
I
B Q B BT
component density.
tal number of parameters grows rapidly with the number of component densities. It is possible to constrained parameters across the mixtures.
Testing for the number of component densities is a tricky issue. For
example, testing for
) I
) I
restriction
5
)
mixture) versus
take on any value without affecting the density. Usual methods such
as the likelihood ratio test are not applicable when parameters are on
the boundary under the null hypothesis. Information criteria means
of choosing the model (see below) are valid.
469
The following are results for a mixture of 2 negative binomial (NB-I) models,
for the OBDV data, which you can replicate using this estimation program
470
*********************************************************************
MEPS data, OBDV
mixnegbin results
Strong convergence
Observations = 500
Function value
-2.2312
t-Stats
params
t(OPG)
t(Sand.)
0.64852
1.3851
1.3226
1.4358
-0.062139
-0.23188
-0.13802
-0.18729
0.093396
0.46948
0.33046
0.40854
sex
0.39785
2.6121
2.2148
2.4882
age
0.015969
2.5173
2.5475
2.7151
-0.049175
-1.8013
-1.7061
-1.8036
0.015880
0.58386
0.76782
0.73281
ln_alpha
0.69961
2.3456
2.0396
2.4029
constant
-3.6130
-1.6126
-1.7365
-1.8411
2.3456
1.7527
3.7677
2.6519
priv_ins
0.77431
0.73854
1.1366
0.97338
sex
0.34886
0.80035
0.74016
0.81892
age
0.021425
1.1354
1.3032
1.3387
0.22461
2.0922
1.7826
2.1470
0.019227
0.20453
0.40854
0.36313
2.8419
6.2497
6.8702
7.6182
0.85186
1.7096
1.4827
1.7883
constant
pub_ins
priv_ins
educ
inc
pub_ins
educ
inc
ln_alpha
logit_inv_mix
Information Criteria
t(Hess)
471
Consistent Akaike
2353.8
Schwartz
2336.8
Hannan-Quinn
2293.3
Akaike
2265.2
*********************************************************************
Delta method for mix parameter st.
mix
se_mix
0.70096
err.
0.12043
The 95% condence interval for the mix parameter is perilously close
to 1, which suggests that there may really be only one component density, rather than a mixture. Again, this is not the way to test this - it is
merely suggestive.
Education is interesting. For the subpopulation that is healthy, i.e.,
that makes relatively few visits, education seems to have a positive
D&
x 7e
are
472
*********************************************************************
MEPS data, OBDV
cmixnegbin results
Strong convergence
Observations = 500
Function value
-2.2441
t-Stats
params
t(OPG)
t(Sand.)
-0.34153
-0.94203
-0.91456
-0.97943
pub_ins
0.45320
2.6206
2.5088
2.7067
priv_ins
0.20663
1.4258
1.3105
1.3895
sex
0.37714
3.1948
3.4929
3.5319
age
0.015822
3.1212
3.7806
3.7042
educ
0.011784
0.65887
0.50362
0.58331
inc
0.014088
0.69088
0.96831
0.83408
ln_alpha
1.1798
4.6140
7.2462
6.4293
const_2
1.2621
0.47525
2.5219
1.5060
lnalpha_2
2.7769
1.5539
6.4918
4.2243
logit_inv_mix
2.4888
0.60073
3.7224
1.9693
constant
Information Criteria
Consistent Akaike
2323.5
Schwartz
2312.5
Hannan-Quinn
t(Hess)
473
2284.3
Akaike
2266.1
*********************************************************************
Delta method for mix parameter st.
mix
se_mix
0.92335
err.
0.047318
5 P
$d f p
5
1 A $d f p
P
5
)
tP 1 H$d f p
P
5
w
7Y
7Yw
It can be shown that the CAIC and BIC will select the correctly specied model
from a group of models, asymptotically. This doesnt mean, of course, that the
474
correct model is necesarily in the group. The AIC is not consistent, and will
asymptotically favor an over-parameterized model over the correctly specied
model. Here are information criteria values for the models weve seen, for
OBDV. According to the AIC, the best is the MNB-I, which has relatively many
TABLE 5. Information Criteria, OBDV
Model
Poisson
NB-I
Hurdle Poisson
MNB-I
CMNB-I
AIC
3822
2282
3333
2265
2266
BIC CAIC
3911 3918
2315 2323
3381 3395
2337 2354
2312 2323
parameters. The best according to the BIC is CMNB-I, and according to CAIC,
the best is NB-I. The Poisson-based models do not do well.
24.3. Models for time series data
This section can be ignored in its present form. Just left in to form a basis
for completion (by someone else ?!) at some point.
Hamilton, Time Series Analysis is a good reference for this section. This is
very incomplete and contributions would be very welcome.
gf
as a
8 f 8
!@A88@9I f!UF
8
c
4f
gf
the behavior of
immediately clear why a model that has other explanatory variables should
marginalize to a linear in the parameters time series model, most time series
475
work is done with linear models, though nonlinear time series is also a large
and growing eld. Well stick with linear time series models.
(24.3.1)
@ i
D EFINITION 54 (Time series). A time series is one observation of a stochastic process, over a specic interval:
I f
@f gt
(24.3.2)
tant to keep in mind that conceptually, one could draw another sample, and
that the values would be different.
%
autocovariance of a stochastic
process is
8 f
g HQ
where
H Hsg
Q
f Q f
(24.3.3)
476
orders:
2
2
Q HQ
one the interval between observations, but not the time of the observations.
D EFINITION 57 (Strong stationarity). A stochastic process is strongly stadoesnt
weak
8
2
depend on
dWgf
W t
f
IqW
) A
The problem is, we have only one sample to work with, since we cant go back
I
@ 1
gf
f )
(24.3.4)
477
k
l
%
is just the
(24.3.5)
autocorrelation,
4f
D EFINITION 60 (White noise). White noise is just the time series literature
are independent,
!e
Ee
2
normality assumption.
8
4 2
and
2
EU
e
and iii)
is white noise if i)
ii)
xD E
e
24.3.2. ARMA models. With these concepts, we can discuss ARMA models. These are closely related to the AR and MA error processes that weve
already discussed. The main difference is that the lhs variable is observed directly now.
order moving average (MA) process is
txxxP tG DI tcgStVP
G(d Pbbb
d P GI d P G
Q
gf
.
.
.
T gf
DI i
P
bxxb
b
)
.
.
.
.
..
P C
bb
xxb
.
..
T
t
..
gf
I gf
.
.
.
..
tG
)
)
bb
xxb t t
I
I T gf
I gf
gf
.
.
.
or
.
.
.
G
tVP T g9xxxP f wsI cwP gf
fTt Pbbb
t P fI t
24.3.2.2. AR(p) processes. An AR(p) process can be represented as
and all of the
as long as
are nite.
%
s
(d(d Pbbb d P Id d
%
9xxP d sgI P d
P I !
d
d P )
tG I tcgP t
d P GI d G
( xxxP
d P b b b
G(d Pbbb
6( t9xxP
Q f
tG
where
478
8
$f
r
I sI
r
I sI
4I
P $
This is simply
8 $ f
Pbbb P
xxI I
I sI R
r
$ C
P
t
P
I i I
Pbbb
cxxP
P
P $ C
or in general
PsI P C P I i P P P P P P
P
P
I P P sI C6 P P P P P P
P
P
C
DI C P P
P
and
I S P I C P P P P
P
P
I P DI i P P P P
P
I P C P P
I C
479
480
such that
Ce P
is simply
I
t
the matrix
so
e
I
t
can be written as
I
e t
is
)
t t
I
Vqt
t I
)
I
Dt
Ce
Ce P
e
g
and
e
t
75
so
the matrix
P
When
)
Vt
t I
481
which can be found using the quadratic equation. This generalizes. For a
order AR process, the eigenvalues are the roots of
Tt T
VI t
bb
gxxb t T e Dt I T e T
I
e
Supposing that all of the roots of this polynomial are distinct, then the matrix
can be factored as
I
and
where
is a diagonal matrix with the eigenvalues on the main diagonal. Using this
decomposition, we can write
bb
I xxb I I
P
I
e
r
I sI
85
A@8A874) 3 Ce
B
8 5 3
@@8A894) !) Se
B
requires that
..
and
I
where
482
is
3U P
pt
hP
U
3U P
6p
This leads to the famous statement that stationarity requires the roots
of the determinantal polynomial to lie inside the complex unit circle.
draw picture here.
When there are roots on the unit circle (unit roots) or outside the unit
Dynamic multipliers:
r
I sI
tG v$ f
Invertibility of AR process
f
I f gf
f
gf f
gf
I gf f
gf
f
T e e xxH
bbb
e
e vEI
e u e I V4xx T e u I T e u T
Tt
Tt bbb
t
It
I
I V I I 6xx
# Tt
# Ttbbb
so we get
T e I xxxH
# b b b
I I
#
e
# e
t
#I t
T # u T I cV T #
T#
)
c T 9uxxb # ucV)
#Tt bb
t #I t
# )bbb
4tT e ctxxH$#
) #
4I
e
r
#
such that the following two expressions are
B
e
B
ie
is the
)
v f I
e
B
ie
)
c T f V4xxb f u f V)
Tt bb
t
It
f T e 9xxH f
)bbb
tG T f uxxb f V f uf
Tt bb
t
I t )
or
g0)
f
gf
f
gf
f
P gf
)
f
f
gc
f
P )
or
24.3. MODELS FOR TIME SERIES DATA
483
tG f t@8@8P f P f
t P 8
t
4) t
t P 8
wtA8@8P f wP f w)
t
t P
t
wP ) gf
gf I f I 4
P
Sgf I f I t f
since
so
tG
t P 8
wA8@8P f wP f ) gf I f I u)
t
t P
Now as
so
tG
G
t7 f t@8@8P f P f c s
t P 8
t
t P )
gf I f I u f u4@8A88 f 0 f V f wt@8A8P f P f
t
t
t
t
t P 8
t
t P )
or, multiplying the polynomials on th LHS, we get
8
4) t
f t )
G f Vc
Now consider a different stationary process
P
B
Ce
Therefore, the
of
The LHS is precisely the determinantal polynomial that gives the eigenvalues
24.3. MODELS FOR TIME SERIES DATA
484
485
started with
f t )
G f Vc
Substituting this into the above equation we have
f t )
f uc
t P 8
t@8A8P f P f gf
t
t P )
so
) f u f wA8@8P f wP f w
t )
t P 8
t
t P )
dene
f
t
t )
6 uc
I f
4) t
fore, for
tG T f uxxb f V f uf
Tt bb
t
I t )
can be written using the factorization
tG f T e xxxH f
)bbb
and given stationarity, all the
)
cgf
8
4) e
B
)
v f I
e
where the
Therefore, we can invert each rst order polynomial on the LHS to get
f
p
p
tGj T bxxb
b
f e
e
gf
Gbbb
tc!xxP f
P f iP c gf
I X )
sented as
In multiplication
83U
6i
is
is an eigenvalue of
3U P
6U
always occur in
P
The
8B t
functions of the
B
ie
The
B
ie
where the
486
then so
P
U
3U 3U P 3U
66"uH6"0$ 6i6p
3U 3U P
which is real-valued.
4I
P $
Pbbb P
xxI I
P
t
P
I i I
P
Pbbb
cxxP
Pbbb P
txxI I
P
I i I
the lagged
i
P
As
sI
P
are vectors
of zeros except for their rst element, so we see that the rst equation
here, in the limit, is just
tG I sI P gf
P $ C
P
sQs
gHG0PsQsyT gwA8@8D ssqI gvA
Q
s gHsg@
f Q f
P TTt P 8
t P I I
u}wA8@8P wDiDt p
With this, the second moments are easy to nd: The variance is
G P Q f T t P 8 P Q
tVHsYT t@8@8Ds
G P fTt Pbbb
0UT 9wtxxP f wsI
t P
g wDsI gt
f t P Q f I
cw}u4@8A88 DuQ
fI t P QT t QI t
Hf
Q
s
so
Q
T t QI t
u4@8A88 DVQ
Tt
uA8@88 uu)
t It
and
so
T t P 8 P t P QI t
@8@8Q wwP
Q
}
2
Q U
f
G
tVP T g9xxxP f wsI cwP gf
fTt Pbbb
t P fI t
Assuming stationarity,
so
8
g
and the
Bt
B
Se
the
(and
487
G P 8
tV@8A8P gf sI cI P gf
P f
tG @8A8ss g qssqI gI q
8 P Q
f Q f Q f
p
tG Q f
s g f
I ! (
f
f
)
p
P 8
@8@8P f g
Id P )
( d
Q f
tG s I ! (
If this is the case,
f
or
so we get
P 8
@8@8P f g
Id P )
( d
8
4) g
B
) 8
Yc@8@8g f c f I c (
)
)
or
with
f g c
B )
P 8
A8@8P f g
Id P )
( d
G
tc (
f
P 8
@8A8P f gP c
Id )
( d
gf
)
P
equations for
T 8I
t@A8@89C p TxD
y
for
) P
s
unknowns (
8)
@@8A899 u
these, the
which have
488
489
where
4
B
@8A8Q P I Q
8 P
Q
P
8 85 3
7@@8A894) 4)
It turns out that one can always manipulate the parameters of an MA(q)
process to nd an invertible representation. For example, the two
MA(1) processes
)
d
i
gf
Q
s
and
I $
G f d )
d
For example, weve seen that
8 d P )
g p
hc $ d p
d P )
d P )
so the variances are the same. It turns out that all the autocovariances
will be the same, as is easily checked. This means that the two MA
processes are observationally equivalent. As before, its impossible to
distinguish between observationally equivalent processes on the basis
of data.
490
For a given MA(q) process, its always possible to manipulate the pa-
gG
as a function of past
8
$ R f
order AR and MA models can usually offer a satisfactory representation of univariate time series data with a reasonable number of parameters.
Stationarity and invertibility of ARMA models is similar to what weve
seen - we wont go into the details. Likewise, calculating moments is
similar.
P )
e
E
P ) f
cP gc f t
Bibliography
[1] Davidson, R. and J.G. MacKinnon (1993) Estimation and Inference in Econometrics, Oxford
Univ. Press.
[2] Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods, Oxford Univ.
Press.
[3] Gallant, A.R. (1985) Nonlinear Statistical Models, Wiley.
[4] Gallant, A.R. (1997) An Introduction to Econometric Theory, Princeton Univ. Press.
[5] Hamilton, J. (1994) Time Series Analysis, Princeton Univ. Press
[6] Hayashi, F. (2000) Econometrics, Princeton Univ. Press.
[7] Wooldridge (2003), Introductory Econometrics, Thomson. (undergraduate level, for supplementary use only).
491
Index
observations, inuential, 27
outliers, 27
own inuence, 29
parameter space, 49
R- squared, uncentered, 31
R-squared, centered, 32
leverage, 28
likelihood function, 49
matrix, idempotent, 27
matrix, projection, 26
matrix, symmetric, 27
492