Professional Documents
Culture Documents
Econometric S If All 2020
Econometric S If All 2020
I Probability 1
1 Basic concepts of probability 1
1.1 Radon-Nikodym Derivative . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Conditional Means as Orthogonal Projections . . . . . . . . . . . . . 12
1.3 Di¤erent Concepts of Dependence . . . . . . . . . . . . . . . . . . . . 13
1.4 Other characteristics of distributions . . . . . . . . . . . . . . . . . . 15
1.5 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Stochastic Convergence 23
II Statistics 36
4 Statistical Models: Identi…cation and Speci…cation 36
ii
J.C. Escanciano Econometrics I Fall 2020
Part I
Probability
These notes are not a substitute for Shao’s book, which is the main reference for the
…rst part of course (Probability, Chapter 1). The notes on probability are intended
as a complement to the book. For the second part, on Statistics, these notes are the
main reference, and Shao’s book becomes a complement.
(i) is the set of all possible outcomes, and it is called the sample space.
Basic De…nitions: The subsets of F are called events. If a statement holds for
all w 2 A such that P(A) = 1; we say that the statement it is true almost surely
(a:s:): Let (M; d) be a metric space and B its Borel …eld (the …eld generated
by the open sets). A random element X with values in (M; d) is a measurable
function from to M , i.e., X : ( ; F) ! (M; B) such that X 1 (B) 2 F for all
Borel sets B 2 B: If M = R; X is called a random variable, and if M = Rn ; n > 1;
X a random vector (r.v). PX (B) P(X 1 (B)) is the law of X: For M = Rn ;
FX (x) = PX (( 1; x]) = P(X x) is the cumulative distribution function
p (cdf).
ix0 X
The characteristic function is the function 'X (x) = E[e ]; where i = 1 and E
is the expectation operator
Z Z
E[X] = X(w)dP = xdPX :
M
Notice that
FX (x) = P(X x) = E[1( 1;x] (X)];
1
J.C. Escanciano Econometrics I Fall 2020
and the B 0 s are disjoint. This lack property is useful and explains the subadditivity
and continuity properties of probabilities. In particular, if A1 A2 ; then
n n
An = [i=1 Ai = [i=1 Bi and
S
1 X
n
P lim An = P Bi = lim P (Bi ) = lim P (An ) :
n!1 i=1 n!1 n!1
i=1
2
J.C. Escanciano Econometrics I Fall 2020
X
k
'(x) = ai 1Ai (x);
i=1
Without loss of generality we can take f to be bounded, with values in [0; 1): Then,
n 1
2X
k
'n (x) = 1A (x);
k=0
2n k
where Ak = f 1 ([k=2n ; (k + 1)=2n )) will do it. Note 'n (x) = k=2n if f (x) 2
[k=2n ; (k + 1)=2n ); thus
1
j'n (x) f (x)j
2n
and 'n (x) f (x): Moreover, 'n (x) 'n+1 (x): This monotone convergence result
will be fundamental for integration theory.
To check measurability, the following result is often useful. If f : ( ; F) ! (M; B)
and B = (C); then it only su¢ ces to check measurability for inverses of C: That is,
f is measurable i¤ f 1 (C) 2 F for all C 2 C: This follows from the de…nition of
measurability and the fact that
1
E M :f (E) 2 F
3
J.C. Escanciano Econometrics I Fall 2020
is a …eld (check this and the proof of the su¢ ciency condition).
If X : ( ; F) ! (M; B) is a r.v. we have that fX 1 (B) : B 2 Bg is a …eld,
and it is denoted (X) (the sigma …eld generated by X): The following result will
have important implications, for example, when de…ning conditional means (this is
not proved in Shao).
Since Am;n 2 (Y ) and (Y ) (X); there exists Bm;n 2 B such that Am;n =
1
X (Bm;n ): Then, construct the function
X
1
m
'n (x) = 1B (x):
m= 1
2n m;n
1
Note by Am;n = X (Bm;n )
X
1
m
'n (X(w)) = 1
n Am;n
(w)
m= 1
2
n
and hence, 'n (X) is closed to Y (withing 2 distance), i.e.
1
'n (X) Y 'n (X) + :
2n
Taking limits we conclude
4
J.C. Escanciano Econometrics I Fall 2020
The starting point of integration theory is the integration of indicator functions: for
a measure and a measurable set A;
Z
1A (y)d = (A):
X
m
ci (Ci ):
i=1
To see why this is true, as usual with integration, we …rst prove it for indicators,
then for simple functions, and then for general functions. For indicators,
Z
1A (y)d x (y) = x (A) = 1A (x);
where the …rst equality uses integration with indicator functions. Using linearity, the
same holds for simple functions, and for general functions. P
Another application of linearity gives, for the counting measure v(A) = ni=1 xi (A);
Z Xn
f (y)dv(y) = f (xi ):
i=1
5
J.C. Escanciano Econometrics I Fall 2020
Simple functions are to indicators, as empirical measures are to Dirac measures. The
empirical measure at the points a1 < a2 <
P < ak with probabilities pi ; pi 0 with
k
p
i=1 i = 1; (possibly with k = 1); is (see 1.11 in Shao) is
X
k X
k
P (A) = pi = pi ai (A):
i=1;ai 2A i=1
This is the cdf of a discrete random variable (compare with 1.10 in Shao). For k = n
equals the sample size, and ai the i th datum of a random sample fXi gni=1 the
discrete measure is
1X 1X
n n
Pn (A) = 1(Xi 2 A) = Xi (A)
n t=1 n t=1
for a Borel set A of Rd : This measure is called the empirical (random) measure
and the corresponding cdf Fn (x) the empirical cdf
1X
n
Fn (x) := Pn (( 1; x])) = 1(Xi x):
n t=1
These quantities play a fundamental role in statistics, as we shall see in future lec-
tures. Conditional on the data points fXi gni=1 and assuming no ties in the data, Pn
or Fn represent the probability and cdf respectively of a discrete random variable
with uniform probabilities on the data (a discrete uniform distribution or multino-
mial distribution with probabilities 1=n): For n = 2; this is the model for ‡ipping a
coin.
There are several theorems that give conditions that allow to exchange limits
and integration. Probably the most intuitive of them is the monotone convergence
6
J.C. Escanciano Econometrics I Fall 2020
Shao mentions that this formula generalizes the one for Riemann integrals (i.e.
change of variable y = f (x))
Z Z
0
g f (x)f (x)dx = g(y)dy:
But, how is the connection? To see this, de…ne the measure (assume f 0 0; otherwise
apply to positive and negative parts)
Z
v(A) = f 0 (x)dx:
A
Note that by the fundamental theorem of calculus v([a; b]) = f (b) f (a); and thus
v f 1 = Lebesgue:
7
J.C. Escanciano Econometrics I Fall 2020
Using that for two disjoint sets 1A[B = 1A + 1B , it can be shown that
Z Z
(A [ B) = 1A[B f d = (1A + 1B )f d = (A) + (B):
(A) = 0 =) (A) = 0:
X
k X
k
P (A) = pi = pi ai (A):
i=1;ai 2A i=1
Take ai with corresponding probability pi > 0; then P (fai g) = pi > 0; but the
Lebesgue measure m(fai g) = 0: Thus, P is not absolutely continuous wrt the
Lebesgue measure and the RND does not exists. How about the RND wrt to the
counting measure? First, the conditions of the RN theorem hold (check this). Then,
Z X
k X
k
P (A) = (f 1A )d = f (ai )1A (ai ) = f (ai ) ai (A);
i=1 i=1
8
J.C. Escanciano Econometrics I Fall 2020
1. (a) Qa is a measure.
(b) Qa n P (Qa is absolutely continuous wrt P ):
(c) If Q? (A) = Q(A \ fp = 0g); then Q(A) = Qa (A) + Q? (A).
(d) The Radon-Nikodym derivative dQa =dP = q=p:
for all B 2 B: To prove existence and uniqueness note that for Y 0 a.s. the right
hand side is a measure absolutely continuous wrt P: For a general Y we can take
positive and negative parts and apply the previous result. Also note that E [Y j (X)]
can be written as h(X) for a measurable h. To see this note that if Z = E [Y j (X)] is
9
J.C. Escanciano Econometrics I Fall 2020
a r.v on (X); then (Z) (X) and by the previous proposition Z = h(X) for some
measurable h: This construction of conditional means based on RN is more elegant
and only require …rst bounded moments, in contrast to the projection approach that
requires second …nite moments.
For the case of random variables we can apply change of variables to the last
display to obtain for h(x) = E [Y j X = x] ; for all borel sets B;
Z Z Z
h(x)dPX (x) = ydPY;X (y; x);
B R B
or equivalently
E [E [Y j X] 1fX 2 Bg] = E [Y 1fX 2 Bg] :
As this holds for indicators, it also holds for simple functions, and for general mea-
surable functions (for which the moments as well de…ned), so that
for all g for which E [(Y E [Y j X]) g(X)] is well de…ned. This “orthogonality”
restriction is used extensively in these notes.
We can use these concepts to de…ne conditional probabilities, Pr [B j X] = E [1B (Y ) j X]
a.s. for all B 2 (Y ): We can also de…ne conditional probabilities that are de…ned
for every x, not just a.s. in x: Since this is a bit technical we do not emphasize it
much in the course.
g(E[X]) E[g(X)]:
Lr Inequality: If 1 r p
kXkr kXkp :
1 1
Holder’s Inequality: If p > 1 and q > 1 with p
+ q
= 1; then
10
J.C. Escanciano Econometrics I Fall 2020
Cauchy-Schwarz Inequality:
Minkowski´s Inequality:
kX + Y kp kXkp + kY kp :
11
J.C. Escanciano Econometrics I Fall 2020
12
J.C. Escanciano Econometrics I Fall 2020
Another important application of this theorem is when S the space of all linear
transformations of a given random vector X;i.e. S = f 0 X : 2 Rd g: In that case,
the orthogonality condition in (3) boils down to
0
E [(Y 0 X) X] = 0:
E [XX 0 ] 0 = E [Y X] :
13
J.C. Escanciano Econometrics I Fall 2020
E [Y j X; Z] = E [Y j Z] a.s.
The same relation as in (5) holds in the conditional case, where conditional un-
correlated means
E [Y X j Z] = E [Y j Z] E [X j Z] a.s.
Consider the following important application of conditional independence. Suppose
we want to evaluate a policy or treatment on an outcome of interest. Let D be the
treatment indicator (=1 if treated, 0 otherwise), Y (1) be the outcome under treat-
ment, Y (0) be the outcome without treatment. We only observe (Y; X; D); with
Y = Y (1) D + Y (0) (1 D). Assume (Y (1) ; Y (0)) and D are independent con-
ditional on X. This assumption is called uncounfoundedness or selection on observ-
ables. Under this condition, the Average Treatment E¤ect (ATE) E[Yi (1) Yi (0)]
is identi…ed, as we now show. First, note that the treatment e¤ect for individ-
ual i; Yi (1) Yi (0); is not identi…ed because either Yi (1) or Yi (0) is not observed.
14
J.C. Escanciano Econometrics I Fall 2020
Nevertheless, using the selection on observables assumption and the law of iterated
expectations
We say the ATE is identi…ed because we have written it as function of the distribution
of observables: We will study identi…cation in more detail below.
To illustrate these points, suppose we run the regression
0
Yi = 0 + 0 Di + 0 Xi + "i ;
Thus, 0 = Yi (1) Yi (0) is the same for every i th! Moreover, E[Yi jXi ; Di ] =
0
0 + 0 Di + 0 Xi ; which is a strong parametric functional form assumption (linearity).
The nonparametric identi…cation result above does not require parametric as-
sumptions, and in that sense is general. But selection on observables may strong in
many applications (e.g. returns to education and ability). It holds in a randomized
experiment where Di is independent of everything. However, in many applications
we only have observational data and not experiments (i.e. Di is a choice variable by
i th; e.g. education).
1.4.1 Quantiles
For any univariate cdf F; de…ne the quantile function
1
F (u) = infft 2 R : F (t) ug; u 2 (0; 1]:
15
J.C. Escanciano Econometrics I Fall 2020
1.4.2 Copulas
A copula function is a multivariate cdf with uniform marginal U [0; 1] cdfs. An
important theorem called Sklar’s Theorem (Sklar 1959) says that for any multivariate
cdf F; with marginals F1 ; :::; Fd ; there exists a copula function C such that
If the marginals Fi are continuous, then C is unique. Copulas have been used in
…nancial econometrics to model dependence (e.g. tail dependence).
16
J.C. Escanciano Econometrics I Fall 2020
f (t)
F (t) = ;
1 F (t)
the so-called hazard function, which can be interpreted as (using the de…nition of F
and f )
F (t + h) F (t)
F (t) = lim
h!0 h (1 F (t))
Pr[t < X t + hjt < X]
= lim :
h!0 h
17
J.C. Escanciano Econometrics I Fall 2020
18
J.C. Escanciano Econometrics I Fall 2020
De…nition 5 A stochastic process fXt gt2Z is strictly stationary if for any given
integer r and for any set of subscripts, i1 ; i2 ; :::; ir the joint distribution of (Xi ; Xi1 ; :::; Xir )
depends only on i1 i; i2 i; :::; ir i; but not on i:
De…nition 6 A random process fXt gt2Z is called weakly stationary (or covari-
ance stationary) if random variables Xt have …nite second moments (in short
Xt 2 L2 (F)) and both E[Xt ] and E[Xt Xt k ] do not depend on the time index t
for all k = 0; 1; 2; :::.
Example 7 Consider the weak white noise (WN) de…ned as a univariate random
process f"t g+1
t= 1 such that E["t ] = 0 and
2
if k = 0;
E["t "t k ] =
0 otherwise,
where 2 is a …nite constant. Similarly, we can de…ne the multivariate weak white
noise f"t g+1
t= 1 by the properties E["t ] = 0 and
if k = 0;
E["t "0t k ] =
0 otherwise,
where is a constant matrix. Both processes are weakly stationary but not necessarily
strictly stationary.
Consider a random process fYt g+1 t=1 and an increasing sequence of information
sets fFt g+1
t=0 , i.e., a collection of -…elds Ft with the property2
19
J.C. Escanciano Econometrics I Fall 2020
20
J.C. Escanciano Econometrics I Fall 2020
The idea behind ergodicity is that the time average of our dynamic process must
converge to the average of in…nitely many identical realizations of the same process
at one point in time. In the last example we had “too much”dependence, since
1
1+ 12
if k = 0;
Cov(Xt ; Xt k ) = Cov(Ut + Z; Ut k + Z) =
1 if k =
6 0:
In the next sections we shall study su¢ cient conditions for the sequence of sample
means to converge almost surely to the unconditional expectation of Xt . Some of
these conditions will be based on mixing assumptions. A prominent example is the
strong-mixing concept.
21
J.C. Escanciano Econometrics I Fall 2020
Doukhan, P. and Ango Nze, P. (2004). Weak dependence, models and appli-
cations to econometrics. Econometric Theory 20, 995-1045.
22
J.C. Escanciano Econometrics I Fall 2020
2 Stochastic Convergence
Asymptotic theory is very useful in econometrics for (i) approximating critical
regions of tests and con…dence intervals of parameters; and for (ii) studying the
quality of inference procedures.
Xn !p X; p lim Xn = X; Xn = X + oP (1):
n!1
Remark 1: For convergence in probability all the r.v´s Xn ; n = 1; 2:::; and X have
to be de…ned in the same probability space ( ; F; P):
Remark 2: Xn !p X is equivalent to jXn Xj !p 0:
Notice that
[
1
Xn ! X a:s () lim P( AT ) = 0;
n!1
T =n
23
J.C. Escanciano Econometrics I Fall 2020
24
J.C. Escanciano Econometrics I Fall 2020
(iii) Xn !p X implies Xn !d X;
(i) Xn + Yn !d X + c;
(ii) Xn Yn !d Xc;
1
(iii) Xn Yn 1 !d Xc provided c 6= 0:
25
J.C. Escanciano Econometrics I Fall 2020
By E[jXi j] < 1; Xi (t) is di¤erentiable around zero (by dominated convergence) and
26
J.C. Escanciano Econometrics I Fall 2020
Problem. Use the previous results to establish the convergence of the t ratio
p
n X
tn = ;
Sn
where X is the sample mean of iid r.v’s with …nite
Pn second moments and Sn is the
2 1 2
sample standard deviation, i.e. Sn = (n 1) i=1 (Xi X) :
Problem. Explain the relation between the previous result and that used in under-
graduate statistics (the t ratio follows a Student t distribution with n 1 degrees of
freedom).
(i) If for > 0; supn 1 E[jXn j1+ ] < 1 =) fXn ; n = 1; 2:::g is u.i.
27
J.C. Escanciano Econometrics I Fall 2020
(ii) If fXn ; n = 1; 2:::g are identically distributed (i.d) and E[jX1 j] < 1 =)
fXn ; n = 1; 2:::g is u.i.
Proof. lim supn 1 EfjXn j 1(jXn j > c)g = lim EfjX1 j 1(jX1 j > c)g = 0;
c!1 c!1
where the last equality follows by Dominated Convergence Theorem (DCT)
applied to any sequence cn " 1 and fn (x) = jxj 1(jxj > cn ) jxj g(x):
28
J.C. Escanciano Econometrics I Fall 2020
(f) Xn !d X =) Xn = OP (1):
Proof. Without loss of generality choose C" a continuity of the cdf of Xn
and X; and by convergence in distribution we have that there exists an n" such
that for all n > n" ;
We can always make P(jXj > C" ) < "=2 by choosing C" su¢ ciently large.
Problem. Use the calculus of OP and oP to …nd the probability limit of Sn (as an
alternative
p to2 the2CMT used before): Furthermore, establish the asymptotic distribu-
tion of n(Sn ); providing su¢ cient conditions for the validity of the convergence.
1 X
n
2 X
n
n
Sn2 = (Xi 2
) +( X) (Xi )+ ( X)2
n 1 i=1
n 1 i=1
n 1
1 Xn
= (Xi )2 + oP (1)oP (1) + OP (1)oP (1)
n 1 i=1
1X X
n n
1
= (Xi )2 + (Xi )2 + oP (1)
n i=1 (n 1)n i=1
1X
n
= (Xi )2 + oP (1)
n i=1
2
= + oP (1):
29
J.C. Escanciano Econometrics I Fall 2020
1 X
n
p
n(Sn2 2
) = p (Xi )2 2
n i=1
X
n X
n p
1 2
p 2 n
+p (Xi ) + n( X) (Xi )+ n( X)2
n(n 1) i=1
n 1 i=1
n 1
1 X
n
= p (Xi )2 2
+ OP (n 1=2
) + OP (1)oP (1) + OP (n 1=2
)OP (1)
n i=1
1 X
n
= p (Xi )2 2
+ oP (1):
n i=1
De…ne Zi = (Xi )2 2
; and note that E[Zi ] = 0 and under the assumption that
4
E[Xi ] < 1, it follows that
E[Zi2 ] = E[(Xi )4 ] 4
< 1:
Thus, by CLT
1 X
n
p (Xi )2 2
!d N (0; E[Zi2 ]):
n i=1
By Slutsky’s theorem then
p
n(Sn2 2
) !d N (0; E[Zi2 ]):
30
J.C. Escanciano Econometrics I Fall 2020
1X 1X
n n
Pn (A) = 1(Xi 2 A) = Xi (A)
n t=1 n t=1
for a Borel set A of Rd and where Xi (A) is the point mass or Dirac measure,
Xi (A) = 1 if Xi 2 A and zero otherwise. A related quantity is the indicator
function 1A (x) = 1 if x 2 A and zero otherwise. The following simple but important
relationship holds, Z
x (A) = 1A (y)d x = 1A (x):
1X
n
Fn (x) := Pn (( 1; x])) = 1(Xi x):
n t=1
Law of Large Numbers (LLN) are theorems for establishing the convergence in
probability or a.s of sample means En [X] to population means E[X]:
Central Limit Theorems (CLT) are theorems for establishing the convergence in
distribution of an (En [X] E[X]) to a proper r.v, for suitable sequences an " 1 as
n ! 1:
We shall study these laws under a di¤erent regularity conditions. Informally
speaking, for these laws to hold we need to restrict the dependence and the moments
of the processes under consideration. From now on we will use the empirical mean
notation En to denote sample means.
31
J.C. Escanciano Econometrics I Fall 2020
En [X] !p E[X]:
The Kolmogorov “three series” theorem establishes necessary and su¢ cient
conditions for a.s convergence under independence, see Davidson (1994, pg. 311).
X
n
2
E[Xnt 1(jXnt j > ")] ! 0; 8" > 0; n ! 1:
t=1
Then
Sn !d X N (0; 1):
X
n
E[jXnt j2+ ] ! 0; for some > 0; as n ! 1:
t=1
2
Lindeberg-Levy CLT: fXnt : t = 1; :::;png iid with E[nXnt ] < C < 1:
Classical CLT corresponds to Xnt = n(Xt )= for iid fXt g.
32
J.C. Escanciano Econometrics I Fall 2020
then,
En [X] !2 E[X]:
Proof. Trivial
Virtually all standard asymptotic theory of time series processes hinges on the
ergodicity property. For this reason, applied econometricians frequently assume that
the time series of interest are ergodic. The following central limit theorem is an
extension of the Lindeberg–Levy central limit theorem to stationary and ergodic
martingale di¤erence sequences.
Theorem 34 (Martingale
Pn LLN) Let fXi ; Fi g be a mds with varianceP1 sequence
2 1 2 2
f i g; Sn = an i=1 Xi and fai g a positive sequence with ai " 1: If i=1 i =ai <
1, then
Sn ! 0 a:s:
Theorem 35 (Mixing LLN) Let fXi g be a strictly stationary and mixing process
with E[jX1 j] < 1: Then
En [X] ! E[X1 ] a:s:
33
J.C. Escanciano Econometrics I Fall 2020
(i) (Ergodic, strictly stationary sequences, Billingsley 1961). Let fXt g be a strictly
stationary and ergodic mds with E[X1 X10 ] = : Then
X
n
1=2
n Xt !d X N (0; ):
t=1
2
(ii) Let
PnfXni2; Fni g be a md array with (unconditional) variance sequence f ni g; with
i=1 ni = 1: If
Pn 2
(a) i=1 Xni !p 1:
(b) max1 i n jXni j !p 0:
Then
X
n
Sn = Xni !d X N (0; 1):
i=1
34
J.C. Escanciano Econometrics I Fall 2020
The function g is di¤erentiable at (E[X]; E[X 2 ])0 with derivative ( 2E[X]; 1)0 : Hence
if (T1 ; T2 ) is the vector that possesses the normal distribution of the last display, by
the Delta Method p
n(Sn2 2
) !d 2E[X]T1 + T2 :
In fact, if E[X] = 0 the limit distribution is a normal r.v with mean zero and variance
E[X 4 ] (E[X 2 ])2 (show this). A more direct proof of this result (without using the
Delta Method) is based on
X
n X
n
Sn2 =n 1
(Xt E[X]) 2
(En [X] 2
E[X]) = n 1
(Xt E[X])2 + oP (1)
t=1 t=1
35
J.C. Escanciano Econometrics I Fall 2020
Part II
Statistics
4 Statistical Models: Identi…cation and Speci…ca-
tion
Suppose we have a sample X1 ; :::; Xn from a distribution P that belongs to a sta-
tistical model P on the sample space (X ; A). A statistical model is a collection of
probabilities: Here, P is called the population and n the sample size. In parametric
modelling we assume P = fP : 2 g where is a subset of Euclidean Space
Rp . For example P can be a normal probability model for a univariate X; with
= ( ; 2) 2 := R (0; 1): In nonparametric econometrics, is a subset (of
in…nite dimension) of an in…nite dimensional metric space. In any case, we say the
model P is correctly speci…ed if P 2 P; that is, there exists 0 2 such that
P = P 0 : From now on a subscript 0 means that the parameter is the one that gen-
erated the true data generating process P (i.e. P = P 0 ). If 0 2 is unique–in the
sense that if P = P then = 0 –we say that 0 is identi…ed. The identi…ed set is
0 (P ) =f 0 2 : P = P 0 g:
36
J.C. Escanciano Econometrics I Fall 2020
1X 1X
n n
Pn (A) := Xi (A) = 1A (Xi );
n i=1 n i=1
1X
n
Fn (x) := 1( 1;x] (Xi ):
n i=1
These are random measures and cdfs. Given a realization x1 ; :::; xn of X1 ; :::; Xn the
empirical measure conditional on the data becomes
1X
n
Pn (A) := xi (A)
n i=1
37
J.C. Escanciano Econometrics I Fall 2020
which is a standard (discrete) measure with a cdf Fn ; which with some abuse of
notation is also denoted by Fn and which is a distribution function putting mass 1=n
to each point xi (when there are no ties): That is, conditional on the data Pn ( ) is a
multinomial probability measure.
Assume univariate data. For a …xed x 2 R; f1(Xi x)gni=1 is a sequence of
independent and identically distributed (iid) Bernoulli variables. The mean and
variance of 1(Xi x) are, respectively, F (x) and F (x)(1 F (x)).3 As an estimator of
F (x); Fn (x) is unbiased and consistent in mean squared. Furthermore, by Hoe¤ding’s
inequality, for any x and " > 0;
2n"2
P (jFn (x) F (x)j > ") 2e ;
This result can be easily extended to multivariate convergence of ( n (x1 ); n (x2 ); :::; n (xm ))
for arbitrary points (x1 ; x2 ; :::; xm ) 2 Rm and m < 1: The r.v. ( n (x1 ); n (x2 ); :::; n (xm ))
converges in distribution to a multivariate normal with zero mean vector and covari-
ance matrix with (j; k) th element (a ^ b := min(a; b))
Theorem 39 (Glivenko-Cantelli, 1933) For fXi g iid r.v’s (with an arbitrary cdf F !)
3 2
E[1(Xi x)] = P(Xi x) = F (x) and V[1(Xi x)] = E[12 (Xi x)] (E[1(Xi x)])
= F (x) F 2 (x):
38
J.C. Escanciano Econometrics I Fall 2020
For any cdf F; de…ne the quantile function F 1 (u) = infft 2 R : F (t) ug; u 2
(0; 1): This is a generalization of the concept of inverse. Since F is non-decreasing,
it has left limits and is right continuous, it can shown that F 1 is non-decreasing, it
has right limits and it is left continuous. Furthermore, from the de…nition of F 1 ,
for each t and any r.v. U;
1
fF (U ) tg = fU F (t)g: (8)
becomes distribution free. That is, its …nite sample distribution does not depend
on F and is fully known. Although it is possible to develop statistical theory in …nite
samples under certain conditions (e.g. exponential families), most of the discussion
in this course will based on asymptotic arguments because …nite sample distributions
are in general unknown. To learn about exponential families I recommend reading
Section 2.1.3 in Shao. Many well known distributions are members of the exponential
family, including the normal distribution, binomial and multinomial, among others.
39
J.C. Escanciano Econometrics I Fall 2020
Note that the Glivenko-Cantelli’s theorem holds for an arbitrary cdf F: It general-
izes the LLN (it is a Uniform LLN, in short ULLN). Moreover, this theorem justi…es
the analog principle in econometrics: (F ) can be estimated by the corresponding
functional of the empirical measure Fn ; (Fn ); when the latter is well-de…ned. An
important class of such functionals are linear functionals or integrals
Z
(F ) = '(x)dF;
Z
(Fn ) = '(x)dFn
Z !
1X
n
= '(x) d Xi
n i=1
Z
1X
n
= '(x)d Xi
n i=1
1X
n
= '(Xi ):
n i=1
(where the …rst follows from a property of indicator functions while the second from
a property of Dirac measures).
In many cases of interest the functional 0 = (F ) is implicit rather than explicit.
For example,
(F ) = arg min QF ( ):
2
This more complicated case is quite general and is investigated in the next sections.
In some sense the empirical cdf Fn is a su¢ cient statistic. Assume there are
no ties. The empirical cdf has jumps at the realizations of the order statistics, x(i) ;
where x(1) < x(2) < < x(n) are realizations of X(1) < X(2) < < X(n) : In fact,
note
j
Fn (x(j) ) =
n
40
J.C. Escanciano Econometrics I Fall 2020
and Rj = nFn (X(j) ) are the ranks. As we shall show below the order statistics are
su¢ cient statistics for P: What we do mean by this?
Denote Xn = (X1 ; :::; Xn ): A statistic Tn (Xn ) is a measurable mapping that is
known if Xn is known. It seems ine¢ cient to code n numbers (or vectors) in a function
Fn ; so often researchers try to reduce the dimensionality and look for statistics with
reduced range. What do we mean by su¢ ciency?
A statistic T (Xn ) provides a reduction of the -…eld (Xn ): Does such a reduction
results in any loss of information concerning the unknown population? If a statistic
T (Xn ) is fully as informative as the original sample Xn , then statistical analyses can
be done using T (Xn ) that is simpler than Xn . The next concept describes what we
mean by fully informative.
Once we observe Xn and compute a su¢ cient statistic T (Xn ), the original data
Xn do not contain any further information concerning the unknown population
P (since its conditional distribution is unrelated to P ) and can be discarded.
If T is su¢ cient for P 2 P, then T is also su¢ cient for P 2 P0 P but not
necessarily su¢ cient for P 2 P1 P.
Example 41 Suppose that Xn = (X1 ; :::; Xn ) and X1 ; :::; Xn are i.i.d. from the bi-
nomial distribution with the p.d.f. (w.r.t. the counting measure)
z
f (z) = (1 )1 z If0;1g (z); z 2 R; 2 (0; 1):
P
Consider the statistic T (Xn ) = ni=1 Xi , which is the number of ones in Xn . For any
realization x of Xn , x is a sequence of n ones and zeros. T contains all information
about , since is the probability of an occurrence of a one in x and given T = t,
what is left in the data set x is the redundant information about the positions of t
ones. To show T is su¢ cient for P , we compute P (Xn = xjT = t). Let t = 0; 1; :::; n
n
and Bt = f(x1 ; :::; xn ) : xi = 0; 1; i=1 xi = tg.
41
J.C. Escanciano Econometrics I Fall 2020
Also
n t
P (T = t) = (1 )n t If0;1;:::;ng (t):
t
Then
P (Xn = x; T = t) 1
P (Xn = xjT = t) = = n IBt (x)
P (T = t) t
How to …nd a su¢ cient statistic? Finding a su¢ cient statistic by means of the
de…nition is not convenient. It involves guessing a statistic T that might be su¢ -
cient and computing the conditional distribution of X given T = t. For families of
populations having p.d.f.’s, a simple way of …nding su¢ cient statistics is to use the
factorization theorem.
42
J.C. Escanciano Econometrics I Fall 2020
Example 44 Let Xn = (X1 ; :::; Xn ) and X1 ; :::; Xn be i.i.d. random variables having
a distribution P 2 P, where P is the family of distributions on R having Lebesgue
p.d.f.’s.
Let X(1) ; :::; X(n) be the order statistics. Note that the joint p.d.f. of X is
43
J.C. Escanciano Econometrics I Fall 2020
X : the range of X
De…nition 45 Loss function L(P; a): a function from P A to [0; 1). L(P; a) is
Borel for each P . If X = x is observed and our decision rule is d, then our “loss" is
L(P; d(x)).
It is di¢ cult to compare L(P; d1 (X)) and L(P; d2 (X)) for two decision rules, d1 and
d2 , since both of them are random. This motivates the de…nition of Risk.
If P is a parametric family indexed by , the loss and risk are denoted by L( ; a) and
Rd ( )
Rd1 (P ) Rd2 (P ) 8P 2 P;
and is better than d2 if, in addition, Rd1 (P ) < Rd2 (P ) for at least one P 2 P.
Two decision rules d1 and d2 are equivalent i¤ Rd1 (P ) = Rd2 (P ) for all P 2 P.
44
J.C. Escanciano Econometrics I Fall 2020
45
J.C. Escanciano Econometrics I Fall 2020
An example of a graph of Rd (P ) is Figure 2.2 of Shao (p127). The 0-1 loss implies
that the loss for two types of incorrect decisions (accepting H0 when P 2 P1 and
rejecting H0 when P 2 P0 ) are the same. In some cases, one might assume unequal
losses: L(P; j) = 0 for P 2 Pj , L(P; 0) = c0 when P 2 P1 , and L(P; 1) = c1 when
P 2 P0 .
If d is as good as any other rule in D, a class of allowable decision rules, then d
is D-optimal (or optimal if D contains all possible rules). In most applications it is
not possible to …nd a decision that is best uniformly in P 2 P: For this reason, we
relax the de…nition of “best” and consider di¤erent concepts such as admissibility
(introduced later). To …nd admissible rules it is convenient to enlarge (or better,
convexify) the class of decision rules to include randomized decision rules.
De…nition 49 A randomized decision rule is a function on X A such that, for
every A 2 A, ( ; A) is a Borel function and, for every x 2 X , (x; ) is a probability
measure on (A; A).
If X = x is observed, we have a distribution of actions: (x; ).
A nonrandomized decision rule d previously discussed can be viewed as a special
randomized decision rule with (x; fag) = Ifag (d(x)), a 2 A, x 2 X .
To choose an action in A when a randomized rule is used, we need to simulate
a pseudorandom element of A according to (x; ).
Thus, an alternative way to describe a randomized rule is to specify the method
of simulating the action from A for each x 2 X .
The loss function for a randomized rule is de…ned as
Z
L(P; ; x) = L(P; a)d (x; a);
A
which reduces to the same loss function we discussed when is a nonrandomized
rule.
The risk of a randomized rule is then
Z Z
R (P ) = E[L(P; ; X)] = L(P; a)d (x; a)dPX (x):
X A
46
J.C. Escanciano Econometrics I Fall 2020
If a decision rule d is inadmissible, then there exists a rule better than d and d
should not be used in principle.
If there are two D-admissible rules that are not equivalent, then there does not
exist any D-optimal rule.
Suppose that we have a su¢ cient statistic d(X) for P 2 P. Intuitively, our
decision rule should be a function of d. This is not true in general, but the following
result indicates that this is true if randomized decision rules are allowed.
Proposition 51 Let d(X) be a su¢ cient statistic for P 2 P and let 0 be a decision
rule.
Then
1 (t; A) = E[ 0 (X; A)jd = t];
which is a randomized decision rule depending only on d, is equivalent to 0 if
R 0 (P ) < 1 for any P 2 P.
If 0 is a nonrandomized rule,
is still a randomized rule, unless 0 (X) = h(d(X)) a.s. P for some Borel function
h.
Hence, this Proposition does not apply to situations where randomized rules
are not allowed.
47
J.C. Escanciano Econometrics I Fall 2020
The following result tells us when nonrandomized rules are all we need and when
decision rules that are not functions of su¢ cient statistics are inadmissible.
How to …nd a decision rule? The concept of admissibility and su¢ ciency helps us
to eliminate some decision rules. However, usually there are still too many rules
left after the elimination of some rules according to admissibility and su¢ ciency.
Although one is typically interested in a D-optimal rule, frequently it does not exist,
if D is either too large or too small.
We
Pn now show that there is no D1 -optimal rule, i.e., there does not exist d =
i=1 ci Xi such that Rd (P ) Rd (P ) for any P 2 P and d 2 D1 .
If there is such a d , then (c1 ; :::; cn ) is a minimum of the function of (c1 ; :::; cn )
on the right-hand side of (10).
Then c1 ; :::; cn must be the same and equal to 2 =( 2 + n 2 ), which depends on
P , i.e., d is not a statistic.
48
J.C. Escanciano Econometrics I Fall 2020
P
ConsiderP now a subclass D2 D1 with ciP’s satisfying ni=1 P
ci = 1. From (10),
Rd (P ) = 2 ni=1 c2i if d 2 D2 . Minimizing 2 ni=1 c2i subject to ni=1 ci = 1 leads to
ci = n 1 . Thus, the sample mean X is D2 -optimal. There may not be any optimal
rule if we consider a small class of rules. For example, if D3 contains all the rules
in D2 except X, then one can show that there is no D3 -optimal rule.
In view of the fact that an optimal rule often does not exist, statisticians adopt
two common approaches to choose a decision rule. The …rst approach is to de…ne
a class D of decision rules that have some desirable properties (statistical and/or
nonstatistical) and then try to …nd the best rule in D. In the previous example,
for instance, any estimator d in D2 has the property that d is linear in X and
E[d(X)] = . In a general estimation problem, we can use the following concept.
bd (P ) = E[d(X)] #
The second approach to …nding a good decision rule is to consider some charac-
teristic Rd of Rd (P ), for a given decision rule d, and then minimize Rd over d 2 D.
The following are two popular ways to carry out this idea. The …rst method is
the Bayes rule. Consider an average of Rd (P ) over P 2 P:
Z
rd ( ) = Rd (P )d (P );
P
49
J.C. Escanciano Econometrics I Fall 2020
d 2 D, then d is called a D-Bayes rule (or Bayes rule when D contains all possible
rules) w.r.t. .
The second method is the minimax rule. Consider the worst situation, i.e.,
supP 2P Rd (P ). If d 2 D and
sup Rd (P ) sup Rd (P )
P 2P P 2P
for any d 2 D, then d is called a D-minimax rule (or minimax rule when D contains
all possible rules).
Example 55 Consider the estimation
Z of 2 R under loss L( ; a) = ( a)2 and
rd ( ) = E ( d(X))2 d ( );
R
50
J.C. Escanciano Econometrics I Fall 2020
For early references on nonlinear estimation see R.A Fisher (1921, 1925), Wald
(1949), Huber (1967), Jennrich (1969) and Malinvaud (1970). For a general treat-
ment of this topic see Amemiya (1973, 1985) and Newey and Mcfadden (1994). We
now provide some important examples P to motivate the general theory. Throughout,
we use the notation En [g(w)] = n 1 ni=1 g(wi ) to denote the empirical expectation
operator based on a sample fwi gni=1 of size n; for a measurable function g:
provided En [xx0 ] is non-singular. In this example we obtain a closed form for the
so-called Ordinary Least Squares (OLS) estimator.
for a given function that is typically chosen to reduce sensitivity to outlying obser-
vations: “Robust estimation”, see Huber (1967).
51
J.C. Escanciano Econometrics I Fall 2020
Note that for n = 0 we recover OLS. Ridge estimators do not do model selection
(see the plot for p = 2 for illustration).
Example 60 (MLE, CML, PMLE or QMLE) The OLS estimate in the previ-
ous example can be seen as a Conditional Maximum Likelihood (CML) estimate
of 0 under the parametric assumption
More generally, the CML estimate maximizes the conditional (log) likelihood
=`(yjx; ; 2)
z }| { 1X n
2
Qn ( ; ) = En `(yjx; ; 2 ) = En log f (yjx; ; 2
) = log f (yi jxi ; ; 2
):
n i=1
where f (xi ; ) is the marginal pdf of xi ; so the log likelihood of the data fyi ; xi g is
2 2
log f (yi ; xi ; ; ; ) = log f (yi jxi ; ; ) + log f (xi ; ):
2
The maximum likelihood estimator (MLE) of ( ; ; ) maximizes
2 2
Qn ( ; ; ) = En log f (y; x; ; ; ) :
The CML of (and 2 ) only maximizes the …rst term, ignoring the second one.
If ( ; 2 ) and are functionally unrelated then the CML and the joint (or full) ML
estimates of ( ; 2 ) are numerically the same.
52
J.C. Escanciano Econometrics I Fall 2020
If they are related (e.g. one parameter appears in both distributions) then they
are no longer numerically equal and intuitively the CML misses information that
could be obtained from the marginal distribution, and in general is less e¢ cient. In
many cases this is the price when we do not know or want to specify f (x; ):
If the true underlying distribution is other than the one speci…ed (e.g. Gaussian
distribution), the resulting estimator is called the Pseudo-MLE estimator or Quasi-
MLE, and it might be consistent and asymptotically normal under fairly weak regu-
larity conditions, although not e¢ cient, see Gourieroux, Monfort and Trognon (1984,
Econometrica). That is, the QMLE allows for some misspeci…cation.
Models with limited dependent variables and endogenous selection are examples
of models that are often estimated by MLE. This approach is often critized because
of its strong distributional assumptions.
The normal distribution is a special case of an exponential family. Exponential
families will be discussed in class.
In general ^n is only implicitly de…ned, due to special form of the objective
function, in many cases because of the nonlinearity of the model, but in others
because of the nature of the estimate.
Example 61 (Estimation of GARCH models for …nancial data: QMLE) The
GARCH(1; 1) model is the most popular model in …nancial econometrics, and it is
used, for instance, in modeling market risk by …nancial institutions. It is de…ned as
p
Yt = ht "t ; ht = w0 + 0 Yt2 1 + 0 ht 1 ; t 2 Z;
where "t is a sequence of strictly stationary and ergodic random variables satisfying
E["2t j Ft 1 ] = 1 almost surely (a.s.), (11)
and w0 > 0; 0 0; 0 0: Here Ft 1 is the -…eld generated by (Yt 1 ; Yt 2 ; :::): De-
…ne the vector of parameters = (w; ; )0 and the parameter space (0; +1)
[0; +1)2 : The true parameter value is unknown, and it is denoted by 0 = (w0 ; 0 ; 0 )0 :
The prime denotes transposition. The problem we tackle is, given a sample of size
n of Yt ; fYt gnt=1 say, to estimate the parameter 0 : A QMLE estimator is de…ned as
any measurable solution bn such that
X
n 2
bn = arg min Qn ( ) n 1 èt ( ); èt ( ) = Yt + log e2 ( ); (12)
t
2 t=1
e2t ( )
53
J.C. Escanciano Econometrics I Fall 2020
starting with initial values chosen, for instance, Y02 = e20 = Y12 : It can be shown that
the initial values do not matter for the asymptotic properties of the QMLE. If the
true innovation’s distribution is Gaussian the estimator is consistent and e¢ cient.
If the distribution is not Gaussian but (11) and other mild conditions hold then the
QMLE is consistent and asymptotically normal, although it is in general not e¢ cient,
see Escanciano (2008) and references therein.
Example 62 (CML in Probit Binary Choice) Let the scalar dependent variable
y denote participation in the labor market, that is, y = 1 if the individual partici-
pates and y = 0 otherwise. Let x be some explanatory variables including gender,
education, age, etc. We are interested in modeling the conditional probability of labor
participation given the d-dimensional vector of variables x: We assume the model
P(y = 1jx) = (x0 0 );
54
J.C. Escanciano Econometrics I Fall 2020
This method can be applied, for instance, to the Probit model, as an alternative to
CML. Another example of nonlinear speci…cation that is often used with positive
dependent variables is the Poisson regression model
where P [vi 0jxi ] = a.s., for some 2 (0; 1): Then, 00 xi becomes the conditional
quantile of the distribution of yi given xi (prove this). The most popular estimator
of 0 under this framework is the Quantile Regression Estimator (QRE), proposed
by Koenker and Basset (1978). The QRE is de…ned as any solution KB;n ( ) mini-
mizing
Xn
0
7 ! Qn ( ) = (yi xi )
i=1
p
with respect to 2 R ; where (") = (1(" 0) )": The Least Absolute
Deviation (LAD) estimator corresponds to the median, = 0:5: Rather than relying
on a single measure of conditional location, the quantile regression approach allows
the researcher to explore a range of conditional quantile functions, thereby providing
a more complete analysis of the conditional dependence structure of the variables
under consideration. Quantile Regression allows labor economists to investigate, for
example, the e¤ect of education on di¤erent parts of the wage distribution (note
0
0 0 ( ) may change with ):
En ['(w)] = h(^n ):
Assuming that h is invertible, with inverse h 1 ; the estimator is given by h 1 (En ['(w)]) =
^n . We can then apply the Delta method to investigate the asymptotic properties of
55
J.C. Escanciano Econometrics I Fall 2020
^n : Many times '(w) involves the …rst p moments. For example, for a scalar w;
'(w) = (w; w2 ; :::; wp ): This is a special case of a Zero estimator that solves the
sample analog of the following orthogonality conditions
E[m] = E [m(w; 0 )] = 0;
p 1
MM
(where we can assume that w is ergodic stationary). In this situation, ^n is such
MM
that we can estimate 0 by the so-called MM estimate ^n ; which is a solution of
MM
En [m(w; ^n )] = 0:
E[m] = E [m(w; 0 )] = 0;
q 1
(where we can assume that w is ergodic stationary). In this situation, we can estimate
0 by the so-called GMM estimate n
^GM M (W ^ ); which is a solution of
n
^ n where
for a symmetric matrix W
1X
n
En [m(w; )] = m(wi ; ):
n i=1
56
J.C. Escanciano Econometrics I Fall 2020
The …rst examples di¤er from Example 66 in that the objective functions in
the …rst ones are sample means, while in the last one is a quadratic form of such
sample means.
The …rst ones, including NLS and (C)ML estimates, which minimize (or maxi-
mize)
Qn ( ) = En [g (w; )]
for some function g of the available random sample of w; are generally denoted as
M-Estimates (from Maximum likelihood-like) by contrast with the quadratic form
case.
Other estimates as minimum distance are of the type arg min Qn ( ) gn ( )0 Wgn ( )
where gn ( ) is not necessarily a sample mean. Examples of minimum distance estima-
tors include the minimum chi-square methods for discrete data, as well as estimators
for simultaneous equation models in Rothenberg (1973) and panel data in Chamber-
lain (1982). For a more recent application to estimation of dynamic games see, e.g.,
Pesendorfer and Schmidt-Dengler (2008, RevStud).
Another large class of estimates are de…ned by the empirical equality En [g (w; )] =
0: These are often call Z-estimates (from Zero) or estimating equation esti-
mates. Notice that by considering Qn ( ) = kEn [g (w; )]k these are particular
cases of extremum estimates.
57
J.C. Escanciano Econometrics I Fall 2020
One can easily check that n = 1=n = arg min 2 Qn ( ) ! 0 6= 1 = arg min 2 Q( ):
We shall see below that uniform convergence and some additional assumptions will
be su¢ cient for consistency. Uniform convergence in probability means 8c > 0;
Example 68 (NLS Probit) Consider the Probit model for the binary outcome y
(e.g. working) and regressors
Assume the data fwi = (yi ; xi )g is iid. Then, the LLN implies for each 2 Rd
The LLN can be applied because f(yi (x0i ))2 g are iid and bounded. But how do
we prove the consistency of ^n ? First, need to show identi…cation (meaning?).
58
J.C. Escanciano Econometrics I Fall 2020
7.2.1 Consistency
Existence. A …rst question is whether such estimate ^n exists. A continuous
function Qn of ; measurable wrt the data and compact guarantee this. We will
always assume that there are no problems of measurability and that such a solution
exists. See Jennrich (1969) for further details on this issue.
We …rst provide conditions under which ^n is consistent for a ”true value”
0 : Henceforth, the function Q is the pointwise limit of Qn ; i.e. Qn ( ) !p Q( ) for
each 2 : This is our …rst general result on consistency:
inf Q( ) > Q( 0 );
d( ; 0 )>"
Proof. Choose " > 0 in (ii), then there exists a > 0 such that d( ; 0) > " implies
Q( ) Q( 0 ) + ; which in turn implies jQ( ) Q( 0 )j : Thus,
Then, we have to show that the RHS converges to zero, which is equivalent to
Q(^n ) !p Q( 0 ): Now,
59
J.C. Escanciano Econometrics I Fall 2020
60
J.C. Escanciano Econometrics I Fall 2020
Then, we have the following useful corollary, which applies to MLE, CML, NLS,
among others. Its proof is left to the reader. It will be referred to as Main Consis-
tency Result. The example that follows illustrates how the main consistency result
is applied.
Example 73 (NLS Probit, cont.) We show the consistency of the NLS ^n for the
Probit model using (72). First assume is a compact subset of Rd : First, we show
identi…cation of the nonlinear model: Note
Q( ) = E (yi (x0i ))2
= E (yi (x0i 0 ))2 + E ( (x0i 0) (x0i ))2 ;
where the cross-product is zero by the orthogonality condition E [y (x0 0 )jx] = 0;
since by iterated expectations
E [(yi (x0i 0 ))( (x0i 0) (x0i ))] = E [E [y (x0 0 )jx] ( (x0i 0) (x0i ))] = 0:
Now, strict monotonicity of yields
61
J.C. Escanciano Econometrics I Fall 2020
Example 74 (MLE cont.) Suppose that fwi g are iid with pdf f (wi ; ) and (i) if
6= 0 then f (wi ; ) 6= f (wi ; 0 ); (ii) 0 2 ; which is compact; (iii) ln f (wi ; ) is
continuous at each 2 with probability one; (iv) E[sup 2 jln f (w; )j] < 1: Then
^M LE !p 0 : The identi…cation condition (ii) here follows from Jensen’s inequal-
ity (how?). To see how, by the strict version of Jensen’s inequality, if f (wi ; ) 6=
f (wi ; 0 )
Q( 0 ) Q( ) = E[ln f (w; )] E[ln f (w; 0 )]
f (w; )
= E ln
f (w; 0 )
f (w; )
< ln E
f (w; 0 )
= ln(1) = 0:
Consider, for example, an exponential density with parameter > 0; f (w; ) =
exp( w) for w > 0: This density satis…es the consistency conditions, with a
compact set of (0; 1) (verify this).
Example 75 (Quantile Regression, cont.) The quantile regression objective func-
tion
1X
n
0
7 ! Qn ( ) = (yi xi )
n i=1
is convex, where recall (") = (1(" 0) )": Then, for consistency it su¢ ces to
show that (i) Q( ) is uniquely minimized at 0 ; (ii) 0 is an element of the interior
of a convex set and (iii) Qn ( ) !p Q( ) for all 2 : Write
0
Q( ) = E [ (yi xi )]
= Q( 0 ) + Q( ) Q( 0 )
0 0 0
= Q( 0 ) + E [ (1("0 xi ) )("0 xi ) + (1("0 0) )("0 xi )]
Z 0
0
= Q( 0 ) + E (" xi )f ("jxi )d" ;
0
xi
0
where ( )= 0 ; "0 = y 0 x and f ("jx) is the conditional density of "0
given x: By Dominated Convergence and the Leibniz’s rule, for S( ) = Q( ) Q( 0 );
Z 0
@S( ) 0
= E (" xi )f ("jxi )j"= 0 xi xi xi f ("jxi )d"
@ ( )0 xi
Z 0
= E xi f ("jxi )d"
( )0 xi
62
J.C. Escanciano Econometrics I Fall 2020
and, evaluated at = 0;
@ 2 S( ) 0
0 = E [xi xi f (0jxi )] :
@ @
Note f (0jxi ) can be also expressed as the conditional density of yi given xi evaluated at
the quantile 0 xi : Thus, under the assumption that E [xi x0i f (0jxi )] is positive de…nite,
the strictly convex function Q( ) has a unique minimum at = 0 : We conclude that
the QRE is consistent by the convexity theorem.
ASS. 1 .
(i) 0 is an interior point of Rp :
(ii) Qn ( ) is twice continuously di¤erentiable in a neighborhood of 0 and
p @
n Qn ( 0 ) !d N(0; D)
@
~n !p @2
0 ) Qn (~n ) !p E > 0 nonstochastic: (17)
@ @ 0
(iii) ^n !p 0:
E = D; so
p
n(^n 0 ) !d
1
N(0; D ):
Remark 2: A su¢ cient condition for the second display of 1(ii) is the uniform
convergence of the Hessian @ 2 Qn ( )=@ @ 0 and the continuity of the limit. More
63
J.C. Escanciano Econometrics I Fall 2020
This remark is very useful in many applications and we shall use it extensively in
these notes. Notice that we only need ^n 2 with probability tending to one for
the previous argument to work and that no rates of convergence of ^n are needed.
In a typical application
1X
n
^
Gn ( n ) = m(zi ; ^n );
n i=1
for some function m: The law of large numbers cannot be directly applied to Gn (^n )
because fm(zi ; ^n )gni=1 are generally dependent. The trick above suggests a solution
to this “problem” by applying a ULLN (from our Main ULLN typically) and using
the consistency of ^n and continuity of the limit (a typical by-product of the Main
ULLN). Thus, 1(ii) follows from a ULLN and the consistency theorem.
We must show
0
Pr(d x) ! Pr(X x); for X N (0; 1); 8 ; = 1:
But
64
J.C. Escanciano Econometrics I Fall 2020
where
Pr(I = 0) = Pr( ^ 0 ) some > 0; by (i)
! 0 by (ii)
) Pr(d x) = Pr(d xjI = 1) + o(1) uniformly in x:
By the mean value theorem
p @ p @ p
0= n Qn ( ^ ) = n Qn ( 0 ) + Fi ( i ) n(^ 0 ); i = 1; : : : ; p
@ i @ i
But ^ !p 0 ) i
!p 0; i = 1; : : : ; p:
) Fi ( i ) !p i -th row of E:
p @ p
) 0= n Qn 0 ) + (E + op (1)) n(^ 0)
@ i
p p @
) E 1 (E + op (1)) n(^ 0) = E 1 n Qn ( 0 )
@ i
!d N (0; E 1 DE 1 ))
p
) n(^ 1
0 ) !d N (0; E DE )
1
The following corollary is useful. Its proof is left to the reader. De…ne
@g(wt ; 0 ) @g(wt ; 0)
D=E 0
@ @
and
@ 2 g(wt ; 0 )
E=E :
@ @ 0
The following result is extremely useful in this course, and will be referred to as the
Main Asymptotic Normality result. It is used to establish our third main step
for inference (proving asymptotic normality, after identi…cation and consistency).
65
J.C. Escanciano Econometrics I Fall 2020
Example 78 (NLS Probit, cont.) We now show the AN of the NLS ^n for the
Probit model. Assume is a compact subset of Rd : We have shown before that if
E [xi x0i ] is non-singular then ^n is consistent for 0 : Next, we check that
g(wi ; ) = (yi (x0i ))2
satis…es the conditions of (77). First, it is twice continuously di¤erentiable for all
and wi with
@g(wt ; )
= 2(yi (x0i )) (x0i )xi
@
@ 2 g(wt ; )
0 = 2 2 (x0i )xi x0i 2(yi (x0i )) _ (x0i )xi x0i ;
@ @
where is the standard normal density and _ its derivative: Then, since and _ are
bounded we can take h(w) C jxj2 ; for a positive constant C. Thus, the NLS ^n is
CAN p
n(^n 1
0 ) !d N(0; E DE );
1
with
2 2
D =E 4(yi (x0i 0 )) (x0i 0 )xi x0i
and (by the orthogonality condition)
E = E 4 2 (x0i 0
0 )xi xi :
Example 79 (MLE cont.) Suppose that fwi g are iid with pdf f (wi ; ) and (i)
the conditions for consistency hold;(ii) 0 2 int( ),with compact;(iii) f (wi ; ) is
twice continuously di¤erentiable and f (wi ; ) > 0 in a neighborhood N of 0 ; (iv)
E[sup 2 jr log f (wi ; )j] < 1;
Z
sup jr log f (w; )j dw < 1;
2N
Z
[sup jr log f (w; )j]dw < 1;
2N
66
J.C. Escanciano Econometrics I Fall 2020
then the MLE is AN. The function s(w; ) =r log f (w; ) is called score. The
variance D =E s(w; 0 )s(w; 0 )0 is called the Fisher Information matrix. We shall
show in the next pages that D
p = ^ E = E [r log f (w;1 0 )] (this is the Information
Equality). Thus, for MLE n( n 0 ) !d N(0; I ); where I = D = E: The
Fisher Information plays a key role in statistics. Under regularity conditions, I 1
represents the variance e¢ ciency bound (the smallest possible variance for a regu-
lar estimator of 0 ): Heuristically,
p regular estimators are estimators for which the
convergence in distribution of n(^n 0 ) holds uniformly (locally, around the true
distribution). The Hodges’estimator is not regular (so there is no contradiction with
the previous statement on e¢ ciency).
and
2
E=E (x0i 0
0 )xi xi :
Therefore, V ^ 1D
^ =E ^E^ 1
with
h i
^ =En 4(yi
D (x0i ^n ))2 2
(x0i ^n )xi x0i
and h i
^ = En 4 2 (x0i ^n )xi x0i :
E
67
J.C. Escanciano Econometrics I Fall 2020
E[sup jlog f (yi jxi ; )j] < 1; E[sup jr log f (yi jxi ; )j] < 1;
2 2
Z
sup jr f (yjxi ; )j dy < 1; (18)
2N
Z
[sup jr f (yjxi ; )j]dy < 1; (19)
2N
where D :=V [s( 0 )] : To see that E [Sn ( 0 )jx] = E [s( 0 )jx] = 0 note that,
Z
E [s( )jx] = s( ;y; x)f (yj ; x) dy
Z
= r f (yj ; x) dy
Z
= r f (yj ; x) dy = 0;
| {z }
=1; 8
for all ; where the exchange of integration and di¤erentiation is allowed by the uni-
form integrability condition (18), so for = 0 ; E [s( 0 )jx] = 0: In dynamic models,
68
J.C. Escanciano Econometrics I Fall 2020
is the Hessian (which is de…nite negative because we are now maximizing), so that
we obtain the so-called information equality
because
=0
z
Z }| { Z
0
r s( ;y; x)f (yj ; x) dy = r0 fs( ;y; x)f (yj ; x)g dy
Z Z
= h( ;yi ; x)f (yj ; x) dy + s( ;y; x)s( ;y; x)0 f (yj ; x) dy
The su¢ cient conditions given here for AN of CML are not optimal. They can be
relaxed using the concept of functional derivatives (Frechet derivative) of the square
root of the density, but this is beyond the scope of these notes.
(x0i 0) = (x0i );
which in turn implies x0i 0 = x0i : Multiplying this by xi both sides, and taking
expectations we conclude 0 = provided
69
J.C. Escanciano Econometrics I Fall 2020
is continuous in for all wi : Verifying the dominance conditions is a little bit more
involved than for the NLS. Note that by the Mean Value Theorem
where (u) = (u)= (u) is the derivative of log (u); is some intermediate point
between and zero, and C and M are constants. In these inequalities we have used
that (u) C j1 + uj and that is compact (and hence bounded). A similar bound
holds for log (1 (x0i )). Therefore, the moment g(wi ; ) satis…es the conditions
of the ULLN, and consistency follows. For asymptotic normality, we compute deriv-
atives (after some computation)
@Qn ( )
= En [yi xi (x0i ) (1 yi )xi ( x0i )]
@
@ 2 Qn ( ) h i
= En _ (x0i )y + _ ( x0i )(1 yi ) xi x0i :
@ @ 0
Then, by the CLT
p @
n Qn ( 0 ) = Sn ( 0 ) !d N(0; D);
@
where
2
(x0i )xi x0i
D =E :
(x0i ) (1 (x0i ))
We shall apply Theorem 11 with
This function is continuous for all wp1, and noting that _ (u) is bounded, we have
jg(wi ; )j C jxi j2 :
70
J.C. Escanciano Econometrics I Fall 2020
This result follows from the Delta Method applied to the inverse of h( ) around 0
(see Theorem 4.1 in van der vaart (1998). This result could be also obtained from a
Taylor expansion argument in the empirical equation
GM M
En [m(w; ^n )] = 0
71
J.C. Escanciano Econometrics I Fall 2020
^n = arg min Qn ( ):
2
We …rst discuss concentration techniques that can be useful when the the dimension
p is large, or when a closed form solution exists for a subcomponent of ^n :
Concentration
p1 11
Suppose we partition as = ; where p1 + p2 = p:
2 p2 1
Suppose that for given 1 ; we can obtain an explicit formula for the optimizing
value of 2 ;
^2n ( 1 ) = arg min Qn ( 1 ; 2 )
2
can be written
^2n ( 1 ) = gn ( 1 )
for a given function g:
^n1
and partitioning ^n = ^2n we have
^2n = gn (^n1 ):
72
J.C. Escanciano Econometrics I Fall 2020
E.g.
+ z
y = +e +v
1
= ; e +v
e z
0
= 2 h( 1 ) + v
X
n
2
0
Qn ( ) = (yi 2 hi ( 1 ))
1
( n ) 1
X X
n
^2n ( 1 ) = hi ( 0
1 )hi ( 1 ) hi ( 1 )yi
1 1
= g( 1 )
Xn
2
Rn ( 1 ) = yi ^02n ( 1 )hi ( 1 )
1
h i
0 0 1 0
= y I h( 1 ) fh ( 1 )h( 1 )g h ( 1) y
where 2 3 2 3
y1 h01 ( 1 )
6 7 6 7
y = 4 ... 5 ; h( 1 ) = 4 ..
. 5
0
yn hn ( 1 )
We can do a similar thing in Gaussian Max. Likelihood estimation of
0
yi = 2a h( 1 ) + vi ; vi N(0; 2b );
2a
2 = ;
2b
i.e we can concentrate out the disturbance variance 2b as well as the scale parameters
2a :
73
J.C. Escanciano Econometrics I Fall 2020
Search Methods
We have been implicitly assuming that contains an uncountable in…nity of
points. Suppose that A is a …nite set of N < 1 points.
Let
^N = arg min Qn ( ):
A
3. Alternating Search.
0
Put = ( 1; 2 ; : : : ; p ):
Fix 2 ; : : : ; p and search over 1: Then for the optimizing 1; and the previous
3 ; : : : ; p ; search over 2 ; etc.
y = h( 1 )0 2 +v
74
J.C. Escanciano Econometrics I Fall 2020
7.3.1 Newton-Raphson
We now assume that ^n also satis…es
@Qn (^n )
fn ( ^ n ) = = 0:
@
@fn ( ) @ 2 Qn ( )
Fn ( ) = = :
@ 0 @ @ 0
Thus
0 fn ( n(k) ) + Fn ( ^
n(k) )( n n(k) )
or
^n n(k) Fn ( n(k) )
1
fn ( n(k) )
1
n(k+1) = n(k) Fn ( n(k) ) f n( n(k) ); k = 1; 2; : : :
75
J.C. Escanciano Econometrics I Fall 2020
yi = i( ) + vi
1X
n
Qn ( ) = (yi i( ))2
n 1
2 X @ i( )
n
fn ( ) = (yi i( ))
n 1 @
2X
n
@ i( ) @ i( )
Fn ( ) =
n 1 @ @ 0
2 X @ 2 i( )
n
(yi i( )) :
n 1 @ @ 0
76
J.C. Escanciano Econometrics I Fall 2020
7.3.2 Gauss-Newton
Suppose Qn ( ) is of the form
1
Qn ( ) = rn ( )0 rn ( ); r is N 1
2
for a vector rn ( ):
Now
@r0n ( ) X @rni ( )
N
fn ( ) = rn ( ) = rni ( )
@ i=1
@
G G G 1 G
n(k+1) = n(k) Gn ( n(k) ) f n ( n(k) );
@r0n ( ) @rn ( )
Gn ( ) = :
@ @ 0
Advantages:
1. Gn ( ) is psd always.
On the other hand, Gauss-Newton tends to converge more slowly than Newton-
Raphson. GN can be used in many econometric problems.
G
The Score Method replaces the Hessian Gn ( n(k) ) by (minus) the Fisher
information matrix, I( G
n(k) ) (i.e., its expectation).
77
J.C. Escanciano Econometrics I Fall 2020
The term
X
n
@2 ni ( )
2 0 (yi ni ( ))
1
@ @
can be neglected (note that this was zero mean at 0; -the true value of ).
78
J.C. Escanciano Econometrics I Fall 2020
h i p
X
0 2 n
Qn ( ) = En (y x ) + j jj ;
n j=1
0
where = ( 1 ; :::; p) and n is a sequence of positive numbers such that
pn ! 0 0:
n
h p i p
X p
0 2 n
Vn (u) = En "0 x u= n "20 + j + uj = n j jj :
n j=1
4
This material does not go into the exam.
79
J.C. Escanciano Econometrics I Fall 2020
while
p p
n
X p X
j + uj = n j jj ! 0 uj sgn( j )1( j 6= 0) + juj j 1( j = 0):
n j=1 j=1
Thus, Vn (u) !d V (u) and since Vn (u) is convex and V (u) has a unique minimum it
follows that p
n(^n 0 ) = arg min Vn (u) !d arg min V (u):
where
1 1
( )=V 0
( )K( )V 0
( )
and
K( ) = E[(1("0 0) )(1("0 0) )xx0 ]:
1 1
If the quantile model is correctly speci…ed, then ( ) = (1 )V 0
( )E[xx0 ]V 0
( ):
How do we estimate ( )? This is a di¢ cult problem. See Escanciano and Goh
(Quantile-Regression Inference With Adaptive Control of Size, with Chuan Goh,
Journal of the American Statistical Association, 114, 382-393, 2019) for a recent
proposal.
80
J.C. Escanciano Econometrics I Fall 2020
T (P ) = EP [T (X)]; P 2 P;
which is the type I error probability of T (X) when P 2 P0 and one minus the type
II error probability of T (X) when P 2 P1 .
With a sample of a …xed size, we are not able to minimize the two error proba-
bilities simultaneously. Our approach involves maximizing the power T (P ) for all
P 2 P1 (i.e., minimizing the type II error probability) over all tests T satisfying
sup T (P ) ;
P 2P0
where 2 [0; 1] is a given level of signi…cance. The left-hand side of the last expres-
sion is de…ned to be the size of T . The level of signi…cance is often small (e.g. 0.1,
0.05 or 0.01), so a type I error is considered a more serious error thant a Type II
error.
Example 88 (Normal variables) Let X1 ; :::; Xn be i.i.d. N ( ; 2 ) random vari-
ables. Suppose that H0 : 0 and H1 : > 0. A popular test is the t-test
T (X) = 1 (tn > c)
where p
nX
tn =
S
is the t test statistic and c is a constant
p to be determined p
so that supP 2P0 T (P ) :
Note tn = Zn + n ; where Zn = n(X )=S and n = n =S: Then,
sup T (P ) = sup EP [T (X)] = sup P (Zn + n > c) = P0 (Zn > c) ;
P 2P0 P 2P0 P 2P0
81
J.C. Escanciano Econometrics I Fall 2020
which again is known. The t-test then rejects H0 at signi…cance level if tn > cn; :
The critical region is S ( ) = fx : tn (x) > cn; g: The p-value is de…ned as
82
J.C. Escanciano Econometrics I Fall 2020
Unless
X
n
n j
= p (1 p0 )n j
j=m+1
j 0
for some integer m, in which case we can choose = 0, the UMP test T is a
randomized test.
An interesting phenomenon in this example is that the UMP test T does not
depend on p1 . In such a case, T is in fact a UMP test for testing H0 : p = p0 versus
H1 : p > p0 . this last property generalizes to families with a monotone likelihood
ratio property (see Shao for de…nition).
An interesting application of this example is to Backtesting in risk management.
83
J.C. Escanciano Econometrics I Fall 2020
For two-sided hypothesis, UMP tests are rare. Imposing unbiasedness helps
with real-valued parameters. More generally, for multivariate parameters,
UMP typically do not exist, and further restrictions (such as invariance) need
to be introduced to de…ne UMP tests.
84
J.C. Escanciano Econometrics I Fall 2020
1. (a) Derive the most powerful test of size = 0:05 for testing H0 : = 0
against H1 : = 1 ; where 0 and 1 are known constants, 0 < 0 < 1 .
A complete answer should include an explicit description
Pn of the critical
region, using a known distribution. [Hint: (2= ) i=1 Xi has a chi-square
distribution with 2n degrees of freedom. You do not need to prove this
result.]
(b) Argue that the test obtained in (a) is also a uniformly most powerful size
0.05 test for testing H0 : = 0 against H1 : > 0 . Express the power
of the test in terms of the CDF, say Gn (), of a chi-square r.v. with 2n
degrees of freedom. Is the power a decreasing or increasing function of ?
Explain.
(c) Propose an asymptotic test for the problem in (b) with limiting size
based on the CLT and the fact that V ar(X1 ) = 2 and show that this
test is consistent.
SOL: (a) By the Neyman-Pearson Lemma a UMP test of size is
8
< 1 if f1 (X) > cf0 (X)
T (X) = if f1 (X) = cf0 (X) ;
:
0 if f1 (X) < cf0 (X)
where
n Y
fj (X) = j exp ; j = 0; 1;
j
P
and Y = ni=1 Xi : Thus, since the distribution of the LR ff10 (X)
(X)
is continuous
and a monotone function of Y , we deduce there is a constant m such that
1 if Y > m
T (X) = T (Y ) =
0 if Y < m;
where m satis…es = E0 [T (Y )] = P0 (Y > m). By the hint and
85
J.C. Escanciano Econometrics I Fall 2020
(b) Since the UMP test T does not depend on 1 , this test is also UMP test
for testing H0 : = 0 against H1 : > 0 . The power function of the UMP
test is
2
T ( ) = E [T (X)] = P (2Y = > 2m= ) = 1 Gn ( 2n;1 0= );
which is an increasing function of : Note T (0) = ; T ( )< if < 0 and
T ( ) > if > 0 .
(c) From the CLT
p X
n !d N (0; 1):
! 1 ( 1) = 1 as n ! 1:
Remark 2 For composite null and composite alternatives UMP are harder to …nd.
The following important example illustrates this point.
1. Let X1 ; :::; Xn be i.i.d. Nk ( ; ) random variables, with positive de…nite and
known.
(a) Let c be a …xed vector in Rk ; and …x 1 with c0 1 > 0: (i) Derive the most
powerful test of size = 0:05 for testing H0 : = 0 against H1 : = 1 ;
where 0 = 1 (c0 1 =c0 c) c. (ii) Argue that the test obtained is also
a uniformly most powerful size 0.05 test for testing H0 : c0 = 0 against
H1 : c0 > 0.
(b) Consider the two-sample problem where we have independent observa-
tions, Z1 ; :::; Zn i.i.d. N ( 1 ; 21 ) and Y1 ; :::; Yn i.i.d. N ( 2 ; 22 ) (treatment
and control groups, respectively). (i) Assuming 21 and 22 are known,
derive the most powerful test of size = 0:05 for testing no treatment
e¤ect H0 : 1 = 2 against H1 : 1 > 2 : [Hint: Use part (b) for a speci…c
vector c]. (ii) For future planning of an experiment, you want to compute
the minimal sample size needed to achieve a power of at least 1 for an
alternative of size = 1 2 > 0; how would you compute such sample
size? What happens with this sample size when is very small?
86
J.C. Escanciano Econometrics I Fall 2020
SOL: (a) (i) Computing the likelihood ratio and using 0 = 1 (c0 1 =c0 c) c
we obtain !
f1 (X) c0 1 0 (c0 1 )2
= exp 0 c Y ;
f0 (X) c c 2c0 c
P
where Y = ni=1 Xi : The UMP test thus rejects for large values of c0 Y; pre-
p 0 p 0
cisely T (Y ) = 1(tn > z ); where tn = nc X= c c; since c0 Y has a normal
distribution N (nc0 ; nc0 c), which is continuous, we can set = 0; and m the
corresponding quantile of the normal distribution: (ii) The UMP test T does
not depend on 0 and 1 ; c0 0 = 0; and hence is also UMP test for testing
H0 : c0 = 0 vs H1 : c0 > 0.
(b) (i) De…ne X = (Z; Y ): By independence of Z and Y; X is a bivariate
normal with mean = ( 1 ; 2 ) and diagonal variance covariance matrix
(with diagonals known 21 and 22 ). Taking c = (1; 1) and applying part (b)
we conclude that the UMP test for testing
p H0 : 1p= 2 against H1 : 1 > 2 is
2 2
T (Y ) = 1(tn >p z ); where tn = n(Z p Y )= 1 + 2 : (ii) De…ne Zn =
p 2 2
p 2 2
n(Z Y )= 1 + 2 and n = n = 1 + 2 : The power function of the
UMP test is
2 2
When is very small the required sample size is very large (unless 1 + 2 is
very small too).
Example 93 (Nonparametric test for mean) Let X1 ; :::; Xn be i.i.d. random vari-
ables with …nite variance. We want to test H0 : = 0 vs H1 : = 1 , 1 > 0; but
we are not willing to assume a parametric distribution for the Xs: That is, we relax
the normal distribution assumption. We could use the t-statistic as before, but when
we try to compute c such that the test has level ;
P0 (tn > c) = ;
this c = Jn 1 (1 ; P0 ) is unknown, as the quantile Jn 1 (1 ; P0 ) of the distribution
Jn (x; P0 ) = P0 (tn x) depends on P0 , which is unknown (we only know its mean is
87
J.C. Escanciano Econometrics I Fall 2020
zero, but other aspects of the distribution that may a¤ect Jn are unknown). As we
do not know the …nite sample distribution of the test statistic, we are not even able
to de…ne the test. How do we address this problem? Resorting to asymptotic theory.
Although Jn (x; P0 ) = P0 (tn x) is not known, from the CLT we know
Jn (x; P0 ) ! (x); as n ! 1:
1
Thus, choosing c = (1 ) z ; the last convergence implies
Note the test Tn = 1 (tn > z ) does not necessarily have level ; but we say is uni-
formly asymptotic level test, i.e.
lim sup T (P ) :
n!1 P 2P0
With composite null this may be hard to achieve, and we simply require pointwise
asymptotic level test, i.e. for each P 2 P0 ;
lim T (P ) :
n!1
Let
1 q 1
=
p 1 2 s 1:
True value:
01
0 = :
02
88
J.C. Escanciano Econometrics I Fall 2020
Composite if q < p
H0 : 01 =0 : (20)
Simple if q = p:
Alternative hypothesis:
H1 : 01 6= 0
The parameter needn’t be the “natural”parameters. Let be “natural”parame-
ters and we test
H0 : g( 0 ) = 0
for smooth q 1 g: Put
1
1 = g( ); 2 = 2; = :
q 1 p 1 2
1 = R r; 2 = 2
R r
) =
O Is 0
1
R r
) = + :
O Is 0
Thus there is no real loss of generality in H0 (20): Note that in applied problems
11
there may be a sequence of tests: e.g. tests = 0; if rejected test 11 = 0:
12
So speci…cation of may change.
89
J.C. Escanciano Econometrics I Fall 2020
n;c ( 01 ) ! 1 as n ! 1; 8 01 6= 0; 8c > 0:
We will discuss a trio of asymptotic tests: the Wald, Lagrange Multiplier (LM,
also called Rao or Score) and Likelihood Ratio (LR)- that can be used for testing
the null hypothesis. The three statistics are asymptotically equivalent in that they
share the same asymptotic distribution ( 2 ) and their properties can be extended to
more general situations, see Econometrics II. Some of the results are more generally
valid for asymptotic normal estimators, as is the case, for instance, for Wald Tests.
p
n(^n 0) !d N (0; A) 8 0 2
where
A = A( 0 ) > 0
q A11 ( ) A12 ( )
A( ) =
s A21 ( ) A22 ( )
q s
ASS. 3 Under H0
^ 11;n !p A11 0
A = A011 :
02
90
J.C. Escanciano Econometrics I Fall 2020
0 1
W n = n ^1n A
^ 11;n ^1n
p ^ p
PROOF. H0 ; ASS. 2 ) n 1n = n ^1n 01 !d N (0; A011 );
0 1
) n^1n A011 ^1n !d 2
q:
0 0 1 p 0 1 p
Wn = n^1n A
^ 111 ^1n = n^1n A011 ^1n + n^1n A
^ 111 A011 n^1n
0 1
= n^1n A011 ^1n + Op (1)op (1)Op (1)
0 1
= n^1n A011 ^1n + op (1)
2
!d q:
91
J.C. Escanciano Econometrics I Fall 2020
ASS. 4
^ 11 !p A11 > 0;
A 8 0
PROOF:
p
n ^1n 01 !d N (0; A011 )
p p
) n^1n = n 01 + Op (1);
) By ASS. 4
p 0 1 p
Wn = n 01 + Op (1) (A11 ) + op (1) n 01 + Op (1)
0 1 1p
= n 01 (A11 ) 01 + 2Op (1)(A11 ) n 01 + Op (1)
0 1 1=2
= n 01 (A11 ) 01 + Op (n );
) c
( 01 ) = Pr(W > cj 01 )
= Pr n 001 (A11 ) 1 01 + Op (n1=2 ) > cj 01
! 1;
0 1
because n 01 (A11 ) 01 > 0 and dominates Op (n1=2 ) as n ! 1:
Examples of Qn ( ) are the sum of squared residuals for LSE or minus the log likeli-
hood for MLE.
Put
@Qn ( ) @ 2 Qn ( )
Q ;n ( ) = ; Q ;n ( ) = ;
@ @ @ 0
92
J.C. Escanciano Econometrics I Fall 2020
where p
nQ ;n ( 0 ) !d N (0; D); Q ;n ( 0 ) !p E:
Under suitable conditions
p
n(^n 0) !d N (0; E 1 DE 1 )
This is true in some cases, e.g. MLE under correct speci…cation, but it is not
true in others, e.g. LSE. Henceforth we will assume for simplicity in the arguments
that “E = D”. For Wald tests the theory follows for general cases, but for LM and
LR tests we need E = D: Therefore, we shall consider Assumptions 2 and 3 with
A = D 1:
A word on notation:
D11 D12 D11 D12
D= ; D 1=
D21 D22 D21 D22
1
Notice that e.g. D11 = D11 D12 D221 D21 :
The asymptotic theory of Wald test follows as in the previous section. Let’s restate
our …ndings on Wald test in the context of extremum estimators:
ASS. 5
p
nQ ;n ( 0 ) !d N (0; D)
n !p 0 ) Q ;n ( n ) !p D; 8 n:
93
J.C. Escanciano Econometrics I Fall 2020
PROOF:
p
) n(^n 0)!d N (0; D 1 )
p
) n(^1n 11
01 ) !d N (0; D ):
PROOF:
p p
n^1n = n 01 + Op (1) Theorem 102
(!p 1)
94
J.C. Escanciano Econometrics I Fall 2020
Let
~n = arg min Qn ( );
H0
~n = 0 @Qn ( )
~2n i.e. Q2n (~) = 0; Qin = :
@ i
Put
0
Ln ( ) = Qn ( ) 1
@L( ) ~ n = Q1n (~n );
= Q1n ( ) ) (~n satis…es H0 )
@ 1
@L(~n )
= Q2n (~n ) = 0:
@ 2
ASS. 6 Under H0
~ 11 11 0
D n !p D :
02
Remark: since Q2n (~n ) = 0, the LM test is also called the Score test:
LMn = n Q ~ 0~ 1 ~
;n ( n ) Dn Q ;n ( n ):
95
J.C. Escanciano Econometrics I Fall 2020
PROOF:
0 Q01n
Q0in = Qin ; Q0 ;n = Q ;n ( 0) = :
02 Q02n
) ~2n 02 = 1
Q22;n Q02;n :
p p 0
) n~n = I; 1
Q12;n Q22;n nQ ;n
1 D11 D12 I
!d N 0; I; D12 (D22 )
D21 D22 (D22 ) 1 D21
= N (0; D11 D12 (D22 ) 1 D21 )jH0
= N (0; (D11 1
0 ) )
0
) LMn = n ~ D11
0
~ n + n ~ 0n (D
~ 11
n D11 ~
0 ) n:
2
!d q:
~ 11
ASS. 7 D 11
n !p D ; 8 0 :
PROOF:
p p
n~n = I; 1
Q12;n Q22;n nQ0 ;n ; (i.e. with 01 = 0)
1
p 01 @ 1 01
= I;Q12;n Q22;n n Q ;n Q ;n
02 @ 02 0
p
= Op (1) + Op (1) n !p 1:
96
J.C. Escanciano Econometrics I Fall 2020
PROOF:
0 = Q2n ~n = Q02 + Q
~ 22;n ~2n 02 (24)
97
J.C. Escanciano Econometrics I Fall 2020
p p ^1n
) n ^n ~n = n ^2n ~2n
I p
= 1 n^1n + op (1)
Q22;n Q21;n
I p
= n^1n + op (1)
(D22 ) 1 D21
under H0 : Also p
n^1n !d N (0; D11
0 ):
Thus
0
LRn = n ~n ^n Q ;n
~n ^n
0
= n ~n ^n D ~n ^n + op (1):
But
1 D11 D12 I
I; D12 (D22 )
D21 D22 (D22 ) 1 D21
1
= D11
0 :
0 1
) LRn = n^1n D11
0
^1n + op (1) !d 2
q:
98
J.C. Escanciano Econometrics I Fall 2020
Some Remarks
Which test to use?
– W tests only uses estimators under the alternative, LM under the null,
and LR under both the null and the alternative.
– In the theory above LM and LR require the key assumption E = D:
These ests can be robusti…ed against misspeci…cations, i.e. can be also applied
to cases where E 6= D if they are suitably modi…ed, as the following important
application illustrates.
where the second equality uses the independence, and the third the de…nition of Yi :
To illustrate these points, suppose we run the regression
0
Yi = 0 + 0 Di + 0 Xi + "i ;
99
J.C. Escanciano Econometrics I Fall 2020
That is, 0 can be interpreted as the average treatment e¤ect. The …rst identi…cation
above is nonparametric, while the second uses additional functional form assumptions
and is less appealing.
We now provide the expression for a robust (to heteroskedasticity) LM tests based
on the OLS objective function for the hypotheses
H0 : 0 =0 vs H1 : 0 6= 0:
Strictly speaking the previous theory for the LM test does not apply to this problem
because this is a situation where E 6= D; even under conditional homoskedastic-
ity. We can however construct a version of the LM that is robust to conditional
homoskedascitity as follows (see Wooldridge, page 60 for discussion). Write the re-
gression as
Yi = 00 Wi + "i
where Wi = (1; Di ; Xi0 )0 . Then the objective function of OLS is
1 X
n
0
Qn ( ) = (Yi W i )2
2n i=1
Then
1X
n
@ 0
Q n( ) = Qn ( ) = (Yi Wi )Wi :
@ n i=1
100
J.C. Escanciano Econometrics I Fall 2020
Let
~n = arg min Qn ( );
=0
0
That is, ~n = (b ; 0; b )0 , where b and b are OLS estimate of the following restricted
regression:
Yi = 0 + 00 Xi + "i :
Then the Lagrange multiplier test statistic can be constructed as
LMn = nQ n (~n )0 D
~ 1 Q n (~n );
n
Hence, we need to compute the asymptotic variance of the middle term. It can be
shown that under the null
1 X 1 X
n n
p (Yi ~0 Wi )Di = p (Yi 0 r
n 0 Wi )Di + op (1);
n i=1 n i=1
! d N(0; B);
where Dir is the population OLS error in the regression of Di against 1 and Xi and
B = E["2i (Dir )2 ]: To get some intuition on the last display, note we can always write
by a least squares projection
Di = 0 + 1 Xi + Dir ;
where, by construction, Dir is uncorrelated with Xi and with zero mean. On the
other hand, we know from the FOC of OLS and substituting from the last display
1 X X
n n
p (Yi ~0 Wi )Di = p1 (Yi ~0 Wi )Dr ;
n n i
n i=1 n i=1
1 X X 1X
n n n
p
p (Yi ~0 Wi )Dr = p1 "i Dir n(~n 0 ) Wi Dir
n i
n i=1 n i=1 n i=1
101
J.C. Escanciano Econometrics I Fall 2020
and
1X
n
p
n(~n 0) Wi Dir !p 0
n i=1
since E[Wi Dir ] = (0; E[Di Dir ]; 0)0 and ~n 0 has a zero in the component corre-
sponding to Di under H0 :
Denote by D ^ ir the OLS residual in the regression of Di against 1 and Xi : Then,
the heteroskedasticity-robust LM statistic is given as
!0 ! 1 !
1 X 1 X 2 ^r 2 1 X
n n n
LMn = p ~"i Di ~" Di p ~"i Di :
n i=1 n i=1 i n i=1
Since, under H0 ,
1 X 2 ^r
n
2
~" Di !p B;
n i=1 i
it follows that under H0 ,
2
LMn !d 1:
The LM test can be applied to Lasso and other machine learning methods if it is
implemented in the form
!0 ! 1 !
1 X ^r 1 X 2 ^r 2 1 X ^r
n n n
LMn = p ~"i Di ~" Di p ~"i Di ;
n i=1 n i=1 i n i=1
102
J.C. Escanciano Econometrics I Fall 2020
points in the support of Z; say z and w with p(z) > p(w); then
This equation highlights the identi…cation problem arising in the use of IV as a causal
estimand. This motivated the Monotonicity assumption: For all z; w in the support of
Z; either Di (z) Di (w) for all i; or Di (z) Di (w) for all i: Under these conditions,
E[Yi jZi = z] E[Yi jZi = w]
= E[Yi (1) Yi (0) jDi (z) 6= Di (w)]:
E[Di jZi = z] E[Di jZi = w]
The right hand side is called the Local Average Treatment E¤ect (LATE), and has a
causal interpretation. If Z is binary, z = 1 and w = 0; then
Cov(Yi 0 0 Di ; Zi ) = 0:
103
J.C. Escanciano Econometrics I Fall 2020
104
J.C. Escanciano Econometrics I Fall 2020
Ideally, we would like …nite sample con…dence sets, but these are hard to …nd. If
Cn is a uniform asymptotic con…dence set, then the following is true: for any > 0
there exists n( ) such that the coverage of Cn (i.e. P ( 2 Cn )) is at least 1
for all n > n( ): With a pointwise asymptotic con…dence set, there may not exist a
…nite n( ): Unfortunately, commonly used con…dence sets are pointwise asymptotic.
The typical construction for a pointwise asymptotic 1 con…dence interval is
^n z ^
=2 se( n );
1
where z = (1 ), which is based on the asymptotic normality result for the
univaritate case, i.e.
(^n 0)
!d N (0; 1):
se(^n )
This is often obtained by Slutsky´s Theorem from
p
n(^n 0 ) !d N (0; V )
105
J.C. Escanciano Econometrics I Fall 2020
q 1=2
and se( n ) = V^ =n; with V^ !P V: For example, for MLE, se(^n ) = In (^n )
^ ;
where In (^n ) is the Fisher information for the full sample (In (^n ) = nI(^n )).
We can extend this to the multivariate case. Suppose we can prove the asymptotic
normality result p
n(^n 0 ) !d N (0; V)
The construction in (28) is based on the Wald test statistic Tn ( ), but other tests
such as LM or LR could be used to construct pointwise asymptotic 1 con…dence
sets.
Example 115 (Bernoulli variables) Let X1 ; :::; Xn be i.i.d. binary random vari-
ables with p = P (X1 = 1). Let ^n denote the sample mean Xn : The central limit
theorem leads to p
n(^n p) !d N (0; p(1 p)):
Thus, a pointwise asymptotic 1 con…dence interval for p is
^n z ^
=2 se( n )
with s
^n (1 ^n )
se(^n ) =
n
106
J.C. Escanciano Econometrics I Fall 2020
107
J.C. Escanciano Econometrics I Fall 2020
Monte Carlo simulation uses numerical simulation to compute Jn (x; F ) for se-
lected choices of F: This is useful to investigate the performance of Tn for reasonable
situations and sample sizes. The basic idea is that for any given F; the distribution
function Jn (x; F ) can be approximated through simulation.
The method of Monte Carlo is quite simple to describe. The researcher chooses
F (the underlying cdf) and the sample size n: A true value for is implied for the
choice of F: The following algorithm is conducted:
Step 2: The test statistic is computed with the previous data, Tn = Tn (X1 ; :::; Xn ; ):
Step 3: Repeat Step 1 and Step 2 B times, where B is a large number, getting
B values of Tn ; Tn1 ; :::; TnB say. Typically, we set B = 1000 or B = 5000:
For step 1, most computer packages have procedures for generating random num-
bers from many well known distributions. In any case, you can always use that
108
J.C. Escanciano Econometrics I Fall 2020
Example 117 (t-statistic (cont.)) Suposse we are interested in the Type I error
associated with an asymptotic 5% two-sided t-test. We can compute
1X
B
Pb = 1(Tnb 1:96);
B b=1
the percentage of the simulated t-ratios which exceed the asymptotic 5% critical value.
The r.v’s 1(Tnb 1:96) are iid Bernoulli distributed. The samplepaverage Pb is
therefore an unbiased estimator of Pqwith standard error s(Pb) = P (1 P )=B;
which can be estimated by sb(Pb) = Pb(1 Pb)=B or using a hypothesized value.
For
p example, if we are p assessing an asymptotic 5% test, then we can set s(Pb) =
0:5(0:95)=B 0:22= B; which for B = 100; 1000 and 5000 are, respectively,
b
s(P ) = 0:022; 0:007 and 0:003.
109
J.C. Escanciano Econometrics I Fall 2020
While the asymptotic distribution of Tn might be known, the exact (…nite sample)
distribution Jn is generally unknown and depends on the underlying cdf F:
Asymptotic inference is based on approximating the cdf Jn (x; F ) with J(x; F ) =
limn!1 Jn (x; F ): When J(x; F ) = J(x) does not depend on F; we say that Tn is
asymptotically pivotal and use the distribution function J for inferential purposes.
In a seminal contribution, Efron (1979) proposed the bootstrap, which makes a
di¤erent approximation. The unknown F is replaced by an estimate Fn (e.g. the
empirical cdf) and plugged into Jn (x; F ) to obtain
Jn (x) = Jn (x; Fn )
Step 2: The test statistic is computed with the previous data, Tn = Tn (X1 ; :::; Xn ; ):
Step 3: Repeat Step 1 and Step 2 B times, where B is a large number, getting
B values of Tn ; Tn1 ; :::; TnB say. Typically, we set B = 1000 or B = 5000:
110
J.C. Escanciano Econometrics I Fall 2020
is invalid but the smoothed nonparametric bootstrap is valid, see e.g the maximum
score estimator of Manski (1975).
In parametric problems, F = F ( ) is estimated through F (b); for a consistent
estimator b of : This is called parametric bootstrap.
Other bootstrap methods are speci…c to the problem at hand, e.g. bootstrap
for regression models such as the wild bootstrap (WB).
Why bootstrap methods?
Bootstrap methods can permit statistical inference when conventional methods
such as standard error computation are di¢ cult to implement.
Bootstrap methods can be “better” than asymptotic approximations. Provide
re…nements that can lead to better approximation in …nite-samples.
Does the bootstrap approximation fail? Yes, sometimes. There are exam-
ples where the bootstrap approximation fails. There are, however, general theo-
rems proving the consistency of the bootstrap under mild smoothness conditions.
That is, the intuition is that if Fn F and Jn (x; F ) is smooth enough in F; then
Jn (x; Fn ) Jn (x; F ):
Examples of applications of the bootstrap: con…dence intervals, estimation of
standard deviations, bias reduction and hypothesis testing, to mention a few.
where b is computed with the bootstrap sample X1 ; :::; Xn : The symbol E stands
for expectation with respect to the bootstrap sample (i.e. conditional on the original
sample). n is estimated by the simulation described previously by
1X 1 Xb
B B
bn = Tnb = b=b b:
B b=1 B b=1
111
J.C. Escanciano Econometrics I Fall 2020
1X b
B 2
Vbn = b :
B b=1 nb nb
A bootstrap standard error for b is the square root of the last display.
Let be a parameter of interest and Tn = Tn (X1 ; :::; Xn ; ) be a test statistic.
The cdf of Tn is
Jn (x; F ) = P(Tn ( ) x):
For a distribution function Jn (x; F ) let qn ( ; F ) its quantile, i.e
How to construct con…dence intervals (CI) for 0? Choose c1 and c2 such that
P(c1 Tn ( ) c2 ) = 1
Examples:
c1 = 1 c2 = qn (1 ; F ) One sided
c1 = qn ( ; F ) c2 = +1 One sided
c1 = qn ( =2; F ) c2 = qn (1 =2; F ) Two sided
If Tn ( ) is a pivot, use qn ( ; F ) qn ( ): If Tn ( ) is an asymptotic pivot, use q( ) =
limn !1 qn ( ; F ). Bootstrap methods use qbn ( ; Fn ); a simulated value of qn ( ; Fn ):
There are general results that guarantee the validity of the bootstrap approximation
for CI computation, in the sense that the one-sided coverage error (CEn ) P(Tn ( )
112
J.C. Escanciano Econometrics I Fall 2020
CEn N orm BB SB
1=2 1=2
One Sided O(n ) O(n ) O(n 1 )
1
Two Sided O(n ) O(n ) O(n 1 )
1
p Alternatively,
1
we can get symmetric two-sided bootstrap CI based on Tn ( ) =
n sn (X n ) ; i.e.
p
CI = f : n sn 1 (X n ) qbn (1 ; Fn )g:
In the latter case CEn = O(n 2 ): These rates of convergence can be computed using
Edgeworth expansions. These expansions are beyond the scope of these notes.
Note that bootstrap provides asymptotic re…nements comparing to asymptotic-based
CI in terms of convergence of the CE. So, a general rule for the bootstrap is: when
possible, use asymptotic pivots and symmetric two-sided tests.
H0 : P 2 P0 vs H1 : P 2 P1 ;
and we have a test statistic Tn ; set up in a way that large values of Tn indicate H1 :
More speci…cally, it is common to have
(ii) Tn !P 1; under H1 .
p
Let tn := nsn 1 (X n 0 ): Then, if = E[X] an example is
H0 : = 0 vs H1 : > 0;
113
J.C. Escanciano Econometrics I Fall 2020
then set Tn = tn : We reject for large values of Tn ; the critical region is fTn > cg
where c is computed such that
So c = qn (1 ; F ):
Asymptotic theory for pivotal tests: c = q1 (1 ) = limn!1 qn (1 ; F ):
Bootstrap: c = qbn (1 ; Qn ) where Qn is an estimate of the true F that imposes
the null hypothesis H0 : Notice that in hypothesis testing the choice qbn (1 ; Fn ) with
Fn the empirical cdf leads to an inconsistent test. p For instance, in the previous
example assume 0 = 0, 2 = 1 (known) and Tn = nX n : Here Z and z(1 )
represent a standard normal r.v. and its (1 ) quantile, respectively. Then,
and p p
P(Tn > qbn (1 ; Fn ) j H1 ) ! P(Z + n > z(1 )+ n )= :
Therefore, the bootstrap test has no power under the alternative!!. To solve this
problem we impose the null in the estimation of the unknown F: In this example, we
take Qn the empirical cdf of the centered sample X1 X n ; :::; Xn X n : Then,
and p
P(Tn > qbn (1 ; Qn ) j H1 ) ! P(Z + n > z(1 )) ! 1:
114
J.C. Escanciano Econometrics I Fall 2020
Let Tn with cdf Jn (x; F ) and J1 (x; F ) = limn!1 Jn (x; F ): Write Tn = Tn (X1 ; :::; Xn ):
Let Tb;i = Tb (Xi ; :::; Xi+b 1 ) be the statistic computed with the subsample (Xi ; :::; Xi+b 1 )
of size b. We note that each subsample of size b (taken without replacement from
the original data) is indeed a sample of size b from the true DGP. Hence, it is clear
that one can approximate the sampling distribution Jn (x; F ) using the distribution
of the values of Tb;i computed over the n b + 1 di¤erent subsamples of size b in a
time series context or using the nb possible subsets with b elements taken from the
original sample in an iid setup. That is, we approximate Jn (x; F ) by
nXb+1
1
Jn;b (x) = 1(Tb;i x):
n b + 1 i=1
Suppose that we are testing a hypothesis and large values of Tn indicate rejection.
Let cn;1 ;b be the (1 )-th sample quantile of Jn;b ; i.e.,
Thus, the subsampling test rejects the null hypothesis if Tn > cn;1 ;b :
Politis, Romano and Wolf (1999) showed the validity of the subsampling proce-
dure for strong mixing processes.
yi = m(bn ; zi ) + vbi :
Not surprisingly, this bootstrap does not work under heteroskedasticity (or in general
under dependence between z and v): A more general bootstrap in this context is the
wild bootstrap (WP) introduced in Wu (1986) and Liu (1988). The bootstrap data
are obtained from the following algorithm:
v1 ; ::::; vbn g:
1) Estimate the original model and obtain the residuals fb
115
J.C. Escanciano Econometrics I Fall 2020
yi = m(bn ; zi ) + vbi :
116