Econometric S If All 2020

Econometrics I: Lecture Notes
Juan Carlos Escanciano

Universidad Carlos III de Madrid
Department of Economics
September 2020
Contents
I Probability 1
1 Basic concepts of probability 1
1.1 Radon-Nikodym Derivative . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Conditional Means as Orthogonal Projections . . . . . . . . . . . . . 12
1.3 Di¤erent Concepts of Dependence . . . . . . . . . . . . . . . . . . . . 13
1.4 Other characteristics of distributions . . . . . . . . . . . . . . . . . . 15
1.5 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Stochastic Convergence 23
3 Law of Large Numbers and Central Limit Theorems. 31

3.1 The Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
II Statistics 36
4 Statistical Models: Identi…cation and Speci…cation 36
5 Empirical Measures and Su¢ ciency 37
6 Statistical Decision Theory 44
7 Statistical Inference: Estimation 50

7.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2 Asymptotic properties of Extremum Estimates . . . . . . . . . . . . . 57
7.3 Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.4 Extensions to non-smooth objective functions1 . . . . . . . . . . . . . 79
8 Statistical Inference: Hypothesis Testing 81

8.1 Finite Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.2 Asymptotic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.3 When does a policy or treatment have signi…cant e¤ects? . . . . . . . 99
9 Statistical Inference: Con…dence Intervals 105

1
This material does not go into the exam.
J.C. Escanciano Econometrics I Fall 2020
10 Monte Carlo and Bootstrap Methods 108

10.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.2 Bootstrap Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
ii
Part I
Probability
These notes are not a substitute for Shao’s book, which is the main reference for the
…rst part of course (Probability, Chapter 1). The notes on probability are intended
as a complement to the book. For the second part, on Statistics, these notes are the
main reference, and Shao’s book becomes a complement.
1 Basic concepts of probability

In this section we summarize some basic results of probability theory, with further
details provided in Shao. A random experiment (probability space) is a triple ( ; F; P)
where
(i) is the set of all possible outcomes, and it is called the sample space.
(ii) F is a family of subsets of which has the structure of a …eld.
(iii) P is a probability measure on ; i.e., a subadditive measure from to [0; 1]

with P( ) = 1:
Basic De…nitions: The subsets of F are called events. If a statement holds for
all w 2 A such that P(A) = 1; we say that the statement it is true almost surely
(a:s:): Let (M; d) be a metric space and B its Borel …eld (the …eld generated
by the open sets). A random element X with values in (M; d) is a measurable
function from to M , i.e., X : ( ; F) ! (M; B) such that X 1 (B) 2 F for all
Borel sets B 2 B: If M = R; X is called a random variable, and if M = Rn ; n > 1;
X a random vector (r.v). PX (B) P(X 1 (B)) is the law of X: For M = Rn ;
FX (x) = PX (( 1; x]) = P(X x) is the cumulative distribution function
p (cdf).
ix0 X
The characteristic function is the function 'X (x) = E[e ]; where i = 1 and E
is the expectation operator
Z Z
E[X] = X(w)dP = xdPX :
M
Notice that
FX (x) = P(X x) = E[1( 1;x] (X)];
1
where 1A ( ) is the indicator function, 1A (x) = 1 if x 2 A occurs and 1A (x) = 0

otherwise (x 2= A). Sometimes we write 1( 1;x] (X) = 1(X x); where for a vector
this is understood component-wise. The functions FX (x) and 'X (x) contain all the
information about PX (this will not be proved, but you should have some intuition
why this is true).
For a r.v X with …nite r moment de…ne the norm kXkr = E[jXjr ]1=r ; r > 0;
where j j is the Euclidean norm. Let Lr (P) be the space of all r.v´s with …nite
r moment. The covariance operator is Cov(X; Y) = E[(X E[X])(Y E[Y])0 ] and
the variance operator is V[X] = Cov(X; X) (which exists if X 2 L2 (P)):
Let be an arbitrary set, let F be the collection of all subsets of ; and let
x 2 be an arbitrary element. A key measure is the point mass or Dirac measure
at x, de…ned as x (A) = 1 if x 2 A; and zero otherwise, where A (show that
x is a measure).
Pn If = fx 1 ; :::; x n g and F as before, de…ne the counting measure
as v(A) = i=1 xi (A); i.e. the number of elements of A: This can be extended to
an arbitrary (see Example 1.1 in Shao) in a simple way: v(A) is the number of
elements of A (hence counting measure).
There is a close connection between indicator functions and Dirac measures, al-
though indicators are functions and P Dirac are measures.
Pn Indeed, for …xed x and
n
A; 1A (x) = x (A): From this, v(A) = i=1 xi (A) = i=1 1A (xi ): As an example, if
= fT rump; Bideng; and we aim to count the votes Pn in favor of Trump, we could
choose A = fT rumpg and simply compute v(A) = i=1 1A (xi ) (adding one for each
person that voted for Trump). This is an absolute frequency.
If A B; then we can write B = A[(B A), which is used to show monotonicity
of measures, and for B = that P (Ac ) = 1 P (A): More generally, we can write
A = (A \ B) [ (A \ B c ); and similarly B in terms of A0 s; which together with
A [ B = (A \ B c ) [ (A \ B) [ (Ac \ B), show the formula
P (A \ B) + P (A [ B) = P (A) + P (B):
Importantly, for an arbitrary n and A1 ; :::; An ; we can de…ne B1 = A1 ; B2 = A2 \
Ac1 ;..., Bk = Ak \ Ac1 \ \ Ack 1 , for k = 1; :::; n: Then
S
n S
n
Ai = Bi ;
i=1 i=1
and the B 0 s are disjoint. This lack property is useful and explains the subadditivity
and continuity properties of probabilities. In particular, if A1 A2 ; then
n n
An = [i=1 Ai = [i=1 Bi and
S
1 X
n
P lim An = P Bi = lim P (Bi ) = lim P (An ) :
n!1 i=1 n!1 n!1
i=1
2
This is useful to show some of the properties of cdfs.

Let be an arbitrary set, and A : Before, we have introduced indicator
functions 1A ( ) : ! R. Let B denote a Borel set. Note that 1A1 (B) = fx 2 :
1A (x) 2 Bg: Since the range of indicators is zero and one, we have
8
>
> ; if 0 2 = B; 1 2
=B
<
A if 0 2 = B; 1 2 B
1A1 (B) = c :
>
> A if 0 2 B; 1 2=B
:
if 0 2 B; 1 2 B
Thus, measurability of the indicator 1A ( ) is equivalent to measurability of A: Next

in complexity to indicators is the class of simple functions
X
k
'(x) = ai 1Ai (x);
i=1
for measurable sets A1 ; :::; Ak : Let f : ( ; F) ! [0; 1] measurable. A key result in

measure theory is that there exists an increasing sequence 'n ( ) of simple functions
whose pointwise limit is f; i.e.
lim 'n (x) = f (x):

n!1
Without loss of generality we can take f to be bounded, with values in [0; 1): Then,
n 1
2X
k
'n (x) = 1A (x);
k=0
2n k
where Ak = f 1 ([k=2n ; (k + 1)=2n )) will do it. Note 'n (x) = k=2n if f (x) 2
[k=2n ; (k + 1)=2n ); thus
1
j'n (x) f (x)j
2n
and 'n (x) f (x): Moreover, 'n (x) 'n+1 (x): This monotone convergence result
will be fundamental for integration theory.
To check measurability, the following result is often useful. If f : ( ; F) ! (M; B)
and B = (C); then it only su¢ ces to check measurability for inverses of C: That is,
f is measurable i¤ f 1 (C) 2 F for all C 2 C: This follows from the de…nition of
measurability and the fact that
1
E M :f (E) 2 F
3
is a …eld (check this and the proof of the su¢ ciency condition).
If X : ( ; F) ! (M; B) is a r.v. we have that fX 1 (B) : B 2 Bg is a …eld,
and it is denoted (X) (the sigma …eld generated by X): The following result will
have important implications, for example, when de…ning conditional means (this is
not proved in Shao).
Proposition: If X : ( ; F) ! (M; B) and Y : ( ; F) ! (M; B) are r.v. then

Y = f (X) for a Borel measuable function f i¤ (Y ) (X).
Proof: To prove this, note that if Y = f (X); then for each B 2 B

1 1 1
Y (B) = X (f (B)) 2 (X)
because f 1 (B) is measurable. Thus, (Y ) (X). The other implication is a bit

harder to prove. For each n 1; de…ne the sets
n n
Am;n = fw 2 : m2 Y (w) (m + 1)2 g:
Since Am;n 2 (Y ) and (Y ) (X); there exists Bm;n 2 B such that Am;n =
1
X (Bm;n ): Then, construct the function
X
1
m
'n (x) = 1B (x):
m= 1
2n m;n
1
Note by Am;n = X (Bm;n )
X
1
m
'n (X(w)) = 1
n Am;n
(w)
m= 1
2
n
and hence, 'n (X) is closed to Y (withing 2 distance), i.e.
1
'n (X) Y 'n (X) + :
2n
Taking limits we conclude
f (X) := lim 'n (X) = Y:

n!1
4
The starting point of integration theory is the integration of indicator functions: for
a measure and a measurable set A;
Z
1A (y)d = (A):
This represents measures as integrals. Next, we extend the de…nition to simple

functions !
Z X k Xk
ai 1Ai (x) d = ai (Ai ):
i=1 i=1
Note that a simple function can have two di¤erent representations, but the inte-
grals for the two representations are the same. To see this, note that if '(x) =
Pk Pk0
i=1 ai 1Ai (x) = i=1 bi 1Bi (x); since ' takes a …nite number of distint values,
say
Pm c1 ; :::; cm ; then Cj = fx : '(x) = cj g; j = 1; :::; m; are disjoint and '(x) =
i=1 ci 1Ci (x); so both integrals are
X
m
ci (Ci ):
i=1
For a generic measurable function f : ( ; F) ! [0; 1] we de…ne the integral through

the previously introduced monotone sequence of simple functions, as
Z Z
f d = lim 'n d :
n!1
Another fundamental equality in integration is

Z
f (y)d x (y) = f (x):
To see why this is true, as usual with integration, we …rst prove it for indicators,
then for simple functions, and then for general functions. For indicators,
Z
1A (y)d x (y) = x (A) = 1A (x);
where the …rst equality uses integration with indicator functions. Using linearity, the
same holds for simple functions, and for general functions. P
Another application of linearity gives, for the counting measure v(A) = ni=1 xi (A);
Z Xn
f (y)dv(y) = f (xi ):
i=1
5
So, adding is special case of integration.

Double sums are also integrals with respect to the product counting measure
Z X
n X m
f (x; y)dv dv(x; y) = f (xi ; yj ):
i=1 j=1
Simple functions are to indicators, as empirical measures are to Dirac measures. The
empirical measure at the points a1 < a2 <
P < ak with probabilities pi ; pi 0 with
k
p
i=1 i = 1; (possibly with k = 1); is (see 1.11 in Shao) is
X
k X
k
P (A) = pi = pi ai (A):
i=1;ai 2A i=1
The associated cdf is

X
k X
k
F (x) = pi ai (( 1; x]) = pi 1(ai x):
i=1 i=1
This is the cdf of a discrete random variable (compare with 1.10 in Shao). For k = n
equals the sample size, and ai the i th datum of a random sample fXi gni=1 the
discrete measure is
1X 1X
n n
Pn (A) = 1(Xi 2 A) = Xi (A)
n t=1 n t=1
for a Borel set A of Rd : This measure is called the empirical (random) measure
and the corresponding cdf Fn (x) the empirical cdf
1X
n
Fn (x) := Pn (( 1; x])) = 1(Xi x):
n t=1
These quantities play a fundamental role in statistics, as we shall see in future lec-
tures. Conditional on the data points fXi gni=1 and assuming no ties in the data, Pn
or Fn represent the probability and cdf respectively of a discrete random variable
with uniform probabilities on the data (a discrete uniform distribution or multino-
mial distribution with probabilities 1=n): For n = 2; this is the model for ‡ipping a
coin.
There are several theorems that give conditions that allow to exchange limits
and integration. Probably the most intuitive of them is the monotone convergence
6
theorem. As an application of this theorem to time series econometrics consider the

following problem.P1Here fXt g is a sequence of r.v. P1
Prove that if j=1 E[jXt j j] < 1; then the series j=1 Xt j converges absolutely
almost surely, i.e. !
X1
P jXt j j < 1 = 1
j=1
P P
Apply monotone convergence theorem to fn = nj=1 jXt j j. Thus, f = 1 j=1 jXt j j
is integrable and nonnegative, then f < 1 a.s.
Theorem 1.2 in Shao provides the change of variables formula for general measures.
We provide some intuition to understand the result. The general result is:
Theorem: (Theorem 1.2 in Shao)) If f : ( ; F; ) ! ( ; G) and g : ( ; G) ! (R; B)

are measurable, then
Z Z
g f (x)dv(x) = g(y)d(v f 1 )(y):
Shao mentions that this formula generalizes the one for Riemann integrals (i.e.
change of variable y = f (x))
Z Z
0
g f (x)f (x)dx = g(y)dy:
But, how is the connection? To see this, de…ne the measure (assume f 0 0; otherwise
apply to positive and negative parts)
Z
v(A) = f 0 (x)dx:
A
Note that by the fundamental theorem of calculus v([a; b]) = f (b) f (a); and thus
v f 1 = Lebesgue:
1.1 Radon-Nikodym Derivative

Let ( ; F; ) a measure space. From the measure we can create many more mea-
sures. To see this, note that for any nonnegative measurable Borel function f we
can de…ne the set function Z
(A) = fd : (1)
A
7
Using that for two disjoint sets 1A[B = 1A + 1B , it can be shown that
Z Z
(A [ B) = 1A[B f d = (1A + 1B )f d = (A) + (B):
Indeed, (A) is a measure with the property that
(A) = 0 =) (A) = 0:
We then say is absolutely continuous with respect to (wrt) ( ). The Radon-

Nikodym theorem essentially says that all are of this form: if is …nite
and ; then there exists a nonnegative Borel measurable function f , called the
Radon-Nikodym derivative (RND) or density of wrt ; such that (1) holds for all
A: The concept of density is a fundamental concept in probability and statistics.
Sometimes we write f = d =d :
Note that interpreting (A) as an integral we write (1) as
Z Z
1A d = 1A f d ;
which can be extended from indicators to arbitrary measurable functions g: This is

a change of measures.
What is the RND of an empirical measure wrt the Lebesgue measure? The
empirical
Pk measure at the points a1 < a2 < < ak with probabilities pi ; pi 0 with
i=1 pi = 1; (possibly with k = 1); is (see 1.11 in Shao)
X
k X
k
P (A) = pi = pi ai (A):
i=1;ai 2A i=1
Take ai with corresponding probability pi > 0; then P (fai g) = pi > 0; but the
Lebesgue measure m(fai g) = 0: Thus, P is not absolutely continuous wrt the
Lebesgue measure and the RND does not exists. How about the RND wrt to the
counting measure? First, the conditions of the RN theorem hold (check this). Then,
Z X
k X
k
P (A) = (f 1A )d = f (ai )1A (ai ) = f (ai ) ai (A);
i=1 i=1
so the RND is f (ai ) = pi :
8
Once we have de…ned densities as RND, we compute the densities of transforma-

tions of variables, as explained in Proposition 1.8 in Shao. To understand better the
result, note if Y = g(X)
FY (y) = E 1fY yg (de…nition of cdf and expectation)

" # !
P
m Pm
= E 1fY yg 1fX2Aj g since 1fX2Aj g = 1 for a partition
j=1 j=1
P
m
= E 1fY yg 1fX2Aj g
j=1
Pm
= E 1fX hj (y)g 1fX2Aj g :
j=1
Taking derivatives wrt y and applying a multivariate change of variables formula we

obtain the transformation formula of Proposition 1.8.
The Radon-Nikodym Theorem implies the following decomposition (called the
Lebesgue decomposition of Q wrt P ), which is not in Shao. Let P and Q be proba-
bility measures on ( ; A) with densities p and q with respect to a measure : De…ne
for A 2 A;
Qa (A) = Q(A \ fp > 0g):
Then, the following is true:
1. (a) Qa is a measure.
(b) Qa n P (Qa is absolutely continuous wrt P ):
(c) If Q? (A) = Q(A \ fp = 0g); then Q(A) = Qa (A) + Q? (A).
(d) The Radon-Nikodym derivative dQa =dP = q=p:
Another important application of RN theorem is for the de…nition of conditional

means. If X : ( ; F) ! (M; B) and Y : ( ; F) ! (M; B) are r.v, with Y integrable,
then we de…ne E [Y j X] := E [Y j (X)] as the a.s unique measurable function in
(X) satisfying Z Z
E [Y j (X)] dP = Y dP
X 1 (B) X 1 (B)
for all B 2 B: To prove existence and uniqueness note that for Y 0 a.s. the right
hand side is a measure absolutely continuous wrt P: For a general Y we can take
positive and negative parts and apply the previous result. Also note that E [Y j (X)]
can be written as h(X) for a measurable h. To see this note that if Z = E [Y j (X)] is
9
a r.v on (X); then (Z) (X) and by the previous proposition Z = h(X) for some
measurable h: This construction of conditional means based on RN is more elegant
and only require …rst bounded moments, in contrast to the projection approach that
requires second …nite moments.
For the case of random variables we can apply change of variables to the last
display to obtain for h(x) = E [Y j X = x] ; for all borel sets B;
Z Z Z
h(x)dPX (x) = ydPY;X (y; x);
B R B
or equivalently
E [E [Y j X] 1fX 2 Bg] = E [Y 1fX 2 Bg] :
As this holds for indicators, it also holds for simple functions, and for general mea-
surable functions (for which the moments as well de…ned), so that
E [(Y E [Y j X]) g(X)] = 0;
for all g for which E [(Y E [Y j X]) g(X)] is well de…ned. This “orthogonality”
restriction is used extensively in these notes.
We can use these concepts to de…ne conditional probabilities, Pr [B j X] = E [1B (Y ) j X]
a.s. for all B 2 (Y ): We can also de…ne conditional probabilities that are de…ned
for every x, not just a.s. in x: Since this is a bit technical we do not emphasize it
much in the course.
1.1.1 Some Inequalities

Jennsen’s inequality. If g( ) : R ! R is convex (g( x1 + (1 )x2 g(x1 ) +
(1 )g(x2 ); 2 (0; 1)) then
g(E[X]) E[g(X)]:
Lr Inequality: If 1 r p
kXkr kXkp :
1 1
Holder’s Inequality: If p > 1 and q > 1 with p
+ q
= 1; then
E[jXY j] E[jXjp ]1=p E[jY jq ]1=q :
10
Cauchy-Schwarz Inequality:
E[jXY j] E[jXj2 ]1=2 E[jY j2 ]1=2 :
Minkowski´s Inequality:
kX + Y kp kXkp + kY kp :
Chebyshev´s Inequality: For all " > 0 and p > 0;
P(jXj > ") " p E[jXjp ]
11
1.2 Conditional Means as Orthogonal Projections

The goal is to predict Y from knowledge of some variables in fS : S 2 Sg:
The Projection Theorem: Let Y and fS : S 2 Sg be random variables (de…ned

on the same probability space) with …nite second moments. A random variable Sb is
called a projection of Y onto S (or L2 projection) if Sb 2 S and Sb minimizes
S 7! E (Y S)2 ; S 2 S: (2)
Often S is a linear space ( 1 S1 + 2 S2 2 S if S1 ; S2 2 S and 1; 2 2 R): In this case
we have the following results:
(i) Sb 2 S is the projection of Y if and only if

h i
E Y Sb S = 0; for all S 2 S: (3)
h i
If the linear space S contains the constant functions, then E [Y ] = E Sb and
2 h i
Cov(Y S; b S) = 0; for all S 2 S: Moreover, E [Y 2 ] = E Y Sb + E Sb2
(Pythagoras’Theorem).
(ii) (Uniqueness) Every two projections of Y are almost surely equal.
(iii) (Existence) If S is closed (with respect to the second-moment norm), then the
projection always exists.
One important application of the projection theorem is to the de…nition of the

Conditional Mean or Optimal Predictor: De…ne S the space of all measurable
functions of a given random variable X with …nite second moments. Hence, given
the random variable Y we can de…ne the projection of Y on S, and denote it by
Sb = E [Y j X] ; which satis…es the orthogonality restrictions
E [(Y E [Y j X]) g(X)] = 0; for all g 2 S: (4)
The conditional mean E [Y j X] is the unique (up to sets of probability zero, these
could be very big sets!) measurable function of X satisfying the orthogonality condi-
tions (4). Notice that the conditional mean can be de…ned by other method (Radon
Nikodym derivatives) that does not need second moments. We will see this method
below. Nevertheless, the current approach is useful as it has a simple geometric
interpretation. Some useful properties of conditional means:
12
1. Taking g 1 in (4), we obtain E [Y ] = E [E [Y j X]] (Law of Iterated Ex-

pectations).
2. If Y = f (X) for some X; it is clear from (2) that E [Y j X] = f (X):
3. If Y is conditional mean independent of X; i.e. E [Y j X] = 0; then E [Y g(X)] =

0; for all g(X) 2 S: By the Projection’s theorem the reciprocal is also true.
4. From Pythagoras’Theorem, E (E [Y j X])2 E [Y 2 ] :
Another important application of this theorem is when S the space of all linear
transformations of a given random vector X;i.e. S = f 0 X : 2 Rd g: In that case,
the orthogonality condition in (3) boils down to
0
E [(Y 0 X) X] = 0:
Which is equivalent to the normal equations
E [XX 0 ] 0 = E [Y X] :
Note that regardless on whether E [XX 0 ] is singular or non-singular there is always

a unique projection 0 X; we call this projection the Optimal Linear Predictor
(OLP) and denote it by L [Y j X] = 00 X:
1.3 Di¤erent Concepts of Dependence

Let Y and X denote random vectors, and let L2 (Y ) denote the set of squared in-
tegrable measurable functions of Y (and similarly for L2 (Y )): We have seen what
independence means in terms of probability measures, cdfs, pdfs, characteristic func-
tions, etc. But a useful way to characterize independence is as follows.
Proposition 1 Two random vectors Y and X are independent i¤ for all f 2 L2 (Y )

and g 2 L2 (X)
Cov (f (Y ); g(X)) = 0:
Proof. If Y and X are independent, then Cov (f (Y ); g(X)) = E [f (Y )g(X)]
E [f (Y )] E [g(X)] = 0: For the other implication take fy (Y ) = 1(Y y) and gx (X) =
1(X x) and note Cov (fy (Y ); gx (X)) = 0 implies the joint cdf of (Y; X) equals the
product of the marginals, which implies independence.
We introduce the following (asymmetric) concept of independence.
13
De…nition 2 Y is Conditionally Mean Independent (CMI) of X if E [Y j X] = c;

where c is a constant.
Note that the constant c has to be necessarily c = E [Y ] (why?). Then, we have

the following charaterization:
Proposition 3 Y is CMI of X i¤ for all g 2 L2 (X)
Cov (Y; g(X)) = 0:
Proof. If Y is CMI of X then Cov (Y; g(X)) = E [Y g(X)] E [Y ] E [g(X)] =

E [E [Y j X] g(X)] E [Y ] E [g(X)] = 0: The other implication follows from the pro-
jection theorem.
From the discussion above we see that
independence ) CMI ) uncorrelation, (5)
with the reciprocal directions not being true in general.

We generalize these concepts to conditional versions.
De…nition 4 Y is Conditionally Independent of X given Z i¤
Pr [B j X; Z] = Pr [B j Z] a.s. for all B 2 (Y ):
Whereas is Y is CMI of X given Z i¤
E [Y j X; Z] = E [Y j Z] a.s.
The same relation as in (5) holds in the conditional case, where conditional un-
correlated means
E [Y X j Z] = E [Y j Z] E [X j Z] a.s.
Consider the following important application of conditional independence. Suppose
we want to evaluate a policy or treatment on an outcome of interest. Let D be the
treatment indicator (=1 if treated, 0 otherwise), Y (1) be the outcome under treat-
ment, Y (0) be the outcome without treatment. We only observe (Y; X; D); with
Y = Y (1) D + Y (0) (1 D). Assume (Y (1) ; Y (0)) and D are independent con-
ditional on X. This assumption is called uncounfoundedness or selection on observ-
ables. Under this condition, the Average Treatment E¤ect (ATE) E[Yi (1) Yi (0)]
is identi…ed, as we now show. First, note that the treatment e¤ect for individ-
ual i; Yi (1) Yi (0); is not identi…ed because either Yi (1) or Yi (0) is not observed.
14
Nevertheless, using the selection on observables assumption and the law of iterated
expectations
E[Yi (1) Yi (0)] = E[E[Yi (1)jXi ] E[Yi (0)jXi ]] (iterated expectations)

= E[E[Yi (1)jXi ; Di = 1] E[Yi (0)jXi ; Di = 0]] (conditional independence)
= E[E[Yi jXi ; Di = 1] E[Yi jXi ; Di = 0]]; (de…nition of Y ).
We say the ATE is identi…ed because we have written it as function of the distribution
of observables: We will study identi…cation in more detail below.
To illustrate these points, suppose we run the regression
0
Yi = 0 + 0 Di + 0 Xi + "i ;
where it is assumed that E["i j Xi ; Di ] = 0: What is the interpretation of 0 in this

model? Has OLS a causal interpretation here? In what sense are these assumptions
strong? Note the potential outcomes in this model are
0
Yi (1) = 0+ 0+ 0 Xi + "i
0
Yi (0) = 0+ 0 Xi + "i :
Thus, 0 = Yi (1) Yi (0) is the same for every i th! Moreover, E[Yi jXi ; Di ] =
0
0 + 0 Di + 0 Xi ; which is a strong parametric functional form assumption (linearity).
The nonparametric identi…cation result above does not require parametric as-
sumptions, and in that sense is general. But selection on observables may strong in
many applications (e.g. returns to education and ability). It holds in a randomized
experiment where Di is independent of everything. However, in many applications
we only have observational data and not experiments (i.e. Di is a choice variable by
i th; e.g. education).
1.4 Other characteristics of distributions

In addition to cdfs, pdfs, characteristic functions, mgf, and moments, other impor-
tant characteristics of distributions include quantile functions (for r.v), copulas (for
multivariate distributions), Gini coe¢ cients, hazard functions, etc.
1.4.1 Quantiles
For any univariate cdf F; de…ne the quantile function
1
F (u) = infft 2 R : F (t) ug; u 2 (0; 1]:
15
From the monotonicity property of cdfs it follows that F 1 is non-decreasing. That

is, if 0 < u1 u2 < 1, then, by the de…nition of a quantile F (F 1 (u2 )) u2 u1 ;
so that F 1 (u1 ) F 1 (u2 ) (by the inf de…nition). The parameter F 1 (0:5) is called
the (population) median, and F 1 (0:75) F 1 (0:25) the interquantile range. The
de…nition of F 1 makes sense even when F does not have an inverse. Suppose F
is discontinuous at c with jump F (c) F (c ); where F (c ) = limx"c F (x): Then,
the inverse F 1 (u) for any F (c ) < u < F (c) does not exist, but F 1 (u) = c; since
F (c) u and F (t) F (c ) for t < c: This example shows that F (F 1 (u)) = u
may not hold. For points F 1 (u) where F is continuous F (F 1 (u)) = u. Always
F (F 1 (u)) u; so continuity gives the additional F (F 1 (u)) u: Continuity of
1
F alone does not imply that F is an inverse (there could be ‡at parts of F ): By
de…nition as an inf F 1 (F (x)) x; and equality is true except when x is in the
interior or at the right end of ‡at part of F:
One application of quantile functions is to generate data with a given distribution
or for representations in terms of U [0; 1] variables. First, note that using properties
of cdfs, for each t and any random variable (r.v.) U; the following two events are
equivalent:
fF 1 (U ) tg = fU F (t)g:
Using this, then one can easily show that if U is a r.v. with uniform distribution on
[0; 1], then F 1 (U ) is distributed as F (even when F is discontinuous).
If F is continuous and strictly increasing, then F 1 (u) is the standard inverse
function of F: De…ne U = F (X): It is straightforward to show that U is a uniformly
distributed r.v. in [0; 1] such that X = F 1 (U ) with probability one. This last
statement is also true even when F is not continuous, but proving it becomes a more
di¢ cult problem.
1.4.2 Copulas
A copula function is a multivariate cdf with uniform marginal U [0; 1] cdfs. An
important theorem called Sklar’s Theorem (Sklar 1959) says that for any multivariate
cdf F; with marginals F1 ; :::; Fd ; there exists a copula function C such that
F (x1 ; :::; xd ) = C(F1 (x1 ); :::; Fd (xd )):
If the marginals Fi are continuous, then C is unique. Copulas have been used in
…nancial econometrics to model dependence (e.g. tail dependence).
16
1.4.3 Hazard functions

Another quantity that characterize a distribution of a r.v X is its cumulative hazard
function Z
1
F (t) = dF (x):
( 1;t] 1 F (x )
This quantity plays an important role in the analysis of duration data (where X
represents time for an event). For an absolute continuous F; F (x ) = F (x) and we
can compute
F (t) = ln(1 F (t));
from which it can be directly seen the one-to-one mapping between F and F: In
this case, the derivative of F (t) is
f (t)
F (t) = ;
1 F (t)
the so-called hazard function, which can be interpreted as (using the de…nition of F
and f )
F (t + h) F (t)
F (t) = lim
h!0 h (1 F (t))
Pr[t < X t + hjt < X]
= lim :
h!0 h
1.4.4 Gini Coe¢ cient

Let Y denote income, and let F be its continuous cdf, with mean and quan-
1 1
tile function F . Continuity implies F (FR (u)) = u: By a change of variables
1
F 1 (u) = Rx; it can be shown that = 0 F 1 (u)du: De…ne the Lorenz curve
p
L(p) = 1 0 F 1 (u)du for 0 p 1: Since F 1 (u) 0 and F 1 is non-decreasing,
then L is a non-decreasing, convex function connecting the points (0,0) and (1,1)
(so, its graph is always below the straight line connecting these points).
If A is the area between the 45o degrees line and the Lorenz curve, and B is the
area underlying the Lorenz curve, the Gini coe¢ cient is de…ned as
A
G =
A+B
Z 1
= 1 2 L(p)dp:
0
17
It can be shown that

E [jY1 Y2 j]
G= ;
E [Y1 + Y2 ]
where Y1 and Y2 are independent copies of Y: Indeed, by the change of variables
F 1 (u) = y and Fubini it is shown that
Z 1 Z p
1
G = 1 2 F 1 (u)dudp
0
Z 1 Z 01
= 1 2 1 1(u < p)dpF 1 (u)du
0 0
1
= 1 2 ( C)
1
= 2 C 1
where Z 1
C= yF (y)dF (y):
0
On the other hand, by symmetry
Z 1Z 1
E [jY1 Y2 j] = jy1 y2 j dF (y1 )dF (y2 )
0 0
Z 1Z 1
= 2 (y1 y2 )1(y1 > y2 )dF (y1 )dF (y2 )
0 0
Z 1 Z y1
= 2 (y1 y2 )dF (y2 ) dF (y1 )
0 0
Z 1 Z 1 Z y1
= 2 y1 F (y1 )dF (y1 ) y2 dF (y2 )dF (y1 )
0 0 0
Z 1 Z 1
= 2 y1 F (y1 )dF (y1 ) y2 (1 F (y2 ))dF (y2 )
0 0
Z 1
= 2 y1 (2F (y1 ) 1)dF (y1 )
0
= 4C 2 :
Thus,
G = 2 1C 1
4C 2
=
2
E [jY1 Y2 j]
= :
E [Y1 + Y2 ]
18
1.5 Stochastic Processes

A stochastic process on ( ; F; P) with state space (M; B) and index set T is a
family of measurable functions Xt ; t 2 T: We shall consider only cases where T = N
or T = Z:
De…nition 5 A stochastic process fXt gt2Z is strictly stationary if for any given
integer r and for any set of subscripts, i1 ; i2 ; :::; ir the joint distribution of (Xi ; Xi1 ; :::; Xir )
depends only on i1 i; i2 i; :::; ir i; but not on i:
De…nition 6 A random process fXt gt2Z is called weakly stationary (or covari-
ance stationary) if random variables Xt have …nite second moments (in short
Xt 2 L2 (F)) and both E[Xt ] and E[Xt Xt k ] do not depend on the time index t
for all k = 0; 1; 2; :::.
Example 7 Consider the weak white noise (WN) de…ned as a univariate random
process f"t g+1
t= 1 such that E["t ] = 0 and
2
if k = 0;
E["t "t k ] =
0 otherwise,
where 2 is a …nite constant. Similarly, we can de…ne the multivariate weak white
noise f"t g+1
t= 1 by the properties E["t ] = 0 and
if k = 0;
E["t "0t k ] =
0 otherwise,
where is a constant matrix. Both processes are weakly stationary but not necessarily
strictly stationary.
Example 8 Any sequence of i.i.d. random variables is obviously strictly stationary.

Consider, for instance, an i.i.d. WN process. A special case of i.i.d. WN process is
a Gaussian white noise f"t g+1 t= 1 de…ned as an i.i.d. WN process such that "t are
normally distributed with zero mean E["t ] = 0.
Consider a random process fYt g+1 t=1 and an increasing sequence of information
sets fFt g+1
t=0 , i.e., a collection of -…elds Ft with the property2
F0 F1 ::: Ft Ft+1 ::: F+1 ; F+1 F:

2
In mathematical literature such an increasing collection of information sets is called …ltration.
In applications it is usually assumed that Y1 ; :::; Yt are known by time t (notation: Y1 ; :::; Yt 2 L0 (Ft )
for each t). If this is the case, the process fYt g+1 +1
t=1 is called adapted to the …ltration fFt gt=0 .
19
De…nition 9 If Yt belongs to the information set Ft and is absolutely integrable (i.e.,

Yt 2 L0 (Ft ) \ L1 (F)), and if
E[Yt+1 jFt ] = Yt for all …nite t,
then fYt g+1

t=1 is called a martingale adapted to Ft .
Now consider a random process fut g+1

t=0 and an increasing sequence of information
+1
sets fFt gt=0 .
De…nition 10 If ut is belongs to the information set Ft (ut 2 L0 (Ft )) and
E[ut jFt 1 ] = 0 for all …nite t,
then fut g+1

t=0 is called a martingale di¤ erence sequence (mds) adapted to Ft .
Obviously, if fYt g+1

t=0 is a martingale process adapted to Ft , then the process
+1
fut gt=1 de…ned by ut = Yt Yt 1 is an mds adapted to Ft . It is also easy to see
that a weakly stationary mds is a weak white noise process, and that an i.i.d. white
noise process is an mds with respect to any …ltration. However, a weak white noise
process does not need to be a martingale di¤erence sequence adapted to Ft .
The de…nitions of martingale and martingale di¤erence sequence can be trivially
extended to vector processes.
When looking at assets prices, the idea of lack of predictability has been com-
monly referred to as the random walk hypothesis. Unfortunately, the term random
walk has been used in di¤erent contexts to mean di¤erent statistical objects. For
instance, in Campbell, Lo and MacKinlay (1997) textbook, they distinguish three
types of random walks according to the dependence structure of the increment se-
ries. Random walk 1 corresponds to independent increments, random walk 2 to mds
increments, and random walk 3 to uncorrelated increments. Of these three notions,
the two relevant for …nancial econometrics are the second and the third. The notion
of random walk 1 is clearly rejected in …nancial data for many reasons, the most
important is the conditional heteroskedasticity.
Another popular concept of dependence in economics is that of a Markov process.
We say the stochastic process fXt gt2Z is a Markov process of order one if
Pr [B j Xt ; Xt 1 ; :::; X0 ] = Pr [B j Xt ] a.s. for all B 2 (Xt+1 ):
This means Xt+1 is independent of (Xt 1 ; :::; X0 ) conditional on Xt :

The concept of ergodicity for stationary processes regards the convergence
of sample moments or time averages of a single realization. The exact mathematical
20
de…nition of ergodicity is cumbersome (e.g. Taniguchi and Kakizawa, 2000, p.17-18),

and it will not be covered in class. Intuitively, if average converges to the population
moment we say that the process is ergodic. The following example illustrates a
situation where ergodicity fails.
Example 11 Consider a strictly stationary random process fXt g de…ned by Xt =

Ut + Z, which is the sum of the sequence of i.i.d. uniform [0; 1] random variables Ut ,
d
t = 1; 2; 3; :::, and random variable Z N (0; 1) independent of all Ut . Then
P P a.s.
YT = T1 Tt=1 Xt = T1 Tt=1 Ut + Z ! 12 + Z 6= 12
by the strong law of large numbers for i.i.d. sequences.
The idea behind ergodicity is that the time average of our dynamic process must
converge to the average of in…nitely many identical realizations of the same process
at one point in time. In the last example we had “too much”dependence, since
1
1+ 12
if k = 0;
Cov(Xt ; Xt k ) = Cov(Ut + Z; Ut k + Z) =
1 if k =
6 0:
In the next sections we shall study su¢ cient conditions for the sequence of sample
means to converge almost surely to the unconditional expectation of Xt . Some of
these conditions will be based on mixing assumptions. A prominent example is the
strong-mixing concept.
De…nition 12 A stochastic process fXt gt2Z is strong mixing if (m) ! 0 as

m ! 1; where (m) are the -strong mixing coe¢ cients de…ned as
(m) = sup sup jP(A \ B) P(A)P(B)j ; m 1

n2Z B2Fn ;A2Pn+m
where the -…elds Fn and Pn are Fn := (Xt ; t n) and Pn := (Xt ; t n);

respectively:
Remark 1: It is relatively straightforward to check stationarity in linear time

series models. However, it is by no means easy to check whether a time series de…ned
by a nonlinear model is strictly stationary. The common practice is to represent the
time series as a Markov chain and to establish ergodicity. Stationarity follows from
the fact that an ergodic Markov chain is stationary.
21
Remark 2: To establish the asymptotic theory for stochastic processes we need

to restrict the “memory”of the process. There are many ways to do that. Ergodicity,
martingale or mixing concepts are just a few examples. There are more: mixingales,
near epoch dependence, other concepts of mixing ( mixing, absolute regular,...),
see Davidson (1994). Martingales arise naturally in economics (from FOC in agent
optimization problems). It remains an open problem to …nd su¢ ciently general
conditions under which a given process satis…es one of these “mixing”concepts.
Recommended reading for weak dependence concepts:
Doukhan, P. and Ango Nze, P. (2004). Weak dependence, models and appli-
cations to econometrics. Econometric Theory 20, 995-1045.
Wooldridge, J. (1994). Estimation and Inference for Dependent Processes, in

Handbook of Econometrics, Volume 4. R.F. Engle and D.L. McFadden (eds.),
2639-2738. Amsterdam: North-Holland.
22
2 Stochastic Convergence
Asymptotic theory is very useful in econometrics for (i) approximating critical
regions of tests and con…dence intervals of parameters; and for (ii) studying the
quality of inference procedures.
De…nition 13 We say that a sequence of r.v´ s Xn ; n = 1; 2:::; converges in proba-

bility to X if,
lim P(jXn Xj > ) = 0; for all > 0;
n!1
and it will be expressed as:
Xn !p X; p lim Xn = X; Xn = X + oP (1):
n!1
Remark 1: For convergence in probability all the r.v´s Xn ; n = 1; 2:::; and X have
to be de…ned in the same probability space ( ; F; P):
Remark 2: Xn !p X is equivalent to jXn Xj !p 0:
De…nition 14 We say that a sequence of r.v´ s Xn ; n = 1; 2:::; converges almost

surely (a.s) to X if,
P( lim jXn Xj = 0) = 1;
n!1
Xn ! X a:s Xn = X + o(1) a:s:
Notice that
[
1
Xn ! X a:s () lim P( AT ) = 0;
n!1
T =n
where AT = fjXT Xj > g for some > 0:
De…nition 15 We say that a sequence of r.v´ s Xn 2 Lr (P); n = 1; 2:::; converges

in r th mean to X 2 Lr (P) if,
lim kXn Xkr = 0;

n!1

Xn !r X:
23
De…nition 16 We say that a sequence of r.v´ s Xn with cdf Fn ; n = 1; 2:::; converges

in distribution to X with cdf F if,
lim Fn (x) = F (x); at each continuity point x of F;

n!1

Xn !d X:
When F is continuous, it can be shown that convergence to F is uniform.
De…nition 17 An estimate bn (based on a sample of size n) of a parameter is

(weak) consistent of if bn !p : If bn ! a:s then we say that bn is strongly
consistent of : The estimator bn is unbiased if E[bn ] = and is asymptotically
unbiased if E[bn ] ! as n ! 1:
Example 18 Let X1 ; :::; Xn be a sequence of independent

Pn and identically distributed
b
r.v’s, distributed as N ( ; 1); and let n = n 1
i=1 Xi be the sample mean. Then
b
E[ n ] = , so the sample mean is an unbiased estimator. By the SLLN (see Section
3) is also strongly consistent. Notice that consistency and asymptotic unbiasedness
do not imply each other. Look at the examples, (i) bn N ( ; 1) 8n; and (ii) bn =
with probability 1 n 1 and bn = n with probability n 1 and > 1:
Recommended reading for asymptotic theory:
Van der vaart, A.W. (1998). Asymptotic Statistics. Cambridge University

press. Chapter 2.
24
Theorem 19 (Continuous Mapping Theorem) Let g : Rk ! Rm be continous

at every point of a set C Rk such that P(X 2 C) = 1. Then
(i) If Xn !d X; then g(Xn ) !d g(X):
(ii) If Xn !p X; then g(Xn ) !p g(X):
(iii) If Xn ! X; a:s: then g(Xn ) ! g(X) a:s:
Theorem 20 Let Xn , Yn , X and Y be r.v’s and c a constant. Then
(i) Xn ! X; a:s: implies Xn !p X;
(ii) Xn !r X; r > 0; implies Xn !p X;
(iii) Xn !p X implies Xn !d X;
(iv) Xn !p c if and only if Xn !d c;
(v) If Xn !d X and jXn Yn j !p 0; then Yn !d X;
(vi) If Xn !d X and Yn !p c; then (Xn ; Yn ) !d (X; c);
(vii) If Xn !p X and Yn !p Y; then (Xn ; Yn ) !p (X; Y):
(viii) Xn !r X; r > 0; implies Xn !s X; r > s:
Corollary 21 (Slutsky Lemma, Slutzky Theorem) Let Xn !d X; Yn !d c

with c a constant. Then
(i) Xn + Yn !d X + c;
(ii) Xn Yn !d Xc;
1
(iii) Xn Yn 1 !d Xc provided c 6= 0:
Proofs: See van der Vaart (1998).
25
Theorem 22 Xn !d X if and only if

(i) E[f (Xn )] ! E[f (X)] as n ! 1 for all real-valued continuous and bounded
functions f:
(ii) Xn (t) X (t) ! 0 as n ! 1 for all t 2 Rd (where is the characteristic
function).
0 0
(iii) Xn ! d X for all 2 Rd (This is known as the Cramer-Wold Device).
Theorem 23 Let Xi , i = 1; :::; n; iid:
(i) If E[jXi j] < 1; then X !p E[Xi ] :
p
(ii) If E[jXi j2 ] < 1; then n X !d N (0; ); where = V ar[Xi ]:
Proof. First,
0
X (t) = E[eit X ]
Qn 0
= E ei(t=n) Xi
i=1
n
= Xi (t=n):
By E[jXi j] < 1; Xi (t) is di¤erentiable around zero (by dominated convergence) and
Xi (t) = Xi (0) + t _ Xi (0) + o (t) :
Thus, since Xi (0) = 1 and _ Xi (0) = i ;

n
it 1
X (t) = 1+ +o ! eit :
n n
The limit eit is the characteristic function of a constant , and the result follows
from Theorem 22(i).
The proof of the second part is similar, but with a Taylor expansion of second
degree. Take without loss of generality = 0: Then, since Xi (0) = 1; _ Xi (0) = 0;
• X (0) = p
i
; then pnX (t) = nXi (t= n)
n
p
t0 t t2 t0 t
nX (t) = 1 +o !e 2 :
2n n
The right hand side is the characteristic function of a N (0; ): Recall the character-
t0 t
istic function of a multivariate normal N ( ; ) is ei t 2 :
26
Problem. Use the previous results to establish the convergence of the t ratio
p
n X
tn = ;
Sn
where X is the sample mean of iid r.v’s with …nite
Pn second moments and Sn is the
2 1 2
sample standard deviation, i.e. Sn = (n 1) i=1 (Xi X) :
Solution. By the standard CLT above

p 2
n X !d N (0; );
2
where = V ar(Xi ): Write
" #
1X 2
n
n
Sn2 = X (X)2 :
n 1 n i=1 i
P
De…ne the sequence Zn = (n=(n 1); X2 ; X); where X2 = n 1 ni=1 Xi2 : By LLN
Zn !p (1; E[Xi2 ]; E[Xi ]): By the Continuous Mapping Theorem (CMT)
Sn2 = g(Zn ) !p 1 E[Xi2 ] (E[Xi ])2 = V ar(Xi ) = 2

:
Again, by CMT Sn !p : By Slutsky’s Theorem

p
n X N (0; 2 ) d
tn = !d = N (0; 1);
Sn
d
where = denotes equality in distribution.
Problem. Explain the relation between the previous result and that used in under-
graduate statistics (the t ratio follows a Student t distribution with n 1 degrees of
freedom).
De…nition 24 Uniform integrability: A sequence of r.v’s fXn ; n = 1; 2:::g is

uniformly integrable (u.i) if:
lim sup EfjXn j 1(jXn j > c)g = 0:

c!1 n 1
Some useful results:
(i) If for > 0; supn 1 E[jXn j1+ ] < 1 =) fXn ; n = 1; 2:::g is u.i.
27
(ii) If fXn ; n = 1; 2:::g are identically distributed (i.d) and E[jX1 j] < 1 =)
fXn ; n = 1; 2:::g is u.i.
Proof. lim supn 1 EfjXn j 1(jXn j > c)g = lim EfjX1 j 1(jX1 j > c)g = 0;
c!1 c!1
where the last equality follows by Dominated Convergence Theorem (DCT)
applied to any sequence cn " 1 and fn (x) = jxj 1(jxj > cn ) jxj g(x):
(iii) fXn ; n = 1; 2:::g is u.i =) supn 1 E[jXn j] < 1:
Theorem 25 Let f : Rk ! R be measurable and continuous at every point in a

set C: Let Xn !d X with X takes its values in C: Then E[f (Xn )] ! E[f (X)] if
and only if ff (Xn ); n = 1; 2:::g is u.i
Stochastic o and O simbols

De…nition 26 We say that fXn ; n = 1; 2:::g is bounded in probability if and only if
for each " > 0; there exists a …nite constant C" and an integer n" such that,
P(jXn j > C" ) < "; for all n > n" ;
and we write Xn = OP (1): If an is a nonrandom positive sequence and Xn =an =

OP (1); we write Xn = OP (an ):
De…nition 27 If Xn !p 0 we write Xn = oP (1) and similarly if Xn =an = oP (1);

we write Xn = oP (an ):
(a) Xn = oP (an ) =) Xn = OP (an ):
(b) If Xn = OP (an ) and an =bn ! 0 as n ! 1; then Xn = oP (bn ):
(c) If Xn = OP (an ) and Yn = OP (bn ); then:
(i) Xn + Yn = OP (max(an ; bn )):

(ii) Xn Yn = OP (an bn ):
(d) Part (c) is satis…ed substituting OP by oP :

1=r
(e) If E[jXn jr ] < 1; then Xn = OP (E[jXn jr ]) :
28
(f) Xn !d X =) Xn = OP (1):
Proof. Without loss of generality choose C" a continuity of the cdf of Xn
and X; and by convergence in distribution we have that there exists an n" such
that for all n > n" ;
P(jXn j > C" ) P(jXj > C" ) + "=2:
We can always make P(jXj > C" ) < "=2 by choosing C" su¢ ciently large.
(g) (Generalization of the Heine-Borel Theorem) Xn = OP (1) =) there exists a

subsequence nj with Xnj !d X as j ! 1; for some X (possibly depending on
the sequence).
Problem. Use the calculus of OP and oP to …nd the probability limit of Sn (as an
alternative
p to2 the2CMT used before): Furthermore, establish the asymptotic distribu-
tion of n(Sn ); providing su¢ cient conditions for the validity of the convergence.
Solution. By several applications of the LLN and OP (1)oP (1) = oP (1):
1 X
n
2 X
n
n
Sn2 = (Xi 2
) +( X) (Xi )+ ( X)2
n 1 i=1
n 1 i=1
n 1
1 Xn
= (Xi )2 + oP (1)oP (1) + OP (1)oP (1)
n 1 i=1
1X X
n n
1
= (Xi )2 + (Xi )2 + oP (1)
n i=1 (n 1)n i=1
1X
n
= (Xi )2 + oP (1)
n i=1
2
= + oP (1):
29
Furthermore, by the same calculations
1 X
n
p
n(Sn2 2
) = p (Xi )2 2
n i=1
X
n X
n p
1 2
p 2 n
+p (Xi ) + n( X) (Xi )+ n( X)2
n(n 1) i=1
n 1 i=1
n 1
1 X
n
= p (Xi )2 2
+ OP (n 1=2
) + OP (1)oP (1) + OP (n 1=2
)OP (1)
n i=1
1 X
n
= p (Xi )2 2
+ oP (1):
n i=1
De…ne Zi = (Xi )2 2
; and note that E[Zi ] = 0 and under the assumption that
4
E[Xi ] < 1, it follows that
E[Zi2 ] = E[(Xi )4 ] 4
< 1:
Thus, by CLT
1 X
n
p (Xi )2 2
!d N (0; E[Zi2 ]):
n i=1
By Slutsky’s theorem then
p
n(Sn2 2
) !d N (0; E[Zi2 ]):
30
3 Law of Large Numbers and Central Limit The-

orems.
Given a sequence fXi gni=1 of r.v’s de…ne the empirical (random) measure
1X 1X
n n
Pn (A) = 1(Xi 2 A) = Xi (A)
n t=1 n t=1
for a Borel set A of Rd and where Xi (A) is the point mass or Dirac measure,
Xi (A) = 1 if Xi 2 A and zero otherwise. A related quantity is the indicator
function 1A (x) = 1 if x 2 A and zero otherwise. The following simple but important
relationship holds, Z
x (A) = 1A (y)d x = 1A (x):
Two important quantities associated to Pn are:
(i) The empirical cdf
1X
n
Fn (x) := Pn (( 1; x])) = 1(Xi x):
n t=1
(ii) The empirical expectation operator

Z Z
1X 1X
n n
En [X] := xdPn = xd Xi = Xi :
n t=1 n t=1
Law of Large Numbers (LLN) are theorems for establishing the convergence in
probability or a.s of sample means En [X] to population means E[X]:
Central Limit Theorems (CLT) are theorems for establishing the convergence in
distribution of an (En [X] E[X]) to a proper r.v, for suitable sequences an " 1 as
n ! 1:
We shall study these laws under a di¤erent regularity conditions. Informally
speaking, for these laws to hold we need to restrict the dependence and the moments
of the processes under consideration. From now on we will use the empirical mean
notation En to denote sample means.
31
3.0.1 Independent observations

Theorem 28 Let fXi g1
i=1 be independent with supi 1 jE[Xi ]j = 0: If fXi g1
i=1 are u.i
then
En [X] !1 0:
Theorem 29 (Khinchine WLLN) Let fXi g1

i=1 be independent and identically dis-
tributed (iid) with jE[X]j < 1; then
En [X] !p E[X]:
The Kolmogorov “three series” theorem establishes necessary and su¢ cient
conditions for a.s convergence under independence, see Davidson (1994, pg. 311).
To permit su¢ cient generality we consider triangular arrays P

of r.v’s fXnt ; t =
1; :::; n; n 1g; having zero mean and variances 2nt : De…ne Sn = nt=1 Xnt :
Theorem 30 P(Lindeberg-Feller CLT for triangular arrays) Assume (i) E[Xnt ] =

n
0; 8t; n; (ii) t=1 V[Xnt ] = 1; 8n; (iii) Xnt and Xns are independent if t 6= s; 8n;
and (iv) (Lindeberg condition)
X
n
2
E[Xnt 1(jXnt j > ")] ! 0; 8" > 0; n ! 1:
t=1
Then
Sn !d X N (0; 1):
The Liapunov CLT replaces (iv) by the su¢ cient condition
X
n
E[jXnt j2+ ] ! 0; for some > 0; as n ! 1:
t=1
2
Lindeberg-Levy CLT: fXnt : t = 1; :::;png iid with E[nXnt ] < C < 1:
Classical CLT corresponds to Xnt = n(Xt )= for iid fXt g.
32
3.0.2 Dependent observations

Theorem 31 (Chebychev’s LLN) If fXi gni=1 satisfy:
E [En [X]] ! E[X] and jV[En [X]]j ! 0 as n ! 1;
then,
En [X] !2 E[X]:
Proof. Trivial
A useful version of the previous theorem is:
Theorem 32 Let fXt ; t = 1; :::; ng satisfy E[Xt ] = t and Cov(Xt ; Xs ) := Rts :

Then
1 XX
n n
tr(Rts ) ! 0 () Xn !2 ;
n2 t=1 s=1
P P
where Xn = n 1 nt=1 Xt and = n 1 nt=1 t :
Theorem 33 (Ergodic Theorem) Let fXt g be a strictly stationary and ergodic

process with E[jX1 j] < 1: Then
En [X] ! E[X1 ] a:s:
Virtually all standard asymptotic theory of time series processes hinges on the
ergodicity property. For this reason, applied econometricians frequently assume that
the time series of interest are ergodic. The following central limit theorem is an
extension of the Lindeberg–Levy central limit theorem to stationary and ergodic
martingale di¤erence sequences.
Theorem 34 (Martingale
Pn LLN) Let fXi ; Fi g be a mds with varianceP1 sequence
2 1 2 2
f i g; Sn = an i=1 Xi and fai g a positive sequence with ai " 1: If i=1 i =ai <
1, then
Sn ! 0 a:s:
Theorem 35 (Mixing LLN) Let fXi g be a strictly stationary and mixing process
with E[jX1 j] < 1: Then
En [X] ! E[X1 ] a:s:
Ref: Taniguchi and Kakizawa (2000, Section 1.3).
33
Theorem 36 (Martingale CLT)
(i) (Ergodic, strictly stationary sequences, Billingsley 1961). Let fXt g be a strictly
stationary and ergodic mds with E[X1 X10 ] = : Then
X
n
1=2
n Xt !d X N (0; ):
t=1
2
(ii) Let
PnfXni2; Fni g be a md array with (unconditional) variance sequence f ni g; with
i=1 ni = 1: If
Pn 2
(a) i=1 Xni !p 1:
(b) max1 i n jXni j !p 0:
Then
X
n
Sn = Xni !d X N (0; 1):
i=1
Theorem 37 (Strong Mixing CLT) Assume that fX Pt g is a strictly stationary

and -mixing process with E[X1 ] = 0 and = (0) + 2 1j=1 (j) is positive de…nite
and …nite. Then
Xn
1=2
n Xt !d X N (0; )
t=1
if one of the following conditions holds:

P1
(i) E[jX1 j ] < 1 and j=1 (j)1 2=
< 1 for some constant > 2;
P
(ii) P(jX1 j < c) = 1 for some contant c > 0 and 1j=1 (j) < 1:
34
3.1 The Delta Method

The Delta Method is useful to establish the asymptotic normality (AN) of a smooth
transformation of an AN estimator.
Theorem 38 (Delta Method) Suppose we have
n1=2 (Xn 0) !d Z N (0; );
and g : Rd ! Rs is a vector-valued function which is continuously di¤erentiable in

a neighbourhood of 0; then
n1=2 (g(Xn ) g( 0 )) !d Z N (0; A A0 );
where A = @g( 0 )=@x:
Example: The sample variance of n observations X1 ; :::; Xn iid distributed is

X
n
Sn2 = n 1
(Xt En [X])2 = En [X 2 ] (En [X])2 := g(En [X]; En [X 2 ]);
t=1
where g(x; y) = y x2 : By the multivariate CLT
p En [X] E[X] E[X 2 ] (E[X])2 E[X 3 ] E[X]E[X 2 ]

n !d N2 0; :
En [X 2 ] E[X 2 ] E[X 3 ] E[X]E[X 2 ] E[X 4 ] (E[X 2 ])2
The function g is di¤erentiable at (E[X]; E[X 2 ])0 with derivative ( 2E[X]; 1)0 : Hence
if (T1 ; T2 ) is the vector that possesses the normal distribution of the last display, by
the Delta Method p
n(Sn2 2
) !d 2E[X]T1 + T2 :
In fact, if E[X] = 0 the limit distribution is a normal r.v with mean zero and variance
E[X 4 ] (E[X 2 ])2 (show this). A more direct proof of this result (without using the
Delta Method) is based on
X
n X
n
Sn2 =n 1
(Xt E[X]) 2
(En [X] 2
E[X]) = n 1
(Xt E[X])2 + oP (1)
t=1 t=1
35
Part II
Statistics
4 Statistical Models: Identi…cation and Speci…ca-
tion
Suppose we have a sample X1 ; :::; Xn from a distribution P that belongs to a sta-
tistical model P on the sample space (X ; A). A statistical model is a collection of
probabilities: Here, P is called the population and n the sample size. In parametric
modelling we assume P = fP : 2 g where is a subset of Euclidean Space
Rp . For example P can be a normal probability model for a univariate X; with
= ( ; 2) 2 := R (0; 1): In nonparametric econometrics, is a subset (of
in…nite dimension) of an in…nite dimensional metric space. In any case, we say the
model P is correctly speci…ed if P 2 P; that is, there exists 0 2 such that
P = P 0 : From now on a subscript 0 means that the parameter is the one that gen-
erated the true data generating process P (i.e. P = P 0 ). If 0 2 is unique–in the
sense that if P = P then = 0 –we say that 0 is identi…ed. The identi…ed set is
0 (P ) =f 0 2 : P = P 0 g:
An equivalent characterization of identi…cation is that 0 (P ) = f 0 g:

Identi…cation can be de…ned without reference to correct speci…cation. A model
P = fP : 2 g, not necessarily parametric, is identi…able i¤ for all 1 ; 2 2 ; 1 6=
2 implies P 1 6= P 2 : This is a global property of the model and depends of course
on the parameter space : Under misspeci…cation, one describes 0 as a pseudo-true
value (White 1982), but its de…nition depends on the metric chosen to measure the
discrepancy between P and the model P; i.e. 0 (d) = arg min 2 d(P; P ); for a
metric d; provided such a minimum exists. One could de…ne 0;d (P ) = f 0 2 :
0 2 arg min 2 d(P; P )g; and then de…ne identi…cation under misspeci…cation as
0;d (P ) = f 0 g: However, as is customary in econometrics, we shall assume correct
speci…cation when dealing to identi…cation in most of our examples.
There are many examples of parametric models in statistics (e.g. normal dis-
tribution, see Shao’s book for more examples). Parametric models are useful, but
inferences within the parametric setting are in general not robust to misspeci…ca-
tion. That is, if P 6= P ; then inferences for the pseudo parameter 0 ; if any, will be
generally invalid. Nonparametric models are more robust to misspeci…cation than
parametric models, but they generally lead to less precise …ts. In semiparametric
36
econometrics, where P = fP ; : 2 ; 2 Hg; Rp and H is a subset of

a metric space, one is typically interested in inference on in the presence of the
unknown nuisance parameter 2 H: Semiparametric models o¤er a compromise
between parametric models and nonparametric models.
These de…nitions of parametric, nonparametric and semiparametric models are
rather informal, but they su¢ ce for our purposes. We typically assume the existence
of a -…nite measure dominating the probability measures of P; and parametrize
the model by its densities with respect to (wrt) ; i.e. P = fp ; = dP ; =d : 2
; 2 Hg: We assume the model is correctly speci…ed, that is, there exist a 0 2
and 0 2 H such that P = P 0 ; 0 : Furthermore, we often assume such 0 2 and
0 2 H are unique: this is identi…cation. Much recent emphasis in econometrics is
given to nonparametric identi…cation (identi…cation when P is large). Identi…cation
is logically the …rst and most important step in statistical modelling. We shall see
some identi…cation results, but much of the discussion will proceed to estimation and
inference (hypothesis testing and con…dence intervals), taking model speci…cation
and identi…cation as given.
5 Empirical Measures and Su¢ ciency

A common statistical problem is: given an iid sample X1 ; :::; Xn from an unknown
distribution P that belongs to a statistical model P on the sample space (X ; A), learn
about some aspect of P; say (P ): An example is (P ) = E[Xi ]: When a model is
identi…ed P = P has a unique solution 0 (P ) and the mapping (P ) = 0 (P ) might
be implicit. In some sense identi…cation implies that estimating P is su¢ cient for
inference. A natural “estimator”for P is the empirical measure
1X 1X
n n
Pn (A) := Xi (A) = 1A (Xi );
n i=1 n i=1
with the associated empirical cdf
1X
n
Fn (x) := 1( 1;x] (Xi ):
n i=1
These are random measures and cdfs. Given a realization x1 ; :::; xn of X1 ; :::; Xn the
empirical measure conditional on the data becomes
1X
n
Pn (A) := xi (A)
n i=1
37
which is a standard (discrete) measure with a cdf Fn ; which with some abuse of
notation is also denoted by Fn and which is a distribution function putting mass 1=n
to each point xi (when there are no ties): That is, conditional on the data Pn ( ) is a
multinomial probability measure.
Assume univariate data. For a …xed x 2 R; f1(Xi x)gni=1 is a sequence of
independent and identically distributed (iid) Bernoulli variables. The mean and
variance of 1(Xi x) are, respectively, F (x) and F (x)(1 F (x)).3 As an estimator of
F (x); Fn (x) is unbiased and consistent in mean squared. Furthermore, by Hoe¤ding’s
inequality, for any x and " > 0;
2n"2
P (jFn (x) F (x)j > ") 2e ;
which implies convergence in probability (and the possibility of constructing …nite

sample con…dence intervals for F (x); more on this below): Since Fn (x) is the sample
mean of f1(Xi x)gni=1 , the strong Law of Large Numbers (LLN) yields, for …xed
x 2 R;
Fn (x) !a:s: F (x) := E[1( 1;x] (Xi )]: (6)
Moreover, also for …xed x 2 R; the standard Central Limit Theorem (CLT) yields
p
n (x) := n(Fn (x) F (x)) !d N (0; F (x)(1 F (x))): (7)
This result can be easily extended to multivariate convergence of ( n (x1 ); n (x2 ); :::; n (xm ))
for arbitrary points (x1 ; x2 ; :::; xm ) 2 Rm and m < 1: The r.v. ( n (x1 ); n (x2 ); :::; n (xm ))
converges in distribution to a multivariate normal with zero mean vector and covari-
ance matrix with (j; k) th element (a ^ b := min(a; b))
F (xj ^ xk ) F (xj )F (xj ):
This is the so called convergence of the …nite-dimensional distributions of n (…dis).

The pointwise convergence in (6) was extended to a uniform convergence in the
following celebrated result:
Theorem 39 (Glivenko-Cantelli, 1933) For fXi g iid r.v’s (with an arbitrary cdf F !)
Dn := kFn F k1 := sup jFn (x) F (x)j !a:s: 0:

x2R
3 2
E[1(Xi x)] = P(Xi x) = F (x) and V[1(Xi x)] = E[12 (Xi x)] (E[1(Xi x)])
= F (x) F 2 (x):
38
The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which provides a uniform ver-

sion of Hoe¤ding inequality, gives a bound on the rate of convergence to zero: for
any " > 0
2n"2
P sup jFn (x) F (x)j > " 2e :
x2R
For any cdf F; de…ne the quantile function F 1 (u) = infft 2 R : F (t) ug; u 2
(0; 1): This is a generalization of the concept of inverse. Since F is non-decreasing,
it has left limits and is right continuous, it can shown that F 1 is non-decreasing, it
has right limits and it is left continuous. Furthermore, from the de…nition of F 1 ,
for each t and any r.v. U;
1
fF (U ) tg = fU F (t)g: (8)
This implies that if U is a r.v. with uniform distribution, then F 1 (U ) is distributed

as F (even when F is discontinuous). This result is useful to generate random
numbers from F; provided F 1 can be explicitly computed. Also useful (and more
di¢ cult to prove) is that for any random variable X with cdf F; we can …nd a
uniformly distributed r.v. U such that X = F 1 (U ) with probability one (try to
understand this with a plot). This result applied to each Xi implies Xi = F 1 (Ui )
for i = 1; :::; n; for an independent sample U1 ; :::; Un from a uniform distribution on
(0,1). Then, from (8), for all x;
Fn (x) = Fn (F (x)); (9)
where Fn is the ecdf of U1 ; :::; Un .

Suppose that F is continuous, so its range is all [0; 1]. Then, the supremum
distance of the standard empirical process satis…es
p
sup j n (x)j = sup n (Fn (F (x)) F (x))
x2R x2R
p
= sup n (Fn (u) u)
u2[0;1]
becomes distribution free. That is, its …nite sample distribution does not depend
on F and is fully known. Although it is possible to develop statistical theory in …nite
samples under certain conditions (e.g. exponential families), most of the discussion
in this course will based on asymptotic arguments because …nite sample distributions
are in general unknown. To learn about exponential families I recommend reading
Section 2.1.3 in Shao. Many well known distributions are members of the exponential
family, including the normal distribution, binomial and multinomial, among others.
39
Note that the Glivenko-Cantelli’s theorem holds for an arbitrary cdf F: It general-
izes the LLN (it is a Uniform LLN, in short ULLN). Moreover, this theorem justi…es
the analog principle in econometrics: (F ) can be estimated by the corresponding
functional of the empirical measure Fn ; (Fn ); when the latter is well-de…ned. An
important class of such functionals are linear functionals or integrals
Z
(F ) = '(x)dF;
which are naturally estimated by the empirical integral
Z
(Fn ) = '(x)dFn
Z !
1X
n
= '(x) d Xi
n i=1
Z
1X
n
= '(x)d Xi
n i=1
1X
n
= '(Xi ):
n i=1
To understand the last equality, note these two basic equalities

Z
z (A) = 1A (x)d z = 1A (z);
(where the …rst follows from a property of indicator functions while the second from
a property of Dirac measures).
In many cases of interest the functional 0 = (F ) is implicit rather than explicit.
For example,
(F ) = arg min QF ( ):
2
This more complicated case is quite general and is investigated in the next sections.
In some sense the empirical cdf Fn is a su¢ cient statistic. Assume there are
no ties. The empirical cdf has jumps at the realizations of the order statistics, x(i) ;
where x(1) < x(2) < < x(n) are realizations of X(1) < X(2) < < X(n) : In fact,
note
j
Fn (x(j) ) =
n
40
and Rj = nFn (X(j) ) are the ranks. As we shall show below the order statistics are
su¢ cient statistics for P: What we do mean by this?
Denote Xn = (X1 ; :::; Xn ): A statistic Tn (Xn ) is a measurable mapping that is
known if Xn is known. It seems ine¢ cient to code n numbers (or vectors) in a function
Fn ; so often researchers try to reduce the dimensionality and look for statistics with
reduced range. What do we mean by su¢ ciency?
A statistic T (Xn ) provides a reduction of the -…eld (Xn ): Does such a reduction
results in any loss of information concerning the unknown population? If a statistic
T (Xn ) is fully as informative as the original sample Xn , then statistical analyses can
be done using T (Xn ) that is simpler than Xn . The next concept describes what we
mean by fully informative.
De…nition 40 Let Xn be a sample from an unknown population P 2 P, where P is

a statistical model. A statistic T (Xn ) is said to be su¢ cient for P 2 P (or for 2
when P = fP : 2 g is a parametric family) i¤ the conditional distribution of Xn
given T (Xn ) is known (does not depend on P or ).
Once we observe Xn and compute a su¢ cient statistic T (Xn ), the original data
Xn do not contain any further information concerning the unknown population
P (since its conditional distribution is unrelated to P ) and can be discarded.
A su¢ cient statistic T (Xn ) contains all information about P contained in Xn

and provides a reduction of the data if T is not one-to-one.
The concept of su¢ ciency depends on the given family P.
If T is su¢ cient for P 2 P, then T is also su¢ cient for P 2 P0 P but not
necessarily su¢ cient for P 2 P1 P.
Example 41 Suppose that Xn = (X1 ; :::; Xn ) and X1 ; :::; Xn are i.i.d. from the bi-
nomial distribution with the p.d.f. (w.r.t. the counting measure)
z
f (z) = (1 )1 z If0;1g (z); z 2 R; 2 (0; 1):
P
Consider the statistic T (Xn ) = ni=1 Xi , which is the number of ones in Xn . For any
realization x of Xn , x is a sequence of n ones and zeros. T contains all information
about , since is the probability of an occurrence of a one in x and given T = t,
what is left in the data set x is the redundant information about the positions of t
ones. To show T is su¢ cient for P , we compute P (Xn = xjT = t). Let t = 0; 1; :::; n
n
and Bt = f(x1 ; :::; xn ) : xi = 0; 1; i=1 xi = tg.
41
If x 62 Bt , then P (Xn = x; T = t) = P (;) = 0.

If x 2 Bt , then
Y
n Y
n
t n t
P (Xn = x; T = t) = P (Xi = xi ) = (1 ) If0;1g (xi ):
i=1 i=1
Also
n t
P (T = t) = (1 )n t If0;1;:::;ng (t):
t
Then
P (Xn = x; T = t) 1
P (Xn = xjT = t) = = n IBt (x)
P (T = t) t
is a known p.d.f. (does not depend on ).

Hence T (Xn ) is su¢ cient for 2 (0; 1) according to the de…nition.
How to …nd a su¢ cient statistic? Finding a su¢ cient statistic by means of the
de…nition is not convenient. It involves guessing a statistic T that might be su¢ -
cient and computing the conditional distribution of X given T = t. For families of
populations having p.d.f.’s, a simple way of …nding su¢ cient statistics is to use the
factorization theorem.
Theorem 42 Suppose that Xn is a sample from P 2 P and P is a family of prob-

ability measures on (Rn ; B n ) dominated by a -…nite measure . Then T (Xn ) is
su¢ cient for P 2 P i¤ there are nonnegative Borel functions h (which does not de-
pend on P ) on (Rn ; B n ) and gP (which depends on P ) on the range of T such that
dP
(x) = gP T (x) h(x):
d
Example 43 If P is an exponential family, then Theorem 2.2 can be applied with
g (t) = expf[ ( )] t ( )g;
i.e., T is a su¢ cient statistic for 2 .

In Example
Pn 2.10 the joint distribution of X is in an exponential family with
T (Xn ) = i=1 Xi .
Hence, we can conclude that T is su¢ cient for 2 (0; 1) without computing the
conditional distribution of X given T .
42
Example 44 Let Xn = (X1 ; :::; Xn ) and X1 ; :::; Xn be i.i.d. random variables having
a distribution P 2 P, where P is the family of distributions on R having Lebesgue
p.d.f.’s.
Let X(1) ; :::; X(n) be the order statistics. Note that the joint p.d.f. of X is
f (x1 ) f (xn ) = f (x(1) ) f (x(n) ):
Hence, T (Xn ) = (X(1) ; :::; X(n) ) is su¢ cient for P 2 P.

The order statistics can be shown to be su¢ cient even when P is not dominated
by any -…nite measure, but Theorem 2.2 is not applicable (see Shao Exercise 31 in
§2.6).
43
6 Statistical Decision Theory

The basic elements of statistical decision theory are:
X: a sample from a population P 2 P
Decision: an action we take after observing X
A: the set of allowable actions
(A; A): the action space
X : the range of X
Decision rule: a measurable function (a statistic) d from (X ; FX ) to (A; A)
If X is observed, then we take the action d(X) 2 A
The criterion for choosing decision rules is given by a loss function:
De…nition 45 Loss function L(P; a): a function from P A to [0; 1). L(P; a) is
Borel for each P . If X = x is observed and our decision rule is d, then our “loss" is
L(P; d(x)).
It is di¢ cult to compare L(P; d1 (X)) and L(P; d2 (X)) for two decision rules, d1 and
d2 , since both of them are random. This motivates the de…nition of Risk.
De…nition 46 The Risk of d at P is the average (expected) loss de…ned as

Z
Rd (P ) = E[L(P; d(X))] = L(P; d(x))dPX (x):
X
If P is a parametric family indexed by , the loss and risk are denoted by L( ; a) and
Rd ( )
For decision rules d1 and d2 , d1 is as good as d2 i¤
Rd1 (P ) Rd2 (P ) 8P 2 P;
and is better than d2 if, in addition, Rd1 (P ) < Rd2 (P ) for at least one P 2 P.
Two decision rules d1 and d2 are equivalent i¤ Rd1 (P ) = Rd2 (P ) for all P 2 P.
44
Example 47 Let X = (X1 ; :::; Xn ) be a vector of iid measurements for a parameter

2 R. We want to estimate . Action space: (A; A) = (R; B). A common loss
function in this problem is the squared error loss L(P; a) = ( a)2 , a 2 A. For
example, let d(X) = X, the sample mean. The loss for X is (X )2 . The risk
RX (P ) is called Mean Squared Error (MSE) of X.
If the population has mean and variance 2 < 1, then
RX (P ) = E( X)2
= ( E X)2 + E(E X X)2
= ( E X)2 + V ar(X)
2
= ( )2 +
n
= Bias2 + V ar:
If is in fact the mean of the population, then
2
RX (P ) = n
;
2
is an increasing function of the population variance and a decreasing function of
the sample size n.
In an estimation problem, a decision rule d is called an estimator. The following

example describes another type of important problem called hypothesis testing.
Example 48 Let P be a family of distributions, P0 P, and P1 = fP 2 P : P 62

P0 g. A hypothesis testing problem can be formulated as that of deciding which of the
following two statements is true:
H0 : P 2 P0 versus H1 : P 2 P1 :
Here, H0 is called the null hypothesis and H1 is called the alternative hypothesis.
The action space for this problem contains only two elements, i.e., A = f0; 1g,
where 0 is the action of accepting H0 and 1 is the action of rejecting H0 . A decision
rule is called a test.
Since a test d(X) is a function from X to f0; 1g, d(X) must have the form IC (X),
where C 2 FX is called the rejection region or critical region for testing H0 versus
H1 . L(P; a) = 0 if a correct decision is made and 1 if an incorrect decision is made,
i.e., L(P; j) = 0 for P 2 Pj and L(P; j) = 1 otherwise, j = 0; 1.
Under this loss, the risk is
P (d(X) = 1) = P (X 2 C) P 2 P0
Rd (P ) =
P (d(X) = 0) = P (X 62 C) P 2 P1 :
45
An example of a graph of Rd (P ) is Figure 2.2 of Shao (p127). The 0-1 loss implies
that the loss for two types of incorrect decisions (accepting H0 when P 2 P1 and
rejecting H0 when P 2 P0 ) are the same. In some cases, one might assume unequal
losses: L(P; j) = 0 for P 2 Pj , L(P; 0) = c0 when P 2 P1 , and L(P; 1) = c1 when
P 2 P0 .
If d is as good as any other rule in D, a class of allowable decision rules, then d
is D-optimal (or optimal if D contains all possible rules). In most applications it is
not possible to …nd a decision that is best uniformly in P 2 P: For this reason, we
relax the de…nition of “best” and consider di¤erent concepts such as admissibility
(introduced later). To …nd admissible rules it is convenient to enlarge (or better,
convexify) the class of decision rules to include randomized decision rules.
De…nition 49 A randomized decision rule is a function on X A such that, for
every A 2 A, ( ; A) is a Borel function and, for every x 2 X , (x; ) is a probability
measure on (A; A).
If X = x is observed, we have a distribution of actions: (x; ).
A nonrandomized decision rule d previously discussed can be viewed as a special
randomized decision rule with (x; fag) = Ifag (d(x)), a 2 A, x 2 X .
To choose an action in A when a randomized rule is used, we need to simulate
a pseudorandom element of A according to (x; ).
Thus, an alternative way to describe a randomized rule is to specify the method
of simulating the action from A for each x 2 X .
The loss function for a randomized rule is de…ned as
Z
L(P; ; x) = L(P; a)d (x; a);
A
which reduces to the same loss function we discussed when is a nonrandomized
rule.
The risk of a randomized rule is then
Z Z
R (P ) = E[L(P; ; X)] = L(P; a)d (x; a)dPX (x):
X A
De…nition 50 Let D be a class of decision rules (randomized or nonrandomized).

A decision rule d 2 D is called D-admissible (or admissible when D contains all
possible rules) i¤ there does not exist any S 2 D that is better than d (in terms of
the risk).
46
If a decision rule d is inadmissible, then there exists a rule better than d and d
should not be used in principle.
However, an admissible decision rule is not necessarily good. Admissibility is

the absence of a negative attribute, rather than possesion of a positive attribute.
For example, in an estimation problem a silly estimator d(X) a constant may
be admissible.
If d is D-optimal, then it is D-admissible.
If d is D-optimal and d0 is D-admissible, then d0 is also D-optimal and is

equivalent to d .
If there are two D-admissible rules that are not equivalent, then there does not
exist any D-optimal rule.
How to check admissibility will be discussed in Chapter 4 in Shao.
Suppose that we have a su¢ cient statistic d(X) for P 2 P. Intuitively, our
decision rule should be a function of d. This is not true in general, but the following
result indicates that this is true if randomized decision rules are allowed.
Proposition 51 Let d(X) be a su¢ cient statistic for P 2 P and let 0 be a decision
rule.
Then
1 (t; A) = E[ 0 (X; A)jd = t];
which is a randomized decision rule depending only on d, is equivalent to 0 if
R 0 (P ) < 1 for any P 2 P.
Note that this Proposition does not imply that 0 is inadmissible.
If 0 is a nonrandomized rule,
1 (t; A) = E[IA ( 0 (X))jd = t] = P ( 0 (X) 2 Ajd = t)
is still a randomized rule, unless 0 (X) = h(d(X)) a.s. P for some Borel function
h.
Hence, this Proposition does not apply to situations where randomized rules
are not allowed.
47
The following result tells us when nonrandomized rules are all we need and when
decision rules that are not functions of su¢ cient statistics are inadmissible.
Theorem 52 Suppose that A is a convex subset of Rk and that for any P 2 P,

L(P; a) is a convex function of a.
R
(i) Let be a randomized
R rule satisfying A
kakd (x; a) < 1 for any x 2 X and
let d1 (x) = A ad (x; a).
Then L(P; d1 (x)) L(P; ; x) (or L(P; d1 (x)) < L(P; ; x) if L is strictly convex
in a) for any x 2 X and P 2 P.
(ii) (Rao-Blackwell theorem). Let T be a su¢ cient statistic for P 2 P, d0 2 Rk

be a nonrandomized rule satisfying Ekd0 k < 1, and d1 = E[d0 (X)jT ]. Then
Rd1 (P ) Rd0 (P ) for any P 2 P. If L is strictly convex in a and d0 is not a
function of T , then d0 is inadmissible.
How to …nd a decision rule? The concept of admissibility and su¢ ciency helps us
to eliminate some decision rules. However, usually there are still too many rules
left after the elimination of some rules according to admissibility and su¢ ciency.
Although one is typically interested in a D-optimal rule, frequently it does not exist,
if D is either too large or too small.
Example 53 Let X1 ; :::; Xn be i.i.d. random variables from a population P 2 P

that is the family of populations having …nite mean and variance 2 . Consider the
estimation of (A = R) under the squared error loss. It will be shown below that if
we let D be the class of all possible estimators, then there is no D-optimal rule.
P Next,
let D1 be the class of all linear functions in X = (X1 ; :::; Xn ), i.e., d(X) = ni=1 ci Xi
with known ci 2 R, i = 1; :::; n. Then,
!2
Xn X n
2 2
Rd (P ) = ci 1 + c2i : (10)
i=1 i=1
We
Pn now show that there is no D1 -optimal rule, i.e., there does not exist d =
i=1 ci Xi such that Rd (P ) Rd (P ) for any P 2 P and d 2 D1 .
If there is such a d , then (c1 ; :::; cn ) is a minimum of the function of (c1 ; :::; cn )
on the right-hand side of (10).
Then c1 ; :::; cn must be the same and equal to 2 =( 2 + n 2 ), which depends on
P , i.e., d is not a statistic.
48
P
ConsiderP now a subclass D2 D1 with ciP’s satisfying ni=1 P
ci = 1. From (10),
Rd (P ) = 2 ni=1 c2i if d 2 D2 . Minimizing 2 ni=1 c2i subject to ni=1 ci = 1 leads to
ci = n 1 . Thus, the sample mean X is D2 -optimal. There may not be any optimal
rule if we consider a small class of rules. For example, if D3 contains all the rules
in D2 except X, then one can show that there is no D3 -optimal rule.
In view of the fact that an optimal rule often does not exist, statisticians adopt
two common approaches to choose a decision rule. The …rst approach is to de…ne
a class D of decision rules that have some desirable properties (statistical and/or
nonstatistical) and then try to …nd the best rule in D. In the previous example,
for instance, any estimator d in D2 has the property that d is linear in X and
E[d(X)] = . In a general estimation problem, we can use the following concept.
De…nition 54 In an estimation problem, the bias of an estimator d(X) of a real-

valued parameter # of the unknown population is de…ned to be
bd (P ) = E[d(X)] #
(denoted by bd ( ) when P is in a parametric family indexed by ). An estimator

d(X) is said to be unbiased for # i¤ bd (P ) = 0 for any P 2 P.
Another property we may consider is invariance
Consider a class of transformations (such as unit changing)

Consider rules that are not a¤ected by transformation (invariance)
Try to …nd the best rule within the class of invariant rules
Details are omitted (see textbook)
The second approach to …nding a good decision rule is to consider some charac-
teristic Rd of Rd (P ), for a given decision rule d, and then minimize Rd over d 2 D.
The following are two popular ways to carry out this idea. The …rst method is
the Bayes rule. Consider an average of Rd (P ) over P 2 P:
Z
rd ( ) = Rd (P )d (P );
P
where is a known probability measure on (P; FP ) with an appropriate -…eld FP .

rd ( ) is called the Bayes risk of d w.r.t. . If d 2 D and rd ( ) rd ( ) for any
49
d 2 D, then d is called a D-Bayes rule (or Bayes rule when D contains all possible
rules) w.r.t. .
The second method is the minimax rule. Consider the worst situation, i.e.,
supP 2P Rd (P ). If d 2 D and
sup Rd (P ) sup Rd (P )
P 2P P 2P
for any d 2 D, then d is called a D-minimax rule (or minimax rule when D contains
all possible rules).
Example 55 Consider the estimation
Z of 2 R under loss L( ; a) = ( a)2 and
rd ( ) = E ( d(X))2 d ( );
R
which is equivalent to E ( d(X))2 , where is random and distributed as and,

given = , the conditional distribution of X is P . Then, the problem can be viewed
as a prediction problem for using functions of X. The best predictor is E( jX),
which is the D-Bayes rule w.r.t. with D being the class of rules d(X) satisfying
E[d(X)]2 < 1 for any .
We usually try to …nd a Bayes rule or a minimax rule in a parametric problem where
P = P for a 2 Rk . A minimax rule in general may be di¢ cult to obtain. Bayes
and minimax rules are futher discussed in Chapter 4 of Shao. I recommend reading
Chapter 2 in Young and Smith (2005) to get some insights into the …nite parameter
space case.
We have seen estimation and testing within the context of statistical decision
theory, but other statistical problems such as con…dence intervals, classica…cation or
rankings can be also motivated from the statistical decision point of view. For an
illustration on the problem of classi…cation see Problem 93 in Chapter 2 of Shao. We
now move into estimation in more detail.
7 Statistical Inference: Estimation

7.1 Examples
Many estimators can be viewed as minimizers (or maximizers) of a stochastic func-
tion. In many cases there is an explicit solution, but not in many others. Let Qn ( )
be a stochastic scalar function of a p 1 vector and the sample size n: Let Rp
denote the parameter space and let
^ = ^n = arg min Qn ( ):
2
50
For early references on nonlinear estimation see R.A Fisher (1921, 1925), Wald
(1949), Huber (1967), Jennrich (1969) and Malinvaud (1970). For a general treat-
ment of this topic see Amemiya (1973, 1985) and Newey and Mcfadden (1994). We
now provide some important examples P to motivate the general theory. Throughout,
we use the notation En [g(w)] = n 1 ni=1 g(wi ) to denote the empirical expectation
operator based on a sample fwi gni=1 of size n; for a measurable function g:
Example 56 (OLS Estimate of Multiple Regression) The linear regression model

is the most popular model in applied work. This model is generally estimated by least
squares regression, that is, taking
h i 1X
n
0 2 2
Qn ( ) = En (y x ) = (yi x0i ) ; so
n i=1
^n = En [xx0 ] 1 En [xy];
provided En [xx0 ] is non-singular. In this example we obtain a closed form for the
so-called Ordinary Least Squares (OLS) estimator.
Example 57 (Robust Estimation, Huber (1967)) More robust estimators for

location and regression parameters can be constructed by taking
0
Qn ( ) = En [ (y x)]
for a given function that is typically chosen to reduce sensitivity to outlying obser-
vations: “Robust estimation”, see Huber (1967).
Example 58 (Lasso) An estimator that is gaining a lot of attention recently is the

Lasso of Tibshirani (1996). This estimator corresponds to the objective function
h i p
X
0 2
Qn ( ) = En (y x ) + n j jj ;
j=1
where = ( 1 ; :::; p )0 and n is a converging-to-zero sequence of positive numbers.

This objective function is convex. I suggest you see graphically for p = 2 what
the estimator is doing, by plotting the set of points ( 1 ; 2 ) on the plane such that
j 1j + j 2j b; together with level sets for En (y x0 )2 . Lasso estimator has
become popular when p is large, and the previous plot tells you why (model selection
and estimation are simultaneously done).
51
Example 59 (Ridge) An estimator that is popular when En [xx0 ] is close to be sin-

gular is the Ridge estimator, which minimizes the objective function
h i p
X
0 2 2
Qn ( ) = En (y x ) + n j;
j=1
where = ( 1 ; :::; p )0 and n is a converging-to-zero sequence of positive numbers.

Solving the …rst order condition we obtain a closed form solution for the estimator
^n = (En [xx0 ] + 1
n Ip ) En [xy]:
Note that for n = 0 we recover OLS. Ridge estimators do not do model selection
(see the plot for p = 2 for illustration).
Example 60 (MLE, CML, PMLE or QMLE) The OLS estimate in the previ-
ous example can be seen as a Conditional Maximum Likelihood (CML) estimate
of 0 under the parametric assumption
fyi ; xi g is iid, yi jxi N(x0i 0;

2
):
More generally, the CML estimate maximizes the conditional (log) likelihood
=`(yjx; ; 2)
z }| { 1X n
2
Qn ( ; ) = En `(yjx; ; 2 ) = En log f (yjx; ; 2
) = log f (yi jxi ; ; 2
):
n i=1
This objective function considers only a part of the complete likelihood

2 2
f (yi ; xi ; ; ; ) = f (yi jxi ; ; )f (xi ; );
where f (xi ; ) is the marginal pdf of xi ; so the log likelihood of the data fyi ; xi g is
2 2
log f (yi ; xi ; ; ; ) = log f (yi jxi ; ; ) + log f (xi ; ):
2
The maximum likelihood estimator (MLE) of ( ; ; ) maximizes
2 2
Qn ( ; ; ) = En log f (y; x; ; ; ) :
The CML of (and 2 ) only maximizes the …rst term, ignoring the second one.
If ( ; 2 ) and are functionally unrelated then the CML and the joint (or full) ML
estimates of ( ; 2 ) are numerically the same.
52
If they are related (e.g. one parameter appears in both distributions) then they
are no longer numerically equal and intuitively the CML misses information that
could be obtained from the marginal distribution, and in general is less e¢ cient. In
many cases this is the price when we do not know or want to specify f (x; ):
If the true underlying distribution is other than the one speci…ed (e.g. Gaussian
distribution), the resulting estimator is called the Pseudo-MLE estimator or Quasi-
MLE, and it might be consistent and asymptotically normal under fairly weak regu-
larity conditions, although not e¢ cient, see Gourieroux, Monfort and Trognon (1984,
Econometrica). That is, the QMLE allows for some misspeci…cation.
Models with limited dependent variables and endogenous selection are examples
of models that are often estimated by MLE. This approach is often critized because
of its strong distributional assumptions.
The normal distribution is a special case of an exponential family. Exponential
families will be discussed in class.
In general ^n is only implicitly de…ned, due to special form of the objective
function, in many cases because of the nonlinearity of the model, but in others
because of the nature of the estimate.
Example 61 (Estimation of GARCH models for …nancial data: QMLE) The
GARCH(1; 1) model is the most popular model in …nancial econometrics, and it is
used, for instance, in modeling market risk by …nancial institutions. It is de…ned as
p
Yt = ht "t ; ht = w0 + 0 Yt2 1 + 0 ht 1 ; t 2 Z;
where "t is a sequence of strictly stationary and ergodic random variables satisfying
E["2t j Ft 1 ] = 1 almost surely (a.s.), (11)
and w0 > 0; 0 0; 0 0: Here Ft 1 is the -…eld generated by (Yt 1 ; Yt 2 ; :::): De-
…ne the vector of parameters = (w; ; )0 and the parameter space (0; +1)
[0; +1)2 : The true parameter value is unknown, and it is denoted by 0 = (w0 ; 0 ; 0 )0 :
The prime denotes transposition. The problem we tackle is, given a sample of size
n of Yt ; fYt gnt=1 say, to estimate the parameter 0 : A QMLE estimator is de…ned as
any measurable solution bn such that
X
n 2
bn = arg min Qn ( ) n 1 èt ( ); èt ( ) = Yt + log e2 ( ); (12)
t
2 t=1
e2t ( )
where the e2t are de…ned recursively, for t 1; by

e2t e2t ( ) = w + Yt2 1 + e2t 1 ( );
53
starting with initial values chosen, for instance, Y02 = e20 = Y12 : It can be shown that
the initial values do not matter for the asymptotic properties of the QMLE. If the
true innovation’s distribution is Gaussian the estimator is consistent and e¢ cient.
If the distribution is not Gaussian but (11) and other mild conditions hold then the
QMLE is consistent and asymptotically normal, although it is in general not e¢ cient,
see Escanciano (2008) and references therein.
Example 62 (CML in Probit Binary Choice) Let the scalar dependent variable
y denote participation in the labor market, that is, y = 1 if the individual partici-
pates and y = 0 otherwise. Let x be some explanatory variables including gender,
education, age, etc. We are interested in modeling the conditional probability of labor
participation given the d-dimensional vector of variables x: We assume the model
P(y = 1jx) = (x0 0 );
where is the standard Gaussian cumulative distribution function (cdf) and 0 is a

vector of unknown parameters in Rd . This speci…cation holds if the data is generated
from
y = 1(x0 0 " 0);
where "jx N(0; 1) (check this). We aim to estimate 0 by the CML estimator.
The conditional likelihood is f (yi jxi ; ) = (x0i 0 )yi [1 (x0i 0 )]1 yi : Then, the CML
estimator is
^n = arg min Qn ( ) En [ yi log (x0i 0) (1 yi ) log (1 (x0i 0 ))] :
2
Example 63 (Nonlinear LS Regression) In many applications the linearity as-

sumption of regression is not realistic, e.g. binary choice models. A general nonlinear
regression speci…cation is given, for some 0 2 ; by
y = m( 0 ; x) + v; where m( 0 ; x) =E [yjx] :
0
As an example m( ; x) = + ex ; = ( ; )0 . In general, the true value 0 is
identi…ed by the condition m( 0 ; x) =E [yjx] ; which is equivalent to say that 0 is
the only value that satis…es
0 = arg min E (y m( ; x))2 :

2
By the analog principle the NLS estimate minimizes

1X
n
2
Qn ( ) = En (y m( ; x)) = (yi m( ; xi ))2 :
n i=1
54
This method can be applied, for instance, to the Probit model, as an alternative to
CML. Another example of nonlinear speci…cation that is often used with positive
dependent variables is the Poisson regression model
E [yjx] = m( 0 ; x) = exp( 00 x):
Example 64 (Linear quantile regression) There are many applications where

one is interested in low or high ranges of a distribution rather than in its central
part. For instance, labor economists are interested in individuals with lower income
or higher unemployment duration. To model this type of situations, consider a “re-
gression” model with iid data fyi ; xi gni=1
0
yi = 0 xi + vi ;
where P [vi 0jxi ] = a.s., for some 2 (0; 1): Then, 00 xi becomes the conditional
quantile of the distribution of yi given xi (prove this). The most popular estimator
of 0 under this framework is the Quantile Regression Estimator (QRE), proposed
by Koenker and Basset (1978). The QRE is de…ned as any solution KB;n ( ) mini-
mizing
Xn
0
7 ! Qn ( ) = (yi xi )
i=1
p
with respect to 2 R ; where (") = (1(" 0) )": The Least Absolute
Deviation (LAD) estimator corresponds to the median, = 0:5: Rather than relying
on a single measure of conditional location, the quantile regression approach allows
the researcher to explore a range of conditional quantile functions, thereby providing
a more complete analysis of the conditional dependence structure of the variables
under consideration. Quantile Regression allows labor economists to investigate, for
example, the e¤ect of education on di¤erent parts of the wage distribution (note
0
0 0 ( ) may change with ):
Example 65 (Method of Moments (MM)) Consider a parametric model P =

fP : 2 g for the data w; where is a subset of Euclidean Space Rp . We have
two ways to estimate a p dimensional moment E['(w)] : nonparametric En ['(w)]
and parametric E ['(w)] = h( ). The idea of Method of Moments is to match these
two: …nd the estimator ^n such that
En ['(w)] = h(^n ):
Assuming that h is invertible, with inverse h 1 ; the estimator is given by h 1 (En ['(w)]) =
^n . We can then apply the Delta method to investigate the asymptotic properties of
55
^n : Many times '(w) involves the …rst p moments. For example, for a scalar w;
'(w) = (w; w2 ; :::; wp ): This is a special case of a Zero estimator that solves the
sample analog of the following orthogonality conditions
E[m] = E [m(w; 0 )] = 0;
p 1
MM
(where we can assume that w is ergodic stationary). In this situation, ^n is such
MM
that we can estimate 0 by the so-called MM estimate ^n ; which is a solution of
MM
En [m(w; ^n )] = 0:
The …rst case above corresponds to m(w; 0) = '(w) h( 0 ):
Example 66 (Nonlinear GMM estimates, Hansen, 1982.) Hansen (1982) ex-

tended the MM estimator to the overidenti…ed case where there are more moments
than parameters. First order conditions in many economic models often lead to Euler
Equations or conditional moment restrictions. In turn, these conditions lead to the
following orthogonality conditions
E[m] = E [m(w; 0 )] = 0;
q 1
(where we can assume that w is ergodic stationary). In this situation, we can estimate
0 by the so-called GMM estimate n
^GM M (W ^ ); which is a solution of
n
arg min Qn ( ) ^ n En [m(w; )];

En [m(w; )]0 W
2
^ n where
for a symmetric matrix W
1X
n
En [m(w; )] = m(wi ; ):
n i=1
For example, in Consumption-Based Asset Pricing Models, Euler equations with

power utility leads to the moment
( )!
ct+1
m(w; 0 ) = 1 (1 + rt+1 ) xt ;
ct q 1
where 0 = ( ; ): Details on this application are provided below. Other application

that falls into the modern GMM is nonlinear IV, as in Sargan (1959).
56
The …rst examples di¤er from Example 66 in that the objective functions in
the …rst ones are sample means, while in the last one is a quadratic form of such
sample means.
The …rst ones, including NLS and (C)ML estimates, which minimize (or maxi-
mize)
Qn ( ) = En [g (w; )]
for some function g of the available random sample of w; are generally denoted as
M-Estimates (from Maximum likelihood-like) by contrast with the quadratic form
case.
Other estimates as minimum distance are of the type arg min Qn ( ) gn ( )0 Wgn ( )
where gn ( ) is not necessarily a sample mean. Examples of minimum distance estima-
tors include the minimum chi-square methods for discrete data, as well as estimators
for simultaneous equation models in Rothenberg (1973) and panel data in Chamber-
lain (1982). For a more recent application to estimation of dynamic games see, e.g.,
Pesendorfer and Schmidt-Dengler (2008, RevStud).
Another large class of estimates are de…ned by the empirical equality En [g (w; )] =
0: These are often call Z-estimates (from Zero) or estimating equation esti-
mates. Notice that by considering Qn ( ) = kEn [g (w; )]k these are particular
cases of extremum estimates.
7.2 Asymptotic properties of Extremum Estimates

In this section we provide a general asymptotic theory for extremum estimates. When
^n is explicitly de…ned, we can …nd its statistical properties directly by means of law
of large numbers and central limit theorems, see e.g. the analysis of OLS. Properties
of implicitly de…ned ^n are harder to obtain. We will provide a general theory for
establishing the asymptotic distribution for a generic extremum estimator ^n , but it
can be helpful to keep in mind a particular example such as the NLS estimator. The
…rst step in our general problem of inference is identi…cation. This step answers
the question: What is ^n estimating? (what is the estimand?) One (abstract)
approach that can be useful is to look at arg min 2 Qn ( ) as a mapping acting on
the function Qn ( ): If Qn converges to Q in a suitable sense, then one expects that
^n = arg min 2 Qn ( ) will converge to 0 = arg min 2 Q( ) by some “continuity”
argument. Identi…cation in our setting boils down to proving that Q( ) has a unique
minimizer, i.e. 0 = arg min 2 Q( ): Once we have shown identi…cation, we proceed
to provide arguments for the convergence of ^n to 0 by showing convergence of Qn ( )
to Q( ): When ^n converges in probability to 0 we say ^n is consistent for 0 .
57
A concept of convergence from Qn to Q is pointwise convergence, i.e. Qn ( ) =

Q( ) + oP (1) for each 2 : This convergence often follows from the Law of Large
Numbers (LLN, see the Appendix for discussion on the LLN). The following example
shows that pointwise convergence does not su¢ ce for the arg min to be “continuous”.
Example 67 De…ne the sequence of deterministic functions

( 2
2
+(1 n )2
0 <1
Qn ( ) = 1
2
=1
The pointwise limit of this function is

0 0 <1
Q( ) = 1
2
=1
One can easily check that n = 1=n = arg min 2 Qn ( ) ! 0 6= 1 = arg min 2 Q( ):
We shall see below that uniform convergence and some additional assumptions will
be su¢ cient for consistency. Uniform convergence in probability means 8c > 0;
Pr sup jQn ( ) Q( )j > c ! 0:

2
Uniform convergence is stronger than pointwise convergence.
Example 68 (NLS Probit) Consider the Probit model for the binary outcome y
(e.g. working) and regressors
E [yjx] = P(y = 1jx) = (x0 0 );
where is the standard Gaussian cumulative distribution function (cdf) and 0 is a

vector of unknown parameters in Rd . The parameters can be estimated by CML as
above, but also by NLS as
^n = arg min Qn ( ) En (yi (x0i 0 ))
2
:
2
Assume the data fwi = (yi ; xi )g is iid. Then, the LLN implies for each 2 Rd
Qn ( ) En (yi (x0i ))2 !p Q( ) E (yi (x0i ))2 :
The LLN can be applied because f(yi (x0i ))2 g are iid and bounded. But how do
we prove the consistency of ^n ? First, need to show identi…cation (meaning?).
58
7.2.1 Consistency
Existence. A …rst question is whether such estimate ^n exists. A continuous
function Qn of ; measurable wrt the data and compact guarantee this. We will
always assume that there are no problems of measurability and that such a solution
exists. See Jennrich (1969) for further details on this issue.
We …rst provide conditions under which ^n is consistent for a ”true value”
0 : Henceforth, the function Q is the pointwise limit of Qn ; i.e. Qn ( ) !p Q( ) for
each 2 : This is our …rst general result on consistency:
Theorem 69 Assume that ^n = arg min 2 Qn ( ) and

(i) ( ; d) is a metric space.
(ii) [Identi…cation] For all " > 0;
inf Q( ) > Q( 0 );
d( ; 0 )>"
that is, Q( ) is uniquely minimized on at 0 2 and the minimum is well

separated;
(iii) [Uniform convergence] Qn ( ) converges uniformly in probability to Q( ).
Let (i)-(iii) hold. Then, ^n !p 0 :
Proof. Choose " > 0 in (ii), then there exists a > 0 such that d( ; 0) > " implies
Q( ) Q( 0 ) + ; which in turn implies jQ( ) Q( 0 )j : Thus,
Pr d(^n ; 0) >" Pr Q(^n ) Q( 0 ) : (13)
Then, we have to show that the RHS converges to zero, which is equivalent to
Q(^n ) !p Q( 0 ): Now,
Q(^n ) Q( 0 ) = Q(^n ) Qn (^n ) + Qn (^n ) Q( 0 ) (14)

sup jQn ( ) Q( )j + Qn ( 0 ) Q( 0 ) (15)
2
2 sup jQn ( ) Q( )j !p 0: (16)
2
The conditions of THM. 69 can be relaxed. For instance, an “almost”minimum

b
Qn ( n ) inf 2 Qn ( ) + oP (1) also satis…es the theorem. This is important in some
applications, see the maximum score estimator of Manski (1975) or the simulated
59
method of moment estimators of Pakes (1986) and McFadden (1989). We have

implicitly assumed in the previous theorem that the estimator exists.
The parameter space can be in…nite-dimensional and d any metric on it.
Condition (ii) is implied by: (i) ( ; d) is a compact set of a normed space, (ii)
Q( ) is lower semicontinuous; (iii) Q( ) Q( 0 ) with equality if and only if = 0 :
Proof: By Weierstrass Theorem, (i) and (ii) imply existence of a minimum, and (iii)
guarantees is unique. Identi…cation has to be analysed on a case-by-case basis.
Note that for the consistency theorem we do not need iid observations. Condi-
tion (iii) is a uniform convergence in probability condition. If is compact, this can
be deduced in general from ULLN or by equicontinuity arguments, see the Appendix.
Summarizing: when ( ; d) is compact, in order to prove consistency it su¢ ces
to prove uniform convergence of Qn ( ) to Q( ) and that Q( ) is continuous and has a
unique minimum. These assumptions are su¢ cient for consistency but not necessary.
Next theorem does not assume compactness of ; but it requires convexity. For its
proof see Newey and McFadden (1994). A function Qn ( ) is convex if for all 1 ,
2 2 and all t 2 [0; 1]; Qn (t 1 + (1 t) 2 ) tQn ( 1 ) + (1 t)Qn ((1 t) 2 ):
Theorem 70 If there is a function Q( ) such that (i) Q( ) is uniquely minimized
at 0 ; (ii) 0 is an element of the interior of a convex set and Qn ( ) is convex; and
(iii) Qn ( ) !p Q( ) for all 2 ; then ^n exists with probability one and ^n !p 0 :
In this theorem the required convergence of Qn is only pointwise, at the price of
the convexity. The latter result applies to probit CML, regression quantile estimators
or Lasso.
Since many objective functions are not convex, we return to the general case and
analyze in more detail the case of M-estimates, providing primitive conditions for
assumptions (ii) and (iii) of THM. 69. Assume then Qn ( ) = En [g (w; )] : The
following result provides simple su¢ cient conditions for a ULLN. Henceforth, the
meaning of “strictly stationary and ergodic” is that the standard (i.e. pointwise)
ULLN applies. The next result is a key result for this course. We will refer to it as
Main ULLN.
Theorem 71 If fwt g are iid or strictly stationary and ergodic, g(wt ; ) is continu-
ous at each 2 with probability one, is compact in a metric space, jg(wt ; )j
d(wt ) for all 2 ; with E[ jd(wt )j] < 1; then the ULLN holds and the limit is
continuous. That is,
sup jEn [g (w; )] E [g (w; )]j !p 0;
2
and Q( ) = E [g (w; )] is continuous.
60
Then, we have the following useful corollary, which applies to MLE, CML, NLS,
among others. Its proof is left to the reader. It will be referred to as Main Consis-
tency Result. The example that follows illustrates how the main consistency result
is applied.
Corollary 72 Let ^n = arg min 2 En [g (w; )]. Assume that

(i) ( ; d) is a compact metric space.
(ii) Q( ) Q( 0 ) with equality if and only if = 0 :
(iii) fwt g are iid or strictly stationary and ergodic, g(wt ; ) is continuous at each
2 with probability one and jg(wt ; )j d(wt ) for all 2 ; with E[ jd(wt )j] < 1:
Let (i)-(iii) hold. Then ^n !p 0 :
Example 73 (NLS Probit, cont.) We show the consistency of the NLS ^n for the
Probit model using (72). First assume is a compact subset of Rd : First, we show
identi…cation of the nonlinear model: Note
Q( ) = E (yi (x0i ))2
= E (yi (x0i 0 ))2 + E ( (x0i 0) (x0i ))2 ;
where the cross-product is zero by the orthogonality condition E [y (x0 0 )jx] = 0;
since by iterated expectations
E [(yi (x0i 0 ))( (x0i 0) (x0i ))] = E [E [y (x0 0 )jx] ( (x0i 0) (x0i ))] = 0:
Now, strict monotonicity of yields
E ( (x0i 0) (x0i ))2 = 0 =) (x0i 0 ) = (x0i ) a.s.

) x0i 0 = x0i a.s.
) E [xi x0i ] 0 = E [xi x0i ]
) 0 = ;
if E [xi x0i ] is non-singular. Therefore, identi…cation follows if there is no perfect

multicollinearity. Next, we check that
g(wi ; ) = (yi (x0i ))2

satis…es the conditions of (72). First, it is continuous in for all wi : Second,
jg(w; )j d(w) 1; since y is binary and a probability. Then, by (72) consis-
tency of the NLS ^n follows. Once we have established consistency of ^n , consistency
of other parameters of interest, such as probability of participation or partial e¤ects,
follows from the continuous mapping theorem.
61
Example 74 (MLE cont.) Suppose that fwi g are iid with pdf f (wi ; ) and (i) if
6= 0 then f (wi ; ) 6= f (wi ; 0 ); (ii) 0 2 ; which is compact; (iii) ln f (wi ; ) is
continuous at each 2 with probability one; (iv) E[sup 2 jln f (w; )j] < 1: Then
^M LE !p 0 : The identi…cation condition (ii) here follows from Jensen’s inequal-
ity (how?). To see how, by the strict version of Jensen’s inequality, if f (wi ; ) 6=
f (wi ; 0 )
Q( 0 ) Q( ) = E[ln f (w; )] E[ln f (w; 0 )]
f (w; )
= E ln
f (w; 0 )
f (w; )
< ln E
f (w; 0 )
= ln(1) = 0:
Consider, for example, an exponential density with parameter > 0; f (w; ) =
exp( w) for w > 0: This density satis…es the consistency conditions, with a
compact set of (0; 1) (verify this).
Example 75 (Quantile Regression, cont.) The quantile regression objective func-
tion
1X
n
0
7 ! Qn ( ) = (yi xi )
n i=1
is convex, where recall (") = (1(" 0) )": Then, for consistency it su¢ ces to
show that (i) Q( ) is uniquely minimized at 0 ; (ii) 0 is an element of the interior
of a convex set and (iii) Qn ( ) !p Q( ) for all 2 : Write
0
Q( ) = E [ (yi xi )]
= Q( 0 ) + Q( ) Q( 0 )
0 0 0
= Q( 0 ) + E [ (1("0 xi ) )("0 xi ) + (1("0 0) )("0 xi )]
Z 0
0
= Q( 0 ) + E (" xi )f ("jxi )d" ;
0
xi
0
where ( )= 0 ; "0 = y 0 x and f ("jx) is the conditional density of "0
given x: By Dominated Convergence and the Leibniz’s rule, for S( ) = Q( ) Q( 0 );
Z 0
@S( ) 0
= E (" xi )f ("jxi )j"= 0 xi xi xi f ("jxi )d"
@ ( )0 xi
Z 0
= E xi f ("jxi )d"
( )0 xi
62
and, evaluated at = 0;
@ 2 S( ) 0
0 = E [xi xi f (0jxi )] :
@ @
Note f (0jxi ) can be also expressed as the conditional density of yi given xi evaluated at
the quantile 0 xi : Thus, under the assumption that E [xi x0i f (0jxi )] is positive de…nite,
the strictly convex function Q( ) has a unique minimum at = 0 : We conclude that
the QRE is consistent by the convexity theorem.
7.2.2 Asymptotic Normality

We now describe conditions under which ^n is asymptotically normal. Next
assumption assumes smoothness of the objective function.
ASS. 1 .
(i) 0 is an interior point of Rp :
(ii) Qn ( ) is twice continuously di¤erentiable in a neighborhood of 0 and
p @
n Qn ( 0 ) !d N(0; D)
@
~n !p @2
0 ) Qn (~n ) !p E > 0 nonstochastic: (17)
@ @ 0
(iii) ^n !p 0:
Theorem 76 (Asymptotic normality) Let ASS. 1 (i)-(iii) hold. Then

p
n(^n 1
0 ) !d N(0; E DE ):
1
Remark 1: In many cases such as maximum likelihood,
E = D; so
p
n(^n 0 ) !d
1
N(0; D ):
But in many other cases, e.g. QMLE, IV estimates, GMM estimates, E 6= D:
Remark 2: A su¢ cient condition for the second display of 1(ii) is the uniform
convergence of the Hessian @ 2 Qn ( )=@ @ 0 and the continuity of the limit. More
63
generally, if sup 2 jGn ( ) G( )j !p 0; G( ) is continuous at 0 and ^n !p 0

with ^n ; 0 2 ; then
Gn (^n ) G( 0 ) !p 0:
This is so because
Gn (^n ) G( 0 ) Gn (^n ) G(^n ) + G(^n ) G( 0 )
sup jGn ( ) G( )j + G(^n ) G( 0 ) :

2
This remark is very useful in many applications and we shall use it extensively in
these notes. Notice that we only need ^n 2 with probability tending to one for
the previous argument to work and that no rates of convergence of ^n are needed.
In a typical application
1X
n
^
Gn ( n ) = m(zi ; ^n );
n i=1
for some function m: The law of large numbers cannot be directly applied to Gn (^n )
because fm(zi ; ^n )gni=1 are generally dependent. The trick above suggests a solution
to this “problem” by applying a ULLN (from our Main ULLN typically) and using
the consistency of ^n and continuity of the limit (a typical by-product of the Main
ULLN). Thus, 1(ii) follows from a ULLN and the consistency theorem.
Proof of the Asymptotic Normality THM. Let

p
d = 0
(E 1 DE 1 ) 1=2 n(^ 0 );
0
=1
^
I = I( interior to ):
We must show
0
Pr(d x) ! Pr(X x); for X N (0; 1); 8 ; = 1:
But
Pr(d x) = Pr(d xjI = 1) Pr(I = 1)

+ Pr(d xjI = 0) Pr(I = 0)
= Pr(d xjI = 1)
+ fPr(d xjI = 0) Pr(d xjI = 1)g Pr(I = 0)
64
where
Pr(I = 0) = Pr( ^ 0 ) some > 0; by (i)
! 0 by (ii)
) Pr(d x) = Pr(d xjI = 1) + o(1) uniformly in x:
By the mean value theorem
p @ p @ p
0= n Qn ( ^ ) = n Qn ( 0 ) + Fi ( i ) n(^ 0 ); i = 1; : : : ; p
@ i @ i
where i is the i-th element of and i

0
^ 0 and
2 3
F1 ( )
6 .. 7 @2
F( ) = 4 . 5 = Q( ):
@ i@ 0
Fp ( )
But ^ !p 0 ) i
!p 0; i = 1; : : : ; p:
) Fi ( i ) !p i -th row of E:
p @ p
) 0= n Qn 0 ) + (E + op (1)) n(^ 0)
@ i
p p @
) E 1 (E + op (1)) n(^ 0) = E 1 n Qn ( 0 )
@ i
!d N (0; E 1 DE 1 ))
p
) n(^ 1
0 ) !d N (0; E DE )
1
The following corollary is useful. Its proof is left to the reader. De…ne
@g(wt ; 0 ) @g(wt ; 0)
D=E 0
@ @
and
@ 2 g(wt ; 0 )
E=E :
@ @ 0
The following result is extremely useful in this course, and will be referred to as the
Main Asymptotic Normality result. It is used to establish our third main step
for inference (proving asymptotic normality, after identi…cation and consistency).
65
Corollary 77 Let ^n = arg min 2 En [g (w; )]. Assume that

(i) ( ; d) is a compact metric space.
(ii) Q( ) Q( 0 ) with equality if and only if = 0 : The matrix E is non-singular.
(iii) fwt g are iid; g(wt ; ) is continuous in a.s. and twice continuously di¤eren-
tiable in a neighborhood of 0 , say 0 ; jg(wt ; )j d(wt ); for all 2 ; with
E[ jd(wt )j] < 1:
(iv) j@ 2 g(wt ; )=@ @ 0 j h(wt ) for all 2 0 ; with E[ jh(wt )j] < 1; E[ j@g(wt ; 0 )=@ j2 ] <
1: p
Then, n(^n 1
0 ) !d N(0; E DE ):
1
Example 78 (NLS Probit, cont.) We now show the AN of the NLS ^n for the
Probit model. Assume is a compact subset of Rd : We have shown before that if
E [xi x0i ] is non-singular then ^n is consistent for 0 : Next, we check that
g(wi ; ) = (yi (x0i ))2
satis…es the conditions of (77). First, it is twice continuously di¤erentiable for all
and wi with
@g(wt ; )
= 2(yi (x0i )) (x0i )xi
@
@ 2 g(wt ; )
0 = 2 2 (x0i )xi x0i 2(yi (x0i )) _ (x0i )xi x0i ;
@ @
where is the standard normal density and _ its derivative: Then, since and _ are
bounded we can take h(w) C jxj2 ; for a positive constant C. Thus, the NLS ^n is
CAN p
n(^n 1
0 ) !d N(0; E DE );
1
with
2 2
D =E 4(yi (x0i 0 )) (x0i 0 )xi x0i
and (by the orthogonality condition)
E = E 4 2 (x0i 0
0 )xi xi :
Example 79 (MLE cont.) Suppose that fwi g are iid with pdf f (wi ; ) and (i)
the conditions for consistency hold;(ii) 0 2 int( ),with compact;(iii) f (wi ; ) is
twice continuously di¤erentiable and f (wi ; ) > 0 in a neighborhood N of 0 ; (iv)
E[sup 2 jr log f (wi ; )j] < 1;
Z
sup jr log f (w; )j dw < 1;
2N
Z
[sup jr log f (w; )j]dw < 1;
2N
66
then the MLE is AN. The function s(w; ) =r log f (w; ) is called score. The
variance D =E s(w; 0 )s(w; 0 )0 is called the Fisher Information matrix. We shall
show in the next pages that D
p = ^ E = E [r log f (w;1 0 )] (this is the Information
Equality). Thus, for MLE n( n 0 ) !d N(0; I ); where I = D = E: The
Fisher Information plays a key role in statistics. Under regularity conditions, I 1
represents the variance e¢ ciency bound (the smallest possible variance for a regu-
lar estimator of 0 ): Heuristically,
p regular estimators are estimators for which the
convergence in distribution of n(^n 0 ) holds uniformly (locally, around the true
distribution). The Hodges’estimator is not regular (so there is no contradiction with
the previous statement on e¢ ciency).
7.2.3 Asymptotic Variance Estimation

The asymptotic variance V = E 1 DE 1 provides a measure of asymptotic dispersion
of the estimator. The fourth main step in inference is estimation of asymptotic
variances. This estimation is important for at least two reasons. It provides an
estimate of the dispersion of the estimator and it can be used to compute con…dence
intervals and test statistics. The estimation
p of D generally depends on the serial
dependence of the data. In the iid case, if n@Qn ( 0 )=@ can be expressed in terms
of
p standardized sample means, then natural estimates of the asymptotic variance of
^
n@Qn ( 0 )=@ are available. Denote the estimate of D by D: A natural estimate
of E for smooth objective functions is E^ = @ 2 Qn (^n )=@ @ 0 : Then, we can estimate
V by V ^ =E ^ 1D
Ê^ 1:
Example 80 (NLS Probit, cont.) For the Probit NLS ^n

2 2
D =E 4(yi (x0i 0 )) (x0i 0 )xi x0i
and
2
E=E (x0i 0
0 )xi xi :
Therefore, V ^ 1D
^ =E Ê^ 1
with
h i
^ =En 4(yi
D (x0i ^n ))2 2
(x0i ^n )xi x0i
and h i
^ = En 4 2 (x0i ^n )xi x0i :
E
67
7.2.4 Conditional Maximum Likelihood

Applications of CML abound in the literature; see Chapter 15 in Wooldridge for
illustration in the context of limited dependent variables models. In this application
of the general theory Qn ( ) = Ln ( ) = En [ ì ( )] ; where ì ( ) = log f (yi jxi ; ):
Assume that is a compact set in an Euclidean space. Suppose that if 6= 0 ;
then f (yi jxi ; ) 6= f (yi jxi ; 0 ): Also, assume that f (yi jxi ; ) is twice continuously
di¤erentiable with respect to a.s. and f (yi jxi ; ) > 0 a.s. in a neighborhood N of
0 ; 0 2 int( ) and
E[sup jlog f (yi jxi ; )j] < 1; E[sup jr log f (yi jxi ; )j] < 1;
2 2
Z
sup jr f (yjxi ; )j dy < 1; (18)
2N
Z
[sup jr f (yjxi ; )j]dy < 1; (19)
2N
Finally, we assume that D :=V [s( 0 )] is non-singular, where s( ) =r `( ) = r log f (yj ; x) :

Under these conditions the CML estimator is asymptotically normal, as we show
now. First, by the consistency theorem (corollary for M-estimates), the CML esti-
mate ^CM L is consistent.
We de…ne the score process
@
Sn ( ) : = Ln ( ) = En [s( )] = En [s( ;y; x)]
@
which satis…es under the conditions above E [Sn ( 0 )jx] = E [s( 0 )jx] = 0 a.s., and
n1=2 Sn ( 0 ) !d N(0; D);
where D :=V [s( 0 )] : To see that E [Sn ( 0 )jx] = E [s( 0 )jx] = 0 note that,
Z
E [s( )jx] = s( ;y; x)f (yj ; x) dy
Z
= r f (yj ; x) dy
Z
= r f (yj ; x) dy = 0;
| {z }
=1; 8
for all ; where the exchange of integration and di¤erentiation is allowed by the uni-
form integrability condition (18), so for = 0 ; E [s( 0 )jx] = 0: In dynamic models,
68
x denotes the information at time t 1; and s( 0 ) follows a martingale di¤erence

sequence.
On the other hand, we can show that D = E because
Hn ( ) :=r Ln ( ) := En [h( ;y; x)]
is the Hessian (which is de…nite negative because we are now maximizing), so that
we obtain the so-called information equality
E [Hn ( )jx] = V [s( )jx]
because
=0
z
Z }| { Z
0
r s( ;y; x)f (yj ; x) dy = r0 fs( ;y; x)f (yj ; x)g dy
Z Z
= h( ;yi ; x)f (yj ; x) dy + s( ;y; x)s( ;y; x)0 f (yj ; x) dy
= E [h( )jx] + E s( ;y; x)s( ;y; x)0 jx = 0
and in particular for = 0 (so E = E) we obtain D :=E s( 0 )s( 0 )0 V [s( 0 )] =

E [Hn ( 0 ] := E; and
p
n(^CM L 0) !d N (0; E 1 ):
The su¢ cient conditions given here for AN of CML are not optimal. They can be
relaxed using the concept of functional derivatives (Frechet derivative) of the square
root of the density, but this is beyond the scope of these notes.
Example 81 (CML in Probit Binary Choice) The previous arguments apply to

CML, and in particular to Probit CML. Here
f (yi jxi ; 0) = (x0i yi

0 ) [1 (x0i 0 )]
1 yi
:
Then, f (yi jxi ; 0) = f (yi jxi ; ) implies
(x0i 0) = (x0i );
which in turn implies x0i 0 = x0i : Multiplying this by xi both sides, and taking
expectations we conclude 0 = provided
E [xi x0i ] is p.d.
69
This shows identi…cation. We proceed to prove consistency. Here
g(wi ; ) = yi log (x0i ) (1 yi ) log (1 (x0i ))
is continuous in for all wi : Verifying the dominance conditions is a little bit more
involved than for the NLS. Note that by the Mean Value Theorem
jlog (x0i )j = log (0) + (x0i )x0i

jlog (0)j + (x0i ) jx0i j
jln 2j + C 1 + x0i jx0i j
jln 2j + C j j jxi j + jxi j2 j j2
jln 2j + M jxi j + jxi j2 ;
where (u) = (u)= (u) is the derivative of log (u); is some intermediate point
between and zero, and C and M are constants. In these inequalities we have used
that (u) C j1 + uj and that is compact (and hence bounded). A similar bound
holds for log (1 (x0i )). Therefore, the moment g(wi ; ) satis…es the conditions
of the ULLN, and consistency follows. For asymptotic normality, we compute deriv-
atives (after some computation)
@Qn ( )
= En [yi xi (x0i ) (1 yi )xi ( x0i )]
@
@ 2 Qn ( ) h i
= En _ (x0i )y + _ ( x0i )(1 yi ) xi x0i :
@ @ 0
Then, by the CLT
p @
n Qn ( 0 ) = Sn ( 0 ) !d N(0; D);
@
where
2
(x0i )xi x0i
D =E :
(x0i ) (1 (x0i ))
We shall apply Theorem 11 with
g(wi ; ) = _ (x0 )y + _ ( x0 )(1 yi ) xi x0i :

i i
This function is continuous for all wp1, and noting that _ (u) is bounded, we have
jg(wi ; )j C jxi j2 :
70
Then, by Theorem 11 and the remark after the AN Theorem, we have

2
@2 (x0i )xi x0i
Qn (~n ) !p E =E :
@ @ 0 (x0i ) (1 (x0i ))
Thus, the Probit MLE satis…es

p
n(^CM L 0) !d N (0; E 1 ):
Remark 1 Under correct speci…cation of the density, the asymptotic variance of

MLE and CMLE simpli…es to the inverse of the Fisher Information. Without correct
speci…cation, but under identi…cation as in QMLE, the asymptotic variance is the
general expression E 1 DE 1 : Thus, QMLE requires robust standard errors for valid
inference.
Example 82 (MM, cont.) Assume h( ) = E ['(w)] is one-to-one on an open set

of the Euclidean Space Rp and continuously di¤erentiable at 0 with nonsingu-
_ 0 ): Moreover, asssume E[j'(w)j2 ] < 1: Then, the MM estimator
lar derivative h(
h(^M M ) = En ['(w)] exists with probability tending to one and satisfy
p
n(^M M _ 1 0 _ 1
0 ) !d N (0; h ( 0 )E['(w)' (w)]h ( 0 )):
This result follows from the Delta Method applied to the inverse of h( ) around 0
(see Theorem 4.1 in van der vaart (1998). This result could be also obtained from a
Taylor expansion argument in the empirical equation
GM M
En [m(w; ^n )] = 0
corresponding to m(w; 0 ) = '(w) h( 0 ) (provide the argument). MLE in expo-

nential families is a method of moments estimator, so the theory above could be also
used to show asymptotic normality of MLE in exponential families.
71
7.3 Numerical Optimization

In this lecture we discuss techniques to optimize a function Qn ( ) of a p-dimensional
vector over a given set of values of ; : That is, we wish to …nd
^n = arg min Qn ( ):
2
We …rst discuss concentration techniques that can be useful when the the dimension
p is large, or when a closed form solution exists for a subcomponent of ^n :
Concentration
p1 11
Suppose we partition as = ; where p1 + p2 = p:
2 p2 1
Suppose that for given 1 ; we can obtain an explicit formula for the optimizing
value of 2 ;
^2n ( 1 ) = arg min Qn ( 1 ; 2 )
2
can be written
^2n ( 1 ) = gn ( 1 )
for a given function g:
Now form the concentrated function

Rn ( 1 ) = Qn 1;
^2n ( 1 )
^n1
and partitioning ^n = ^2n we have
^n1 = arg min Rn (^n1 )

1
^2n = gn (^n1 ):
The dimension of the numerical optimization has been reduced from p to p1 :

This is perhaps particularly useful if ”search”methods are subsequently to be used.
The disadvantage: Rn ( 1 ) is typically more complicated than Qn ( ):
Example 83 (Nonlinear least squares) .

y = h( 1 ; z)0 2 +v
72
E.g.
+ z
y = +e +v
1
= ; e +v
e z
0
= 2 h( 1 ) + v
X
n
2
0
Qn ( ) = (yi 2 hi ( 1 ))
1
( n ) 1
X X
n
^2n ( 1 ) = hi ( 0
1 )hi ( 1 ) hi ( 1 )yi
1 1
= g( 1 )
Xn
2
Rn ( 1 ) = yi ^02n ( 1 )hi ( 1 )
1
h i
0 0 1 0
= y I h( 1 ) fh ( 1 )h( 1 )g h ( 1) y
where 2 3 2 3
y1 h01 ( 1 )
6 7 6 7
y = 4 ... 5 ; h( 1 ) = 4 ..
. 5
0
yn hn ( 1 )
We can do a similar thing in Gaussian Max. Likelihood estimation of
0
yi = 2a h( 1 ) + vi ; vi N(0; 2b );
2a
2 = ;
2b
i.e we can concentrate out the disturbance variance 2b as well as the scale parameters
2a :
73
Search Methods
We have been implicitly assuming that contains an uncountable in…nity of
points. Suppose that A is a …nite set of N < 1 points.
Let
^N = arg min Qn ( ):
A
If Qn ( ) is su¢ ciently smooth and A is su¢ ciently ”representative”of ; we might

expect ^N ^n .
At least we might hope that ^N is good enough to start an iterative procedure.
However the larger p; the larger N is likely to be, so the computations can be ex-
pensive.
1. Grid Search: choose A a priori, for example spaced points.
2. Random Search. Choose A by means of random number generators. A

uniform distribution could be used, but alternatively a distribution with a
greater density around the suspected ^n :
3. Alternating Search.
0
Put = ( 1; 2 ; : : : ; p ):
Fix 2 ; : : : ; p and search over 1: Then for the optimizing 1; and the previous
3 ; : : : ; p ; search over 2 ; etc.
There are more sophisticated methods than these.
y = h( 1 )0 2 +v
In EX. 83 we can concentrate out 2 and then search Rn ( 1 ) where 1 is scalar.
74
7.3.1 Newton-Raphson
We now assume that ^n also satis…es
@Qn (^n )
fn ( ^ n ) = = 0:
@
Suppose that n(k) is close to ^n : By Taylor’s Theorem,

2
0 = fn (^n ) = f n ( n(k) ) + Fn ( ^
n(k) )( n n(k) ) +O ^n n(k) ;
under regularity conditions where
@fn ( ) @ 2 Qn ( )
Fn ( ) = = :
@ 0 @ @ 0
Thus
0 fn ( n(k) ) + Fn ( ^
n(k) )( n n(k) )
or
^n n(k) Fn ( n(k) )
1
fn ( n(k) )
(with F nonsingular). This suggests forming an iterative sequence
1
n(k+1) = n(k) Fn ( n(k) ) f n( n(k) ); k = 1; 2; : : :
given an initial n(1) :
One hopes that the sequence f n(k) g converges to ^n : When Qn ( ) is quadratic

trivially it converges to ^n (i.e. ^n(2) = ^n for all n(1) ). When it converges it can
be shown under suitable conditions that,
2
^n n(k+1) C ^n n(k)
i.e. quadratic convergence (we treat Qn ( ) here as if it were nonstochastic).
75
yi = i( ) + vi
1X
n
Qn ( ) = (yi i( ))2
n 1
2 X @ i( )
n
fn ( ) = (yi i( ))
n 1 @
2X
n
@ i( ) @ i( )
Fn ( ) =
n 1 @ @ 0
2 X @ 2 i( )
n
(yi i( )) :
n 1 @ @ 0
However the Newton-Raphson iterations need not converge. In general it is

possible that Qn ( n(k+1) ) can be greater than Qn ( n(k) ); and Fn ( n(k) ) may not be
positive de…nite.
76
7.3.2 Gauss-Newton
Suppose Qn ( ) is of the form
1
Qn ( ) = rn ( )0 rn ( ); r is N 1
2
for a vector rn ( ):
Now
@r0n ( ) X @rni ( )
N
fn ( ) = rn ( ) = rni ( )
@ i=1
@
@r0n ( ) @rn ( ) X @ 2 rni ( )

N
Fn ( ) = + 0 rni ( );
@ @ 0 @ @
|i=1 {z }
( )
rn ( ) = (rn1 ( ); : : : ; rnN ( ))0 :

Because Qn ( ) is here a quadratic form, we might expect that the rni ( ) are
”small”for near the ^n : Thus we might omit the term (*) and consider
G G G 1 G
n(k+1) = n(k) Gn ( n(k) ) f n ( n(k) );
@r0n ( ) @rn ( )
Gn ( ) = :
@ @ 0
Advantages:
1. Gn ( ) is psd always.
2. We do not need to calculate 2nd derivatives of rn ( ):
On the other hand, Gauss-Newton tends to converge more slowly than Newton-
Raphson. GN can be used in many econometric problems.
G
The Score Method replaces the Hessian Gn ( n(k) ) by (minus) the Fisher
information matrix, I( G
n(k) ) (i.e., its expectation).
77
Example 86 (Nonlinear least squares)
In the NLS example

X
n
@ )0 @
ni ( ni ( )
Gn ( ) = 2 0 :
1
@ @
The term
X
n
@2 ni ( )
2 0 (yi ni ( ))
1
@ @
can be neglected (note that this was zero mean at 0; -the true value of ).
Typically there can be a variety of matrices which also approximate Fn ( )

and also use only 1st derivatives. They may exploit the particular structure of the
problem and the statistical assumptions. There are also modi…cations of GN which
have better numerical properties.
78
7.4 Extensions to non-smooth objective functions4

The previous theory does not apply to Lasso or quantile regression. We discuss
brie‡y extension to these cases.
7.4.1 Lasso estimator

We follow Knight and Fu (2000, Annals of Statistics). Assume the Lasso estimator
bn minimizes the objective function
h i p
X
0 2 n
Qn ( ) = En (y x ) + j jj ;
n j=1
0
where = ( 1 ; :::; p) and n is a sequence of positive numbers such that
pn ! 0 0:
n
De…ne the stochastic function

p
X
0 0
V (u) = 2u Z + u u + 0 uj sgn( j )1( j 6= 0) + juj j 1( j = 0);
j=1
where = E [xi x0i ] is assumed to be positive de…nite and Z d N(0; 2 ), with

"0 = y x0 0 independent of x and with variance 2 : Then, Theorem 2 in Knight
and Fu (2000, Annals of Statistics) show
p
n(^n 0 ) !d arg min V (u):
Note the asymptotic distribution is implicitly de…ned, and is non-Gaussian when

0 > 0: What happens if 0 = 0?
The proof of the last convergence is beyond the scope of this course, but an
heuristic
p argument is as follows. Write " = y x0 = "0 x0 ( 0 ): Denoting
u = n( 0 ) we can write a centered version of the objective function as
h p i p
X p
0 2 n
Vn (u) = En "0 x u= n "20 + j + uj = n j jj :
n j=1
4
This material does not go into the exam.
79
It can be veri…ed that

h p i
2
En "0 x0 u= n "20 !d 2u0 Z + u0 u;
while
p p
n
X p X
j + uj = n j jj ! 0 uj sgn( j )1( j 6= 0) + juj j 1( j = 0):
n j=1 j=1
Thus, Vn (u) !d V (u) and since Vn (u) is convex and V (u) has a unique minimum it
follows that p
n(^n 0 ) = arg min Vn (u) !d arg min V (u):
7.4.2 Quantile Regression

The quantile regression estimator minimizes the objective function
1X
n
0
7 ! Qn ( ) = (yi xi ):
n i=1
0
Let de…ne "0 = y 0x and let f ("jx) be the conditional density of "0 given x: De…ne
V 0 ( ) = E [xi x0i f (0jxi )] :
Note f (0jxi ) can be also expressed as the conditional density of yi given xi evaluated
at the quantile 00 xi : The proof of the following result can be found in Angrist et al.
(2006, Econometrica).
Theorem 87 Assume fyi ; xi g are iid; f ("jxi ) exists, is bounded and uniformly con-
tinuous around 0; V 0 ( ) is positive de…nite; E [jxjp ] < 1 for some 2 < p < 1.
Then, p
n(^n 0 ) ! N (0; ( ));
where
1 1
( )=V 0
( )K( )V 0
( )
and
K( ) = E[(1("0 0) )(1("0 0) )xx0 ]:
1 1
If the quantile model is correctly speci…ed, then ( ) = (1 )V 0
( )E[xx0 ]V 0
( ):
How do we estimate ( )? This is a di¢ cult problem. See Escanciano and Goh
(Quantile-Regression Inference With Adaptive Control of Size, with Chuan Goh,
Journal of the American Statistical Association, 114, 382-393, 2019) for a recent
proposal.
80
8 Statistical Inference: Hypothesis Testing

8.1 Finite Sample Theory
We introduce some basic concepts of testing. Let X be a sample from a population
P in a model P. Based on the observed X, we test a given hypothesis
H0 : P 2 P0 vs H1 : P 2 P1
where P0 and P1 are two disjoint subsets of P and P0 [ P1 = P.
A test for a hypothesis is a statistic T (X) taking values in [0; 1]. When X = x is
observed, we reject H0 with probability T (x). If T (X) = 1 or 0 a.s. P, then T (X) is
a nonrandomized test; otherwise T (X) is randomized.
For a given test T (X), the power function of T (X) is de…ned to be
T (P ) = EP [T (X)]; P 2 P;
which is the type I error probability of T (X) when P 2 P0 and one minus the type
II error probability of T (X) when P 2 P1 .
With a sample of a …xed size, we are not able to minimize the two error proba-
bilities simultaneously. Our approach involves maximizing the power T (P ) for all
P 2 P1 (i.e., minimizing the type II error probability) over all tests T satisfying
sup T (P ) ;
P 2P0
where 2 [0; 1] is a given level of signi…cance. The left-hand side of the last expres-
sion is de…ned to be the size of T . The level of signi…cance is often small (e.g. 0.1,
0.05 or 0.01), so a type I error is considered a more serious error thant a Type II
error.
Example 88 (Normal variables) Let X1 ; :::; Xn be i.i.d. N ( ; 2 ) random vari-
ables. Suppose that H0 : 0 and H1 : > 0. A popular test is the t-test
T (X) = 1 (tn > c)
where p
nX
tn =
S
is the t test statistic and c is a constant
p to be determined p
so that supP 2P0 T (P ) :
Note tn = Zn + n ; where Zn = n(X )=S and n = n =S: Then,
sup T (P ) = sup EP [T (X)] = sup P (Zn + n > c) = P0 (Zn > c) ;
P 2P0 P 2P0 P 2P0
81
where P0 denotes P when = 0: Thus, we need to compute c so that P0 (Zn > c) ;

or 1 F0;n (c); where F0;n is the cdf of Zn (which is known, the cdf of a student-t
distribution with n-1 degrees of freedom). Then,
c = cn; = F0;n1 (1 ) = inffu : 1 F0;n (u)g;
which again is known. The t-test then rejects H0 at signi…cance level if tn > cn; :
The critical region is S ( ) = fx : tn (x) > cn; g: The p-value is de…ned as
p^ = inffu : tn > cn;u g = PZn (Zn > tn ) = 1 F0;n (tn );
so equivalently, tn > cn; i¤ p^ < :
A test T of size is a uniformly most powerful (UMP) test if and only if T (P )

T (P ) for all P 2 P1 and T of level . Searching over all measurable functions of X
can be simpli…ed if there is a su¢ cient statistic for P 2 P: This follows because if
U (X) is a su¢ cient statistic for P 2 P, then for any test T (X), E(T jU ) has the same
power function as T and, therefore, to …nd a UMP test we may consider tests that
are functions of U only. Next Theorem is the celebrated Neyman-Pearson Lemma
(for its proof see Shao).
Theorem 89 Suppose that P0 = fP0 g and P1 = fP1 g. Let fj be the p.d.f. of Pj

w.r.t. a -…nite measure (e.g., = P0 + P1 ), j = 0; 1.
(i) Existence of a UMP test: For every , there exists a UMP test of size , which
is 8
< 1 if f1 (X) > cf0 (X)
T (X) = if f1 (X) = cf0 (X)
:
0 if f1 (X) < cf0 (X)
where 2 [0; 1] and c 0 are some constants chosen so that E[T (X)] = when
P = P0 (c = 1 is allowed).
(ii) Uniqueness: If T is a UMP test of size , then
1 if f1 (X) > cf0 (X)

T (X) = a.s P:
0 if f1 (X) < cf0 (X)
Theorem 89 shows that when both H0 and H1 are simple (a hypothesis is

simple i¤ the corresponding set of populations contains exactly one element),
there exists a UMP test that can be determined by Theorem 89 uniquely (a.s.
P) except on the set B = fx : f1 (x) = cf0 (x)g.
82
If (B) = 0, then we have a unique non-randomized UMP test; otherwise UMP

tests are randomized on the set B and the randomization is necessary for UMP
tests to have the given size :
We can always choose a UMP test that is constant on B.
Example 90 (Bernoulli variables) Let X1 ; :::; Xn be i.i.d. binary random vari-

ables with p = P (X1 = 1). Suppose that H0 : p = p0 and H1 : p = p1 , where
0 < p0 < p1 < 1. By Theorem 89, a UMP test of size is
8
< 1 if (Y ) > c
T (Y ) = if (Y ) = c
:
0 if (Y ) < c;
P
where Y = ni=1 Xi and
Y n Y
p1 1 p1
(Y ) = :
p0 1 p0
Since (Y ) is increasing in Y , there is an integer m > 0 such that
8
< 1 if Y > m
T (Y ) = if Y = m
:
0 if Y < m;
where m and satisfy = E0 [T (Y )] = P0 (Y > m) + P0 (Y = m). Since Y has the
binomial distribution Bi(p; n), we can determine m and from
X
n
n j n m
= p (1 p0 )n j
+ p (1 p0 )n m
:
j=m+1
j 0 m 0
Unless
X
n
n j
= p (1 p0 )n j
j=m+1
j 0
for some integer m, in which case we can choose = 0, the UMP test T is a
randomized test.
An interesting phenomenon in this example is that the UMP test T does not
depend on p1 . In such a case, T is in fact a UMP test for testing H0 : p = p0 versus
H1 : p > p0 . this last property generalizes to families with a monotone likelihood
ratio property (see Shao for de…nition).
An interesting application of this example is to Backtesting in risk management.
83
For two-sided hypothesis, UMP tests are rare. Imposing unbiasedness helps
with real-valued parameters. More generally, for multivariate parameters,
UMP typically do not exist, and further restrictions (such as invariance) need
to be introduced to de…ne UMP tests.
Example 91 (Normal variables, cont.) Let X1 ; :::; Xn be i.i.d. N ( ; 2 ) random

variables, with 2 known. Suppose that H0 : = 0 and H1 : = 1 , 1 > 0: Compute
the likelihood ratio as
f1 (X) 1 n 21
= exp 2
Y ;
f0 (X) 2 2
P
where Y = ni=1 Xi : Thus, a UMP test of size is
8
< 1 if (Y ) > c
T (Y ) = if (Y ) = c
:
0 if (Y ) < c;
where now
2
1 n 1
(Y ) = exp 2
Y 2
:
2
Since (Y ) is increasing in Y , there is a constant m > 0 such that
8
< 1 if Y > m
T (Y ) = if Y = m
:
0 if Y < m;
where m and satisfy = E0 [T (Y )] = P0 (Y > m) + P0 (Y = m). Since Y has a
normal distribution N (n ; n 2 ), which is continuous, we can set = 0; and m the
corresponding quantile of the normal distribution.
p The UMP test is non-randomized
1
and given by T (Y ) = 1(tn > z ); where tn = nX= and z = (1 ): As in the
Bernoulli case, the UMP test T does not depend onp 1 and hence is also UMP p test
for testing H0 : = 0 and H1 : > 0. De…ne Zn = n(X )= and n = n = :
The power function of the UMP test is
T ( ) = E [T (X)] = P (Zn + n >z )=1 (z n );
which is an increasing function of n : Note T (0) = ; T ( ) < if < 0 and

T ( ) > if > 0.
If the testing problem is two-sided, H0 : = 0 and H1 : 6= 0, the previous
optimal test is not UMP anymore. If we incorporate the additional restriction of
unbiasedness, i.e. T ( ) if 6= 0; then the test rejecting for large absolute
values of tn is the UMP within the class of unbiased tests of level .
84
Example 92 Let X1 ; :::; Xn be i.i.d. random variables from a one-parameter expo-

nential distribution with un known mean ; and pdf f (x) = 1 e x= , x > 0 and
> 0.
1. (a) Derive the most powerful test of size = 0:05 for testing H0 : = 0
against H1 : = 1 ; where 0 and 1 are known constants, 0 < 0 < 1 .
A complete answer should include an explicit description
Pn of the critical
region, using a known distribution. [Hint: (2= ) i=1 Xi has a chi-square
distribution with 2n degrees of freedom. You do not need to prove this
result.]
(b) Argue that the test obtained in (a) is also a uniformly most powerful size
0.05 test for testing H0 : = 0 against H1 : > 0 . Express the power
of the test in terms of the CDF, say Gn (), of a chi-square r.v. with 2n
degrees of freedom. Is the power a decreasing or increasing function of ?
Explain.
(c) Propose an asymptotic test for the problem in (b) with limiting size
based on the CLT and the fact that V ar(X1 ) = 2 and show that this
test is consistent.
SOL: (a) By the Neyman-Pearson Lemma a UMP test of size is
8
< 1 if f1 (X) > cf0 (X)
T (X) = if f1 (X) = cf0 (X) ;
:
0 if f1 (X) < cf0 (X)
where
n Y
fj (X) = j exp ; j = 0; 1;
j
P
and Y = ni=1 Xi : Thus, since the distribution of the LR ff10 (X)
(X)
is continuous
and a monotone function of Y , we deduce there is a constant m such that
1 if Y > m
T (X) = T (Y ) =
0 if Y < m;
where m satis…es = E0 [T (Y )] = P0 (Y > m). By the hint and
= P0 (2Y = 0 > 2m= 0 )

2 2
it follows that 2m= 0 = 2n;1 or m = 0 2n;1 =2: The rejection region is
2
Y > 0 2n;1 =2:
85
(b) Since the UMP test T does not depend on 1 , this test is also UMP test
for testing H0 : = 0 against H1 : > 0 . The power function of the UMP
test is
2
T ( ) = E [T (X)] = P (2Y = > 2m= ) = 1 Gn ( 2n;1 0= );
which is an increasing function of : Note T (0) = ; T ( )< if < 0 and
T ( ) > if > 0 .
(c) From the CLT
p X
n !d N (0; 1):
Thus an asymptotic test rejects at % if

p
X> 0 + 0 z0:95 = n:
To show the consistency, note under H1 : > 0
p
p p X n( 0 ) 0 z0:95
P X > 0 + 0 z0:95 = n = P n > +
! 1 ( 1) = 1 as n ! 1:
Remark 2 For composite null and composite alternatives UMP are harder to …nd.
The following important example illustrates this point.
1. Let X1 ; :::; Xn be i.i.d. Nk ( ; ) random variables, with positive de…nite and
known.
(a) Let c be a …xed vector in Rk ; and …x 1 with c0 1 > 0: (i) Derive the most
powerful test of size = 0:05 for testing H0 : = 0 against H1 : = 1 ;
where 0 = 1 (c0 1 =c0 c) c. (ii) Argue that the test obtained is also
a uniformly most powerful size 0.05 test for testing H0 : c0 = 0 against
H1 : c0 > 0.
(b) Consider the two-sample problem where we have independent observa-
tions, Z1 ; :::; Zn i.i.d. N ( 1 ; 21 ) and Y1 ; :::; Yn i.i.d. N ( 2 ; 22 ) (treatment
and control groups, respectively). (i) Assuming 21 and 22 are known,
derive the most powerful test of size = 0:05 for testing no treatment
e¤ect H0 : 1 = 2 against H1 : 1 > 2 : [Hint: Use part (b) for a speci…c
vector c]. (ii) For future planning of an experiment, you want to compute
the minimal sample size needed to achieve a power of at least 1 for an
alternative of size = 1 2 > 0; how would you compute such sample
size? What happens with this sample size when is very small?
86
SOL: (a) (i) Computing the likelihood ratio and using 0 = 1 (c0 1 =c0 c) c
we obtain !
f1 (X) c0 1 0 (c0 1 )2
= exp 0 c Y ;
f0 (X) c c 2c0 c
P
where Y = ni=1 Xi : The UMP test thus rejects for large values of c0 Y; pre-
p 0 p 0
cisely T (Y ) = 1(tn > z ); where tn = nc X= c c; since c0 Y has a normal
distribution N (nc0 ; nc0 c), which is continuous, we can set = 0; and m the
corresponding quantile of the normal distribution: (ii) The UMP test T does
not depend on 0 and 1 ; c0 0 = 0; and hence is also UMP test for testing
H0 : c0 = 0 vs H1 : c0 > 0.
(b) (i) De…ne X = (Z; Y ): By independence of Z and Y; X is a bivariate
normal with mean = ( 1 ; 2 ) and diagonal variance covariance matrix
(with diagonals known 21 and 22 ). Taking c = (1; 1) and applying part (b)
we conclude that the UMP test for testing
p H0 : 1p= 2 against H1 : 1 > 2 is
2 2
T (Y ) = 1(tn >p z ); where tn = n(Z p Y )= 1 + 2 : (ii) De…ne Zn =
p 2 2
p 2 2
n(Z Y )= 1 + 2 and n = n = 1 + 2 : The power function of the
UMP test is
T ( ) = E [T (X)] = P (Zn + n >z )=1 (z n ):
Thus, to achieve a power of at least 1 for an alternative of size = 1 2 >

1
0; we need (z n) , or z ( ) n : This happens if n is larger or
1
equal than (note z = ( ))
2 2
+
[z + z ]2 1
2
2
:
2 2
When is very small the required sample size is very large (unless 1 + 2 is
very small too).
Example 93 (Nonparametric test for mean) Let X1 ; :::; Xn be i.i.d. random vari-
ables with …nite variance. We want to test H0 : = 0 vs H1 : = 1 , 1 > 0; but
we are not willing to assume a parametric distribution for the Xs: That is, we relax
the normal distribution assumption. We could use the t-statistic as before, but when
we try to compute c such that the test has level ;
P0 (tn > c) = ;
this c = Jn 1 (1 ; P0 ) is unknown, as the quantile Jn 1 (1 ; P0 ) of the distribution
Jn (x; P0 ) = P0 (tn x) depends on P0 , which is unknown (we only know its mean is
87
zero, but other aspects of the distribution that may a¤ect Jn are unknown). As we
do not know the …nite sample distribution of the test statistic, we are not even able
to de…ne the test. How do we address this problem? Resorting to asymptotic theory.
Although Jn (x; P0 ) = P0 (tn x) is not known, from the CLT we know
Jn (x; P0 ) ! (x); as n ! 1:
1
Thus, choosing c = (1 ) z ; the last convergence implies
lim P0 (tn > c) = :

n!1
Note the test Tn = 1 (tn > z ) does not necessarily have level ; but we say is uni-
formly asymptotic level test, i.e.
lim sup T (P ) :
n!1 P 2P0
With composite null this may be hard to achieve, and we simply require pointwise
asymptotic level test, i.e. for each P 2 P0 ;
lim T (P ) :
n!1
For example, the t-test is pointwise asymptotic level (show this).
8.2 Asymptotic Tests

We focus now on “parametric” tests with large sample justi…cation. Parametric in
the sense that null and alternative hypotheses can be written in terms of a …nite-
dimensional parameter. These apply to a wide variety of models, often under gen-
eral conditions, and useful …nite-sample justi…cation can be given only in special
circumstances. The properties of the tests follow largely from the properties of point
estimates already described in class (LSE, MLE).
Let
1 q 1
=
p 1 2 s 1:
True value:
01
0 = :
02
88
Consider the null hypothesis
Composite if q < p
H0 : 01 =0 : (20)
Simple if q = p:
Alternative hypothesis:
H1 : 01 6= 0
The parameter needn’t be the “natural”parameters. Let be “natural”parame-
ters and we test
H0 : g( 0 ) = 0
for smooth q 1 g: Put
1
1 = g( ); 2 = 2; = :
q 1 p 1 2
We can invert these to get in terms of :
Example 94 H0: R 0 = r; rank(R) = q;
1 = R r; 2 = 2
R r
) =
O Is 0
1
R r
) = + :
O Is 0
Thus there is no real loss of generality in H0 (20): Note that in applied problems
11
there may be a sequence of tests: e.g. tests = 0; if rejected test 11 = 0:
12
So speci…cation of may change.
De…nition 95 (Power function) For a test statistic ^n suppose that we reject H0

(20) when
^n > c:
Then
n;c ( 01 ) = Pr (^n > cj 01 )
is the power function for ^n :
89
De…nition 96 (Consistency) The test in DEF. 95 is consistent i¤
n;c ( 01 ) ! 1 as n ! 1; 8 01 6= 0; 8c > 0:
We will discuss a trio of asymptotic tests: the Wald, Lagrange Multiplier (LM,
also called Rao or Score) and Likelihood Ratio (LR)- that can be used for testing
the null hypothesis. The three statistics are asymptotically equivalent in that they
share the same asymptotic distribution ( 2 ) and their properties can be extended to
more general situations, see Econometrics II. Some of the results are more generally
valid for asymptotic normal estimators, as is the case, for instance, for Wald Tests.
8.2.1 Generalized Wald Tests

These tests are the extension of the classical t-test to the multivariate setting.
^1n
ASS. 2 ^n = ^2n estimates 0 such that
p
n(^n 0) !d N (0; A) 8 0 2
where
A = A( 0 ) > 0
q A11 ( ) A12 ( )
A( ) =
s A21 ( ) A22 ( )
q s
ASS. 3 Under H0
^ 11;n !p A11 0
A = A011 :
02
Example 97 (NLS) For the LSE A is estimated by A ^ 1D

^ =E Ê^ 1
^ =
where E
^ =En ^"2 xi x0i .
En [xi x0i ] and D
Example 98 (MLE) For the LSE A is estimated by A ^ = Î 1 ; where Î is a consis-

h i
^ ^ ^ 0
tent estimator of the Fisher information matrix, e.g. I :=En s( n )s( n ) :
90
De…nition 99 (Wald Test) The (Generalized) Wald statistic is
0 1
W n = n ^1n A
^ 11;n ^1n
Remark: note that under H0 ; 01 = 0:

There can be many choices of ^n in a given problem, and for a given ^n many
^ 11;n : The variance estimate A
choices of A ^ 11;n may or may not embody H0 -if so, it
depends on ^2n only. If not, we may have A ^ 11;n !p A11 ( 0 ) whether or not H0 holds.
Theorem 100 Under ASS. 2 and 3

2
Wn !d q:
p ^ p
PROOF. H0 ; ASS. 2 ) n 1n = n ^1n 01 !d N (0; A011 );
0 1
) n^1n A011 ^1n !d 2
q:
0 0 1 p 0 1 p
Wn = n^1n A
^ 111 ^1n = n^1n A011 ^1n + n^1n A
^ 111 A011 n^1n
0 1
= n^1n A011 ^1n + Op (1)op (1)Op (1)
0 1
= n^1n A011 ^1n + op (1)
2
!d q:
An asymptotically valid -test is thus

2
Reject H0 i¤ Wn > q;1 (21)
2
where q;1 is the (1 ) quantile of the 2q distribution, i.e.
Z 1
= pdf ( 2q )d 2q :
2
q;1
Theorem 101 Under ASS. 2 and 3

c 2
(0) ! as n ! 1 for c = q; :
91
ASS. 4
^ 11 !p A11 > 0;
A 8 0
Theorem 102 Under ASS. 2 and 4, (21) is a consistent test, 8 :
PROOF:
p
n ^1n 01 !d N (0; A011 )
p p
) n^1n = n 01 + Op (1);
) By ASS. 4
p 0 1 p
Wn = n 01 + Op (1) (A11 ) + op (1) n 01 + Op (1)
0 1 1p
= n 01 (A11 ) 01 + 2Op (1)(A11 ) n 01 + Op (1)
0 1 1=2
= n 01 (A11 ) 01 + Op (n );
) c
( 01 ) = Pr(W > cj 01 )
= Pr n 001 (A11 ) 1 01 + Op (n1=2 ) > cj 01
! 1;
0 1
because n 01 (A11 ) 01 > 0 and dominates Op (n1=2 ) as n ! 1:
Remark: As mentioned above A ^ 11 may converge to a di¤erent value under H1 if

^ 11 not based on H0 may satisfy
based on H0 ; whereas an estimate A
^ 11;n !p A11 ( 0 )
A 8 0:
A General Example: E¢ cient Extremum Estimator

Henceforth, we consider extremum estimates
^ = ^n = arg min Qn ( ):
Examples of Qn ( ) are the sum of squared residuals for LSE or minus the log likeli-
hood for MLE.
Put
@Qn ( ) @ 2 Qn ( )
Q ;n ( ) = ; Q ;n ( ) = ;
@ @ @ 0
92
where p
nQ ;n ( 0 ) !d N (0; D); Q ;n ( 0 ) !p E:
Under suitable conditions
p
n(^n 0) !d N (0; E 1 DE 1 )
and ^n is e¢ cient if E = D; when

p
n(^n 0) !d N (0; D 1 ):
This is true in some cases, e.g. MLE under correct speci…cation, but it is not
true in others, e.g. LSE. Henceforth we will assume for simplicity in the arguments
that “E = D”. For Wald tests the theory follows for general cases, but for LM and
LR tests we need E = D: Therefore, we shall consider Assumptions 2 and 3 with
A = D 1:
A word on notation:
D11 D12 D11 D12
D= ; D 1=
D21 D22 D21 D22
1
Notice that e.g. D11 = D11 D12 D221 D21 :
Under this set-up, the Wald statistic is

0 1
Wn = n^1n D
^ 11
n
^1n :
The asymptotic theory of Wald test follows as in the previous section. Let’s restate
our …ndings on Wald test in the context of extremum estimators:
ASS. 5
p
nQ ;n ( 0 ) !d N (0; D)
n !p 0 ) Q ;n ( n ) !p D; 8 n:
Theorem 103 Under ASS 3, 5, H0

2
Wn !d q:
93
PROOF:
p
) n(^n 0)!d N (0; D 1 )
p
) n(^1n 11
01 ) !d N (0; D ):
Theorem 104 Under ASS 3, 5, Wn provides a consistent test.
PROOF:
p p
n^1n = n 01 + Op (1) Theorem 102
(!p 1)
94
8.2.2 Generalized Lagrange Multiplier Tests

In W test we use an unrestricted ^n ; then examine whether it is consistent with
restrictions H0 : Sometimes it is easier to estimate 0 under H0 : The LM test makes
use of this.
Let
~n = arg min Qn ( );
H0
~n = 0 @Qn ( )
~2n i.e. Q2n (~) = 0; Qin = :
@ i
Put
0
Ln ( ) = Qn ( ) 1
@L( ) ~ n = Q1n (~n );
= Q1n ( ) ) (~n satis…es H0 )
@ 1
@L(~n )
= Q2n (~n ) = 0:
@ 2
If H0 is true, constrains used in …nding ~n are super‡uous. Equivalently, does ~n

approximately satisfy the conditions for a minimum? Q1n (~n ) = 0?
ASS. 6 Under H0
~ 11 11 0
D n !p D :
02
De…nition 105 (LM Statistic) The LM statistic is

0
LMn = n ~ n D
~ 11
n n
~
Remark: since Q2n (~n ) = 0, the LM test is also called the Score test:
LMn = n Q ~ 0~ 1 ~
;n ( n ) Dn Q ;n ( n ):
95
Theorem 106 Under ASS 5, 6 and H0

2
LMn !d q:
PROOF:
0 Q01n
Q0in = Qin ; Q0 ;n = Q ;n ( 0) = :
02 Q02n
~ n = Q1n (~n ) = Q01n + Q12;n (~2n 02 )
0 = Q2n (~n ) = Q02n + Q22;n (~2n 02 )
) ~2n 02 = 1
Q22;n Q02;n :
p p 0
) n~n = I; 1
Q12;n Q22;n nQ ;n
1 D11 D12 I
!d N 0; I; D12 (D22 )
D21 D22 (D22 ) 1 D21
= N (0; D11 D12 (D22 ) 1 D21 )jH0
= N (0; (D11 1
0 ) )
0
) LMn = n ~ D11
0
~ n + n ~ 0n (D
~ 11
n D11 ~
0 ) n:
2
!d q:
~ 11
ASS. 7 D 11
n !p D ; 8 0 :
Theorem 107 Under ASS 5, 7 LM provides a consistent test.
PROOF:
p p
n~n = I; 1
Q12;n Q22;n nQ0 ;n ; (i.e. with 01 = 0)
1
p 01 @ 1 01
= I;Q12;n Q22;n n Q ;n Q ;n
02 @ 02 0
p
= Op (1) + Op (1) n !p 1:
96
8.2.3 Generalized Likelihood Ratio Principle

For simple null and simple alternative Theorem 89 gives the UMP rejecting when
f1 (X)
> c;
f0 (X)
which can be shown to be equivalent to
f0 (X)
< c0
max(f0 (X); f1 (X))
for some c0 : We can write this as
supH0 f (X)
(X) = < c0 :
sup f (X)
Applying logarithm, this is equivalent to rejecting for small values of supH0 log f (X)
sup log f (X) or equivalently large values of [inf H0 Qn ( )] [inf Qn ( )] ; where
Qn ( ) = En [log f (X)] : This motivates the following de…nition:
De…nition 108 (LR statistic) The LR-like test statistic is
LRn = 2n(Qn (~n ) Qn (^n )):
Theorem 109 Under ASS 5, H0

2
LRn !d q:
PROOF:
LRn = 2n(Qn (~n ) Qn (^n ))

^n + 1 ~n
0 0
= 2n Qn (^n ) + Q ;n
^n ~n ^n Q ;n
~n ^n Qn (^n )
2
0
= n ~n ^n Q ;n
~n ^n :
0 = Q1n ^n = Q01 + Q11;n ^1n + Q12;n ^2n 02 (22)
0 = Q2n ^n = Q02 + Q21;n ^1n + Q22;n ^2n 02 (23)
0 = Q2n ~n = Q02 + Q
~ 22;n ~2n 02 (24)
97
From (23) and (24)
) 0 = Q21;n ^1n + Q22;n ^2n ~2n + (Q22;n Q ~ 22;n ) ~2n 02

p p
) n ^2n ~2n = Q22;n 1
Q21;n n^1n + op (1)
p p ^1n
) n ^n ~n = n ^2n ~2n
I p
= 1 n^1n + op (1)
Q22;n Q21;n
I p
= n^1n + op (1)
(D22 ) 1 D21
under H0 : Also p
n^1n !d N (0; D11
0 ):
Thus
0
LRn = n ~n ^n Q ;n
~n ^n
0
= n ~n ^n D ~n ^n + op (1):
But
1 D11 D12 I
I; D12 (D22 )
D21 D22 (D22 ) 1 D21
1
= D11
0 :
0 1
) LRn = n^1n D11
0
^1n + op (1) !d 2
q:
Theorem 110 Under ASS 5, LRn provides a consistent test.

PROOF:
p I np p o
n ^n ~n = n ^1n 01 + n 01 :
Q221 Q21
The …rst term in square brackets is Op (1); the second ! 1:
98
Some Remarks
Which test to use?
– W tests only uses estimators under the alternative, LM under the null,
and LR under both the null and the alternative.
– In the theory above LM and LR require the key assumption E = D:
These ests can be robusti…ed against misspeci…cations, i.e. can be also applied
to cases where E 6= D if they are suitably modi…ed, as the following important
application illustrates.
8.3 When does a policy or treatment have signi…cant e¤ects?

Suppose we want to evaluate a policy or treatment on an outcome of interest. Let D
be the treatment indicator (=1 if treated, 0 otherwise), Y (1) be the outcome under
treatment, Y (0) be the outcome without treatment. We only observe (Y; X; D);
with Y = Y (1) D + Y (0) (1 D). Assume (Y (1) ; Y (0)) and D are independent
conditional on X. This assumption is called uncounfoundedness or selection on
observables. Under this condition, the average treatment e¤ect E[Yi (1) Yi (0)] is
identi…ed, as we now show. First, note that Yi (1) Yi (0) is not identi…ed for any
individual. Nevertheless, using the selection on observables assumption and the law
of iterated expectations
E[Yi (1) Yi (0)] = E[E[Yi (1)jXi ] E[Yi (0)jXi ]]

= E[E[Yi (1)jXi ; Di = 1] E[Yi (0)jXi ; Di = 0]]
= E[E[Yi jXi ; Di = 1] E[Yi jXi ; Di = 0]];
where the second equality uses the independence, and the third the de…nition of Yi :
To illustrate these points, suppose we run the regression
0
Yi = 0 + 0 Di + 0 Xi + "i ;
where E["i j Xi ; Di ] = 0: What is the interpretation of 0 in this model? Note the

potential outcomes in this model are
0
Yi (1) = 0+ 0+ 0 Xi + "i
0
Yi (0) = 0+ 0 Xi + "i :
99
Thus, 0 = Yi (1) Yi (0); independently of i (homogenous treatment e¤ects). Simi-

larly,
0
E[Yi jXi ; Di ] = 0 + 0 Di + 0 Xi
and given Yi = Yi (1) Di + Yi (0) (1 Di ),

0
E[Yi jXi ; Di = 1] = E[Y (1)jXi ] = 0+ 0 + 0 Xi ; (25)
0
E[Yi jXi ; Di = 0] = E[Y (0)jXi ] = 0+ 0 Xi : (26)
Thus substracting (26) from (25) delivers
0 = E[Y (1)jXi ] E[Y (0)jXi ] = E[Y (1) Y (0)jXi ]: (27)
Again this is a homogeneity (in observables) of conditional treatment e¤ects. Taking

expectation on both sides of (27), we have
0 = E[Y (1) Y (0)]:
That is, 0 can be interpreted as the average treatment e¤ect. The …rst identi…cation
above is nonparametric, while the second uses additional functional form assumptions
and is less appealing.
We now provide the expression for a robust (to heteroskedasticity) LM tests based
on the OLS objective function for the hypotheses
H0 : 0 =0 vs H1 : 0 6= 0:
Strictly speaking the previous theory for the LM test does not apply to this problem
because this is a situation where E 6= D; even under conditional homoskedastic-
ity. We can however construct a version of the LM that is robust to conditional
homoskedascitity as follows (see Wooldridge, page 60 for discussion). Write the re-
gression as
Yi = 00 Wi + "i
where Wi = (1; Di ; Xi0 )0 . Then the objective function of OLS is
1 X
n
0
Qn ( ) = (Yi W i )2
2n i=1
Then
1X
n
@ 0
Q n( ) = Qn ( ) = (Yi Wi )Wi :
@ n i=1
100
Let
~n = arg min Qn ( );
=0
0
That is, ~n = (b ; 0; b )0 , where b and b are OLS estimate of the following restricted
regression:
Yi = 0 + 00 Xi + "i :
Then the Lagrange multiplier test statistic can be constructed as
LMn = nQ n (~n )0 D
~ 1 Q n (~n );
n
where D ~ the asymptotic variance of pnQ n (~n ):

~ n is some constistent estimate for D:
This expression can be simpli…ed as follows. Note that, by de…nition of the OLS ~n
2 3
0
p P 0
nQ n (~n ) = 4 p1n ni=1 (Yi ~n Wi )Di 5 :
0
Hence, we need to compute the asymptotic variance of the middle term. It can be
shown that under the null
1 X 1 X
n n
p (Yi ~0 Wi )Di = p (Yi 0 r
n 0 Wi )Di + op (1);
n i=1 n i=1
! d N(0; B);
where Dir is the population OLS error in the regression of Di against 1 and Xi and
B = E["2i (Dir )2 ]: To get some intuition on the last display, note we can always write
by a least squares projection
Di = 0 + 1 Xi + Dir ;
where, by construction, Dir is uncorrelated with Xi and with zero mean. On the
other hand, we know from the FOC of OLS and substituting from the last display
1 X X
n n
p (Yi ~0 Wi )Di = p1 (Yi ~0 Wi )Dr ;
n n i
n i=1 n i=1
but (Yi ~0 Wi ) = "i (~n 0

0 ) Wi ; and thus
n
1 X X 1X
n n n
p
p (Yi ~0 Wi )Dr = p1 "i Dir n(~n 0 ) Wi Dir
n i
n i=1 n i=1 n i=1
101
and
1X
n
p
n(~n 0) Wi Dir !p 0
n i=1
since E[Wi Dir ] = (0; E[Di Dir ]; 0)0 and ~n 0 has a zero in the component corre-
sponding to Di under H0 :
Denote by D ^ ir the OLS residual in the regression of Di against 1 and Xi : Then,
the heteroskedasticity-robust LM statistic is given as
!0 ! 1 !
1 X 1 X 2 ^r 2 1 X
n n n
LMn = p ~"i Di ~" Di p ~"i Di :
n i=1 n i=1 i n i=1
Since, under H0 ,
1 X 2 ^r
n
2
~" Di !p B;
n i=1 i
it follows that under H0 ,
2
LMn !d 1:
The LM test can be applied to Lasso and other machine learning methods if it is
implemented in the form
!0 ! 1 !
1 X ^r 1 X 2 ^r 2 1 X ^r
n n n
LMn = p ~"i Di ~" Di p ~"i Di ;
n i=1 n i=1 i n i=1
^ r can be Lasso residuals.

where now both ~"i and D i
Example 111 (Instrumental Variables, LATE.) An alternative to using selec-

tion on observables is Instrumental Variables. But, what is IV estimating in general
(i.e. nonparametrically)? We have seen this exercise with OLS. For IV a recom-
mended reading is Imbens and Angrist (1994, Econometrica). In the setting of the
last example, assume that we also observe a variable Z; in addition to Y and the
treatment D: Similarly to Yi (d); denote by D(z) the potential treatment variable for
an individual with Zi = z: Assume that (Yi (1) ; Yi (0) ; Di (z)) and Zi are independent,
and that the propensity score p(z) = E[Di jZi = z] is not constant. Then, take two
102
points in the support of Z; say z and w with p(z) > p(w); then
E[Yi jZi = z] E[Yi jZi = w]

= E[Yi (1) Di (z) + Yi (0) (1 Di (z)) jZi = z]
+ E[Yi (1) Di (w) + Yi (0) (1 Di (w)) jZi = w]
= E[(Di (z) Di (w))(Yi (1) Yi (0))]
= Pr (Di (z) Di (w) = 1) E[Yi (1) Yi (0)jDi (z) Di (w) = 1]
Pr (Di (z) Di (w) = 1) E[Yi (1) Yi (0)jDi (z) Di (w) = 1]:
This equation highlights the identi…cation problem arising in the use of IV as a causal
estimand. This motivated the Monotonicity assumption: For all z; w in the support of
Z; either Di (z) Di (w) for all i; or Di (z) Di (w) for all i: Under these conditions,
E[Yi jZi = z] E[Yi jZi = w]
= E[Yi (1) Yi (0) jDi (z) 6= Di (w)]:
E[Di jZi = z] E[Di jZi = w]
The right hand side is called the Local Average Treatment E¤ect (LATE), and has a
causal interpretation. If Z is binary, z = 1 and w = 0; then
E[Yi jZi ] = a0 + b0 Zi and E[Di jZi ] = c0 + d0 Zi
E[Yi jZi = 1] E[Yi jZi = 0]

0 =
E[Di jZi = 1] E[Di jZi = 0]
b0
=
d0
Cov(Yi ; Zi )
= ;
Cov(Di ; Zi )
which is the IV estimand 0 that solves the equation
Cov(Yi 0 0 Di ; Zi ) = 0:
This a Method of Moment estimators with moments

IV
En [m(w; ^n )] = 0:
where 0 = ( 0 ; 0 )0 ; m(w; 0) = (Yi 0 0 Di ; (Yi 0 0 Di )Zi )

0
: In fact, the
estandard IV estimator is
ÎV = Covn (Yi ; Zi ) and ^ IV = En [Y ] ÎV En [D]:

n n n
Covn (Di ; Zi )
103
We can test hypothesis about causal e¤ects, such as 0 = 0 as discussed above.

Unfortunately, the interpretation of IV when the endogenous variable Di is con-
tinuous is much more complicated.
104
9 Statistical Inference: Con…dence Intervals

We have discussed methods to identify and estimate a parameter and to test
hypotheses about it. Now, we discuss methods to obtain con…dence sets for : Let
the data X = (X1 ; :::; Xn ) have probability measure P; and let P be a class of
probability measures for P: We have three de…nitions for con…dence sets.
De…nition 112 Cn is a …nite sample 1 con…dence set if
inf P ( 2 Cn ) 1 for all n:

P 2P
De…nition 113 Cn is a uniform asymptotic 1 con…dence set if
lim inf inf P ( 2 Cn ) 1 :

n!1 P 2P
De…nition 114 Cn is a pointwise asymptotic 1 con…dence set if
lim inf P ( 2 Cn ) 1 for every P 2 P:

n!1
Ideally, we would like …nite sample con…dence sets, but these are hard to …nd. If
Cn is a uniform asymptotic con…dence set, then the following is true: for any > 0
there exists n( ) such that the coverage of Cn (i.e. P ( 2 Cn )) is at least 1
for all n > n( ): With a pointwise asymptotic con…dence set, there may not exist a
…nite n( ): Unfortunately, commonly used con…dence sets are pointwise asymptotic.
The typical construction for a pointwise asymptotic 1 con…dence interval is
^n z ^
=2 se( n );
1
where z = (1 ), which is based on the asymptotic normality result for the
univaritate case, i.e.
(^n 0)
!d N (0; 1):
se(^n )
This is often obtained by Slutsky´s Theorem from
p
n(^n 0 ) !d N (0; V )
105
q 1=2
and se( n ) = V^ =n; with V^ !P V: For example, for MLE, se(^n ) = In (^n )
^ ;
where In (^n ) is the Fisher information for the full sample (In (^n ) = nI(^n )).
We can extend this to the multivariate case. Suppose we can prove the asymptotic
normality result p
n(^n 0 ) !d N (0; V)
and the consistency of an asymptotic variance estimator

^ !p V:
V
Then, a pointwise asymptotic 1 con…dence set is and is given by the ellipsoid

n o
Cn = ^
2 : n( n 0^ 1 ^
)V ( n )0 2
p; ; (28)
where 2p; is the 1 quantile of the chi-square distribution with p degrees of

freedom. This con…dence set is a special case of a more general principle. Let Tn ( )
be a test statistic for H0 : 0 = vs H0 : 0 6= ; such that
2
Tn ( ) !d p; under H0 :
Then, the following is a pointwise asymptotic 1 con…dence set (show this)

2
Cn = 2 : Tn ( ) p; :
The construction in (28) is based on the Wald test statistic Tn ( ), but other tests
such as LM or LR could be used to construct pointwise asymptotic 1 con…dence
sets.
Example 115 (Bernoulli variables) Let X1 ; :::; Xn be i.i.d. binary random vari-
ables with p = P (X1 = 1). Let ^n denote the sample mean Xn : The central limit
theorem leads to p
n(^n p) !d N (0; p(1 p)):
Thus, a pointwise asymptotic 1 con…dence interval for p is
^n z ^
=2 se( n )
with s
^n (1 ^n )
se(^n ) =
n
106
On the other hand, using the Hoe¤ding´ s inequality, which says
P Xn p >" 2 exp( 2n"2 );
one can show that r

1
Xn log(2= );
2n
forms a 1 …nite sample con…dence interval.
107
10 Monte Carlo and Bootstrap Methods

10.1 Monte Carlo
Important background in this chapter is the Glivenko-Cantelli’s Theorem, which
shows that if fX1 ; :::; Xn g is a sequence of iid random variables with cdf F (prob-
ability P), then the empirical distribution function Fn (x) = En [1(X x)] is a
uniformly consistent estimator of F (x): Let be a parameter of interest and Tn =
Tn (X1 ; :::; Xn ; ) be a statistic (wlg we consider real-valued quantities). The cdf of
Tn is
Jn (x; F ) = P(Tn x):
While the asymptotic distribution of Tn might be known, the exact (…nite sample)
distribution Jn is generally unknown and depends on the underlying cdf F:
Example 116 (t-statistic) Let X1 ; :::; Xn be a sequence p of iid r.v’s, distributed as

1
F with mean and …nite variance and set Tn = sn n( n ); where n is the
sample mean, n = En [X]; and sn is the sample standard deviation. By the CLT the
asymptotic distribution of Tn is a N (0; 1) but Jn (x; F ) is in general unknown and
depends on F:
Monte Carlo simulation uses numerical simulation to compute Jn (x; F ) for se-
lected choices of F: This is useful to investigate the performance of Tn for reasonable
situations and sample sizes. The basic idea is that for any given F; the distribution
function Jn (x; F ) can be approximated through simulation.
The method of Monte Carlo is quite simple to describe. The researcher chooses
F (the underlying cdf) and the sample size n: A true value for is implied for the
choice of F: The following algorithm is conducted:
Step 1: A sample of size n is generated from F : X1 ; :::; Xn :
Step 2: The test statistic is computed with the previous data, Tn = Tn (X1 ; :::; Xn ; ):
Step 3: Repeat Step 1 and Step 2 B times, where B is a large number, getting
B values of Tn ; Tn1 ; :::; TnB say. Typically, we set B = 1000 or B = 5000:
Step 4: Approximate the distribution Jn (x; F ) with the empirical distribution

of Tn1 ; :::; TnB .
For step 1, most computer packages have procedures for generating random num-
bers from many well known distributions. In any case, you can always use that
108
F 1 (U ) is distributed as F if U U [0; 1]: In step 3 the replications must be in-

dependently drawn. With the B values Tn1 ; :::; TnB we can estimate any feature of
the unknown …nite sample distribution of Tn : The researcher must select the number
of Monte Carlo replications B: A larger B results in more precise estimates of the
features of interest of Jn ; but requires more computational time. In practice, there-
fore, the choice of B is often guided by the computational demands of the statistical
procedure. In many cases it is a straightforward matter to calculate standard errors.
If the standard error is too large to make reliable inference, then B will have to
increase.
Example 117 (t-statistic (cont.)) Suposse we are interested in the Type I error
associated with an asymptotic 5% two-sided t-test. We can compute
1X
B
Pb = 1(Tnb 1:96);
B b=1
the percentage of the simulated t-ratios which exceed the asymptotic 5% critical value.
The r.v’s 1(Tnb 1:96) are iid Bernoulli distributed. The samplepaverage Pb is
therefore an unbiased estimator of Pqwith standard error s(Pb) = P (1 P )=B;
which can be estimated by sb(Pb) = Pb(1 Pb)=B or using a hypothesized value.
For
p example, if we are p assessing an asymptotic 5% test, then we can set s(Pb) =
0:5(0:95)=B 0:22= B; which for B = 100; 1000 and 5000 are, respectively,
b
s(P ) = 0:022; 0:007 and 0:003.
The typical purpose of a Monte Carlo simulation is to investigate the perfor-

mance of a statistical procedure (estimator or test) in realistic settings. Monte Carlo
simulations can be carried out to estimate the bias, the MSE or the variance of an
estimator. For test statistics, Monte Carlo can be used to study the empirical power
and size performance of tests. Clearly the performance will depend on n and F: It
is therefore useful to conduct a variety of experiments, for a selection of choices of n
and F:
10.2 Bootstrap Methods

Let be a parameter of interest and Tn = Tn (X1 ; :::; Xn ; ) be a statistic (wlg we
consider real-valued quantities). The cdf of Tn is
Jn (x; F ) = P(Tn x):
109
While the asymptotic distribution of Tn might be known, the exact (…nite sample)
distribution Jn is generally unknown and depends on the underlying cdf F:
Asymptotic inference is based on approximating the cdf Jn (x; F ) with J(x; F ) =
limn!1 Jn (x; F ): When J(x; F ) = J(x) does not depend on F; we say that Tn is
asymptotically pivotal and use the distribution function J for inferential purposes.
In a seminal contribution, Efron (1979) proposed the bootstrap, which makes a
di¤erent approximation. The unknown F is replaced by an estimate Fn (e.g. the
empirical cdf) and plugged into Jn (x; F ) to obtain
Jn (x) = Jn (x; Fn )
The intuition behind this approximation is that if Glivenko-Cantelli holds Fn F

and Jn (x; F ) is smooth enough in F; it is expected that Jn (x; Fn ) Jn (x; F ):
Notice that the bootstrap distribution Jn (x) is a random distribution, as it de-
pends on the sample through the estimator Fn :
The nonparametric bootstrap is obtained when the empirical cdf Fn is used.
Since the empirical cdf Fn is a multinomial (with n support points), in principle the
distribution Jn (x) can be calculated by direct methods. In practice such calculation
is computationally infeasible unless n is very small. The popular alternative approach
is to use Monte Carlo simulations to approximate the distribution. The following
algorithm is conducted:
Step 1: A sample of size n is generated from Fn : X1 ; :::; Xn :
Step 2: The test statistic is computed with the previous data, Tn = Tn (X1 ; :::; Xn ; ):
Step 3: Repeat Step 1 and Step 2 B times, where B is a large number, getting
B values of Tn ; Tn1 ; :::; TnB say. Typically, we set B = 1000 or B = 5000:
Step 4: Approximate the distribution Jn (x; F ) with the empirical distribution

of Tn1 ; :::; TnB .
In Step 1 since Fn is a discrete probability putting probability mass 1=n at each

sample point, sampling from Fn is equivalent to random sampling from the observed
data with replacement. In consequence, a bootstrap sample X1 ; :::; Xn will neces-
sarily have some ties and multiple values, which is generally not a problem.
A theory for the determination of the number of bootstrap replications B has
been developed by Andrews and Buchinsky (2000).
Other nonparametric bootstrap methods make use of smoothed nonparametric
estimations of F: There are examples where the nonparametric (edf based) bootstrap
110
is invalid but the smoothed nonparametric bootstrap is valid, see e.g the maximum
score estimator of Manski (1975).
In parametric problems, F = F ( ) is estimated through F (b); for a consistent
estimator b of : This is called parametric bootstrap.
Other bootstrap methods are speci…c to the problem at hand, e.g. bootstrap
for regression models such as the wild bootstrap (WB).
Why bootstrap methods?
Bootstrap methods can permit statistical inference when conventional methods
such as standard error computation are di¢ cult to implement.
Bootstrap methods can be “better” than asymptotic approximations. Provide
re…nements that can lead to better approximation in …nite-samples.
Does the bootstrap approximation fail? Yes, sometimes. There are exam-
ples where the bootstrap approximation fails. There are, however, general theo-
rems proving the consistency of the bootstrap under mild smoothness conditions.
That is, the intuition is that if Fn F and Jn (x; F ) is smooth enough in F; then
Jn (x; Fn ) Jn (x; F ):
Examples of applications of the bootstrap: con…dence intervals, estimation of
standard deviations, bias reduction and hypothesis testing, to mention a few.
10.2.1 Bias reduction. Estimation of standard deviations. Con…dence

intervals.
Let Tn ( ) = b : The bias of b is
n = E[b ] = E[Tn ( )]:

The bootstrap estimate of n is
n = E [Tn (b)] = E [b b];
where b is computed with the bootstrap sample X1 ; :::; Xn : The symbol E stands
for expectation with respect to the bootstrap sample (i.e. conditional on the original
sample). n is estimated by the simulation described previously by
1X 1 Xb
B B
bn = Tnb = b=b b:
B b=1 B b=1
If b is biased, it might be desirable to construct a biased-corrected estimator. Ideally,

this would be
e=b n;
111
but n is unknown. The estimated bootstrap biased-corrected estimator is

e = b bn
= b b +b
= 2b b :
Let Tn = b: The variance of b is
Vn = E[(Tn E[Tn ])2 ]
Let Tn = b : It has variance
Vn = E [(Tn E [Tn ])2 ]:
The simulation estimate is
1X b
B 2
Vbn = b :
B b=1 nb nb
A bootstrap standard error for b is the square root of the last display.
Let be a parameter of interest and Tn = Tn (X1 ; :::; Xn ; ) be a test statistic.
The cdf of Tn is
Jn (x; F ) = P(Tn ( ) x):
For a distribution function Jn (x; F ) let qn ( ; F ) its quantile, i.e
qn ( ; F ) = inf fJn (x; F ) g:

x
How to construct con…dence intervals (CI) for 0? Choose c1 and c2 such that
P(c1 Tn ( ) c2 ) = 1
Examples:
c1 = 1 c2 = qn (1 ; F ) One sided
c1 = qn ( ; F ) c2 = +1 One sided
c1 = qn ( =2; F ) c2 = qn (1 =2; F ) Two sided
If Tn ( ) is a pivot, use qn ( ; F ) qn ( ): If Tn ( ) is an asymptotic pivot, use q( ) =
limn !1 qn ( ; F ). Bootstrap methods use qbn ( ; Fn ); a simulated value of qn ( ; Fn ):
There are general results that guarantee the validity of the bootstrap approximation
for CI computation, in the sense that the one-sided coverage error (CEn ) P(Tn ( )
112
qbn (1 ; Fn )) [1 ] ! 0 as n ! 1: These quantities are de…ned analogously for

two sided CI. In this case we say the bootstrap is consistent (note: the concept of
consistency of the bootstrap changes from application to application).
Def: The coverage probability (for one sided bootstrap CI) is P(Tn ( )
qbn (1 ; Fn ))
Next table compares the CE for asymptotic and bootstrap based tests. Norm
stands
p for asymptotic-based quantile. BB for the non-studentized
p bootstrap (Tn ( ) =
n(X n )) and SB for the studentized bootstrap (Tn ( ) = nsn 1 (X n )):
CEn N orm BB SB
1=2 1=2
One Sided O(n ) O(n ) O(n 1 )
1
Two Sided O(n ) O(n ) O(n 1 )
1
p Alternatively,
1
we can get symmetric two-sided bootstrap CI based on Tn ( ) =
n sn (X n ) ; i.e.
p
CI = f : n sn 1 (X n ) qbn (1 ; Fn )g:
In the latter case CEn = O(n 2 ): These rates of convergence can be computed using
Edgeworth expansions. These expansions are beyond the scope of these notes.
Note that bootstrap provides asymptotic re…nements comparing to asymptotic-based
CI in terms of convergence of the CE. So, a general rule for the bootstrap is: when
possible, use asymptotic pivots and symmetric two-sided tests.
10.2.2 Hypothesis testing.

Suppose we want to test
H0 : P 2 P0 vs H1 : P 2 P1 ;
and we have a test statistic Tn ; set up in a way that large values of Tn indicate H1 :
More speci…cally, it is common to have
(i) Tn !d T; under H0 ; with T a continuous distribution.
(ii) Tn !P 1; under H1 .
p
Let tn := nsn 1 (X n 0 ): Then, if = E[X] an example is
H0 : = 0 vs H1 : > 0;
113
then set Tn = tn : We reject for large values of Tn ; the critical region is fTn > cg
where c is computed such that
Type I error = P(Tn > c j H0 ) = :
So c = qn (1 ; F ):
Asymptotic theory for pivotal tests: c = q1 (1 ) = limn!1 qn (1 ; F ):
Bootstrap: c = qbn (1 ; Qn ) where Qn is an estimate of the true F that imposes
the null hypothesis H0 : Notice that in hypothesis testing the choice qbn (1 ; Fn ) with
Fn the empirical cdf leads to an inconsistent test. p For instance, in the previous
example assume 0 = 0, 2 = 1 (known) and Tn = nX n : Here Z and z(1 )
represent a standard normal r.v. and its (1 ) quantile, respectively. Then,
P(Tn > qbn (1 ; Fn ) j H0 ) ! P(Z > z(1 )) = ;
and p p
P(Tn > qbn (1 ; Fn ) j H1 ) ! P(Z + n > z(1 )+ n )= :
Therefore, the bootstrap test has no power under the alternative!!. To solve this
problem we impose the null in the estimation of the unknown F: In this example, we
take Qn the empirical cdf of the centered sample X1 X n ; :::; Xn X n : Then,
P(Tn > qbn (1 ; Qn ) j H0 ) ! P(Z > z(1 )) = ;
and p
P(Tn > qbn (1 ; Qn ) j H1 ) ! P(Z + n > z(1 )) ! 1:
10.2.3 Other resampling methods: Subsampling

There are many extensions of the nonparametric bootstrap. For time series sequences
bootstrap methods become more challenging. The block bootstrap is the oldest
and best known bootstrap method for dependent data. Methods such as the sta-
tionary bootstrap of Politis and Romano (1994) or the subsampling method are
very general and valid under mild regularity conditions. In this section we introduce
the subsampling approximation. The subsampling is more general than the boot-
strap approach, in the sense that it works under more general circumstances than
bootstrap does, but it is less e¢ cient than bootstrap (in terms of CE), when the
latter works.
114
Let Tn with cdf Jn (x; F ) and J1 (x; F ) = limn!1 Jn (x; F ): Write Tn = Tn (X1 ; :::; Xn ):
Let Tb;i = Tb (Xi ; :::; Xi+b 1 ) be the statistic computed with the subsample (Xi ; :::; Xi+b 1 )
of size b. We note that each subsample of size b (taken without replacement from
the original data) is indeed a sample of size b from the true DGP. Hence, it is clear
that one can approximate the sampling distribution Jn (x; F ) using the distribution
of the values of Tb;i computed over the n b + 1 di¤erent subsamples of size b in a
time series context or using the nb possible subsets with b elements taken from the
original sample in an iid setup. That is, we approximate Jn (x; F ) by
nXb+1
1
Jn;b (x) = 1(Tb;i x):
n b + 1 i=1
Suppose that we are testing a hypothesis and large values of Tn indicate rejection.
Let cn;1 ;b be the (1 )-th sample quantile of Jn;b ; i.e.,
cn;1 ;b = inffx : Jn;b (x) 1 g:
Thus, the subsampling test rejects the null hypothesis if Tn > cn;1 ;b :
Politis, Romano and Wolf (1999) showed the validity of the subsampling proce-
dure for strong mixing processes.
10.2.4 Wild Bootstrap: Nonlinear Regression

For some 0 2 let
y = m( 0 ; z) + v;
where m( 0 ; z) = E[y j z]: We have a sample f(y1 ; z1 ); ::::; (yn ; zn )g of size n to make
inference on 0 :
The standard bootstrap in this context is to estimate the model and obtain
residuals fb v1 ; ::::; vbn g; then bootstrap from the P
empirical distribution of centered
residuals fb v1 vb; ::::; vbn vbg; where vb = n 1 ni=1 vbi , to obtain bootstrap data
f(y1 ; z1 ); ::::; (yn ; zn )g according to
yi = m(bn ; zi ) + vbi :
Not surprisingly, this bootstrap does not work under heteroskedasticity (or in general
under dependence between z and v): A more general bootstrap in this context is the
wild bootstrap (WP) introduced in Wu (1986) and Liu (1988). The bootstrap data
are obtained from the following algorithm:
v1 ; ::::; vbn g:
1) Estimate the original model and obtain the residuals fb
115
2) Generate WB residuals according to vbi = vbi wi for i = 1; :::; n; where fwi : 1

i ng is a sequence of independent random variables (r.v’s) with zero mean
and unit variance, and also independent of the original sample.
3) Generate bootstrap data for the dependent variable yi according to
yi = m(bn ; zi ) + vbi :
Examples of fwi g sequences are i.i.d. Bernoulli variates with

p p
P (wi = 0:5(1 5)) = b P (wi = 0:5(1 + 5)) = 1 b; (29)
p p
where b = (1 + 5)=2 5; or P (wi = 1) = 0:5 and P (wi = 1) = 0:5. See Stute,
González-Manteiga and Presedo-Quindimil (1998) for use and theoretical validity of
this bootstrap for speci…cation testing.
Notice that the WB imposes the null hypothesis on the bootstrap sample, re-
gardless on whether the null is true or not. That is
vi j z] = vbi E [wi j z] = 0 a:s:

E [b
116

Econometric S If All 2020

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econometric S If All 2020

Uploaded by

Copyright:

Available Formats

Econometrics I: Lecture Notes

Juan Carlos Escanciano

3 Law of Large Numbers and Central Limit Theorems. 31

5 Empirical Measures and Su¢ ciency 37

6 Statistical Decision Theory 44

7 Statistical Inference: Estimation 50

8 Statistical Inference: Hypothesis Testing 81

9 Statistical Inference: Con…dence Intervals 105

10 Monte Carlo and Bootstrap Methods 108

1 Basic concepts of probability

(ii) F is a family of subsets of which has the structure of a …eld.

(iii) P is a probability measure on ; i.e., a subadditive measure from to [0; 1]

where 1A ( ) is the indicator function, 1A (x) = 1 if x 2 A occurs and 1A (x) = 0

This is useful to show some of the properties of cdfs.

Thus, measurability of the indicator 1A ( ) is equivalent to measurability of A: Next

for measurable sets A1 ; :::; Ak : Let f : ( ; F) ! [0; 1] measurable. A key result in

lim 'n (x) = f (x):

Proposition: If X : ( ; F) ! (M; B) and Y : ( ; F) ! (M; B) are r.v. then

Proof: To prove this, note that if Y = f (X); then for each B 2 B

because f 1 (B) is measurable. Thus, (Y ) (X). The other implication is a bit

f (X) := lim 'n (X) = Y:

This represents measures as integrals. Next, we extend the de…nition to simple

For a generic measurable function f : ( ; F) ! [0; 1] we de…ne the integral through

Another fundamental equality in integration is

So, adding is special case of integration.

The associated cdf is

theorem. As an application of this theorem to time series econometrics consider the

Theorem: (Theorem 1.2 in Shao)) If f : ( ; F; ) ! ( ; G) and g : ( ; G) ! (R; B)

1.1 Radon-Nikodym Derivative

Indeed, (A) is a measure with the property that

We then say is absolutely continuous with respect to (wrt) ( ). The Radon-

which can be extended from indicators to arbitrary measurable functions g: This is

so the RND is f (ai ) = pi :

Once we have de…ned densities as RND, we compute the densities of transforma-

FY (y) = E 1fY yg (de…nition of cdf and expectation)

Taking derivatives wrt y and applying a multivariate change of variables formula we

Another important application of RN theorem is for the de…nition of conditional

E [(Y E [Y j X]) g(X)] = 0;

1.1.1 Some Inequalities

E[jXY j] E[jXjp ]1=p E[jY jq ]1=q :

E[jXY j] E[jXj2 ]1=2 E[jY j2 ]1=2 :

Chebyshev´s Inequality: For all " > 0 and p > 0;

P(jXj > ") " p E[jXjp ]

1.2 Conditional Means as Orthogonal Projections

The Projection Theorem: Let Y and fS : S 2 Sg be random variables (de…ned

(i) Sb 2 S is the projection of Y if and only if

One important application of the projection theorem is to the de…nition of the

1. Taking g 1 in (4), we obtain E [Y ] = E [E [Y j X]] (Law of Iterated Ex-

2. If Y = f (X) for some X; it is clear from (2) that E [Y j X] = f (X):

3. If Y is conditional mean independent of X; i.e. E [Y j X] = 0; then E [Y g(X)] =

4. From Pythagoras’Theorem, E (E [Y j X])2 E [Y 2 ] :

Which is equivalent to the normal equations

Note that regardless on whether E [XX 0 ] is singular or non-singular there is always

1.3 Di¤erent Concepts of Dependence

Proposition 1 Two random vectors Y and X are independent i¤ for all f 2 L2 (Y )

We introduce the following (asymmetric) concept of independence.

De…nition 2 Y is Conditionally Mean Independent (CMI) of X if E [Y j X] = c;

Note that the constant c has to be necessarily c = E [Y ] (why?). Then, we have

Proposition 3 Y is CMI of X i¤ for all g 2 L2 (X)

Cov (Y; g(X)) = 0:

Proof. If Y is CMI of X then Cov (Y; g(X)) = E [Y g(X)] E [Y ] E [g(X)] =

From the discussion above we see that