Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

Econometrics

Hélène Huber 1

Maison des Sciences Economiques


Office 415
helene.huber@univ-paris1.fr

PSME
2015-2016

1/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

2/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

3/74
What if hypotheses do not hold ?

I Heteroskedasticity : error terms do not have the same variance


(ex : size effect for firms) → mostly cross sections
I Autocorrelation : error term from period t is correlated to error
term of period t − 1, and/or t − 2 ... → mostly time series
In that case, V (u) 6= σ 2 IN : OLS are still unbiased but less precise.
Furthermore, all the tests we derived are not valid any more since
they rely on homoskedasticity assumptions. We thus have to
modify the estimation to make a reliable assessment on parameter
values.

4/74
The problem with OLS

Let’s use the following model : y = Xb + u


I Assume that V (u) = σ 2 Ω, with Ω any suitable matrix (must
be symmetric and positive definite)
I E (b̂ols ) = b : still unbiased
I But V (b̂ols ) = σ 2 (X 0 X )−1 X 0 ΩX (X 0 X )−1
I If we run an OLS regression with a software, we will get b̂
computed as b̂ = (X 0 X )−1 X 0 y (right) but V̂ (b̂) computed as
V̂ (b̂) = σ̂ 2 (X 0 X )−1 (wrong)
I The software will thus give us unbiased estimates, but wrong
variances, so all the tests we may run are wrong (T-test,
F-test)

5/74
How can we correct for these problems ?

I We need to transform the model so as to get homoskedastic


and independent error terms
I We can use generalized least squares, that will replace
ordinary least squares
I We will eventually end up re-weighting the model with the
help of the variance matrix of error terms (see modified Chow
test : same concept)

6/74
Generalized least squares

Let’s use the following model : y = Xb + u


I Assume that V (u) = σ 2 Ω
I Ω is symmetric and positive definite, so Ω−1 exists and Ω−1/2
exists as well
I Ω−1/2 Ω−1/2 = Ω−1
I Let’s multiply the model on the left handside by Ω−1/2 :
Ω−1/2 y = Ω−1/2 Xb + Ω−1/2 u
I The error terms of this "new" model are homoskedastic and
uncorrelated : V (Ω−1/2 u) = σ 2 IN
I We can use the transformed model for the estimation of b and
most of all for inference

7/74
The GLS estimator

I Using OLS on the transformed model, we get the GLS


estimator
I b̂gls = (X 0 Ω−1 X )−1 X 0 Ω−1 y
I It is obtained by minimizing the sum of squared residuals
weighted by Ω−1
I GLS are unbiased
I The "true" variance that can be used for tests is the
following : V (b̂gls ) = σ 2 (X 0 Ω−1 X )−1
I Notice that the transformed model does not necessarily have a
meaning : it is only useful for the estimation procedure

8/74
Using GLS

Since GLS amounts to OLS computed on a model that has nice


properties, we can implement all the tests we described earlier
(T-test, F-test)

9/74
The FGLS estimator

I GLS have nice properties


I The problem is however that we do not know the true Ω : we
thus have to use an estimator of Ω and we thus turn to the
feasible generalized least squares FGLS
I We cannot compute all the variances and covariances for the
N error terms, since we have N observations
I We’ll have to estimate these variances and covariances in
some way (see later)

10/74
Remarks

I FGLS and GLS are asymptotically equivalent (when sample


size goes to infinity)
I FGLS have good properties only asymptotically
I FGLS are usually biased with a small sample
I Thus, in small samples, we cannot be sure that FGLS
outperform OLS

11/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

12/74
Heteroskedasticity

I Heteroskedasticity usually occurs with cross sections


I Let’s assume error terms are uncorrelated but do not have the
same variance (ex : households with very different income
levels)
I The variance matrix of error terms is a diagonal matrix σ 2 Ω
I Is it thus easy to compute Ω−1
I This is sometimes called the Weighted Least Squares
estimator (WLS) because each observation is weighted
proportionally to the inverse of the standard error of its
corresponding error term
I It is as if we put a smaller weight on high-income households

13/74
Re-weighting matrix

   
a1 0 · · · 0 1/a1 0 ··· 0
 0 a2 · · · 0  0 1/a2 · · · 0
   
 so Ω−1
 
Ω=
 .. .. .. .. =
 .. .. .. .. 
. . . .  . . . .
 
 
0 0 0 aN 0 0 0 1/aN
(1)
And the re-weighting matrix is :
 √ 
1/ a1 0 ··· 0

 0

1/ a2 · · · 0 
Ω−1/2

=
 .. .. .. ..  (2)
 . . . . 


0 0 0 1/ aN

14/74
Testing for heteroskedasticity

I First, simply looking at residuals


I Then, various tests are available
I Breusch-Pagan test : regress the squared residuals on all the
explanatory variables
I White test : regress the squared residuals on all the
explanatory variables raw, squared, and their cross-products
I If we find some parameters significantly different from 0, then
we conclude there is heteroskedasticity
I We will eventually use these regressions to estimate Ω and
lead the weighted regression
Stata command : estat hettest, bpagan, whitetst

15/74
Weighted least squares in practice (1)

Let’s consider a heteroskedastic model y = Xb + u


I The variance matrix of error terms is a diagonal with N
unknown elements V (ui )
I Assume that V (u) depends on variables X :
V (u) = σ 2 . exp(X γ) (we drop i’s)
I The exponential ensures everything is always positive
I The squared residuals are the empirical counterparts of the
error variances because V (ui ) = E (ui2 )
I So this equation becomes empirically : û 2 = σ 2 .exp(X γ) so
that log(û 2 ) = X γ + constant
I The error variances can now be estimated

16/74
Weighted least squares in practice (2)

What to do :
1. Estimate the original regression y = Xb + u with OLS, save
residuals, compute their square
2. Using plots or any other method, pick the right X ’s for the
auxiliary regression : log(û 2 ) = X γ + constant
3. Compute the predicted residual squared from the auxiliary
regression (need to get back to levels from logs)
4. For each individual i, divide every term of the original model
by the square root of that prediction
5. Run OLS on transformed data

17/74
Should we always use FGLS ?

I FGLS do not have nice small sample properties, while OLS do


I But using OLS gives us wrong standard errors for estimates
I We could use OLS and compute corrected standard errors for
them
I We can use the White (1980) matrix to run tests

18/74
The White matrix

White (1980) proved that a consistent estimator for the variance


matrix of parameters is :
N
V̂ (b̂ols ) = (X 0 X )−1 ( ût2 Xt Xt0 )(X 0 X )−1
X

t=1

Where the û’s are the residuals from the OLS regression.
This matrix can be used as an estimate of the true variance of OLS
estimators, and thus for tests. They are called "White" or "robust"
standard errors (because they are robust to heteroskedasticity),
and can be computed directly by softwares (option "robust").

19/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

20/74
Autocorrelation : a preview

I Will be treated in details in the Time Series chapter


I Error term from period t is correlated to error term of period
t − 1, and/or t − 2 ...
I First-order autocorrelation : ut = ρut−1 + εt
I Where |ρ| < 1 and ε is a "white noise"
I "White noise" means ∀t, E (εt ) = 0 and V (εt ) = σε , all εt
uncorrelated to each other and to u’s

21/74
First-order autocorrelation

The variance of error terms σ 2 Ω is no longer a diagonal matrix


because Ω is no longer the identity matrix :
 
1 ρ ρ2 ··· ρN−1

 ρ 1 ρ ··· ρN−2 

Ω= .. .. .. .. ..  (3)
. . . . . 
 

ρN−1 ρN−2 ρN−3 · · · 1
σε2
And we have : E (ut ) = 0, V (ut ) = σ 2 = 1−ρ2
.
The correlation coefficient between ut−h and ut is ρ|h| .

22/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

23/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

24/74
Considering random explanatory variables

I In the basic framework, explanatory variables X are usually


considered non random
I This amounts to reasoning conditionally on X , i.e. as if the X
were fixed
I An important hypothesis was that X and u were independent,
which is always the case if X is fixed
I But the X ’s can be considered as random variables
I Conditionally on X , all the previous results hold

25/74
Hypotheses for estimation

1. E (u) = 0
2. ∀t, t 0 , Xt is random and uncorrelated to ut 0
3. Rk(X ) = k
4. E (uu 0 ) = σ 2 IN
5. Error terms are iid (0, σ 2 )
6. plim[(X 0 X )]/N = VX which is a positive definite matrix (when
N goes to infinity, the X variables always keep some variance)

26/74
Properties of OLS : unbiasedness

I We know that b̂ols = (X 0 X )−1 X 0 y = b + (X 0 X )−1 X 0 u


I So E (b̂ols ) = b + E [(X 0 X )−1 X 0 u]
I We get : E [(X 0 X )−1 X 0 u] = EX [Eu [(X 0 X )−1 X 0 u / X ]]
= EX [(X 0 X )−1 X 0 Eu [u / X ]] and Eu [u / X ] = 0
I b̂ols is still unbiased
I In the same way, we get an expression for the variance of b̂ols :
I V (b̂ols ) = σ 2 E [(X 0 X )−1 ]

27/74
Properties of OLS : consistency

X 0 X −1 X 0u
plim(b̂mco ) = b + [plim( )] plim( ) = b + VX−1 .0 = b
N N
We will see in the next section that the asymptotic distribution of
b̂ is a normal, as expected.

28/74
Unconditional inference

I We know that b̂ols = b + (X 0 X )−1 X 0 u


I If we assume that the u are normal, b̂ols is normal only
conditionally on X
I But if we assume that X is random, there is no reason why
b̂ols should be normal so in principle, we cannot get the
Student and Fisher distributions we use for tests
I However, it can be shown that usual tests still work

29/74
Sketch of proof
Recall that for any parameter bj , the usual Student t-statistic is
the following :

b̂j
tj =
σ̂b̂j

I Conditionally on X , tj followed a Student distribution TN−k


I But in fact the distribution TN−k does not depend on the X
variables : so we can state that tj follows a TN−k distribution
unconditionally
I The same kind of result can be proven for the F distribution
I Conclusion : all the tests we used are still valid
I We only need the u to be normal and the X can follow any
distribution
I The fact that these test statistics follow a given distribution
that does not depend on X makes them pivotal 30/74
General properties

I If we assume that the Xi are iid 1 and the u’s are normal,
then :
I OLS estimators are asymptotically normal and consistent
I So if the sample is large enough for this to hold, then all the
previous results on T and F tests remain the same
I We can thus run the usual T- and F-tests
I If the u are not normal but only iid, we will use the
asymptotic version of the usual T-test and F-tests if the
sample is large enough
I In that case, Normal replaces Student and χ2 replaces Fisher

31/74

1. independent and identically distributed


Asymptotics : OLS estimators

We wish now to enlarge our framework to the case where u’s are
only iid (and not necessarily normal) and where X ’s can be random
(and not necessarily fixed). In what follows, we will thus use the
following asymptotic results, obtained under the usual model
hypotheses :

X 0 X −1 X 0u
plim(b̂mco ) = b + [plim( )] plim( ) = b + VX−1 .0 = b
N N

2
plim(σ
bmco ) = σ2


N(b̂mco − b) → N (0, σ 2 VX−1 )

32/74
Asymptotics : tests (1)

Tests rely on the following property :

(C b̂mco − Cb)0 (C (X 0 X )−1 C 0 )−1 (C b̂mco − Cb)


f = 2
→ χ2r
σmco
2
Where, as before in the finite sample case, σmco is unknown. But
since by assumption we work with a very large sample, σ 2
bmco is a
2 2
consistent estimate of σmco . We can thus replace σmco by its
consistent estimate, without having to replace the χ2 with a
2
Fisher, as in the finite sample case where σmco was imperfectly
estimated. This leads to :

(C b̂mco − Cb)0 (C (X 0 X )−1 C 0 )−1 (C b̂mco − Cb)


f = 2
→ χ2r
σ
bmco

33/74
Asymptotics : tests (2)

So if we test H0 : Cb = c, the following test statistic fa (called the


Wald statistic) converges in distribution to a χ2r (where
Rk(C ) = r , i.e. r is the number of constraints we test) and should
be compared to the quantiles of this distribution for the test :

(C b̂mco − c)0 (C (X 0 X )−1 C )−1 (C b̂mco − c)


fa = 2
σ
bmco
In practice, we can get this statistic from the default Fisher F
statistic that Stata provides thanks to the fact that under H0 ,
fa = rF (compare it to the classic Fisher statistic seen before).

34/74
Asymptotics : tests (3)

I For the significance test of one parameter, we find that under


H0 , fa is equal to the usual Student t, squared
I Since fa converges in distribution to a χ21 , its square root
becomes a N (0, 1)
I We end up comparing the regular Student t provided by Stata
to a standardized normal
I It was already the case before, if the sample size was large
enough, since when N increases, the Student distribution
becomes a normal
I So, there are no big changes wrt the basic framework we are
used to, but all this was necessary for the IV framework since
IV only work when sample size is large : using asymptotic
results is thus mandatory

35/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

36/74
Non exogenous explanatory variables

I Let’s use the following model : yt = Xt0 b + ut


I Assume that E (Xt0 ut ) 6= 0 : some explanatory variables are
correlated to the current error term
I In this case, plim(X 0 u/N) 6= 0
I OLS is biased and inconsistent

X 0 X −1 X 0u
plim(b̂ols ) = b + [plim( )] plim( ) 6= b
N N
Even if we increase the size of the sample, we won’t get the right
value for b

37/74
Exogeneity : definitions

I Strong exogeneity : ∀t, t 0 , E (Xt0 ut 0 ) = 0


I X are u are never correlated
I Exogeneity : ∀t, E (Xt0 ut ) = 0
I X and u are uncorrelated for any given time period
I X predetermined : ∀t 0 ≥ t, E (Xt0 ut 0 ) = 0
I X uncorrelated to current and future u
This class focuses on cross-sections so on the second case.

38/74
Omitted variables
Let’s assume the true model is the following :

yt = Xt0 b + wt d + vt
If we omit wt and estimate the following model instead :

yt = Xt0 b + ut

I wt is included in ut
I If wt is correlated to X , then E (Xt0 wt ) 6= 0 and thus
E (Xt0 ut ) 6= 0 : OLS is biased
I Remark : in the fixed X case, we would get the same result
using the Frish-Waugh theorem (see Dormont)
I When a variable that is correlated to the X ’s is omitted,
estimators suffer from an omitted variable bias
I Estimators are inconsistent
39/74
Variables measured with error
Let’s use a very simple one-variable theoretical model :

yt∗ = xt∗ .b + vt
Assume that x is measured with error. To run the estimation, we
have access to values yt and xt : yt = yt∗ and xt = xt∗ + et , et
being a white noise.

The estimated model is thus :

yt = xt .b + ut
With ut = vt − bet . We thus have the following :

E (xt ut ) = E [(xt∗ + et )(vt − bet )] = −bσe2 6= 0


OLS is thus inconsistent, and it can be shown that it is biased
towards 0 : the influence of x on y is underestimated.
40/74
Simultaneous equations

Say we have the following system of equations :


I Yt = a + bXt + ut (1)
I Xt = Yt + Zt (2)
I Because of equation (1), Yt is endogenous and because of
equation (2), Xt is endogenous too
I At the system level, only Zt is exogenous
I To get back to what we are used to, we should rewrite
endogenous variables as functions of exogenous variables only

41/74
Simultaneous equations cont’d

The system can be rewritten in the following way :


a b 1
I Yt = 1−b + 1−b Zt + 1−b ut
a 1 1
I Xt = 1−b + 1−b Zt + 1−b ut
Calling µ the "new" residual, we get back to a usual framework.
Notice that Xt appears to be a function of ut : in equation (1) of
the first system, it is thus correlated to the error term. A basic
hypothesis of OLS is violated (cov (Xt , ut ) 6= 0) and if we estimate
(1) without taking into account the information provided by (2),
estimates will be non consistent.

42/74
Autocorrelation with lagged dependent variable
Assume we use the following model :

yt = b0 + b1 xt + b2 yt−1 + ut
With :

ut = ρut−1 + εt
We can then write :
I yt = b0 + b1 xt + b2 yt−1 + ρut−1 + εt (1)
I And at the same time :
I yt−1 = b0 + b1 xt−1 + b2 yt−2 + ut−1 (2)
I Equation (2) shows that yt−1 depends on ut−1 , so yt−1 is
correlated to the error term in equation (2)
I The lagged outcome variable is thus not exogenous in the
estimated model
I It can be shown that it is exogenous if errors terms are not
autocorrelated
I How to test for this : Durbin’s h statistic (see Dormont) 43/74
General issue in policy evaluation

I Say we want to evaluate the impact of a policy on people’s


wages (ex : a specific program helping the unemployed)
I A model describing the wage outcome is
Yi = a + bXi + cPi + ui , where X comprises individual
characteristics and P is a dummy variable indicating whether
the individual was assigned to the program
I If the policy is not randomized, i.e. if the fact of being
assigned to the program depends on unobserved individual
characteristics, then P is correlated to u and evaluation of the
impact of the policy cannot be estimated consistently
I Intuitive reason : those who where assigned to the program
are not comparable to the ones who were not, so the latter
cannot be considered as a valid control group for the former
I This is called a selection effect
44/74
Consequences

In the case of non-exogeneity (also called endogeneity) of some


explanatory variables :
I OLS estimators are biased : on average, we do not get the
true value of parameters
I OLS are non consistent : even if we increase the size of the
sample, the bias does not go to zero and we will never get the
true value of parameters
We thus have to use another technique of estimation, using
auxiliary variables : the instrumental variables technique.

45/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

46/74
Instrumental variables
Let’s use the following model, where (to make things simple) all
X ’s are endogenous, except the constant (can’t be endogenous by
definition) :

y = X .b + u
Let’s consider a set of variables Z(N,p) different from the X ’s but
including the constant, with the following properties :
I E (Zt0 ut ) = 0 : variables Z are exogenous
I Rk(Z ) = p
I plim(Z 0 X /N) = VZX with VZX a non null matrix of dimension
(p, k) and rank k
I plim(Z 0 Z /N) = VZ with VZ a finite positive definite matrix
of dimension (p, p)
Means : variables Z need to be both exogenous and correlated to
X , and the number of IV, p, is greater than the number of
47/74
explanatory variables X (p ≥ k).
The estimator
The original model is :

y = X .b + u
I Assume we regress y , u and every X on variables Z (k + 2
regressions)
I We compute the predictions, respectively ỹ , ũ and X̃
I We use these new elements in the model instead :

ỹ = X̃ .b + ũ
I Let’s call PZ the orthogonal projection matrix, that projects
on L(Z ) : PZ = Z (Z 0 Z )−1 Z 0
I PZ is such that it is symmetric (PZ0 = PZ ) and idempotent
(PZ .PZ = PZ )
I It can be shown that ỹ = PZ y , ũ = PZ u and X̃ = PZ X
I It’s as if we had premultiplied the original model by PZ
I The IV estimator is : b̂iv = (X̃ 0 X̃ )−1 X̃ 0 ỹ = (X 0 PZ X )−1 X 0 PZ y 48/74
Remark

I To get to the estimated model, we multiply everything on the


left y and X by PZ
I What would happen if we multiplied only X and not y ?
I b̂vi = (X̃ 0 X̃ )−1 X̃ 0 y = (X 0 PZ X )−1 X 0 PZ y
I It does not change anything as regards b̂vi : in practice, we
transform only X and leave y unchanged

49/74
Intuition (1)

I OLS on the original model y = X .b + u are inconsistent


because variables X are correlated to u
I We can’t replace the X ’s by other variables because the model
should remain unchanged
I We would like to get rid of this correlation, while keeping the
information provided by X
I Imagine a regression where each variable in X would be
explained by the set of variables Z , say : ∀j, xj = Zd + e
I A prediction for xj would be : x̂j = Z d̂
I If this regression id good enough, x̂j is close to the true xj
I And x̂j is a linear combination of variables Z that are
exogenous : it is thus exogenous as well
I We can thus replace xj by x̂j in the original model
50/74
Intuition (2)

I We can also explain this algebraically


I To remove the correlation between X and u, we project the
model onto the vector space L(Z ), spanned by the columns of
Z , that are at the same time exogenous and correlated to X
I This amounts to keeping from X only the exogenous
information, uncorrelated to the error terms
I The more the Z are correlated to the X (and the more
numerous the Z ’s are), the more precise the estimator is
because the information loss is minimal
I PZ y is the prediction from the regression of y on variables Z ,
same for PZ X

51/74
Remarks

I If p = k (same number of instruments Z and of explanatory


variables X ), we get b̂iv = (Z 0 X )−1 Z 0 y
I Proof : rewriting b̂iv , given that in this case, matrix Z 0 X is
square and invertible
I Even if this expression is simple, it is not recommended to
choose this minimal number of intruments

52/74
Generalization (1)

I What if some variables in X are endogenous, and others


exogenous ?
I Let’s call X1 the set of the exogenous ones and X2 the set of
the endogenous ones : the matrix notation would be
X = (X1 , X2 ) (again, the constant is always exogenous)
I In that case, variables X1 can be used as instruments in
addition to the Z
I Let’s call W this extended set of instruments : W = (Z , X1 )
I To compute the VI estimator, we premultiply X by PW
I PW X = PW (X1 , X2 ) = (PW X1 , PW X2 ) = (X1 , PW X2 )
I Indeed, since X1 belongs to W , it’s being projected onto itself
and remains unchanged

53/74
Generalization (2)

I Conclusion : we will use X1 as additional instruments because


they will remain unchanged in the tranformed model, and will
help to instrument the X2
I We only need to make sure that there are enough "new"
instruments Z as endogenous variables X2 and don’t rely only
on exogenous X1
I Z are sometimes called excluded instruments
I We realize here that the constant, exogenous by nature,
belongs to the exogenous X1 and that we had been right to
use it as an instrument in the first place
I That’s what we did when describing the first stage regression,
that contain a constant by default

54/74
Important : final notation

I Since the exogenous variables from the set of variables X will


always be used as instruments, we don’t usually bother
mentioning it
I We usually call Z the global matrix of instruments that
contains both the excluded instruments and the
variables from X that are exogenous
I So that the hypotheses and formulae are the same as the ones
presented in the beginning
I We only need to make sure that there are at least as many
excluded instruments as endogenous variables

55/74
Properties of the IV estimator (1)

I b̂iv is biased in small sample


I We thus cannot derive a general expression for its
variance-covariance matrix
I But it is consistent (when sample size goes to infinity)
I It is asymptotically normal
√ 0
N(b̂iv − b) → N(0, σ 2 (VZX VZ−1 VZX )−1 )

56/74
Properties of the IV estimator (2)

I There is no general expression for its variance-covariance


matrix, but we can derive its asymptotic variance-covariance
matrix that is a consistent estimator of its "true"
variance-covariance matrix

√ 0
basympt ( N(b̂iv − b)) = σ̂ 2 ( X PZ X )−1
V iv
N
SSRiv
with σˆiv 2 = N and ûi = y − X b̂iv
I When OLS is consistent, IV is less precise than OLS
I Precision increases with the number of instruments

57/74
Tests on regression parameters

I We do not have convenient properties of the estimator in the


finite sample case
I We can only run asymptotic tests : Wald tests, similar to
F-tests, but when sample size is large
I Example of a test for r linear constraints on parameters :

(C b̂iv − Cb)0 [C (X 0 PZ X )−1 C 0 ]−1 (C b̂iv − Cb)


2 → χ2r
σ̂iv
With r the number of linear restrictions : C is of size (r , k).
Proof : taking the expression of the asymptotic normality of b̂iv ,
taking its quadratic form and "dividing" it by its variance will give
a χ2 distribution (the "true" variance is replaced by its consistent
estimator).
58/74
Outline

"Non spherical disturbances" : GLS


Heteroskedasticity
Autocorrelation

Endogeneity and Instrumental variables


Framework
Random explanatory variables
Asymptotics
Non-exogeneity
Definition
Possible causes
The IV estimator
IV in practice
Two-stage least squares
Diagnostics on the IV procedure
Examples

59/74
How to run IV estimation

I Stata : "ivregress" command


I This amounts to running two-stage least squares
I Intuition : first, regress the y and X 0 s on variables Z , then use
the predictions in the model instead of the original values

60/74
Two-stage least squares (1)

I Run k + 1 regression, to get PZ y and PZ X


I Estimate OLS on the transformed model PZ y = PZ yb + u
I We thus get b̂iv = (X 0 PZ X )−1 X 0 PZ y
I The first k + 1 regressions can be used to assess the
conveniency of instruments (they have to be correlated
enough to the X 0 s)
I Remark : this provides the same values if we do not replace y
by PZ y

61/74
Two-stage least squares (2)

I Warning : if we do this procedure "by hand", running 2 OLS


regressions, instead of running the convenient procedure with
the software, the s.e.’s of the second regression cannot be
used for tests on the coefficients
I Reason : in the second stage equation, residuals are computed
as : û = PZ y − PZ X b̂iv
I Whereas they should be computed as û = y − X b̂iv

62/74
Exogeneity test (1)

I We test H0 : E (X 0 u) = 0
I This is called the "Hausman test" or "Durbin-Wu-Hausman
test"
I If H0 is true, then both OLS and IV estimators are consistent
I If H0 is false, only the IV estimator is consistent
I The test is based on the difference between b̂iv and b̂ols
I They are asymptotically normal : if we compute the difference
between the two, take its quadratic form and "divide" it by its
variance matrix, we will get a χ2 distribution

63/74
Exogeneity test (2)

I One could think that the parameter of this χ2 is the number


of tested variables (those potentially endogenous), just like a
Wald test
I But looking close at the test statistic : :
I H = (b̂vi − b̂mco )0 (V (b̂vi ) − V (b̂mco ))−1 (b̂vi − b̂mco )
I This parameter is in fact equal to the rank of the following
matrix : (V (b̂vi ) − V (b̂mco ))
I So sometimes the software can’t run the test (H < 0, small
sample, etc)

64/74
The augmented regression

I Consider model y = Xb + u, where a subset of variables


belonging to X might be endogenous : x
I Let’s call Z the instruments, some belonging to X (in fact the
X without the x ) and some not
I Consider the augmented model : y = Xb + MZ xc + ε
I MZ x are the residuals of the regression of x on Z
I The b̂ of this "augmented" regression is equal to the IV
estimator of the original model
I Testing c = 0 amounts to testing exogeneity of the x (it is
equivalent to the Hausman test)

65/74
Proofs

I b̂aug = b̂iv and equivalence of tests : using the Frish-Waugh


theorem
I Remark : this augmented model has no theoretical meaning,
it’s only a tool

66/74
Selecting convenient instruments

I Sargan test : H0 : E (Z 0 u) = 0
I Also called : test of overidentifying restrictions (Stata : overid
command)
Under H0 :

û 0 PZ û
→ χ2p−k
s2
û 0 û
with û = y − X b̂iv and s 2 = N .
û 0 PZ û is the sum of the predicted value of the regression of û on
Z , squared.
Remark : when p = k, the statistic is always zero because
b̂iv = (Z 0 X )−1 Z 0 y and Z 0 û = 0. So we cannot run the test with
the minimum number of instruments.
67/74
The problem with weak instruments

I If instruments are too weakly correlated to the X 0 s, even if we


increase the number of observations, there can be an
important bias in estimations
I Plus, the estimator has a nonnormal sampling distribution
which makes statistical inference meaningless
I The weak instrument problem is increased with many
instruments, so drop the weakest ones and use the most
relevant ones
I A way to measure how instruments are correlated to
potentially endogenous variables is to run the regression
explaining the former by the latter and check its goodness of
fit
I A criterion can be the global F statistic : if F<10, then
instruments are weak
68/74
Returns on education
Angrist-Krueger (1991), "Does Compulsory School Attendance
Affect Schooling and Earnings", Quarterly Journal of Economics
I Goal : find causal link between education and wages, and we
know that such a wage equation is subject to endogeneity
I In the US, compulsory education starts the year pupils turn 6,
and ends when pupil turns 16
I Pupils born in January : begin school at 6 years 8 months old
I Pupils born in December : begin school at 5 years 9 months
old
I For pupils who stop school at 16, pupils born in December
study almost one year more than those born in January
I Quarter of birth is thus correlated to schooling, but random so
uncorrelated to individual ability : good instrument for
schooling
69/74
Quarter of birth and education

70/74
Quarter of birth and earnings

71/74
Results

72/74
Method

I Authors use as instruments first quarter, interacted with


regional dummies
I Result : OLS underestimates the impact of education on
earnings
I Even better method : use of randomized trials
I Ex : Moving to Opportunity, RAND Health Insurance
Experiment
I Many other methods in policy evaluation : difference in
difference, matching, etc

73/74
Are IV estimates reliable ?
I The goal of IV estimation is to bring to light real causality
and not mere correlation between outcome y and explanatory
variables X
I In treatment evaluation, we want to assess the Average
Treatment Effect
I So we instrument the dummy indicating whether the
individual was assigned to the treatment group because we
suspect that choosing a treatment is not exogenous : it is
likely to be correlated to unobserved heterogeneity
I However, it can be shown that in doing so, we estimate a
Local Average Treatment Effect, i.e. we estimate the impact
of the treatment only on the individuals for which the
instrument indeed has an impact (called compliers)
I In the Angrist study, compliers represent only 2% of the
sample
74/74

You might also like