4. Proposed solutions

(a) Introduce exogenous variables

If exogenous variables are added (to a first-order autoregressive process), the bias in the OLS
estimator is reduced in magnitude but remains positive. The coefficients on the exogenous
variables are biased towards zero. (The direction of bias for a pth order AR process is
difficult to identify a priori.)

The LSDV estimator remains biased if exogenous variables are added to (2), for small T.

(b) Instrumental variable methods

(i) Anderson-Hsiao

A normal technique for dealing with variables that are correlated with the error term is to
instrument them.

Taking first differences eliminates the  i , which were the source of the bias in the OLS
estimator. This gives:

( y it  y it 1 )   ( y it 1  y it  2 )  (u it  u it 1 ) i  1, ..., N t  1, ..., T

Now we need to instrument y it 1  ( y it 1  y it  2 ) , which is still clearly correlated with the

error (uit  uit 1 ) . The second lag of the level, yit  2 , and the first difference of this second
lag, y it  2  ( y it  2  y it  3 ) , are possible instruments, since they are both correlated with
( yit 1  yit  2 ) but are uncorrelated with (u it  u it 1 ) , as long as the uit themselves are not
serially correlated.

Both the resulting instrumental variables estimators (known as Anderson-Hsiao):

  ( y it  y it 1 )( y it  2  y it  3 )
 IV  i 1 t  3
  ( y it 1  y it  2 )( y it  2  y it  3 )
i 1 t  3
  ( y it  y it 1 ) y it  2
 IV  i 1 t  2
  ( y it 1  y it  2 ) y it  2
i 1 t  2
are consistent when N   or T   or both.

· Instrumenting with the second lag of the level, (17), has the advantage over instrumenting
with the second lagged difference, (16), that only two time periods are required,
rather than at least three.
· When T  3 , the choice of instrument can be based on correlations between ( yit  yit 1 )
and each of ( yit  2  yit  3 ) and yit  2 .
· It has been found that the estimator resulting from instrumenting using differences
( yit  2  yit  3 ) has a singularity point and very large variances over a significant range of
parameter values. Instrumenting using levels does not lead to the singularity problem,
and results in much smaller variances, and so is preferable.

(ii) Arellano-Bond

The Anderson-Hsiao instrumental variables estimator may be consistent, but it is not efficient
because it does not take into account all the available moment restrictions. (Moment
restrictions are restrictions on the covariances between regressors and the error term.
Regressors may be orthogonal to the error term, in which case we are justified in imposing,
or using, orthogonality restrictions that the covariance between regressor and error is zero.)

Arellano and Bond (1991) argue that a more efficient estimator results from the use of
additional instruments whose validity is based on orthogonality between lagged values of the
dependent variable y it and the errors uit . The Arellano-Bond estimator is now widely used
in short dynamic panels, not least due to the fact that they wrote a Gauss-based regression
package, DPD, which gives the standard OLS, fixed effects (Within or differences), random
effects estimators, plus their own. (See below for a schematic discussion of the estimators
available in DPD.)

Take (15) above, the first-differenced simple AR(1) model with no regressors. At t=3, the
first period we observe the relationship in (15),

( yi 3  yi 2 )   ( yi 2  yi1 )  (ui 3  ui 2 ) (18)

[NB 3 is t, 2 is t-1, 1 is t-2]

yi1 is a valid instrument for yi 2 , since these are highly correlated, and yi1 is not
correlated with (ui 3  ui 2 ) unless the uit are serially correlated. At t=4,

( yi 4  yi 3 )   ( yi 3  yi 2 )  (ui 4  ui 3 ) (19)

Here yi 2 and yi1 are both valid instruments: neither is correlated with (ui 4  ui 3 ) unless
the uit are serially correlated. Proceeding in this manner, we can see that at T, the valid
instrument set is ( yi1 , yi 2 ,..., yiT  2 ).

The matrix of instruments is W  [W1,..., WN ] , where

[ y i1 ] 0 
 [ y i1 , y i 2 ] 
Wi   
  
 0 [ y i1 , ..., y iT  2 ]

[NB The top row of this matrix refers to t=3, and the last to t=T: it is a square (T-2) matrix.]

The moment conditions are given by

E[( y it  y it 1 ) y it  j ]  0 j  2, ..., t  1 t  3, ..., T (20)

or, in vector form,
E(Wi ui )  0
ui  (ui 3  ui 2 ,..., uiT  uiT 1 )
There are m=(T-2)(T-1)/2 linear moment restrictions for T  3 .

Premultiplying (15) (here written in vector form) by W  gives

W y  W ( y1 )  W' u

Performing generalised least squares (GLS) on (23) gives the Arellano-Bond (1991)
preliminary one-step consistent estimator:

 1  [( y1 )W (W ( I N  G)W ) 1 W ( y1 )]1[( y1 ) W (W ( I N  G)W ) 1 W ( y)]


where W ( I N  G )W   WGW
i i and G is a (T-2) square matrix with twos in the main
i 1
diagonal, minus ones in the first subdiagonals and zeros otherwise:
 2 1 0  0 0
 
 1 2 1  0 0
 0 1 2  0 0
G 
      
 0 0 0  2 1
 
 0 0 0  1 2

Arellano and Bond also put forward a consistent 2-step generalised method of moments
(GMM) estimator:

 2  [( y1 )WVN 1W ( y1 )]1[( y1 )WVN 1W ( y)]


where VN   Wi ( u i )( u i ) Wi , and in practice the differenced residuals from the
i 1
preliminary one-step consistent estimator  1 are used in place of u .

· Why should we use  2 rather than  1 ? Because  2 does not rely on knowledge about
the distribution of the components of  it ,  i and u it ;  2 and  1 are asymptotically
equivalent if the uit are IID(0,  u2 ). (Nor does  2 require knowledge about initial
conditions, y i 0 .)
What happens if we have additional independent explanatory variables in our equation?

Specifically, assume there are K additional independent explanatory variables, so we revert to

equation (1):

y it  y it 1   x it   it i  1,..., N t  1,..., T (1)

where  it   i  uit and  and xit are K  1 .

The two-step estimators for  and  are given by:

   1 1 1
    [( y 1 X )  WVN W ( y 1 X )] [( y 1 X )  WVN W ( y )]
 

where X is the N(T-2)  K matrix of observations on x it . The one-step estimator is

obtained if WVN 1W  is replaced by (W ( I N  G )W ) 1 (cf. (25) and (24) above).

The instrument matrix W can be expanded to take advantage of the additional

independent explanatory variables. The instrument matrix that is optimal (i.e. efficient)
differs according to whether the additional explanatory variables are correlated with the fixed
effects or not, and whether they are predetermined or strictly exogenous.

If the xit are all correlated with the fixed effects  i :

1. If the xit are predetermined, then future values of these regressors are correlated with the
current error, i.e. E( x it u is )  0 for s<t and 0 otherwise. Then we can use x it as
instruments up to the same date as our error term. In that case, then at time s, only
xi1 ,..., xis 1 are valid instruments in the first-differenced version of (1) (only up to
x is 1 because the differenced error includes u is 1 ). Then the optimal instrument matrix
will be:
[ y i1 , x i1 , x i2 ] 0  - refers to t  3
 [ y i1 , y i 2 , x i1 , x i2 , x i3 ]  - refers to t  4

Wi   
   
 
 0 [ y i1 , ..., y iT  2 , x i1 , ..., x iT 1 ] - refers to t  T

2. If the xit are strictly exogenous, i.e. E( x it u is uis 1 )  0 for all t,s, then all the x i s are
valid instruments, so all will appear in all elements of the leading diagonal of the optimal
instrument matrix:
[ y i1 , x i1 , ..., x iT ]  - refers to t  3
 [ y i1 , y i 2 , x i1 , ..., x iT ]  - refers to t  4
Wi  
 
   
 
 0 [ y i1 , ..., y iT  2 , x i1 , ..., x iT ] - refers to t  T
3. The optimal instrument matrix when xit includes both predetermined and strictly
exogenous variables should be obvious.

If at least some of the xit are not correlated with the fixed effects  i :

We can exploit that lack of correlation by estimating levels as well as differenced versions of
the equations, and using extra restrictions in the levels equations. Specifically, let a subset
x 1it of xit be uncorrelated with  i .

4. If x 1it are predetermined, observations on x1it up to and including t=s are valid
instruments for the levels equation at t=s. So at t=2 - the first levels equation we can
estimate - we can use x1i1 and x1i 2 as additional instruments for the levels equation, and
for t=3,...,T we can use x1it as an additional instrument. All other restrictions that could
be placed on the levels equations have effectively already been imposed through the
restrictions placed on the equations in differences (i.e. the additional restrictions are
redundant). For example, at t=3, the only additional instrument we can get from the xit
is x1i 3 , since x1i 1 and x1i 2 are already used in the difference equation for t=3. This
means that there are T extra restrictions in the levels equations:
E (  i 2 x1i1 )  0 and E( it x1it )  0 , t=2,...,T.

In estimation, the levels equations from t=2 to t=T are stacked under the equations in
differences (which themselves run from t=3 to t=T). The optimal instrument matrix can
then be written:
Wi  0  - refers to t = 3, ..., T difference equations
 
 [ x 1i1 , x 1i 2 ]  - refers to t = 2 levels equation
Wi  
x1i 3  - refers to t = 3 levels equation
 
   
 0 x 1iT  - refers to t = T levels equation

5. If x 1it are strictly exogenous, observations on x 1it for all t become valid instruments in
the levels equations. But we still only have T extra restrictions:
E [(1 / T )  uis x1it ]  0, t  1,..., T ,
s 1
given those already exploited for the equations in first differences. This implies that the 2-
step estimator would combine the (T-1) first difference equations and the average level

Tests for the validity of the GMM estimator

The GMM estimator is consistent if there is no second-order serial correlation in the error
term of the first-differenced equation: it requires E[ uit uit 2 ]  0 . A test for the validity
of the instruments (and the moment restrictions) is a test of second-order serial correlation in
these residuals. The test statistic is
u  2 u  asy
m2  ~ N (0,1)
u 1/ 2
under the null E[ uit uit 2 ]  0 . m2 is only defined for T  5 , since it involves differenced
residuals two periods apart. The u have a completely hideous formula that need not
concern us but is given in Arellano and Bond equation (9), p.284. (The m2 test might not
reject if the residuals in levels follow a random walk, as well as if the errors in levels are not
serially correlated of order one. To exclude the former (when OLS as well as GMM would
be consistent), you could, for example, check that the first-differenced residuals have first-
order serial correlation.)

The most common test of the instruments is Sargan’s (1958) test of over-identifying
restrictions (reference: Sargan, J. D. (1958), “The estimation of economic relationships using
instrumental variables”, Econometrica, vol.26, 393-415).
s  u    Wi  ( u i )( u i )Wi  W ( u ) ~  2p  K 1
N asy

i 1 
where p is the number of columns in the instrument matrix, and u are the residuals from
the 2-step estimation of (26). When T=4, for example, the Sargan statistic tests two linear
combinations of the three moment restrictions available:
E (u i 3 y i 1 )  E ( u i 4 y i 1 )  E ( u i 4 y i 2 )  0 .
In this case, the Sargan test is available when the m2 test is not (there are no differenced
residuals 2 periods apart, as involved in the m2 test).

A related possibility is the Sargan difference test, given by:

ds  s  s I ~  2p  pI (29)
if the errors in levels are not serially correlated. sI is defined as the Sargan test above, except
that only the instruments that remain valid when the errors in levels are MA(1) are included
in the W matrix. This test clearly requires the econometrician to have a good idea about
which regressors are strictly exogenous: it is these which will be used as instruments in the
regression that forms the basis for the sI test. sI will not reject when errors in levels are
MA(1) as well as when they are MA(0), but ds will reject when errors in levels are serially
correlated, i.e. are MA(1). Since s should already show if serial correlation in levels is
present, the ds test can be regarded as a back-up.

A final possibility is a Hausman test.

^ ^  asy 2
h  ( I   )[avar( I )  avar( )] ( I   )~  r (30)

where r=rank avar(   ) ; if all variables except the lagged dependent variable are strictly
exogenous, then r=1. [ ] indicates a generalised inverse. Like the previous test, the h test
also requires a clear idea about which columns of the X matrix are strictly exogenous.
Application: Employment equations for UK companies, 1979-84 (Arellano and Bond,


Arellano and Bond (1991) (AB) apply various procedures to estimate an employment
equation for an unbalanced panel of 140 UK companies for whom they have at least 7
continuous observations (on employment, real product wage, capital stock) between 1976
and 1984. (They have 7 observations on 103 companies, 8 on 23 and 9 on 14. 3 lags will be
lost through taking first differences and including lags, so estimation is over 1979-84. 611
observations are used in estimation.)

Their basic equation is:

n it   1 n it 1   2 nit  2   ( L) x it   t   i  u it (31)
where nit is the natural log of domestic employment in company i at the end of year t
(which is the accounting year, so varies across companies). They include time-specific
effects  t , which is the calendar year in which the accounting year ends.

xit include the log of the real product wage (pay bill per employee divided by industry
price, adjusted by average weekly hours worked in manufacturing industries), the log of an
inflation-adjusted estimate of the company’s capital stock (Gross fixed assets), and the log of
industry output (value added), each company having been classified into one of 9 sub-sectors
of manufacturing according to their main product by sales.

The equation can be motivated along the lines of Layard and Nickell’s work. With zero
adjustment costs, a price-setting firm facing a constant elasticity demand curve would choose
to set employment according to a log-linear labour demand equation of the form:
nit   0  1wit   2 kit   3 ite   i (32)
where 1  0 ,  2  0 and 1  0 . wit is the log of the real product wage, kit is the log of
gross capital, and  ite is a measure of the expected demand for the firm’s product relative to
potential output (industry output captures industry demand shocks in the estimated equation
(31), and time dummies capture aggregate demand shocks). If it is costly to change
employment, actual employment nit will deviate from nit in the short run, suggesting a lag
structure as in (31).

AB’s Table 4 shows GMM estimates of equation (31) (based on first differences) [we will
ignore column (d) since it is not relevant for our purposes]. Table 5 shows other estimates
(the two Anderson-Hsiao estimators, OLS and Within-groups). Employment adjustment
does appear to take 2 years, employment responds negatively to current wage rises, and
positively to (industry) output shocks, and is higher, the higher is the capital stock. Column
(c) is AB’s preferred specification, and suggests a long-run wage elasticity of -0.24 (but
s.e.=0.28), and a long-run elasticity w.r.t. capital of 0.7 (s.e.=0.14). Employment appears

affected by changes in industry output (0.890 0.875), which accords with the Layard-
Nickell interpretation that employment responds to movements in demand relative to
potential output.
Columns (a1) and (a2) instrument the lags of the dependent variable with the efficient levels
of employment as in the Wi matrix on page 9 above. Despite assuming the other regressors
are exogenous, AB don’t exploit any additional restrictions (which would be as in 2. above if
the regressors were correlated with the individual effect, and as in 5. if they were not). (a1)
shows 1-step estimates and (a2) shows 2-step estimates. Increased efficiency resulting from
the 2nd step might be shown in the roughly 30%-lower (asymptotic) standard errors, but AB
also refer to simulation results in which they found 2-step standard errors to be biased
downwards (in finite samples) by around 20% (see AB p.285). Column (b) omits
insignificant dynamics from the 2-step model, with little change in the long-run properties.

Column (c) allows for the fact that the real wage and capital stock may be endogenous. In
principle, these would each be instrumented with own (t-2)-and-earlier lags. In practice, only
(t-2) and (t-3) lags are used as instruments due to computational complexity and relatively
small sample size, but additional instruments are used (lags of company sales and

The instrument tests discussed above are reported under the coefficient estimates. None of
the m2, the Sargan s, and the difference-Sargan ds tests reject the null of serially uncorrelated
errors in the levels equations for the 2-step GMM estimator. The s and ds tests reject for the
1-step estimator, but AB’s simulation suggested these tests reject too often in the presence of
heteroskedasticity. The Hausman test rejects for both 1- and 2-step, but again, in
simulations, AB found this test over-rejects.

AB hypothesise that the rejections reflect the fact that some of the regressors that have been
assumed exogenous are in fact endogenous. When these variables - wages and capital - are
instrumented (column (c)), none of the tests reject the necessary null of no serial correlation
in the levels disturbances.

Column (e) of Table 5 reports the Anderson-Hsiao estimator with the differenced lagged
dependent variable instrumented with the own differenced third lag. The number of
observations falls as one further observation per individual is lost (estimation is over 1980-
84). Column (f) reports the other Anderson-Hsiao estimator, using the third lag of the level
as the instrument. In both cases, the estimates are poorly determined: there appears to be a
large gain in efficiency through using the additional instruments in the AB GMM procedure.

Column (f) reports OLS estimates (these are over 1978-84 as one observation is gained).
The lagged dependent variable coefficient is biased upward, as we would expect in the
presence of firm-specific effects (which we have of course been assuming).

Column (g) reports Within-groups estimates (again, a year is gained). Surprisingly, the
coefficient on the lagged dependent variable is greater than that using the GMM estimators
(we would expect it to be biased downwards in the presence of fixed effects). AB point out
that the endogeneity of some regressors could cloud the comparison between Within-groups
and GMM.
Using DPD

The above relates directly to the estimators available in DPD. The package gives the options:

State form of model

- type 0 for levels
1 for first differences
2 for orthogonal deviations
3 for combined first differences and levels
4 for combined first differences and average level
5 for combined orthogonal deviations and levels
6 for combined orthogonal deviations and average level
7 for within groups
8 for error components generalised least squares

Model 0 - levels
is OLS. Use this if you have a static model and don’t want to allow intercepts to vary across

Model 1 - first differences

is OLS on first differences, i.e. the fixed effects model transformed using first differencing to
remove the fixed effects. Use this if you have a static model in which you think intercepts
vary across individuals and are non-random. Arguments for fixed rather than random
include: my regressors are correlated with the error term; I am content to make inferences
conditional on the set of individuals in my dataset (e.g. I can talk happily about what matters
for OECD countries, and I don’t want to make inferences about all countries in the world);
everyone else uses the fixed effects model too.

Model 7 - within groups

is OLS on data demeaned by the Within transformation, i.e. the fixed effects model
transformed by subtracting time-means to eliminate the fixed effects. Use this if you have a
static model in which you think intercepts vary across individuals and are non-random.
Rationales for using this model are as for Model 1, plus additional efficiency since you don’t
lose a time period through differencing.

Model 2 - orthogonal deviations

is OLS on data demeaned by the orthogonal deviations transformation, i.e. the fixed effects
model transformed by subtracting forward time-means to eliminate the fixed effects (see
Note below for more detail on orthogonal deviations). Use this if you have a static model in
which you think intercepts vary across individuals and are non-random. Rationales for using
this model are as for Model 7. In addition, it has computational advantages: it reduces the
size of the computational problem of calculating the instrumental variables estimators.
[Arellano and Bover (1995), “Another look at the instrumental variables estimation of error-
component models”, Journal of Econometrics, vol.68, 29-51, has more detail and

Model 8 - error components generalised least squares

is GLS. Use this if you have a static model in which you think intercepts vary across
individuals and are random. Model 8 is inconsistent unless all regressors are strictly
exogenous. Arguments for random rather than fixed effects include: I want to draw
implications for the whole population, rather than make inferences conditional on my

Model 3 - combined first differences and levels

is a version of the Arellano-Bond estimator. Use this if you have a dynamic model in which
you think intercepts vary across individuals and are non-random, and if your model includes
other independent regressors which are not correlated with the fixed effects, but not all of
which are strictly exogenous (i.e. some are predetermined). gmm() gives asymptotically
efficient instruments for these regressors.

Model 4 - combined first differences and average level

is a version of the Arellano-Bond estimator. Use this if you have a dynamic model in which
you think intercepts vary across individuals and are non-random, and if your model includes
other independent regressors, and all of these regressors are not correlated with the fixed
effects and are strictly exogenous.

Model 5 - combined orthogonal deviations and levels

is a version of the Arellano-Bond estimator.

Model 6 - combined orthogonal deviations and average level

is a version of the Arellano-Bond estimator.

DPD then gives the option of various form of constant term:

Select from choice of constants to be included

- type 0 for none
1 for time dummies only
2 for time dummies interacted with industry dummies
and if you chose model 0 or any of 3-8 you can also choose:
3 for constant only
4 for industry dummies only
5 for time dummies and industry dummies

For models 0-2 and 7 you are given the choice of robust test statistics and 2-step estimates (a
yes/no decision). 1-step estimates involve a known G matrix. Estimates using first
differences use the G matrix defined above. Estimates using levels or orthogonal deviations
use the identity matrix in place of this G matrix. If 1-step residuals are heteroskedastic,
efficiency will be increased by using these residuals in a second step.

Models 3-6 and 8 imply 2-step and robust estimates. Models 3-6 will automatically use the
instrument matrix you define in the DPD command program.

Note: Orthogonal deviations

An orthogonal deviation x it is given by:
1 1

x  [ x it  ( x it 1 ... x iT ) / (T  t )](T  t ) / (T  t  1)
2 2
t  1,..., T  1
An orthogonal deviation is the deviation of the observation from the average of future
x it 1 ... x iT
observations in the sample, x it  ; this deviation is then weighted to standardise
T t

the variance, by multiplying by  T t 2. If the original errors are IID, so will be the
 
 T  t  1
errors using orthogonal deviations. Note that an estimator that uses orthogonal deviations is,
like that using deviations from the full-sample mean, sometimes known as a within-groups

