Autocorrelation Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

ESE 302 Tony E.

Smith

NOTES ON THE
AUTOCORRELATION PROBLEM

The following notes illustrate the problem of temporally autocorrelated regression residuals that
may arise when using time-series data (and represent the most common violation of the
independent-residual assumption in regression modeling). Here the Durbin-Watson statistic is
shown to provide diagnostic tool for identifying temporal autocorrelation, and the method of
two-stage least squares is shown to be one possible method for removing this effect.

This development will utilize the sales forecasting data set, Sales.jmp, in the class directory. The
question of interest for this particular data set is whether per capita income levels can be used to
predict retail sales. To answer this question, data on annual retail sales, sales, and annual per
capita income, pci, in the US have been collected for a period of 15 years. A (Fit Y by X)
regression of sales on pci yields the following results:

Figure 1. Initial Regression of Sales on PCI

The parameter estimates are seen to be very significant, and the R-squared value is quite
impressive. But observe that the residuals appear to exhibit a “cyclical” pattern about the

1
ESE 302 Tony E. Smith

regression line, suggesting that they are not independent, in the sense that neighbors of positive
residuals tend to be positive, and similarly, that neighbors of negative residuals tend to be
negative. This cyclical dependency can be seen even more clearly by plotting the residuals of this
regression (click the red triangle next to Linear Fit, and select Plot Residuals). In Figure 2 below,
only two of the four resulting plots are shown:

Residual by X Plot

1.0

0.5
sales Residual

0.0

-0.5

-1.0

16 18 20 22 24

pci

Residual by Row Plot

1.0

0.5
sales Residual

0.0

-0.5

-1.0

0 5 10 15

Row Number

Figure 2. Residual Plots

The top plot, Residual by X, essentially flattens the regression line to a horizontal base line and
plots the size of each regression deviation about this line. But since the rows of the data table are
ordered by year, it is the second plot, Residual by Row, which actually shows that these residuals
are exhibiting a cyclical pattern in time. The reason why this pattern also appears in the upper
table is that the explanatory variable, PCI, happens to exhibit the same order, i.e., per capita
incomes are uniformly increasing over this 15 year period. More generally however, such
patterns may not be apparent when simply plotting residuals against explanatory variables. This
is why the Residual by Row plot is so useful for detecting temporal autocorrelation. (Notice also
that this plot works equally well for multiple regressions since there is exactly one residual for
each time point, no matter how many explanatory variables are used.) But even when looking at
such plots, the presence of temporally autocorrelated residuals may not always be this obvious,
especially when key explanatory variables are missing. So it is desirable to develop statistical
tests for identifying significant autocorrelation effects.

1. Durbin-Watson Statistic

The single most commonly used test is the Durbin-Watson test. This test is based on the simple
observation that if residuals are autocorrelated, then neighboring residuals should tend be more
similar in value than arbitrary pairs of residuals. This suggests that sums of squared differences

2
ESE 302 Tony E. Smith

between neighboring residuals should tend to be small relative to sums of squared residuals
themselves. More formally, if for any given set of residuals, (ε t : t = 1,.., T ) , the ratio

∑ (ε t − ε t −1 )2
T

(1) D = t =2

∑ ε
T 2
t =1 t

is designated as the Durbin-Watson statistic, D, then values of D for temporally autocorrelated


residuals should tend to be small. This suggests that a test for such autocorrelation effects can be
constructed in terms of this statistic. But before doing so, notice that these regression residuals
are indexed by “t” to denote different time periods. While other types of orderings may in some
cases be relevant, we focus exclusively on time orderings. Observe also that the summation in
the numerator of D starts at t = 2 , since period t − 1 is not defined for t = 1 . (More generally, for
any sequence of T values, there are only T − 1 successive differences of these values.)

To construct a test of autocorrelation based on D, one begins (as always) by asking how D would
behave under the null hypothesis of “no autocorrelation”. In other words, what would the
distribution of D look like if the residuals (ε t : t = 1,.., T ) satisfied the standard regression
assumptions that

(2) ε t ~ N (0,σ ε2 ) , t = 1,.., T


iid

While it is difficult to characterize this distribution explicitly, one can in fact calculate its mean
explicitly as follows. Under hypothesis (2) it can be shown 1 that values of the ratio, D, are
statistically independent of the values of its denominator, ∑ t =1 ε t2 , so that by (1),
T

∑ D* ∑ ε
(ε − ε ) =
T 2 T 2
(3)
t
=t 2= t −1 t 1 t

=t 2
⇒ E (∑ T
(ε t − =
ε= ) (
t −1 )
2
E D * ∑= ε t2
t 1=
T
) (
E ( D ) E ∑ t 1 ε t2
T
)
(
E ∑ t =2 (ε t − ε t −1 )2
⇒ E ( D) = T
T
)
E ∑ t =1 ε t2 ( )
Thus it suffices to calculate the means of the numerator and denominator separately, as follows.
Turning first to the denominator, and noting from (2) that

(4) σ ε2 =var(ε t ) = E (ε t2 ) − E (ε t )2 = E (ε t2 ) , t =1,.., T

1
This follows from the celebrated Koopmans-Pitman Theorem, which is best understood in terms of Koopmans’
original proof (p.18) in Koopmans, T.C. (1942), “Serial Correlation and Quadratic Forms in Normal Variables”,
Annals of Mathematical Statistics, 13: 14-33.

3
ESE 302 Tony E. Smith

we see that

(5)
T
E t
=t 1 =(∑=
2
ε )
T
t 1 ∑ E (ε=
= 2
t ) ∑
= σ 2 T σ ε2
t 1 ε
T

Similarly, by expanding the numerator and recalling from independence that

(6) cov(ε t , ε t ′ ) =
0= E (ε t ε t ′ ) − E (ε t ) E (ε t ′ ) =
E (ε t ε t ′ )

for all distinct time periods, t and t ′ , we see that

(7)
=
E ∑ t 2= (
(ε t − ε t −1 )=
T 2
)
∑ t 2 E[(ε t − ε t −1 )2 ]
T

∑ E[ε t2 − 2ε t ε t −1 + ε t2−1 ]
T
= t =2

∑ t 2=
E (ε t2 ) − 2∑ t 2 E=
(ε t ε t −1 ) + ∑ t 2 E (ε t2−1 )
T T T
=
=

=∑ σ − 2∑ (0) + ∑ σ ε2
T 2 T T
ε
=t 2 = t 2=t 2

= 2(T − 1) σ ε2

Thus it follows from (3), (5) and (7) that

2(T − 1) σ ε2  T −1
(8)=
E ( D) = 2 
Tσ ε2
 T 

So under hypothesis (2) we see that (for any reasonably sized T )

(9) E ( D) ≈ 2

Moreover, for any positively autocorrelated residuals we have already seen that squared
differences, (ε t − ε t −1 )2 , should tend to be small, so that each E[(ε t − ε t −1 )2 ] in the numerator of
(1) should also be small. Thus for positively correlated residuals, E ( D ) should lie between 0 and
2. Finally, if residuals are negatively correlated so that neighbors tend to have opposite signs, the
same argument suggests that mean squared differences, E[(ε t − ε t −1 )2 ] , should tend to be larger
than 2. While negative autocorrelation is of far less interest for our purposes, it is worth noting
that in the extreme case where ε t ≡ −ε t −1 [so that ε t − ε t −1 ≡ 2 ε t ], we can actually approximate (1)
as follows:

4
ESE 302 Tony E. Smith

 ∑ (2ε ) 
 4E ∑
 ε 
T 2 T 2

E ( D) E 
(10) =
=t 2= t
=
t 2 t  ≈ 4(1) 4
=
T 2  ∑ ε 
T 2 ∑ ε 
=t 1 =t  t 1 t   

So in summary, the mean behavior of D can be neatly summarized as follows:

< positive corr independent negative corr >


• • •
4
0 2

Figure 3. Range of E(D) values

2. Durbin-Watson Test

Using this statistic, we now construct an explicit test for autocorrelation based on the null
hypothesis in (2) above. To do so, we begin by estimating this statistic in the obvious way,
namely by using the estimated residuals, (εˆt : t = 1,.., T ) , obtained from the regression in Figure 1
above. This yields the corresponding test statistic,

∑ (εˆt − εˆt −1 )2
T

(11) d = t =2

∑ εˆ
T 2
t =1 t

While one can of course save these residuals and construct this statistic explicitly, it is not
surprising that this construction is available in JMP. Here it is important to emphasize that which
the Fit Y by X option was useful for plotting residuals in alternative ways, the Durban-Watson
test is only available in the Fit Model option, which of course allows simple as well as multiple
regressions. So the first task here is to redo this regression using Fit Model. Having done so, the
Durbin Watson test can be accessed by click the red triangle next to Response Sales, and using
the path Row Diagnostics > Durbin Watson test. The result will now appear at the bottom of
the regression tableau, as shown in Figure 4 below:

Figure 4. Durbin Watson Test


5
ESE 302 Tony E. Smith

Here the number on the left is the value of the Durbin Watson test statistic, d = 0.8034 . Thus we
see that d is less than 2, and indeed, is closer to 0 than it is to 2. From the arguments above, this
certainly suggests that these residual are positively autocorrelated -- but we have yet to develop
an actual test of this assertion.

To so, it is important to emphasize that even under the null hypothesis in (2), the distribution of
the test statistic, d, in (11) is much more complex than the distribution of D in (1). In particular,
since each residual estimate, εˆt , depends explicitly on the values of the explanatory data, pci, as
well as the sales data, the distribution of d must also depend on this data. In fact this distribution
is so complex, that it can only be estimated by simulation methods. In the present case, this
amounts to sampling many realizations of (ε t : t = 1,.., T ) from the joint normal distribution in
(2), and computing the corresponding values of d for each such realization, as shown by the
histogram of 1000 simulated d values in Figure 5 below: 2

1000

800

600

400

200

0.5 1 1.5 2 2.5 3 3.5 4

d = .8034 d

Figure 5. Durbin-Watson P-values

Notice also that the sample mean of the d-distribution is larger than 2 (actually d = 2.15 in this
case). This would appear to contradict (8) which is slightly less than 2. But, as stated above, the
distribution of the estimator, d, is different from that of D (and in particular, depends on the
given data values).

Given this simulated distribution, the p-value for a one-sided test of hypothesis (2) can be now
be calculated by determining the fraction of d samples which do not exceed the observed value,
d = 0.8034 shown in the Figure. This fraction, which can be seen to be very small, is in this case
given by:

(12) pvalue = 0.0014

2
It can be shown that the sampled values of d are independent of the value of σ 2 in hypothesis (2), which is here
set equal to 1. The actual simulation was programed and implemented in Matlab.

6
ESE 302 Tony E. Smith

To display such a p-value in JMP, one must click the red triangle next to Durbin-Watson in
Figure 4, and then click on “Significant P Value” to obtain:

Figure 6. Durbin-Watson P-value Display

Notice that the resulting value, 0.0011, is slightly different than (12) because a more complex
exact procedure is used in JMP. 3 But such small variations have little effect on the result –
namely that these residuals are significantly positively autocorrelated.

3. Consequences of Autocorrelation

Before proceeding, it is important to ask what effect autocorrelation has on the regression results.
To explore this question graphically, the left panel in Figure 7 below illustrates a linear model
with autocorrelated residuals very similar to the present example, where y = sales and where the
explanatory variable, x = pci, is increasing in time (so that the x-y plot reveals the
autocorrelation).

Figure 7. Underestimation of Residual Variance

If one estimates this particular model with linear regression, then since the sum of squared
residuals is always minimized by definition, the resulting regression line will tend to be closer to

3
The calculation of this p-value in JMP involves the (approximate) integration of a certain complex-valued integral
transform. If you want further details, look at the SAS online documentation of their autoreg function at
https://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_autoreg_details27.htm,
which is not available in the JMP documentation. All that is said in the documentation is that “the computation of
this exact probability can be memory and time-intensive if there are many observations”. This is why the calculation
of this exact p-value is made “optional” in their Durbin-Watson test.

7
ESE 302 Tony E. Smith

the data points, as shown in the right panel. So it should be clear that this procedure will tend to
underestimate the actual sum of squared residuals, and thus produce a root-mean-square
estimate, σˆ ε , which underestimates the true standard error, σ ε , of the residuals in the left panel.
But since the standard error of the slope estimator, βˆ1 , for the regression in Figure 1 was shown
in class to have the form,

σε
(13) ( )
σ βˆ1 =
∑ (x t t − x )2

it follows that the standard error estimator,

σˆ ε
(14) s1 =
∑ (x
t t − x )2

will also underestimate (13). (A more precise analysis of this issue is given in Appendix 1
below.) Finally, since this estimator appears in the denominator of the associated t-ratio,
t1 = βˆ1 / s1 [and since the numerator, βˆ1 , continues to provide an unbiased estimate of its true
value, β1 ], it follows that t1 will tend to be too large – thus inflating the significance of pci. This
simple illustration underscores the most important practical problem with temporal
autocorrelation, namely its tendency to make regression results look too significant. So to draw
meaningful inferences about the significance of explanatory variables, it is important to correct
for such autocorrelation effects.

4. The First-Order Autocorrelation Model

While the Durbin-Watson test provides a fairly general method for detecting autocorrelation,
there is no equally general method for correcting this problem. The difficulty is that
autocorrelation itself can take many forms. But there is nonetheless one simple probabilistic
model of autocorrelation which is sufficiently robust to allow a relatively general correction
procedure to be developed. This model, known as the first-order autocorrelation model [or
AR(1) model ] postulates that time dependencies between residuals are very local in nature. For
our present purposes, it is convenient to formalize this notion in a general multivariate setting as
follows. For any temporal regression model,

β 0 ∑ j =1 β j xtj + ut , t =
k
(15) yt =+ 1,.., T

the residuals (ut : t = 1,.., T ) are said to exhibit first-order autocorrelation whenever they are
related in the following way

(16) ut =ρ ut −1 + ε t , t =2,.., T

8
ESE 302 Tony E. Smith

where (ε t : t = 1,.., T ) is a sequence of iid normal random variables [as in (2) above]. So all
dependencies of ut on the past (ut −1 , ut −2 ,...) are assumed to be fully captured by the previous
period, ut −1 . The parameter, ρ , determines the sign of this autocorrelation effect, and is thus
designated the coefficient of autocorrelation. The additional randomness term, ε t , is by
construction independent of the past, and is usually referred to as the innovation occurring in
period t. Finally if ρ = 0 , then the resulting process of innovations yields precisely the standard
regression case. So in this setting, hypothesis (2) is equivalent to the (null) hypothesis that ρ = 0 .

Observe however that condition (16) is not fully complete, and in particular, says nothing about
the initial period, u1 . It will turn out that the specification of u1 has no effect on the correction
scheme we will employ below (and for this reason, is often set equal to ε1 for completeness).
However it does make a difference from a theoretical perspective, as can be seen as follows. If
we set u1 = ε1 , then it follows from (16) that

(17) u2 = ρ u1 + ε 2 = ρ ε1 + ε 2 ⇒ var(u2 ) = ρ 2σ ε2 + σ ε2 > σ ε2 = var(u1 )

Similarly, residual variances must continue to increase each period, which makes very little
sense from a behavioral viewpoint. So it is much more natural to suppose that the variances of
these correlated residuals remain constant over time, as is implicit in the standard regression
assumption (2). This “steady state” assumption is easily modeled by observing that if this
constant variance be denoted by σ u2 , then by the same argument as in (17) it follows that

(18) var(u2 )= ρ 2 var(u1 ) + var(ε1 ) ⇒ σ u2= ρ 2 σ u2 + σ ε2

⇒ (1 − ρ 2 )σ u2 =σ ε2

⇒ σ u2 = 1
1− ρ 2( )
σ ε2

Notice that (18) is only meaningful under the “stationarity condition” that | ρ | < 1 . Given this
=
condition, it follows that if we now set, u1 ε1 1 − ρ 2 , then by definion the steady-state
relation in (18) is automatically satisfied. 4 So the complete temporal regression model for our
purposes is (15) together with the condition that

 1 2 ε1 , t = 1
(19) ut =  1− ρ
 ρ ut −1 + ε t , t >1

4
Alternatively, if one allows residual variance to keep increasing over successive time periods, then by extending
the argument in expression (17), it can be shown that if | ρ | < 1 , then the limiting residual variance is given by
σ=
2
u [Σ ∞t =1 ( ρ 2 )t −1 ]σ ε2 =
σ ε2 + ρ 2σ ε2 + ( ρ 2 ) 2 σ ε2 + ⋅⋅⋅ =  1−1ρ  σ ε2 , which is exactly (18). So an equivalent
2

interpretation of this steady state is that the observed data is part of a temporal sequence that started “long ago”.

9
ESE 302 Tony E. Smith

for some sequence of innovations as in (2).

5. Correcting for Autocorrelation

If autocorrelation effects are assumed to be first-order in nature, then a natural correction


procedure can be motivated by observing that if the value of the autocorrelation coefficient, ρ ,
were known, then such effects could easily be eliminated as follows. Starting with the regression
model in (15), we can construct a new model with essentially the same beta coefficients by
considering the following lagged variables:

(20) Yt − ρYt −1 , t =
Zt = 2,.., T

(21) xtj − ρ xt −1, j , t =


wtj = 2,.., T , j =
1,.., k

With these definitions, it follows from (15) together with (16) that

(22) Z t= (
=Yt − ρYt −1 = β 0 + ∑ j 1 = ) (
β j xtj + ut − ρ β 0 + ∑ j 1 β j xt −1, j + ut −1
k k
)
=(1 − ρ ) β 0 + ∑ j =1 β j ( xtj − ρ xt −1, j ) + (ut − ρ ut −1 )
k

α 0 + ∑ j =1 β j wtj + ε t , t =
k
⇒ Zt = 2,.., T

where α 0= (1 − ρ ) β 0 . The key point to observe here is that by (16) the “innovation” residuals
(ε t : t = 2,.., T ) exhibit no autocorrelation. So by this change of variables, we obtain a standard
linear regression model involving the same slope coefficients, ( β j : j = 1,.., k ) , and a simple
(known) multiple of the original intercept, β 0 :

α 0 ∑ j =1 β j wtj + ε t , t =
k
(23) Z t =+ 2,.., T

(24) ε t ~ N (0,σ ε2 )
iid

While this standard model is not operational without knowing the value of ρ , it nonetheless
suggests that a good approximation can be obtained by finding a reasonable estimate, ρ̂ , of ρ .

Here a natural estimate of ρ can obtained as follows. If for any given set of data,
( yt , xt1 ,.., xtk ) , t = 1,.., T , it is true that the residual estimates

10
ESE 302 Tony E. Smith

(25) (
uˆt = yt − yˆ t = yt − βˆ0 + ∑ j =1 βˆ j xtj
k
)
resulting from regression (15) are first-order autocorrelated, then by replacing each ut in (16)
with its estimate, uˆt , in (25), it is reasonable to suppose that these estimates should satisfy the
relation:

(26) uˆt = ρ uˆt −1 + ε t , t = 2,.., T

If so, then (26) itself constitutes a “no-intercept” regression model that can be used to estimate
ρ , with the resulting estimate, ρ , given by


T
uˆt uˆt −1
(27) ρ = t =2


T
t =2
uˆt2−1

While this regression procedure suffers from certain theoretical problems, 5 it can be shown that
(27) yields a consistent estimator of ρ under quite general conditions. However, the estimator
often used in practice (and in particular in JMP) is a slight modification of (27) in which û12 is
added to the denominator to the denominator to obtain: 6

∑ uˆ uˆ
T
t = 2 t t −1
(28) ρˆ =
∑ uˆ
T 2
t =1 t

This is precisely the Autocorrelation estimate, ρˆ = 0.4821 , as shown in Figure 7 above. One
conceptual advantage of this estimator is its direct relation to the sample estimator, d, of the
Durbin-Watson statistic in (1), which can be seen by expanding d as follows (using uˆt rather
than εˆt ):

∑ ∑ uˆt2 ∑ t 2 uˆt2−= ∑ t 2 uˆt uˆt −1 ≈ 1 + 1 − 2 ρˆ


T T T T
(uˆt − uˆt −1 )=
2

(29) = = + −
=t 2 t 2= 1
d 2
∑ t 1 uˆt2 = ∑ t 1 u= ∑ t 1 uˆt2 = ∑ t 1 uˆt2
T T T T
=
ˆt2

⇒ d ≈ 2(1 − ρˆ )

5
In particular, regression model (26) suffers from the “endogeneity” problem that each uˆ t serves as both the
=
dependent variable in one equation, uˆ t ρ uˆ t −1 + ε t , and the independent (explanatory) variable in another equation,
ˆ t +1 ρ uˆ t + ε t +1 .
u=
6
Note that ρ̂ is also a consistent estimator since ρˆ = (=
ρ ) ( Σ Tt 2=
uˆ t2 / Σ Tt 1uˆ t2 ) → ( ρ )(1) = ρ .
prob

11
ESE 302 Tony E. Smith

So within the context of the first-order autocorrelation model above, the (more general) Durbin-
Watson statistic, d, is essentially an equivalent form of this autocorrelation statistic, ρ̂ . In
particular, with respect to the classification scheme in Figure 3 above, we now have the
approximate correspondence:

ρˆ = 1 ↔ d = 0

(30) ρˆ = 0 ↔ d = 2
ρˆ =
−1 ↔ d =4

which is seen to strengthen the interpretation of this classification scheme in terms of


corresponding levels of autocorrelation.

Given this estimator of ρ , we can now approximate the “corrected” model in [(23),(24)] above
by simply replacing ρ with ρ̂ in expressions (20) and (21) above to obtain the corresponding
approximate lag variables,

(31) Yt − ρˆYt −1 , t =
Zˆ t = 2,.., T

(32) xtj − ρˆ xt −1, j , t =


wˆ tj = 2,.., T , j =
1,.., k

Finally, using these variables, one can proceed to estimate the corresponding approximated
version of model (23):

α 0 ∑ j =1 β j wˆ tj + ε t , t =
k
(33) Zˆ t =+ 2,.., T

The resulting slope estimates, ( βˆ1 ,.., βˆk ) , in this two-stage regression procedure can now be
used to estimate ( β1 ,.., β k ) in model (15) above. Similarly, the intercept estimate, α̂ 0 , can be
used to estimate β 0 by setting

(34) βˆ0= (1 − ρˆ ) αˆ 0

But before doing so, the key question to be addressed is whether or not autocorrelation effects
have be effectively removed from (16) by this procedure.

This can of course be checked by a second application of the Durbin-Watson test, which we now
illustrate in terms of the Sales.jmp example above. To do so, we start by recording the rho_hat
estimate, 0.4821 , as a new column in the data set, and then construct the transformed variables

12
ESE 302 Tony E. Smith

(31) and (32) in terms of rho_hat. In particular, the transformed sales variable is here denoted by
d_sales (for weighted sales difference), and is constructed as in Figure 8 below, where the Lag
operator (under Row functions) identifies the sales value in the previous row.

Figure 8. Transformed Sales Variable

The transformed pci variable, d_pci, is constructed in a similar manner, and Fit Model is now
used to regress d_sales on d_pci, yielding the results in Figure 9 below:

Figure 9. Regression of Transformed Variables

Notice first that the cyclical pattern of residuals in Figure 1 above appears to have been
substantially reduced, though not completely removed. This is typical of such a two-stage
regression procedure. Next observe that while d_pci is still seen to be a very significant
predictor of d_sales, a comparison of the t Ratio in Figures 1 and 9 shows that this level of
significance has indeed been reduced (along with the value of RSquare). This is precisely the
expected effect of accounting for the higher variance of βˆ1 , as detailed in Appendix 1.

13
ESE 302 Tony E. Smith

The Durbin-Watson results for this transformed data set are shown in Figure 1 below, along with
a Residual-by-Row plot showing the sequence of these transformed residuals in time.

Figure 10. Durbin-Watson Test plus Residual Plot

Note first from the Durbin-Watson test that the significance, 0.1160, of autocorrelation is now
considerably less, and indeed, is no longer even weakly significant (by the usual “prob < 0.10”
standard). This is borne out by the Residual-by-Row plot in which the strong cyclical relation
seen in Figure 1 above is no longer evident. So in the present example, it does appear that this
two-stage regression procedure has been successful in eliminating (or at least significantly
reducing) the effects of autocorrelation. This means that one can have much more faith in these
new estimates and significance levels for the beta coefficients.

However, it should be emphasized that autocorrelation effects are not always so easily rectified.
Indeed, while the significance of autocorrelation is almost always reduced by this two-stage
procedure, it may nonetheless continue to be quite significant (say reduced from .001 to .01).
This suggests that it might be worthwhile repeating the above procedure by “differencing the
differences” in the second stage. Such a three-stage regression procedure is detailed in
Appendix 2 below. However, it should also be emphasized that there are dangers in doing so.
Roughly speaking, taking differences of data series tends to produce a “rougher” series with
larger variance (as we have seen). So taking second differences will tend to increase variance
even further. Thus, while autocorrelation effects may be reduced further, this additional variance
may render all beta coefficients insignificant as well. In such cases, all that can be concluded is
that autocorrelation effects are so strong that no substantive relations between the dependent and
explanatory variables can be identified. With this in mind, we now consider an alternative
approach in which autocorrelation effects are captured by additional explanatory variables.

14
ESE 302 Tony E. Smith

6. An Explanatory Approach to Autocorrelation

In the Sales.jmp example above, it is of interest ask whether the cyclical fluctuations in residuals
might actually be explained by other means. In the present case, it is well known that economic
fluctuations (business cycles) often influence spending behavior in a manner which is more rapid
than overall changes in per capita income. Because this is appears to be consistent with the
fluctuations observed in these sales residuals, it is of interest to ask whether there are other
explanatory variables that might capture such phenomena. An obvious choice here is the
unemployment rate, which is perhaps the single best indicator of the state of the economy.
Annual unemployment rates, ur, are readily available at the national level, and are included in
Sale.jmp. If this new variable is added to the regression, then the results obtained using Fit Model
are shown in Figure 11 below.

Figure 11. Regression Results with Unemployment Added

Note first from the RSquare Adj value that the addition of this unemployment variable has
substantially improved the overall fit of the model, and in particular that its highly significant
negative beta coefficient suggests that sales are indeed substantially reduced during periods of
unemployment. Moreover, we see from the associated Durbin-Watson test that all
autocorrelation effects have been effectively removed from this regression (as also seen by the
Residual-by-Row plot). So in this example, we have not only succeeded in eliminating
autocorrelation effects, but have also obtained a sharper explanation of the original sales data.

However, it should be emphasized that such simple explanations of autocorrelation effects are
often not available. So the real power of the “black box” two-stage regression procedure above
is that it is always applicable, even when sources of autocorrelation effects are complex, or
perhaps not known at all.

15
ESE 302 Tony E. Smith

Appendix 1. Consequences of Autocorrelation for Regression

To analyze the consequences of autocorrelation, we focus on the variance of the slope estimate,
βˆ1 , for a simple regression,

(35) Yt =β 0 + β1 xt + ut , t =1,.., T

with autocorrelated errors satisfying

 1 2 ε1 , t =1
(36) ut =  1− ρ
 ρ ut −1 + ε t , t=
2,.., T

where | ρ | < 1 and ε t ~ N (0,σ ε2 ) , t = 1,.., T . To do so, recall first that βˆ1 is a linear estimator of
iid
the form,

βˆ1 = ∑ t =1 wt Yt
T
(37)

where

xt − x
=
(38) wt = , t 1,.., T
∑ i=1 ( xi − x )2
T

As was shown in class, this implies that βˆ1 is an unbiased estimator of β1 , regardless of the
presence of autocorrelation. However, the estimated variance of βˆ1 depends on the assumption
that that the errors (ut : t = 1,.., T ) are independently and identically distributed with mean zero
and variance, σ u2 , so that as shown in class,

σ u2
var( βˆ1 ) ∑ t 1 = u ∑ t 1 wt
σ=
T T
(39) = =
w 2
t var(Yt )
2 2
=
∑ (x
t t − x )2

With these observations, our main objective is to show that the true variance of βˆ1 tends to be
much larger than (39) in the presence of autocorrelation. To do so, we start by observing that if
the Yt ′s are not independent, and in particular have autocorrelated errors, then the variance of the
sum of random variables in (37) now takes the more general form:

var( βˆ1 ) ∑ ∑ t 1 ∑ s≠t wt ws cov(Yt ,Ys )


T T
=
(40) =t 1
wt2 var(Yt ) + =

To analyze this expression further, observe next that

16
ESE 302 Tony E. Smith

(41) cov(Yt , Ys ) =
cov(ut , us ) =
E (ut us ) − E (ut ) E (us ) =
E (ut us )

In particular, this implies that for each t = 2,.., T ,

(42) cov(ut , ut −1 ) = E (ut ut −1 ) = E[( ρ ut −1 + ε t )ut −1 ] = ρ E (ut2−1 ) + E (ε t ut −1 )

But since ut −1 is a function of (ε t −1 , ε t −2 ,...) , it follows that ε t and ut −1 are independent, so that

(43) ut , ut −1 ) ρ=
cov(= E (ut2−1 ) ρ var(=
ut −1 ), t 2,.., T

Moreover, the stationarity argument in (18) shows that

(44) =
var( ut ) σ=
2
u , t 1,2,.., T

so that expression (43) takes the simpler form,

(45) ut , ut −1 ) ρ=
cov(= σ u2 , t 2,.., T

By applying the same argument to ut and ut −2 , we see that

(46) cov(ut , ut =
−2 ) E ( ut u t =
−2 ) E[( ρ ut −1 + ε t )ut=
−2 ] E[( ρ ( ρ ut −2 + ε t −1 ) + ε t ) ut −2 ]

= ρ 2 E (ut2−2 ) + ρ E (ut −2ε t −1 ) + E (=


ut −2ε t ) ρ 2 E (ut2−2 ) + 0 + 0

= ρ 2σ u2

Thus, by recursive applications of the same argument, it can readily be shown that

(47) cov(ut , ut −k ) = ρ k σ u2 , k = 1,.., t − 1, t = 2,.., T

and thus that the second term in (40) can be given the more explicit form,

∑ ∑ w w cov(Y , Y ) = 2∑ t =
1∑ s<t t s
T T
(48) t=
1 s ≠t t s t s w w cov(Yt , Ys )

∑ t 1=
∑ k 1 wt wt −k cov(Yt ,Yt −k ) = 2=
∑ t 1=
∑ k 1 wt wt −k ρ kσ u2
T t −1 T t −1
2=

=
Finally, since var( =
Yt ) var( ut ) σ u2 by (41), it follows that (40) can be simplified to

∑ ∑ t 1=
∑ k 1 wt wt −k ρ kσ u2
t −1
var( βˆ1 ) wt2 σ u2 + 2=
T T
(49) = =t 1

17
ESE 302 Tony E. Smith

To evaluate the w coefficients in this expression, it is convenient to replace the denominator in


(38) by leverage, L =
Σ t ( xt − x )2 , so that (49) becomes

( xt − x ) 2 2 t −1 ( x − x )( x − x)
var( βˆ1 ) ∑ t 1 L2 σ u + 2= ∑ t 1=
∑ k 1 t L2 t −k ρ kσ u2
T T
=
(50) =

L ( x − x )( xt −k − x ) 
∑ ∑
t −1
= σ u2  2 + 2= ρk t
T

L t 1= k 1
L2 

σ u2  ( xt − x )( xt −k − x ) 
∑ ∑
t −1
ρk
T
= 1 + 2= t 1= 
L  k 1
L

Finally, by replacing leverage, L, with its explicit form, we obtain the key result: 7

σ u2  ( x − x )( xt −k − x ) 
∑ ∑
t −1
var( βˆ1 ) ρk t
T
(51) = 1 + 2=
Σ t ( xt − x ) 
2 t 1= k 1
Σ t ( xt − x )2 

By comparing this expression with (39), it becomes clear that the presence of autocorrelation
will tend to increase the variance of βˆ1 , whenever the summation in brackets is positive. But
recall that autocorrelation is almost invariably positive, so that ρ > 0 . Moreover, since the vast
majority of x-processes also exhibit some degree of positive autocorrelation, one can expect the
the cross products, ( xt − x )( xt −k − x ) , to be positive for small k. Finally, since the (geometrically)
decreasing weights, ρ k , ensure that these small-lag terms dominate this summation, it follows
that the sum will be positive in most cases, and thus that the true variance of βˆ1 will be
substantially larger that expression (39). 8

7
This expression is essentially the same as expression (12.2.8) in Gujarati, D.N. and D.C. Porter (2009) Basic
Econometrics, 5th Ed., Chapter 12 (available online at https://www.academia.edu/15273562/). The only difference is
that their notation expresses the explanatory variable, x, in deviation form (i.e., as deviations from the sample mean).
8
A more detailed examination of the potential size of this effect is given in the Gujarati-Porter reference above.

18
ESE 302 Tony E. Smith

Appendix 2. Extension to Three-Stage Least Squares

At mentioned in the text, the two-stage least squares procedure for removing temporal
autocorrelation effects can be extended to a three-stage procedure. This procedure can be
formalized as follows.

Three-Stage Model

β 0 + ∑ j =1 β j x jt + ε t , t =
k
(52) Yt = 1,.., T

(53) ε t = ρε t −1 + ut , t = 2,.., T

(54) ut = λ ut −1 + vt , t = 2,.., T

(55) vt ~ N (0,σ 2 ) , t = 1,.., T


iid

The key point here is that the “second-order” residuals, ut , are no longer assumed to be
independent. Rather, they now depend on previous residual values as well. In this three-stage
model, there are two autocorrelation parameters, ρ and λ , and the resulting “third-order”
residuals, vt , are now assumed to be iid normal [expression (55)].

Three-Stage Procedure

As in the two-stage procedure, we start by letting,

(56) Yt − ρYt −1 , t =
Yt (1) = 2,.., T

(57) jt =
x (1) x jt − ρ x j ,t −1 , t =
2,.., T

so that

= (β + ∑ β x + ε ) − ρ (β + ∑ β j x j ,t −1 + ε t −1 )
(1) k k
(58)
t
= 0 Yj jt
j 1= t 0 j 1

=(1 − ρ ) β 0 + ∑ j =1 β j ( x jt − ρ x j ,t −1 ) + (ε t − ρε t −1 )
k

(1 − ρ ) β 0 + ∑ j =1 β j x (1)
k
= jt + ut , t =
2,.., T

Now proceed to stage three by letting

(59) Yt=
(2)
Yt (1) − λYt (1)
−1

19
ESE 302 Tony E. Smith

jt =
x (2) jt − λ x j ,t −1 , t =
x (1) (1)
(60) 3,.., T

so that

[(1 − ρ ) β 0 + ∑ j =1 β j x (1)
k
(61) Yt (2) = jt + ut ]

−λ[(1 − ρ ) β 0 + ∑ j =1 β j x (1)
k
j ,t −1 + ut −1 ]

= (1 − λ )(1 − ρ ) β 0 + ∑ j =1 β j ( x (1)
jt − ρ x j ,t −1 ) + (ut − ρ ut −1 )
k (1)

=(1 − λ )(1 − ρ ) β 0 + ∑ j =1 β j x (2)


k
jt + vt , t =3,.., T

So if ρ and λ were known, then by (55) we see that the resulting regression in (61) with T − 2
samples has removed all autocorrelation effects. Moreover, since the slope values, β j , in (61)
are the same as in (52), these initial values can be estimated using (61).

Finally, to estimate the unknown parameters, ρ and λ , we start with the two-stage procedure
and obtain a set of estimated regression residuals, εˆt , and a corresponding (modified) ρ
estimate,

∑ εˆ εˆ
T
t =3 t t −1
(62) ρˆ =
∑ εˆ
T 2
t =2 t

If the residuals, uˆ=


t εˆt − ρε
ˆ ˆt −1 are uncorrelated (by Durbin-Watson), then we may assume that
λ = 0 in (54) and stop. Otherwise, we proceed to estimate λ by the regression:

(63) λ uˆt −1 + vt , vt ~ N (0,σ 2 ) , t =


uˆt = 1,.., T
iid

and obtain the corresponding (modified) least squares estimate

∑ uˆ uˆ
T
t =3 t t −1
(64) λˆ =
∑ uˆ
T 2
t =2 t

which is equivalent to iterating the two-stage least squares procedure in JMP.

If the residuals, uˆt − λˆuˆt −1 , are uncorrelated (again by Durbin-Watson), then this procedure has
been successful. Otherwise, one could in principle proceed to a “fourth stage”. However, as I
said in class, so much additional “differencing noise” has already been introduced, that the beta
estimates of interest in such a fourth stage are not likely to remain significant.

20
ESE 302 Tony E. Smith

Note finally that expression (52) in this three-stage model looks exactly like the original multiple
regression model. So expressions (53) through (55) simply elaborate the error structure of this
model. Moreover, if one formally “initializes” this model by including the error variables ε1 and
u1 , and replacing (55) by

(65) ε1 , u1 , v1 ,.., vT ~ N (0,σ 2 )


iid

then it follows in particular that E (ε1 ) = 0 and that

(66) ε 2 = ρε1 + u2 = ρε1 + (λ u1 + v2 )

⇒ E (ε 2 ) = ρ E (ε1 ) + λ E (u1 ) + E ( v2 ) = 0

Proceeding by induction, one may similarly verify that E (ε t ) = 0 for all t = 1,.., T , and thus that
the conditional expectation in (52) has the familiar form:

β 0 + ∑ j =1 β j x jt , t =
k
(67) E (Yt | x1t ,.., xkt ) = 1,.., T

So all β j values continue to have the same interpretation, i.e., the expected change in Yt
resulting from a unit change in x jt , all else being equal.

21

You might also like