Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Chapter 4.

Gauss-Markov Model
4.1 Model Assumptions
So far we've approached the linear model only as a method of mathematical
approximation. In this chapter, we pose the Gauss-Markov model which embodies the most
common assumptions for the statistical approach to the linear model, leading to the Gauss-
Markov Theorem. The Gauss-Markov model takes the form
yœXbe (4.1)
where y is the (N by 1) vector of observed responses, and X is the (N by p) known design
matrix. As before, the coefficient vector b is unknown and to be determined or estimated.
The main features of the Gauss-Markov model are the assumptions on the error e:
E(e) œ 0 and Cov(e) œ 5 2 IN . (4.2)
The notation for expectation and covariances above can be rewritten component by
component:
( E(e) )i œ ith component of E(e) œ E(ei )
( Cov(e) )ij œ i,jth element of covariance matrix œ Cov(ei , ej ),
so that the Gauss-Markov assumptions can be rewritten as
E(ei ) œ 0, i œ 1, ..., N

Cov(ei , ej ) œ 
5 2 for i œ j
0 for i Á j
that is, the errors in the model have a zero mean, constant variance, and are uncorrelated. An
alternative view of the Gauss-Markov model does not employ the error vector e:
E(y) œ X b, Cov(y) œ 5 2 IN .
The assumptions in the Gauss-Markov model are easily acceptable for most practical
problems and deviations from these assumptions will be considered in more detail later.
Before we get to the Gauss-Markov Theorem, we will need some simple tools for
conveniently working with means and variances of vector random variables, in particular, a
linear combination of variables aT y where a is a fixed vector. The rules are:
i) E(aT y) œ aT E(y)
ii) Var(aT y) œ aT Cov(y) a
iii) Cov(aT y, cT y ) œ aT Cov(y) c, for fixed a, c
iv) Cov( AT y) œ AT Cov(y) A, for fixed matrix A.
The key result is (iii), from which (ii) and (iv) follow easily, and just algebraic bookkeeping is

Cov(aT y, cT y ) œ Cov(!ai yi , !cj yj )


necessary:

œ !ai Cov(yi , !cj yj ) œ !! ai cj Cov(yi ,yj ) œ aT Cov(y) c.


i j

i j i j

Example 4.1. Variance and covariance calculations.


Ô4 2 4 × Ô1× Ô2×
Õ 4 0 25 Ø Õ2Ø Õ1Ø
Let Cov(y) œ 2 5 0 , c œ -1 , a œ 0 , then we have

Chapter 4, page 1
Var(y1  y2  2y3 ) œ Var(cT y ) œ cT Cov(y) c œ
Ô 4 2 4 ×Ô 1 × Ô 10 ×
c 1 -1 2 d 2 5 0 -1 œ c 1 -1 2 d -3 œ 121. Also
Õ 4 0 25 ØÕ 2 Ø Õ 54 Ø

Cov(2y1  y3 ,y1  y2  2y3 ) Cov(aT y, cT y ) œ aT Cov(y) c


Ô 4 2 4 ×Ô 1 × Ô 10 ×
œ c2 0 1d 2 5 0 -1 œ c 2 0 1 d -3 œ 74.
Õ 4 0 25 ØÕ 2 Ø Õ 54 Ø

Example 4.2. Variance of a least squares estimator. Let y œ X b  e, with the Gauss-Markov
assumptions on e, so that Cov(y) œ 52 IN , and let -T b be an estimable function. Then the
variance of the least squares estimator follows the calculation (see Exercise 4.2)
^) œ Var( -T (XT X)g XT y) œ -T (XT X)g XT Cov(y)X(XT X)gT -
Var(-T b
œ -T (XT X)g XT (52 IN )X(XT X)gT - œ 52 -T (XT X)g -

Example 4.3. In the simple linear regression case in Example 3.1, we solved the normal
equations and found the usual slope estimate
^ œ !(x  x)y
- /!(x  x)
N N
b - 2.
2 i i i
i=1 i=1
To compute the mean and variance of b ^ , we have two routes. One route is to treat b
^ as a
2 2
linear combination of the yi 's:
^ œ !(x  x)y
- /!(x  x) - 2 œ ![(x  x)/S ] y where S œ !(x  x)
N N N N
b - - 2.
2 i i i i xx i xx i

^ ) œ ![(x  x)/S
xx ] E(yi ) œ ![(xi  x)/Sxx ] (b1  b2 xi ) œ !(xi  x)xi /Sxx b2 œ b2 and
i=1 i=1 i=1 i=1
N N N
Then E(b - - -
2 i
i=1 i=1 i=1
the usual slope estimate is unbiased. Its variance can be found from the following algebra:
^ ) œ ![(x  x)/S
xx ] ![(xj  x)/Sxx ] Cov(yi , yj )
N N
Var(b - -
2 i
i=1 j=1

œ ![(xi  x)/Sxx ] Var(yi ) œ 5 !(xi  x) /Sxx œ 5 /Sxx ,


N N
- 2 2 - 2 2 2
i=1 i=1
employing the usual assumptions of constant variance and uncorrelated observations. The
other route would be to follow the algebra in Example 4.2 above, and find the (2, 2) element

!xi -1 !x2i !xi


of

52 (XT X)-1 œ 52 ” ! • ” •.
xi !xi !xi
N 52
2 œ NS xx N

4.2. The Gauss-Markov Theorem


The goal throughout this chapter has been to show that the least squares estimators
derived back in Section 3.2 are the 'best' estimators in some sense. The Gauss-Markov Model
that we've been talking about consists of just those set of assumptions that are sufficient.
Recall that a linear estimator takes the form c  aT y, and will be unbiased for estimable -T b
when
E( c  aT y) œ -T b

Chapter 4, page 2
for all b, which leads to c œ 0, and - œ XT a. In _ our One-Way ANOVA Model,
yij œ .  !i  eij , then y11 , (y11  2y12 )/3 and y1. are all unbiased estimators of .  !1 . In
Simple Linear Regression, yi œ "0  "1 xi  ei , with xi œ i, say, then y2  y1 or y3  y2 are
both unbiased estimators of "1 . Which is a better estimator and how shall we measure? When
two estimators are both unbiased, then the estimator with smaller variances is better, since
variance is a measure of variability around its mean. Our least squares estimator of -T b is
-T b^ where b^ is a solution to the normal equations. Recall that if we construct all solutions to
the normal equations,
^(z) œ (XT X)g XT y  (I  (XT X)g XT X)z
b
then for estimable -T b, -T b ^(z) is constant for all values of z -- all solutions to the normal
equations lead to the same least squares estimator.

Theorem 4.1. (Gauss-Markov Theorem) Under the assumptions of the Gauss-Markov Model,
y œ X b  e, where E(e) œ 0 and Cov(e) œ 5 2 IN ,
^ is the best (minimum variance) linear unbiased estimator
if -T b is estimable, then -T b
(BLUE) of -T b, where b ^ solves the normal equations
T T
X X b œ X y.
Proof: Suppose c  dT y is another unbiased estimator of -T b. Then c œ 0 and dT X œ -T
since E(c  dT y) œ c  dT Xb œ -T b for all b. Now,
Var(c  dT y) œ Var(dT y) œ Var(-T b ^  dT y  -T b^)
œ Var(-T b^)  Var(dT y  -T b ^)  2 Cov(-T b ^ , dT y  - T b
^)
œ Var(-T b^)  Var(dT y  -T b ^)  2 -T (XT X)g XT (5 2 I ) ( d  X (XT X)gT - )
N
T^ T^
œ Var(- b)  Var(d y  - b)  25 - (X X) (X d  XT X (XT X)gT - )
T 2 T T g T

œ Var(-T b^)  Var(dT y  -T b ^)  2 5 2 -T (XT X)g ( -  - )


œ Var(-T b^)  Var(dT y  -T b ^)
Since - b is estimable, we have - − V(X ) and a projection XT X(XT X)gT onto it, hence
T T

XT X(XT X)gT - œ -. So Var(dT y)   Var(-T b ^), with equality iff


Var(dT y  -T b^) œ Var( ( d  X (XT X)g - )T y ) œ 5 2 l d  X (XT X)gT - l2 œ 0.
So equality occurs (and an estimator with equal variance) iff d œ X (XT X)gT - or dT y œ -T b ^.
In other words, the best linear unbiased estimator is unique. …

Note that in the proof above, the crucial step is showing


Cov(-T b^, dT y  -T b
^) œ 0, where aT X œ -T .
Now what is dT y  -T b ^ estimating? Notice that
^) œ aT Xb  -T (XT X)g XT Xb œ -T b  -T b œ 0
E( dT y  -T b
^ is an unbiased estimator of zero, and the Best Linear Unbiased Estimator
so that dT y  -T b
^ is uncorrelated with it.
-T b

Chapter 4, page 3
Result 4.1. The BLUE -T b ^ of estimable -T b is uncorrelated with all unbiased estimators of
zero.
Proof: First, characterize unbiased estimators of zero as c  aT y such that
E( c  aT y) œ c  aT Xb œ 0 for all b,
or c œ 0 and XT a œ 0, or a − a (XT ). Computing the covariance between -T b ^ and aT y we
have
^, aT y) œ -T (XT X)g XT Cov(y) a œ 5 2 -T (XT X)g XT a œ 0.…
Cov(-T b

The Gauss-Markov Theorem can be extended to the vector case. Let the columns -(j) of the
matrix A be linear independent such that -(j) T b are linearly independent estimable functions,
then
Cov(AT b ^) œ Cov(AT (XT X)g XT y) œ 5 2 AT (XT X)g XT X (XT X)gT A œ 5 2 AT (XT X)g A.
If we have any set of unbiased estimator AT b, say CT y, that is, E(CT y) œ AT b for all b, then
the difference in their covariance matrices
Cov(CT y)  Cov(AT b ^)
is nonnegative definite. If X has full column rank and applying A œ I, we have
Cov(b^) œ 5 2 (XT X)-1 . Moreover, in the full column rank model, the Gauss-Markov Theorem
~
says that if b is any other unbiased estimator of b, then
~ ^)
Cov(b)  Cov(b
is nonnegative definite.

4.3. Variance Estimation


Throughout this discussion, our focus has been on estimating linear functions of the
coefficients b and no attention has been paid to the other unknown parameter 5 2 . So far,
we've used PX y to estimate b, as another version of the normal equations is Xb œ PX y. As

SSE œ l(I  PX )yl2 to estimate 5 2 . To construct an unbiased estimator for 5 2 , we will use
the reader might guess, we will use (I  PX )y, or, more specifically its sum of squares

the following lemma.

Lemma 4.1. Let Z be a vector random variable with E(Z) œ . and Cov(Z) œ D. Then
E(ZT AZ) œ .T A.  tr(AD).
Proof: Note that
E(Zi  .i )(Zj  .j ) œ Dij œ E(Zi Zj )  .i E(Zj )  .j E(Zi )  .i .j
so that
[ E(Z ZT ) ]ij œ E(Zi  .i )(Zj  .j )  .i .j œ Dij  .i .j ,
or E(Z ZT ) œ ..T  D. Now using the linearity of trace and expectation operations, we have
E( ZT A Z) œ E( tr( ZT A Z) ) œ E( tr(A Z ZT ) ) œ tr( A E(Z ZT ) ) œ tr( A (..T  D) )
œ tr(A..T )  tr(AD) œ .T A.  tr(AD). …

Result 4.2. Consider the Gauss-Markov Model given in (4.1) and (4.2). An unbiased
^ 2 œ SSE/(N  r), where SSE œ ^eT ^e œ yT (I  PX )y and r œ rank(X).
estimator of 5 2 is 5
Proof: Since E(y) œ Xb, Cov(y) œ 5 2 IN , applying Lemma 4.1 above yields

Chapter 4, page 4
E( yT (I  PX )y) œ (Xb)T (I  PX )Xb  tr((I  PX )(5 2 IN ) ) œ 5 2 (N  r).
Dividing by the degrees of freedom parameter tr(I  PX ) œ (N  r) associated with
SSE œ yT (I  PX )y gives the desired result.…
T
If we examine the regression sum of squares, SSR œ y^ y^ œ yT PX y, we find
E(SSR) œ E(yT PX y) œ bT XT PX Xb  tr(PX 5 2 IN ) œ lXbl2  r 5 2 .
^ 2 without further distributional assumptions, even when
Little else can be said about 5
ei are iid (independent, identically distributed) since its variance depends on the third and
fourth moments of the underlying distribution. The algebra for computing Var(5 ^ 2 ) in terms of
the third and fourth moments is given in an addendum to this Chapter.

4.4. Implications of Model Selection


In the application of linear models, sometimes the appropriate statistical model is not
obvious. Even in designed experiments, researchers will debate whether to include certain
interactions into the model. In the case of observational studies with continuous covariates the
problem of model selection becomes quite difficult and leads to many diagnostics and
proposed methodologies (see, e.g. Rawlings, Pantula, & Dickey, 1998). In the context of this
book, however, we are concerned with the theoretical aspects that drive the practical model
selection. At this point an important distinction must be made between the underlying true
model for the data, and the model that the researcher is employing. In practice, of course, we
cannot know the true model.
The issue of model selection is often described as steering between two unpleasant
situations: overfitting, that is, including explanatory variables which are not needed, and
underfitting, not including an important explanatory variable. We will focus on the
implications of these two situations.
In the case of underfitting, also referred to as misspecification, we can write the model
for the truth as
y œ Xb  (  e (4.3)
where X is the design matrix for the model that the researcher is using, and ( includes the
omitted variables and their coefficients. Assume, as usual, E(e) œ 0 and Cov(e) œ 5 2 IN , and
notice that if E(e) were not zero, ( would capture its effect. Consider first the least squares
estimators:
E(-T b^) œ -T (XT X)g XT E(y) œ -T (XT X)g XT (Xb  ( ) œ -T b  aT P (
X
where - œ XT a. When the model is misspecified, the least squares estimators are biased:
E(-T b^)  -T b œ aT P ( œ -T (XT X)g XT (
X
where the bias above depends on:
- the magnitude of the misspecified effect (
- how much of that effect lies in V(X), PX ( , and
- how much does it relate to the function at hand -T (XT X)g XT ( .
If the missing signal ( is orthogonal to V(X), then PX ( œ 0 and the estimation of coefficients
will be unaffected. If the misspecification is due to omitted explanatory variables that are
uncorrelated with those included in X, that is, XT ( œ 0, the estimates are not biased.

Chapter 4, page 5
The estimation of the variance is also affected by misspecification:
E( yT (I  PX )y) œ (Xb  ( )T (I  PX )(Xb  ( )  tr((I  PX )(5 2 IN ) )
œ (T (I  PX )(  5 2 (N  r). (4.4)
Therefore, the bias in the variance estimate disappears only when the misspecification
disappears. Note that the bias in 5^ 2 is zero if and only if
(T (I  PX )( œ 0, or ( − V(X), or ( œ Xd for some d.
In that case, the model is then
y œ Xb  Xd  e œ X(b+d)  e.
One view of misspecification is that the signal that is not accounted for ( is partitioned into
two pieces:
y œ Xb  (  e œ PX y  (I  PX )y œ (Xb  PX (  PX e)  (I  PX )((  e)
One piece PX ( in V(X) affects the estimation of -T b; the other part (I  PX )( affects the
estimation of the variance.

Example 4.4. Simple misspecification. Suppose yi œ "0  "1 xi  ei , but the covariate xi is
ignored and we just estimate the mean and variance. Here we have (i œ "1 xi and X œ 1 so
^ œ_ _ _

^ 2 œ !(yi  y)2 /(N  1). Applying (4.4), the size of the bias,
that " 0 y and E(y) œ "0  "1 x. As for the variance, our usual variance estimate is
_
5

^ 2 )  52 œ "12 !(xi  x)2 /(N  1)


i
_
E(5
i
depends on the size of the departure of the mean response from the model, whose mean is
constant.

Example 4.5. Electricity problem. Consider the analysis of electricity consumption, using the
following model as the true model for households:
billi œ "0  "1 incomei  "2 personsi  "3 areai  ei (4.5)
where billi œ monthly electric bill for household i, incomei œ monthly disposable income,
personsi œ number in household, areai œ heating living area of home or apartment. In such
analyses, often income is may not be available, so consider the consequences in estimating
the regression coefficients "2 and "3 when income is dropped from the model:

In this situation, rows of X contain c 1 personsi areai d, b œ c "0 "2 "3 d, and
E(billi ) œ "0  "2 personsi  "3 areai . (4.6)
T

(i œ "1 incomei . Another approach is to construct a regression model for the missing variable
income, using the remaining variables as explanatory variables:
incomei œ #0  #1 personsi  #2 areai  fi . (4.7)
Combining the true model (4.5) and the expression above for the missing variable (4.7), the
misspecified model we are fitting (4.6) really now becomes
billi œ ("0  "1 #0 )  ("2  "1 #1 )personsi  ("3  "1 #2 )areai  (ei  "1 fi ). (4.8)
If income is not related to persons, then #1 œ 0, and the estimate of "2 remains unbiased;
however, if income is related to area, then E(" ^ ) œ "  " # , with the size of the bias
3 3 1 2
depending on the importance of the missing variable "1 and the strength of the relationship
with the variable of interest #2 . Also note the expression for the error in the misspecified
model (4.8), ei  "1 fi , contains both the original error ei , and the effect of the misspecification.

The case of overfitting can be viewed in the following context, by partitioning the
explanatory variables into two groups or blocks:

Chapter 4, page 6
y œ X1 b1  X2 b2  e
where the second group of explanatory variables is not needed, since b2 œ 0. To make
comparisons, let us simplify matters and assume the Gauss-Markov model and that both X1
and X œ (X1 X2 ) have full column rank. Using the smallest model with a design matrix of X1
~
leads to the least squares estimator we will denote as b1 , constructed as
~
b1 œ (XT1 X1 )-1 X1T y .
~
Using previous results, we can show that this estimator is unbiased, E( b1 ) œ b1 , and its
covariance matrix is
~
Cov( b1 ) œ 5 2 (XT1 X1 )-1 .
Including the second block of explanatory variables into the model that the researcher fits
leads to the least squares estimators

– ^ — ” • ” •.
^
b XT1 X1 X1T X2
-1
X1T y
1
œ
b 2
XT2 X1 X2T X2 X2T y
Again, it is easy to show similar results for this estimator:

E[ – 1 —] œ ” 1 • and Cov( – 1 — ) œ 5 2 ” 1T 1 • .
^
b b ^
b XT X X1T X2
-1

^
b 2
0 ^
b 2
X2 X1 X2T X2
Using Exercise A.74 (partitioned inverse) we can show
Cov( b^ ) œ 5 2 (XT X )-1  5 2 (XT X )-1 XT X [XT (I  P )X ]-1 XT X (XT X )-1 (4.9)
1 1 1 1 1 1 2 2 X1 2 2 1 1 1
and so that the penalty for including the second block of variables X2 is increased variance in
the coefficient estimators.
Denoting rank(X1 ) œ r1 and rank(X) œ r, then the variance estimators are both
unbiased
E[yT (I  PX1 )y]/(N  r1 ) œ 5 2
E[yT (I  PX )y]/(N  r) œ 5 2
since there is no misspecification, and the only difference betweeen the two estimators is in
the degrees of freedom. Except for some applications where error degrees of freedom are
scarce, there is little lost in variance estimation by overfitting.

Example 4.6. No intercept. Suppose we have the simple linear regression problem, with

where we include the intercept is 52 /!(xi  x)2 . If we drop the intercept, the slope estimator
yi œ "0  "1 xi  ei , but "0 œ 0. The variance_of the least squares estimate of the slope,

is simply !xi yi /!x2i and its variance is 52 /!x2i , which is smaller, since !x2i  !(xi  x)2 .
i _
i i i i i
See also Exercise 4.7.
Returning to the coefficient estimation, examination of the difference of the two
covariance matrices shows the effect of including the second block of variables:
^ )  Cov( ~
Cov( b b1 ) œ 5 2 (XT1 X1 )-1 X1T X2 [X2T (I  PX1 )X2 ]-1 X2T X1 (X1T X1 )-1
1
If the second block of explanatory variables is orthogonal to the first, that is, XT1 X2 œ 0, then
the estimators not only have the same variance, they are the same estimators, as XT X becomes
block diagonal . As X2 gets closer to V(X1 ), then XT2 (I  PX1 )X2 gets smaller, and, when
^ ) to explode. This condition, known as multicollinearity, is the
inverted, causes the Cov( b 1
other feared consequence in model selection. One signal for severe multicollinearity is the

Chapter 4, page 7
appearance of unexpected signs of coefficient estimators due to their wild inaccuracy.
Without multicollinearity, and with enough degrees of freedom to estimate the variance, there
is little to be lost in overfitting. With multicollinearity, overfitting can be catastrophic.
One measure of the effect of multicollinearity is called the variance inflation factor or

4.3, then Var(bj ) œ 5 /Sxx where Sxx œ !(Xij  X.j ) . In practice, when the explanatory
VIF. If the explanatory variables were mutually orthogonal,
_ then, following, say, Example
^ 2 2
i
variables are not orthogonal, then the defining expression is:
Var(b ^ ) œ (VIF) ‚ 5 2 /S (4.10)
j xx
This relationship arises from the other form of the partitioned inverse result (Exercise A.74).
Without loss of generality, consider j œ 1, and partition the design matrix X as above, with
just the first column in X1 , and all of the other columns in X2 . The employing the other form
of the partitioned inverse result, we have
Cov( b ^ ) œ 5 2 [XT (I  P )X ]-1 . (4.11)
1 1 X2 1
Putting (4.10) and (4.11) together, we find that
VIF œ XT (ISPxxX )X1 œ 11R2 (4.12)
1 2 j

where Rj2 is what we would have for R2 if we took column j for the response vector, and
employed the remaining explanatory variables. Clearly, when an explanatory variable can
itself be closely approximated by a linear combination of other variables, then Rj2 will be close
to one and VIF very large, indicating a serious multicollinearity problem.
The quantity XT2 (I  PX1 )X2 is employed by some algorithms, e.g. the sweep operator
in SAS (see, e.g. Goodnight(1979) or Monahan(2001)), to determine rank of X or XT X in
regression problems. In practice, this works quite well, although the finite precision of
floating point arithmetic limits its effectiveness. When most of the columns of the design
matrix X are composed of 0, 1, or some small integer, the computed elements of
XT2 (I  PX1 )X2 can be zero or nearly so. However, in the case of continuous covariates,
XT2 (I  PX1 )X2 will rarely be zero even in the case of perfect collinearity due to the effects of
rounding error. In practice, most computer software for regression determine rank by testing
whether XT2 (I  PX1 )X2 , or something similar, is close to zero. While the mathematics
presented here appears to proceed smoothly in the case of dependence in the columns of X --
by using the generalized inverse when the inverse doesn't exist -- the reality is that the effect is
dramatic: certain components or functions of b are no longer estimable. The practical matter
is that the rank of the design matrix should be known in advance by the researcher. If the
computer software determines a smaller rank than expected, then either an unexpected
dependency or catastrophic multicollinearity problem exists. Finding a larger rank than
expected indicates the inability of the software to detect dependence. (See Exercises 4.15,
4.16)

4.5 The Aitken Model and Generalized Least Squares


The Aitken model is a slight extension of the Gauss-Markov Model in that only
different moment assumptions are made on the errors. The Aitken Model takes the form
y œ Xb  e, where E(e) œ 0, but Cov(e) œ 5 2 V

Chapter 4, page 8
where the matrix V is a KNOWN positive definite matrix. In this way, it is similar to the
Gauss-Markov Model in that the covariance is known up to a scalar 5 2 . If V œ I, then we
have the Gauss-Markov model. In practice, however, usually V is unknown, or has just a few
parameters; this case will be addressed later. In an Aitken model, the least squares estimator
^ of an estimable function -T b may no longer be the BLUE for-T b. We will now
-T b
construct a generalized least squares (GLS) estimator of -T b and show that it is the BLUE for
-T b.
The crucial step in GLS is the construction of a square root of the unscaled covariance
matrix, that is, find R such that RVRT œ IN . As discussed in Appendix A, there are two
approaches for constructing a square root of a positive definite matrix, Cholesky factorization
and Spectral decomposition. The Cholesky factorization writes V as the product of a lower
triangular matrix L and its transpose, V œ LLT . Taking this route, use R œ L-1 . The spectral
decomposition uses the eigenvector-eigenvalue decomposition V œ QAQT where A is the
diagonal matrix of eigenvalues, and Q is the (orthogonal) matrix of eigenvectors stacked as
columns. This route suggests taking R œ QA-1/2 QT , so that the 'square root' matrix is
symmetric in this case. Note that we will insist that V be positive definite, so that L or A are
nonsingular. If V were singular, then there would be a linear combination of observations
with zero variance, and this case would more properly be treated as a linear contraint.
Using the matrix R, we can reformulate the Aitken Model using a tranformed response
variable
z œ Ry œ RXb  Re, or
z œ Ub  f, where E(f) œ 0 and Cov(f) œ 5 2 IN , (4.13)
which looks just like the Gauss-Markov Model. Now we can tackle all of the same issues as
before, and then transform back to the Aitken model:
a) Estimability. The linear function -T b is estimable if - is in the column space of the
transpose of the design matrix. Here this means - − V(U) œ V(XT RT ) œ V(XT ) since R is
nonsingular. From another viewpoint, estimability did not involve the second moment
anyway, so that estimability should not be affected by the fact that V is not a constant diagonal
matrix.
b) Linear estimator. Note that any linear estimator g  hT y that is linear in z is also a
linear estimator g  hT Ry œ g  aT y and vice versa. In other words, the class of linear
estimators in z is the same as the class of linear estimators in y.
c) Generalized Least Squares Estimators. In the transformed model, the normal
equations are
UT U b œ UT z (4.14)
and so the least squares estimator from (4.13) solves (4.14) above. However, these normal
equations can be easily rewritten as
(RX)T (RX) b œ (RX)T (Ry), or
XT V-1 X b œ XT V-1 y (4.15)
^
which are known as the Aitken equations, and the solution to (4.8) will be denoted as bGLS , a
generalized least squares estimator of b. When needed for clarity, the solution to the usual
normal equations XT Xb œ XT y will be denoted by b ^ , for Ordinary Least Squares. From
OLS
T^
Section 4.1, we should expect that - bGLS is BLUE for -T b, but this will be examined
further.

Chapter 4, page 9
Theorem 4.2. (Aitken's Theorem) Consider the Aitken Model given by y œ Xb  e, where
E(e) œ 0, and Cov(e) œ 5 2 V, where V is a known positive definite matrix. If -T b is
^
estimable, then -T b T
GLS is the BLUE for - b.
Proof: Follows from (a), (b), (c), and the Gauss-Markov Theorem.…

^
l z  Ub l œ lR(y  Xb) l2 œ (y  Xb)V-1 (y  Xb),
From Chapter 2, we know that b GLS minimizes a sum of squares, in particular,
2

so that this sum of squares is often called weighted least squares or generalized least squares.

(y  Xb)V-1 (y  Xb) œ ! V1ii (yi  "0  "1 xi )2


In the simplest case, say, simple linear regression and V diagonal, then we have

i
and the name should be apparent.
c) Estimation of 5 2 . Since the transformed model follows the Gauss-Markov
assumptions, the estimator constructed in Section 4.3 is the natural unbiased estimator for 5 2 :
2
5GLS œ (z  Ub^ )T (z  Ub ^ ) / (N  r) œ (y  Xb ^ )V-1 (y  Xb ^ ) / (N  r)
GLS GLS GLS GLS

Example 4.7. Heteroskedasticity.


Consider the simple linear regression model through the origin with heteroskedastic (different
variances) errors, with the variances proportional to the squares of the explanatory variables:
yi œ " xi  ei ,
where E(ei ) œ 0, Var(ei ) œ 52 x2i , ei uncorrelated and xi Á 0. Notice that
Var(ei /xi ) œ 52
so that the obvious step is to transform by dividing by xi :

The BLUE of " , then is "^ GLS œ z œ N" ! (yi /xi ), and Var("^ GLS ) œ 5 2 /N. For comparison,
zi œ yi /xi œ "  ei /xi .
_

!xi yi / !x2i , and see Exercise 4.4.


i
^
" OLS œ
i i

Example 4.8. Autoregressive errors.


Suppose we have the usual multiple regression model
yi œ xTi b  ei ,
where the errors have the usual zero mean E(ei ) œ 0, but the covariance structure induced by
the model
ei œ 3ei-1  ai
where the ai 's are uncorrelated with zero mean and variance 52 . Then it can be shown that
Var(ei ) œ 52 /(1  32 ) and the covariance matrix of the original errors ei is given by
Ô 1 3 32 ... 3N-1 ×
Ö 3 1 3 ... 3N-2 Ù
Ö Ù
52 Ö Ù
Cov(y) œ Cov(e) œ 5 V œ 132 Ö 32 Ù
Ö ... 3 Ù
2
3 1 ...

Õ 3N-1 ... 1 Ø
...
3

so that Vij œ 3|i-j| /(1  32 ). This error structure is known as a first-order autoregressive model.
The following transformation, known as Cochrane-Orcutt transformation, restores the usual
Gauss-Markov assumptions:

Chapter 4, page 10
z1 œ È1  32 y1 œ È1  32 xT1 b  È1  32 e1

zi œ yi  3 yi-1 œ xTi b  3 xi-1


T
b  ei  3 ei-1
œ ( xi  3 xi-1 ) b  ai œ uiT b  ai for i œ 2, ..., N.
T

The single parameter 3 of this model is not usually known, and the usual approach is to begin
with ordinary least squares to estimate b, estimate 3 from the residuals, do GLS with
estimated 3 to reestimate b, reestimate 3, and iterate until convergence. This procedure is
known as estimated generalized least squares (EGLS) since 3 is estimated.
Finding the BLUE estimator under the Aitken model really is not difficult, as you've
seen; all that was necessary was to transform y and X to z and U so that the Gauss-Markov
assumptions held. A more interesting pursuit, however, is the set of conditions that make the
^
usual b OLS to be BLUE under the Aitken model assumptions.

Result 4.3. (Generalization of Result 4.1) The estimator tT y is the BLUE for E(tT y) iff tT y is
uncorrelated with all unbiased estimators of zero.
Proof: (If) Let aT y be another unbiased estimator of E(tT y), that is, E(aT y) œ E(tT y). Then
Var(aT y) œ Var(tT y  aT y  tT y)
œ Var(tT y)  Var(aT y  tT y)  2 Cov(tT y, aT y  tT y)
T T
Since a y  t y is an unbiased estimator of zero, the covariance terms drops out of the
equation above, leading to
Var(aT y) œ Var(tT y)  Var(aT y  tT y)   Var(tT y).
(Only if) Suppose there is another unbiased estimator of zero hT y, so that E(hT y) œ 0 where
h Á 0, and let Cov(tT y, hT y) œ c and Var(hT y) œ d. Then consider the estimator of E(tT y)
given by
aT y œ tT y  (c/d) hT y.
This estimator is also unbiased for E(tT y). Its variance is
Var(aT y) œ Var(tT y)  (c/d)2 Var(hT y)  2 (c/d) Cov(tT y, hT y)
œ Var(tT y)  c2 /d Ÿ Var(tT y).
So that if t y is the BLUE, then c œ 0, otherwise the estimator aT y constructed above will
T

also be unbiased and have smaller variance. …

Corollary 4.1. Under the Aitken Model, the estimator tT y is the BLUE for E(tT y) iff
Vt − V(X).
Proof: From the preceding Result 4.3, tT y is the BLUE for E(tT y) iff Cov(tT y, hT y) œ 0 for
all h such that E(hT y) œ 0. Note that if hT y is an unbiased estimator of zero, then we have
E(hT y) œ hT Xb œ 0 for all b.
This means hT X œ 0 or h − a (XT ). Now Cov(tT y, hT y) œ 5 2 tT Vh, and this is zero iff h is
orthogonal to Vt, or, since h − a (XT ), Vt − V(X). …

^
Corollary 4.2. Under the Aitken Model, -T b T T
OLS œ a PX y (that is, - œ X a) is the BLUE for
an estimable -T b iff VPX a œ Xq for some q.
Proof: Let t œ PX a in Corollary 4.1.

^
Result 4.4. Under the Aitken Model, all OLS estimators are BLUE (that is, each -T b OLS is the
T
BLUE for the corresponding estimable - b) iff there exists a matrix Q such that VX œ XQ.

Chapter 4, page 11
Proof: (If VX œ XQ) First write -T b ^ T T g T T T g
OLS œ - (X X) X y œ t y for t œ X (X X) - so that
-T b ^ T
OLS is the BLUE for - b iff Vt − V(X). Now if VX œ XQ, then
Vt œ VX(XT X)g œ XQ(XT X)g - − V(X)
and employ Corollary 4.2.
(Only if) Now if -T b ^ T (j) T
OLS is the BLUE for - b, take - as column j of X X, so that
t(j) œ X(XT X)g -(j) , then from Corollary 4.2, Vt(j) œ Xq(j) . Stacking these columns side by
side to form matrices T and Q, we have
VT œ VX(XT X)g XT X œ VX œ XQ. …

Example 4.9. Consider the regression problem y œ Xb  e with an equicorrelated covariance


structure, that is Cov(e) œ V œ 52 IN  7 2 1N 1TN , and with an intercept and the following
partitioning:
X œ c 1 X* d b œ ” 1 •
b 1
b2 p1
^
In this case, is -T b the BLUE for estimable -T b? It will be if we can find Q such that
OLS

VX œ (52 IN  7 2 1N 1TN ) c 1 X* d œ c 52 1N  7 2 1N N, 52 X*  7 2 1N 1TN X* d


VX œ XQ.

XQ œ c 1 X* d ” œ c 1Q11  X* Q21 1Q12  X* Q22 d


Q22 •
Q11 Q12
Q21

Matching the first entries suggests choosing Q11 œ 52  N7 2 and Q21 œ 0, and matching the
second yields Q12 œ 7 2 1T X* and Q22 œ 52 IN . See also Exercise 4.8.

Example 4.10. Seemingly Unrelated Regression. Suppose we have m individuals, each with
n responses following regression models:
y(i) œ X(i) b(i)  e(i) , i œ 1, ..., m
where y(i) and e(i) are n ‚ 1, X(i) is n ‚ p, and b(i) is p ‚ 1. The covariances in the errors e(i)
ties these regressions together:
Cov( e(i) , e(j) ) œ 5ij In .
For example, the individuals may be companies, and the responses are quarterly sales which
would be contemporaneously correlated. We can write this as one large linear model by
combining these pieces:

Ôy × ÔX 0 × Ôb × Ô 511 In 512 In ... 51m In ×


(1) (1) (1)

Öy Ù Ö 0 0 Ù Ö b(2) Ù 2 Ö5 I 52m In Ù
0 ...
yœÖ Ù, X œ Ö Ù, b œ Ö Ù, 5 V œ Ö 21 n Ù
(2) (2) 522 In
X

Õ y(m) Ø Õ 0 X(m) Ø Õ b(m) Ø Õ 5m1 In 5m2 In ... 5mm In Ø


... ... ... ... ... ...
0
What are the best estimators in this case? Or, more specifically, when are the least squares
estimators BLUE? In general, the least squares estimators are not always the best.
However, some specific cases are interesting.
If 5ij œ 0 for i Á j, then the problem completely decouples into m individual least
squares problems and, not surprisingly, the least squares estimators are BLUE. The other
interesting case has X(i) œ X(1) , that is, the design matrices are the same for each company.
In this case, we can show VX œ XV, so that taking Q œ V, the least squares estimators are
BLUE. This latter situation is known as multivariate regression, although it is usually written
differently.

Chapter 4, page 12
4.6. Summary

1) The Gauss-Markov assumptions on the errors in a linear model are introduced. They
specify that the errors have zero mean, are uncorrelated and have constant variance.

2) The Gauss-Markov Theorem says that the least squares estimator -T b ^ has the smallest
variance of all linear unbiased estimators of an estimable function -T b when the Gauss-
Markov assumptions hold.
2
^ œ SSE/(N  r) œ yT (I  PX )y/(N  r) is an unbiased estimator of the
3) The estimator 5
variance parameter 5 2 .

4) The consequences of underfitting or misspecification and overfitting are evaluated.

5) Generalized least squares estimators are introduced for cases where the Gauss-Markov
assumptions on the errors may not hold.

4.7. Exercises
1) Suppose the random variable Yi represents the number of votes that a candidate receives in
county i. A reasonable model would be that Yi would be independent binomial random
variables with parameters ni œ number of voters in county i and p œ Pr(voter correctly votes
for candidate).
a) What are E(Yi ) and Var(Yi )?
b) Write this as a linear model.
c) Find the BLUE of p in this situation.

2) Under the Gauss-Markov Model, show that for - such that -T b is estimable,
^) œ 5 2 -T (XT X)g - does not depend on the choice of generalized inverse (XT X)g .
Var(-T b

^ ) and compare it to Var(b


3) In Example 4.6 (heteroskedasticity) find Var(b ^ ).
OLS GLS

4) Consider the heteroskedasticity situation in Example 4.6, but suppose Var(ei ) œ 5 2 xi where
xi  0. Find Var(b^ ) and compare it to Var(b ^ ).
OLS GLS

5) Find the Cholesky factor of V in Example 4.6.

6) Suppose we have the simple linear regression model


yi œ "0  "1 xi  ei , i œ 1, ..., N
where ei are uncorrelated and Var(ei ) œ 5 2 . Consider the instrumental variables estimator of
the slope

Chapter 4, page 13
b1 œ ! (zi  z)(yi  y) / ! (zi  z)(xi  x)
N _ _ N _ _
~
i=1 i=1
where z1 , ..., zN are known constants.
~
a) Is b1 an unbiased estimator of the slope parameter "1 ?
~
b) Find the variance of b1 .
c) We know that taking zi œ xi gives our familiar least squares estimator ^b1 . Find the ratio of
~
^ ) / Var(b ^ is BLUE).
You may find the following results useful: ! (xi  x) œ ! (zi  z) œ 0
the two variances: Var(b 1 1 ) and show that it is
_ less than or equal
_ to 1 (b1

! (xi  x)(yi  y) œ ! (xi  x)yi œ ! xi (yi  y)


_ _ _ i=1 _ i=1
i=1 i=1 i=1

7) In a reversal of Example 4.6, suppose we have the simple linear regression problem, with

estimator !xi yi /!x2i when "0 Á 0.


yi œ "0  "1 xi  ei , and the usual Gauss-Markov assumptions. Compute the bias in the slope

i i
8) (Compare with Example 4.9 (equicorrelation)) Consider the linear model y œ Xb  e
where e œ u  Z1 (Z is a scalar random variable) where Cov(u) œ 5 2 IN , Var(Z) œ 7 2 , and Z
and u are uncorrelated. Find V and derive conditions under which the OLS estimator of every
estimable function is BLUE.

9) Suppose yi œ "0  "1 xi  "2 x2i  ei , where E(ei ) œ 0, Var(ei ) œ 5 2 and the ei are
independent. But suppose we fit a simple linear regression model: E(yi ) œ "0  "1 xi . To
illustrate, consider the following simple situation: xi œ i, i œ 1, 2, ..., n œ 8, for "0 œ 2,
"1 œ 3. Consider also various values of "2 , say, -2 to 4 by ones. (Hint: I suggest using PROC
REG for computations.)
a) Compute the bias in the least squares estimators, "^ 0 , "^ 1 .
b) Compute the bias in our usual variance estimate 5 ^ 2.
c) Would your results change if the values of "0 and "1 were changed? Explain.

10) Consider the simple linear regression problem, yi œ "0  "1 xi  ei , for i œ 1, ..., 4 œ n
with xi œ i and the Gauss-Markov assumptions on ei . A direct route for getting the BLUE
would be to construct ALL linear unbiased estimators and then directly minimize the variance.
Write the unbiased estimators as !ai yi and focus on estimating the slope "1 .
n

a) Write the two linear equations in the ai 's that express constraints so that !ai yi is an
i=1

unbiased estimator of "1 .


b) Construct the family of solutions to the equations in (a). (Hint: you'll need two z's) This
will parameterize all unbiased estimators with just two parameters.
c) Compute the variance of the estimators in (b) and minimize the variance. You should get a
familiar solution.

11) Prove the Gauss-Markov Theorem directly, that is, by constructing all linear estimators
aT y which are unbiased for -T b (find a family of solutions a(z)), and then minimizing the
variance 5 2 aT a.

Chapter 4, page 14
12) Show that if R is square and nonsingular and RT VR œ I then V-1 œ RT R. Do you need
the assumption that R is square? nonsingular?

13) Let Q œ X(XT V-1 X)g XT V-1 . Show that Q is a projection onto V(X). (This is difficult
without the Hint: first factor V œ LLT with Cholesky and work with a symmetric version
Q* œ L-1 QL)

14) Let 5 2 V be the N ‚ N covariance matrix for a first order moving average process:
Vij œ 1  !2 if i œ j
Vij œ ! if | i  j | œ 1
Vij œ 0 if | i  j |  1
Notice that V is banded with zeros outside of three bands, on the diagonal and above and
below the diagonal.
a) Show that the Cholesky factor of V is also banded (lower triangular, with nonzeros on the
diagonal, and below the diagonal).
b) Find the limit of the two nonzero elements in row N as Np_.

15) Consider the multiple regression problem including an intercept with the following list of
explanatory variables:
c1 œ cos(21i/7) s1 œ sin(21i/7)
c2 œ cos(212i/7) s2 œ sin(212i/7)
c3 œ cos(213i/7) s3 œ sin(213i/7)
c4 œ cos(214i/7) s4 œ sin(214i/7)
c5 œ cos(215i/7) s5 œ sin(215i/7)
c6 œ cos(216i/7) s6 œ sin(216i/7)
for i œ 1, ..., N.
a) Show thtat the last six variables (c4, s4, ..., s6) are linearly dependent on the first six (c1, s1,
..., s3) and an intercept.
b) Test whether the regression software that you commonly use can detect dependencies
among the explanatory variables, using 3.1416 as your approximation for 1, and various
values of N.
c) Repeat this exercise with a cruder approximation 3.14 for 1.
d) Repeat this exercise with 4 in place of 7 (that is, 21i/4, 41i/4, etc.).

16) Using the variables (1, c1, s1, c2, s2, c3, s3) from Exercise 15 above, is there any
multicollinearity problem?

17) Consider the usual simple linear regression situation


E(yi ) œ !  " xi for i œ 1, ..., 5 with xi œ i
Var(yi ) œ 5 2 and yi are independent
Note the simple form of xi and that we have only 5 observations.
a) Find the least squares estimator "^ and express it in terms of tT y by explicitly giving t.
b) Show that "^ is unbiased and find its variance.

Chapter 4, page 15
c) Show that #^ œ (y4  y2 )/2 and (^ œ (y5  y1 )/4 are also unbiased for " and also find the
variance of each.
Consider now another estimator of the slope parameter " the estimator $^ œ c #^  (1-c) (^ .
d) Find the variance of $^ in terms of c and 5 2 .
e) Find the value of c that minimizes the variance found in part d).
f) Suppose we knew that ! œ 0, would "^ still be BLUE?

18) Under Gauss-Markov assumptions, show that if cov( aT y , dT ^e ) œ 0 for all d, then aT y is
the BLUE for its expectation.

19) Under the Aitken Model, with Cov(e) œ 5 2 V, if cov( aT y , dT ^e ) œ 0 for all d, then is aT y
still the BLUE for its expectation?

20) Prove Aitken's Theorem (Theorem 4.2) directly. Consider the Aitken Model, and let -T b
be estimable. Let aT y be a linear unbiased estimator of -T b. Show that
Var(aT y) œ Var(-T b^ )  Var(aT y  -T b ^ ).
GLS GLS

21) Let Yi , i œ 1, .., N be iid exponential(-), that is, each has density f(y) œ --1 e-y/- for y  0.

b) Find the variance of the usual variance estimator, Var( !(Yi  Y)2 /(N  1) ).
a) Find E Yki for k œ 1, 2, 3, 4. _
i

22) Recall the analysis of the US population in Exercise 3.23. Find the VIF's for each of the
coefficients in the uncentered model, using t œ 1790 through 2000. If you center using
c œ 1890, what happens to the VIF's?

23) Prove that if V is positive definite, then V(XT V-1 X) œ V(XT ).

24) Ridge regression is a technique that has been recommended by some statisticians to
address multicollinearity problems arising in multiple regression. In our usual linear models
framework with E(y) œ Xb, and Cov(y) œ 5 2 IN , the ridge regression estimator takes the form
~
b œ (XT X  kIp )-1 XT y
where k  0. Assume here that the X has full column rank, that is, rank(X) œ p.
~
a) Find E(b).
~
b) Is -T b an unbiased estimator of -T b ?
~
c) Find Cov(b).
d) Mean squared error is commonly used to assess the quality of an estimator. For the ridge
~
E( ¼b  b¼ ) œ E( (b  b)T (b  b) ).
regression estimator b, find its mean squared error
~ 2 ~ ~

covariates (!xi œ !zi œ 0), taking the form


Consider applying ridge regression to a multivariate regression problem with two centered

Chapter 4, page 16
Ô 1 x1 z1 × "0
Ö 1 x2 z2 ÙÔ ×
Xb œ Ö Ù "
... ... ... Õ 1 Ø
Õ 1 xN zN Ø "2
yielding the inner product matrix
ÔN 0 ×
X X œ 0 !xi !xi zi .
0

Õ 0 !x z !z2i Ø
T 2

For simplicity, suppose N œ 10, !x2i œ !z2i œ 5 and !xi zi œ 4.


i i

^ œ (XT X)-1 XT y in this


e) Find the covariance matrix of our usual least squares estimator b
situation.
~
f) Find a value of k such that one component of b (your choice of component) has smaller
variance than the corresponding least squares estimator.
g) Part (f) looks like it violates the Gauss-Markov Theorem. Does it?

4.10. References

D. Cochrane and G. H. Orcutt (1949) "Applications of least squares regression to relationships


containing autocorrelated error terms,' Journal of the American Statistical Association, 44:32-
61.

J. H. Goodnight (1979) 'A Tutorial on the Sweep Operator,' The American Statistician 33:149-
158.

J. F. Monahan (2001) Numerical Methods of Statistics, Cambridge University Press.

J. O. Rawlings, S. G. Pantula, and D. A. Dickey (1998) Applied Regression Analysis: A


Research Tool, Springer-Verlag.

4.11. Addendum -- Variance of variance estimator


Computing the Var(5 ^ 2 ) requires considerable detail, athough the result is none too
deep. The main result will be given with some generality.

Result 4.9. Let P be a symmetric matrix, and e be a random vector. The components ei are iid
with the following four moments: E(ei ) œ 0, Var(ei ) œ E(e2i ) œ 5 2 , E(e3i ) œ #3 , E(e4i ) œ #4 ,

Var( (.  e)T P(.  e) ) œ 45 2 (.T P2 .)  4 #3 ! .i Pii !Pji


then

i j

Chapter 4, page 17
 25 4 !P2ij  !(#4  5 4 )P2ii
iÁ j i
Proof: Begin with
Var( (.  e)T P(.  e) ) œ
E[(.  e)T P(.  e)(.  e)T P(.  e)]  E[(.  e)T P(.  e)]2
Note that E[(.  e)T P(.  e)] œ .T P.  5 2 tr(P) from Lemma 4.3. Now write out all 16
terms of E[(.  e)T P(.  e)(.  e)T P(.  e)] as
E[.T P..T P.] œ (.T P.)2
E[.T P.eT P.] œ 0
E[.T P..T Pe] œ 0
E[.T P.eT Pe] œ (.T P.)E[eT Pe] œ (.T P.)5 2 trace(P)

E[eT P..T P.] œ 0


E[eT P.eT P.] œ Var(.T Pe) œ E[eT P..T Pe] œ 5 2 (.T P2 .)
E[eT P..T Pe] œ 5 2 (.T P2 .)
E[eT P.eT Pe] œ E[.T PeeT Pe] (see below)

E[.T Pe.T P.] œ 0


E[.T PeeT P.] œ 5 2 (.T P2 .)
E[.T Pe.T Pe] œ 5 2 (.T P2 .)
E[.T PeeT Pe] œ (see below)

E[eT Pe.T P.] œ (.T P.)E[eT Pe] œ (.T P.)5 2 trace(P)


E[eT PeeT P.] œ E[.T PeeT Pe] (see below)
E[eT Pe.T Pe] œ E[.T PeeT Pe] (see below)
E[ eT PeeT Pe] œ (see below)

Only two difficult expressions remain: E[.T PeeT Pe] and E[eT PeeT Pe]. For both we need the
following algebra:

E[aT PbcT Pd] œ ! ! ! ! E[ ai bj ck dl Pij Pkl ]


i j k l
We're interested in all cases where the indices are the same:
iiii (all same) n
iiij (one differs) n(n-1) of each of (iiij, iiji, ijii, jiii)
iijj (two pair) n(n-1) of each of (iijj, ijij, ijji)
iijk (one pair) n(n-1)(n-2) of each of (iijk, ijik, ijki, jiik, jiki, jkii)
ijkl (all different) n(n-1)(n-2)(n-3)

For E[.T PeeT Pe] œ ! ! ! ! E[ .i ej ek el Pij Pkl ] we have

! E[ .i ei ei ei Pii Pii ] œ ! .i #3 P2ii


i j k l

iiii
! E[ .j ei ei ei Pji Pii ] œ ! .i #3 Pji Pii
i i
iiij
iÁ j iÁj

Chapter 4, page 18
Note that the other three cases are zero, for example, E[ .i ej ei ei Pij Pii ] œ 0. This extends to
the other three possible combinations (two pair, one pair, all differ). Adding these two pieces

E[.T PeeT Pe] œ ! .i #3 Pii2  ! .i #3 Pji Pii œ #3 ! .i Pii !Pji


produces

i iÁ j i j
which does not appear to simplify further.

For E[eT PeeT Pe] œ ! ! ! ! E[ ei ej ek el Pij Pkl ] we have

! E[ ei ei ei ei Pii Pii ] œ !#4 P2ii


i j k l

iiii
! E[ ej ei ei ei Pji Pii ] œ 0 (all four cases: jiii, ijii, iiji, iiij)
i i
iiij
iÁ j

iijj ! E[ ei ei ej ej Pii Pjj ] œ 5 4 ! Pii Pjj

! E[ ei ej ei ej Pij Pij ] œ 5 4 ! Pij Pij


iÁ j iÁj
ijij
! E[ ei ej ej ei Pij Pji ] œ 5 4 ! Pij Pji
iÁ j iÁj
ijji
iÁ j iÁj
and the other cases are all zero. Gathering up the pieces produces the following:
Var( (.  e)T P(.  e) ) œ

œ (.T P.)2  2(.T P.)5 2 trace(P)  45 2 (.T P2 .)  4 #3 ! .i Pii !Pji


E[(.  e)T P(.  e)(.  e)T P(.  e)]  E[(.  e)T P(.  e)]2

 5 4 !(Pii Pjj  2P2ij )  !#4 P2ii  (.T P.  5 2 tr(P))(.T P.  5 2 tr(P))


i j

œ 45 2 (. P .)  4 #3 ! .i Pii !Pji  5 4 !(Pii Pjj  2Pij2 )  5 4 tr(P)2  !#4 P2ii


iÁ j i
T 2

œ 45 (. P .)  4 #3 ! .i Pii !Pji  25 4 !Pij2  !(#4  5 4 )P2ii


i j iÁ j i
2 T 2
i j iÁ j i

tr(P)2 œ (!Pii )2 œ !P2ii  !Pii Pjj .…


The last step employs

i i iÁ j

Corollary 4.10. Let P be a symmetric and idempotent, and the components of the random

Var(eT Pe) œ 25 4 !Pij2  !(#4  5 4 )P2ii


vector e iid with four moments existing, then

iÁ j i

Corollary 4.11. Let P be symmetric and #3 œ 0, #4 œ 35 4 (which hold if ei are iid


Normal( 0, 5 2 )), then
Var(eT Pe) œ 45 2 (.T P2 .)  25 4 tr(P2 )
Proof: Here #3 œ 0, #4 œ 35 4 and starting with

Chapter 4, page 19
Var(eT Pe) œ 45 2 (.T P2 .)  25 4 !Pij2  !(35 4  5 4 )Pii2
iÁ j i

tr(P2 ) œ ! ! Pij2 œ !P2ii  !P2ij .…


and now use

i j i iÁ j

JFM -- 05 January 2006

Chapter 4, page 20

You might also like