Professional Documents
Culture Documents
Chapter 4. Gauss-Markov Model
Chapter 4. Gauss-Markov Model
Gauss-Markov Model
4.1 Model Assumptions
So far we've approached the linear model only as a method of mathematical
approximation. In this chapter, we pose the Gauss-Markov model which embodies the most
common assumptions for the statistical approach to the linear model, leading to the Gauss-
Markov Theorem. The Gauss-Markov model takes the form
yœXbe (4.1)
where y is the (N by 1) vector of observed responses, and X is the (N by p) known design
matrix. As before, the coefficient vector b is unknown and to be determined or estimated.
The main features of the Gauss-Markov model are the assumptions on the error e:
E(e) œ 0 and Cov(e) œ 5 2 IN . (4.2)
The notation for expectation and covariances above can be rewritten component by
component:
( E(e) )i œ ith component of E(e) œ E(ei )
( Cov(e) )ij œ i,jth element of covariance matrix œ Cov(ei , ej ),
so that the Gauss-Markov assumptions can be rewritten as
E(ei ) œ 0, i œ 1, ..., N
Cov(ei , ej ) œ
5 2 for i œ j
0 for i Á j
that is, the errors in the model have a zero mean, constant variance, and are uncorrelated. An
alternative view of the Gauss-Markov model does not employ the error vector e:
E(y) œ X b, Cov(y) œ 5 2 IN .
The assumptions in the Gauss-Markov model are easily acceptable for most practical
problems and deviations from these assumptions will be considered in more detail later.
Before we get to the Gauss-Markov Theorem, we will need some simple tools for
conveniently working with means and variances of vector random variables, in particular, a
linear combination of variables aT y where a is a fixed vector. The rules are:
i) E(aT y) œ aT E(y)
ii) Var(aT y) œ aT Cov(y) a
iii) Cov(aT y, cT y ) œ aT Cov(y) c, for fixed a, c
iv) Cov( AT y) œ AT Cov(y) A, for fixed matrix A.
The key result is (iii), from which (ii) and (iv) follow easily, and just algebraic bookkeeping is
i j i j
Chapter 4, page 1
Var(y1 y2 2y3 ) œ Var(cT y ) œ cT Cov(y) c œ
Ô 4 2 4 ×Ô 1 × Ô 10 ×
c 1 -1 2 d 2 5 0 -1 œ c 1 -1 2 d -3 œ 121. Also
Õ 4 0 25 ØÕ 2 Ø Õ 54 Ø
Example 4.2. Variance of a least squares estimator. Let y œ X b e, with the Gauss-Markov
assumptions on e, so that Cov(y) œ 52 IN , and let -T b be an estimable function. Then the
variance of the least squares estimator follows the calculation (see Exercise 4.2)
^) œ Var( -T (XT X)g XT y) œ -T (XT X)g XT Cov(y)X(XT X)gT -
Var(-T b
œ -T (XT X)g XT (52 IN )X(XT X)gT - œ 52 -T (XT X)g -
Example 4.3. In the simple linear regression case in Example 3.1, we solved the normal
equations and found the usual slope estimate
^ œ !(x x)y
- /!(x x)
N N
b - 2.
2 i i i
i=1 i=1
To compute the mean and variance of b ^ , we have two routes. One route is to treat b
^ as a
2 2
linear combination of the yi 's:
^ œ !(x x)y
- /!(x x) - 2 œ ![(x x)/S ] y where S œ !(x x)
N N N N
b - - 2.
2 i i i i xx i xx i
^ ) œ ![(x x)/S
xx ] E(yi ) œ ![(xi x)/Sxx ] (b1 b2 xi ) œ !(xi x)xi /Sxx b2 œ b2 and
i=1 i=1 i=1 i=1
N N N
Then E(b - - -
2 i
i=1 i=1 i=1
the usual slope estimate is unbiased. Its variance can be found from the following algebra:
^ ) œ ![(x x)/S
xx ] ![(xj x)/Sxx ] Cov(yi , yj )
N N
Var(b - -
2 i
i=1 j=1
52 (XT X)-1 œ 52 ” ! • ” •.
xi !xi !xi
N 52
2 œ NS xx N
Chapter 4, page 2
for all b, which leads to c œ 0, and - œ XT a. In _ our One-Way ANOVA Model,
yij œ . !i eij , then y11 , (y11 2y12 )/3 and y1. are all unbiased estimators of . !1 . In
Simple Linear Regression, yi œ "0 "1 xi ei , with xi œ i, say, then y2 y1 or y3 y2 are
both unbiased estimators of "1 . Which is a better estimator and how shall we measure? When
two estimators are both unbiased, then the estimator with smaller variances is better, since
variance is a measure of variability around its mean. Our least squares estimator of -T b is
-T b^ where b^ is a solution to the normal equations. Recall that if we construct all solutions to
the normal equations,
^(z) œ (XT X)g XT y (I (XT X)g XT X)z
b
then for estimable -T b, -T b ^(z) is constant for all values of z -- all solutions to the normal
equations lead to the same least squares estimator.
Theorem 4.1. (Gauss-Markov Theorem) Under the assumptions of the Gauss-Markov Model,
y œ X b e, where E(e) œ 0 and Cov(e) œ 5 2 IN ,
^ is the best (minimum variance) linear unbiased estimator
if -T b is estimable, then -T b
(BLUE) of -T b, where b ^ solves the normal equations
T T
X X b œ X y.
Proof: Suppose c dT y is another unbiased estimator of -T b. Then c œ 0 and dT X œ -T
since E(c dT y) œ c dT Xb œ -T b for all b. Now,
Var(c dT y) œ Var(dT y) œ Var(-T b ^ dT y -T b^)
œ Var(-T b^) Var(dT y -T b ^) 2 Cov(-T b ^ , dT y - T b
^)
œ Var(-T b^) Var(dT y -T b ^) 2 -T (XT X)g XT (5 2 I ) ( d X (XT X)gT - )
N
T^ T^
œ Var(- b) Var(d y - b) 25 - (X X) (X d XT X (XT X)gT - )
T 2 T T g T
Chapter 4, page 3
Result 4.1. The BLUE -T b ^ of estimable -T b is uncorrelated with all unbiased estimators of
zero.
Proof: First, characterize unbiased estimators of zero as c aT y such that
E( c aT y) œ c aT Xb œ 0 for all b,
or c œ 0 and XT a œ 0, or a − a (XT ). Computing the covariance between -T b ^ and aT y we
have
^, aT y) œ -T (XT X)g XT Cov(y) a œ 5 2 -T (XT X)g XT a œ 0.
Cov(-T b
The Gauss-Markov Theorem can be extended to the vector case. Let the columns -(j) of the
matrix A be linear independent such that -(j) T b are linearly independent estimable functions,
then
Cov(AT b ^) œ Cov(AT (XT X)g XT y) œ 5 2 AT (XT X)g XT X (XT X)gT A œ 5 2 AT (XT X)g A.
If we have any set of unbiased estimator AT b, say CT y, that is, E(CT y) œ AT b for all b, then
the difference in their covariance matrices
Cov(CT y) Cov(AT b ^)
is nonnegative definite. If X has full column rank and applying A œ I, we have
Cov(b^) œ 5 2 (XT X)-1 . Moreover, in the full column rank model, the Gauss-Markov Theorem
~
says that if b is any other unbiased estimator of b, then
~ ^)
Cov(b) Cov(b
is nonnegative definite.
SSE œ l(I PX )yl2 to estimate 5 2 . To construct an unbiased estimator for 5 2 , we will use
the reader might guess, we will use (I PX )y, or, more specifically its sum of squares
Lemma 4.1. Let Z be a vector random variable with E(Z) œ . and Cov(Z) œ D. Then
E(ZT AZ) œ .T A. tr(AD).
Proof: Note that
E(Zi .i )(Zj .j ) œ Dij œ E(Zi Zj ) .i E(Zj ) .j E(Zi ) .i .j
so that
[ E(Z ZT ) ]ij œ E(Zi .i )(Zj .j ) .i .j œ Dij .i .j ,
or E(Z ZT ) œ ..T D. Now using the linearity of trace and expectation operations, we have
E( ZT A Z) œ E( tr( ZT A Z) ) œ E( tr(A Z ZT ) ) œ tr( A E(Z ZT ) ) œ tr( A (..T D) )
œ tr(A..T ) tr(AD) œ .T A. tr(AD).
Result 4.2. Consider the Gauss-Markov Model given in (4.1) and (4.2). An unbiased
^ 2 œ SSE/(N r), where SSE œ ^eT ^e œ yT (I PX )y and r œ rank(X).
estimator of 5 2 is 5
Proof: Since E(y) œ Xb, Cov(y) œ 5 2 IN , applying Lemma 4.1 above yields
Chapter 4, page 4
E( yT (I PX )y) œ (Xb)T (I PX )Xb tr((I PX )(5 2 IN ) ) œ 5 2 (N r).
Dividing by the degrees of freedom parameter tr(I PX ) œ (N r) associated with
SSE œ yT (I PX )y gives the desired result.
T
If we examine the regression sum of squares, SSR œ y^ y^ œ yT PX y, we find
E(SSR) œ E(yT PX y) œ bT XT PX Xb tr(PX 5 2 IN ) œ lXbl2 r 5 2 .
^ 2 without further distributional assumptions, even when
Little else can be said about 5
ei are iid (independent, identically distributed) since its variance depends on the third and
fourth moments of the underlying distribution. The algebra for computing Var(5 ^ 2 ) in terms of
the third and fourth moments is given in an addendum to this Chapter.
Chapter 4, page 5
The estimation of the variance is also affected by misspecification:
E( yT (I PX )y) œ (Xb ( )T (I PX )(Xb ( ) tr((I PX )(5 2 IN ) )
œ (T (I PX )( 5 2 (N r). (4.4)
Therefore, the bias in the variance estimate disappears only when the misspecification
disappears. Note that the bias in 5^ 2 is zero if and only if
(T (I PX )( œ 0, or ( − V(X), or ( œ Xd for some d.
In that case, the model is then
y œ Xb Xd e œ X(b+d) e.
One view of misspecification is that the signal that is not accounted for ( is partitioned into
two pieces:
y œ Xb ( e œ PX y (I PX )y œ (Xb PX ( PX e) (I PX )(( e)
One piece PX ( in V(X) affects the estimation of -T b; the other part (I PX )( affects the
estimation of the variance.
Example 4.4. Simple misspecification. Suppose yi œ "0 "1 xi ei , but the covariate xi is
ignored and we just estimate the mean and variance. Here we have (i œ "1 xi and X œ 1 so
^ œ_ _ _
^ 2 œ !(yi y)2 /(N 1). Applying (4.4), the size of the bias,
that " 0 y and E(y) œ "0 "1 x. As for the variance, our usual variance estimate is
_
5
Example 4.5. Electricity problem. Consider the analysis of electricity consumption, using the
following model as the true model for households:
billi œ "0 "1 incomei "2 personsi "3 areai ei (4.5)
where billi œ monthly electric bill for household i, incomei œ monthly disposable income,
personsi œ number in household, areai œ heating living area of home or apartment. In such
analyses, often income is may not be available, so consider the consequences in estimating
the regression coefficients "2 and "3 when income is dropped from the model:
In this situation, rows of X contain c 1 personsi areai d, b œ c "0 "2 "3 d, and
E(billi ) œ "0 "2 personsi "3 areai . (4.6)
T
(i œ "1 incomei . Another approach is to construct a regression model for the missing variable
income, using the remaining variables as explanatory variables:
incomei œ #0 #1 personsi #2 areai fi . (4.7)
Combining the true model (4.5) and the expression above for the missing variable (4.7), the
misspecified model we are fitting (4.6) really now becomes
billi œ ("0 "1 #0 ) ("2 "1 #1 )personsi ("3 "1 #2 )areai (ei "1 fi ). (4.8)
If income is not related to persons, then #1 œ 0, and the estimate of "2 remains unbiased;
however, if income is related to area, then E(" ^ ) œ " " # , with the size of the bias
3 3 1 2
depending on the importance of the missing variable "1 and the strength of the relationship
with the variable of interest #2 . Also note the expression for the error in the misspecified
model (4.8), ei "1 fi , contains both the original error ei , and the effect of the misspecification.
The case of overfitting can be viewed in the following context, by partitioning the
explanatory variables into two groups or blocks:
Chapter 4, page 6
y œ X1 b1 X2 b2 e
where the second group of explanatory variables is not needed, since b2 œ 0. To make
comparisons, let us simplify matters and assume the Gauss-Markov model and that both X1
and X œ (X1 X2 ) have full column rank. Using the smallest model with a design matrix of X1
~
leads to the least squares estimator we will denote as b1 , constructed as
~
b1 œ (XT1 X1 )-1 X1T y .
~
Using previous results, we can show that this estimator is unbiased, E( b1 ) œ b1 , and its
covariance matrix is
~
Cov( b1 ) œ 5 2 (XT1 X1 )-1 .
Including the second block of explanatory variables into the model that the researcher fits
leads to the least squares estimators
– ^ — ” • ” •.
^
b XT1 X1 X1T X2
-1
X1T y
1
œ
b 2
XT2 X1 X2T X2 X2T y
Again, it is easy to show similar results for this estimator:
E[ – 1 —] œ ” 1 • and Cov( – 1 — ) œ 5 2 ” 1T 1 • .
^
b b ^
b XT X X1T X2
-1
^
b 2
0 ^
b 2
X2 X1 X2T X2
Using Exercise A.74 (partitioned inverse) we can show
Cov( b^ ) œ 5 2 (XT X )-1 5 2 (XT X )-1 XT X [XT (I P )X ]-1 XT X (XT X )-1 (4.9)
1 1 1 1 1 1 2 2 X1 2 2 1 1 1
and so that the penalty for including the second block of variables X2 is increased variance in
the coefficient estimators.
Denoting rank(X1 ) œ r1 and rank(X) œ r, then the variance estimators are both
unbiased
E[yT (I PX1 )y]/(N r1 ) œ 5 2
E[yT (I PX )y]/(N r) œ 5 2
since there is no misspecification, and the only difference betweeen the two estimators is in
the degrees of freedom. Except for some applications where error degrees of freedom are
scarce, there is little lost in variance estimation by overfitting.
Example 4.6. No intercept. Suppose we have the simple linear regression problem, with
where we include the intercept is 52 /!(xi x)2 . If we drop the intercept, the slope estimator
yi œ "0 "1 xi ei , but "0 œ 0. The variance_of the least squares estimate of the slope,
is simply !xi yi /!x2i and its variance is 52 /!x2i , which is smaller, since !x2i !(xi x)2 .
i _
i i i i i
See also Exercise 4.7.
Returning to the coefficient estimation, examination of the difference of the two
covariance matrices shows the effect of including the second block of variables:
^ ) Cov( ~
Cov( b b1 ) œ 5 2 (XT1 X1 )-1 X1T X2 [X2T (I PX1 )X2 ]-1 X2T X1 (X1T X1 )-1
1
If the second block of explanatory variables is orthogonal to the first, that is, XT1 X2 œ 0, then
the estimators not only have the same variance, they are the same estimators, as XT X becomes
block diagonal . As X2 gets closer to V(X1 ), then XT2 (I PX1 )X2 gets smaller, and, when
^ ) to explode. This condition, known as multicollinearity, is the
inverted, causes the Cov( b 1
other feared consequence in model selection. One signal for severe multicollinearity is the
Chapter 4, page 7
appearance of unexpected signs of coefficient estimators due to their wild inaccuracy.
Without multicollinearity, and with enough degrees of freedom to estimate the variance, there
is little to be lost in overfitting. With multicollinearity, overfitting can be catastrophic.
One measure of the effect of multicollinearity is called the variance inflation factor or
4.3, then Var(bj ) œ 5 /Sxx where Sxx œ !(Xij X.j ) . In practice, when the explanatory
VIF. If the explanatory variables were mutually orthogonal,
_ then, following, say, Example
^ 2 2
i
variables are not orthogonal, then the defining expression is:
Var(b ^ ) œ (VIF) ‚ 5 2 /S (4.10)
j xx
This relationship arises from the other form of the partitioned inverse result (Exercise A.74).
Without loss of generality, consider j œ 1, and partition the design matrix X as above, with
just the first column in X1 , and all of the other columns in X2 . The employing the other form
of the partitioned inverse result, we have
Cov( b ^ ) œ 5 2 [XT (I P )X ]-1 . (4.11)
1 1 X2 1
Putting (4.10) and (4.11) together, we find that
VIF œ XT (ISPxxX )X1 œ 11R2 (4.12)
1 2 j
where Rj2 is what we would have for R2 if we took column j for the response vector, and
employed the remaining explanatory variables. Clearly, when an explanatory variable can
itself be closely approximated by a linear combination of other variables, then Rj2 will be close
to one and VIF very large, indicating a serious multicollinearity problem.
The quantity XT2 (I PX1 )X2 is employed by some algorithms, e.g. the sweep operator
in SAS (see, e.g. Goodnight(1979) or Monahan(2001)), to determine rank of X or XT X in
regression problems. In practice, this works quite well, although the finite precision of
floating point arithmetic limits its effectiveness. When most of the columns of the design
matrix X are composed of 0, 1, or some small integer, the computed elements of
XT2 (I PX1 )X2 can be zero or nearly so. However, in the case of continuous covariates,
XT2 (I PX1 )X2 will rarely be zero even in the case of perfect collinearity due to the effects of
rounding error. In practice, most computer software for regression determine rank by testing
whether XT2 (I PX1 )X2 , or something similar, is close to zero. While the mathematics
presented here appears to proceed smoothly in the case of dependence in the columns of X --
by using the generalized inverse when the inverse doesn't exist -- the reality is that the effect is
dramatic: certain components or functions of b are no longer estimable. The practical matter
is that the rank of the design matrix should be known in advance by the researcher. If the
computer software determines a smaller rank than expected, then either an unexpected
dependency or catastrophic multicollinearity problem exists. Finding a larger rank than
expected indicates the inability of the software to detect dependence. (See Exercises 4.15,
4.16)
Chapter 4, page 8
where the matrix V is a KNOWN positive definite matrix. In this way, it is similar to the
Gauss-Markov Model in that the covariance is known up to a scalar 5 2 . If V œ I, then we
have the Gauss-Markov model. In practice, however, usually V is unknown, or has just a few
parameters; this case will be addressed later. In an Aitken model, the least squares estimator
^ of an estimable function -T b may no longer be the BLUE for-T b. We will now
-T b
construct a generalized least squares (GLS) estimator of -T b and show that it is the BLUE for
-T b.
The crucial step in GLS is the construction of a square root of the unscaled covariance
matrix, that is, find R such that RVRT œ IN . As discussed in Appendix A, there are two
approaches for constructing a square root of a positive definite matrix, Cholesky factorization
and Spectral decomposition. The Cholesky factorization writes V as the product of a lower
triangular matrix L and its transpose, V œ LLT . Taking this route, use R œ L-1 . The spectral
decomposition uses the eigenvector-eigenvalue decomposition V œ QAQT where A is the
diagonal matrix of eigenvalues, and Q is the (orthogonal) matrix of eigenvectors stacked as
columns. This route suggests taking R œ QA-1/2 QT , so that the 'square root' matrix is
symmetric in this case. Note that we will insist that V be positive definite, so that L or A are
nonsingular. If V were singular, then there would be a linear combination of observations
with zero variance, and this case would more properly be treated as a linear contraint.
Using the matrix R, we can reformulate the Aitken Model using a tranformed response
variable
z œ Ry œ RXb Re, or
z œ Ub f, where E(f) œ 0 and Cov(f) œ 5 2 IN , (4.13)
which looks just like the Gauss-Markov Model. Now we can tackle all of the same issues as
before, and then transform back to the Aitken model:
a) Estimability. The linear function -T b is estimable if - is in the column space of the
transpose of the design matrix. Here this means - − V(U) œ V(XT RT ) œ V(XT ) since R is
nonsingular. From another viewpoint, estimability did not involve the second moment
anyway, so that estimability should not be affected by the fact that V is not a constant diagonal
matrix.
b) Linear estimator. Note that any linear estimator g hT y that is linear in z is also a
linear estimator g hT Ry œ g aT y and vice versa. In other words, the class of linear
estimators in z is the same as the class of linear estimators in y.
c) Generalized Least Squares Estimators. In the transformed model, the normal
equations are
UT U b œ UT z (4.14)
and so the least squares estimator from (4.13) solves (4.14) above. However, these normal
equations can be easily rewritten as
(RX)T (RX) b œ (RX)T (Ry), or
XT V-1 X b œ XT V-1 y (4.15)
^
which are known as the Aitken equations, and the solution to (4.8) will be denoted as bGLS , a
generalized least squares estimator of b. When needed for clarity, the solution to the usual
normal equations XT Xb œ XT y will be denoted by b ^ , for Ordinary Least Squares. From
OLS
T^
Section 4.1, we should expect that - bGLS is BLUE for -T b, but this will be examined
further.
Chapter 4, page 9
Theorem 4.2. (Aitken's Theorem) Consider the Aitken Model given by y œ Xb e, where
E(e) œ 0, and Cov(e) œ 5 2 V, where V is a known positive definite matrix. If -T b is
^
estimable, then -T b T
GLS is the BLUE for - b.
Proof: Follows from (a), (b), (c), and the Gauss-Markov Theorem.
^
l z Ub l œ lR(y Xb) l2 œ (y Xb)V-1 (y Xb),
From Chapter 2, we know that b GLS minimizes a sum of squares, in particular,
2
so that this sum of squares is often called weighted least squares or generalized least squares.
i
and the name should be apparent.
c) Estimation of 5 2 . Since the transformed model follows the Gauss-Markov
assumptions, the estimator constructed in Section 4.3 is the natural unbiased estimator for 5 2 :
2
5GLS œ (z Ub^ )T (z Ub ^ ) / (N r) œ (y Xb ^ )V-1 (y Xb ^ ) / (N r)
GLS GLS GLS GLS
The BLUE of " , then is "^ GLS œ z œ N" ! (yi /xi ), and Var("^ GLS ) œ 5 2 /N. For comparison,
zi œ yi /xi œ " ei /xi .
_
Õ 3N-1 ... 1 Ø
...
3
so that Vij œ 3|i-j| /(1 32 ). This error structure is known as a first-order autoregressive model.
The following transformation, known as Cochrane-Orcutt transformation, restores the usual
Gauss-Markov assumptions:
Chapter 4, page 10
z1 œ È1 32 y1 œ È1 32 xT1 b È1 32 e1
The single parameter 3 of this model is not usually known, and the usual approach is to begin
with ordinary least squares to estimate b, estimate 3 from the residuals, do GLS with
estimated 3 to reestimate b, reestimate 3, and iterate until convergence. This procedure is
known as estimated generalized least squares (EGLS) since 3 is estimated.
Finding the BLUE estimator under the Aitken model really is not difficult, as you've
seen; all that was necessary was to transform y and X to z and U so that the Gauss-Markov
assumptions held. A more interesting pursuit, however, is the set of conditions that make the
^
usual b OLS to be BLUE under the Aitken model assumptions.
Result 4.3. (Generalization of Result 4.1) The estimator tT y is the BLUE for E(tT y) iff tT y is
uncorrelated with all unbiased estimators of zero.
Proof: (If) Let aT y be another unbiased estimator of E(tT y), that is, E(aT y) œ E(tT y). Then
Var(aT y) œ Var(tT y aT y tT y)
œ Var(tT y) Var(aT y tT y) 2 Cov(tT y, aT y tT y)
T T
Since a y t y is an unbiased estimator of zero, the covariance terms drops out of the
equation above, leading to
Var(aT y) œ Var(tT y) Var(aT y tT y) Var(tT y).
(Only if) Suppose there is another unbiased estimator of zero hT y, so that E(hT y) œ 0 where
h Á 0, and let Cov(tT y, hT y) œ c and Var(hT y) œ d. Then consider the estimator of E(tT y)
given by
aT y œ tT y (c/d) hT y.
This estimator is also unbiased for E(tT y). Its variance is
Var(aT y) œ Var(tT y) (c/d)2 Var(hT y) 2 (c/d) Cov(tT y, hT y)
œ Var(tT y) c2 /d Ÿ Var(tT y).
So that if t y is the BLUE, then c œ 0, otherwise the estimator aT y constructed above will
T
Corollary 4.1. Under the Aitken Model, the estimator tT y is the BLUE for E(tT y) iff
Vt − V(X).
Proof: From the preceding Result 4.3, tT y is the BLUE for E(tT y) iff Cov(tT y, hT y) œ 0 for
all h such that E(hT y) œ 0. Note that if hT y is an unbiased estimator of zero, then we have
E(hT y) œ hT Xb œ 0 for all b.
This means hT X œ 0 or h − a (XT ). Now Cov(tT y, hT y) œ 5 2 tT Vh, and this is zero iff h is
orthogonal to Vt, or, since h − a (XT ), Vt − V(X).
^
Corollary 4.2. Under the Aitken Model, -T b T T
OLS œ a PX y (that is, - œ X a) is the BLUE for
an estimable -T b iff VPX a œ Xq for some q.
Proof: Let t œ PX a in Corollary 4.1.
^
Result 4.4. Under the Aitken Model, all OLS estimators are BLUE (that is, each -T b OLS is the
T
BLUE for the corresponding estimable - b) iff there exists a matrix Q such that VX œ XQ.
Chapter 4, page 11
Proof: (If VX œ XQ) First write -T b ^ T T g T T T g
OLS œ - (X X) X y œ t y for t œ X (X X) - so that
-T b ^ T
OLS is the BLUE for - b iff Vt − V(X). Now if VX œ XQ, then
Vt œ VX(XT X)g œ XQ(XT X)g - − V(X)
and employ Corollary 4.2.
(Only if) Now if -T b ^ T (j) T
OLS is the BLUE for - b, take - as column j of X X, so that
t(j) œ X(XT X)g -(j) , then from Corollary 4.2, Vt(j) œ Xq(j) . Stacking these columns side by
side to form matrices T and Q, we have
VT œ VX(XT X)g XT X œ VX œ XQ.
Matching the first entries suggests choosing Q11 œ 52 N7 2 and Q21 œ 0, and matching the
second yields Q12 œ 7 2 1T X* and Q22 œ 52 IN . See also Exercise 4.8.
Example 4.10. Seemingly Unrelated Regression. Suppose we have m individuals, each with
n responses following regression models:
y(i) œ X(i) b(i) e(i) , i œ 1, ..., m
where y(i) and e(i) are n ‚ 1, X(i) is n ‚ p, and b(i) is p ‚ 1. The covariances in the errors e(i)
ties these regressions together:
Cov( e(i) , e(j) ) œ 5ij In .
For example, the individuals may be companies, and the responses are quarterly sales which
would be contemporaneously correlated. We can write this as one large linear model by
combining these pieces:
Öy Ù Ö 0 0 Ù Ö b(2) Ù 2 Ö5 I 52m In Ù
0 ...
yœÖ Ù, X œ Ö Ù, b œ Ö Ù, 5 V œ Ö 21 n Ù
(2) (2) 522 In
X
Chapter 4, page 12
4.6. Summary
1) The Gauss-Markov assumptions on the errors in a linear model are introduced. They
specify that the errors have zero mean, are uncorrelated and have constant variance.
2) The Gauss-Markov Theorem says that the least squares estimator -T b ^ has the smallest
variance of all linear unbiased estimators of an estimable function -T b when the Gauss-
Markov assumptions hold.
2
^ œ SSE/(N r) œ yT (I PX )y/(N r) is an unbiased estimator of the
3) The estimator 5
variance parameter 5 2 .
5) Generalized least squares estimators are introduced for cases where the Gauss-Markov
assumptions on the errors may not hold.
4.7. Exercises
1) Suppose the random variable Yi represents the number of votes that a candidate receives in
county i. A reasonable model would be that Yi would be independent binomial random
variables with parameters ni œ number of voters in county i and p œ Pr(voter correctly votes
for candidate).
a) What are E(Yi ) and Var(Yi )?
b) Write this as a linear model.
c) Find the BLUE of p in this situation.
2) Under the Gauss-Markov Model, show that for - such that -T b is estimable,
^) œ 5 2 -T (XT X)g - does not depend on the choice of generalized inverse (XT X)g .
Var(-T b
4) Consider the heteroskedasticity situation in Example 4.6, but suppose Var(ei ) œ 5 2 xi where
xi 0. Find Var(b^ ) and compare it to Var(b ^ ).
OLS GLS
Chapter 4, page 13
b1 œ ! (zi z)(yi y) / ! (zi z)(xi x)
N _ _ N _ _
~
i=1 i=1
where z1 , ..., zN are known constants.
~
a) Is b1 an unbiased estimator of the slope parameter "1 ?
~
b) Find the variance of b1 .
c) We know that taking zi œ xi gives our familiar least squares estimator ^b1 . Find the ratio of
~
^ ) / Var(b ^ is BLUE).
You may find the following results useful: ! (xi x) œ ! (zi z) œ 0
the two variances: Var(b 1 1 ) and show that it is
_ less than or equal
_ to 1 (b1
7) In a reversal of Example 4.6, suppose we have the simple linear regression problem, with
i i
8) (Compare with Example 4.9 (equicorrelation)) Consider the linear model y œ Xb e
where e œ u Z1 (Z is a scalar random variable) where Cov(u) œ 5 2 IN , Var(Z) œ 7 2 , and Z
and u are uncorrelated. Find V and derive conditions under which the OLS estimator of every
estimable function is BLUE.
9) Suppose yi œ "0 "1 xi "2 x2i ei , where E(ei ) œ 0, Var(ei ) œ 5 2 and the ei are
independent. But suppose we fit a simple linear regression model: E(yi ) œ "0 "1 xi . To
illustrate, consider the following simple situation: xi œ i, i œ 1, 2, ..., n œ 8, for "0 œ 2,
"1 œ 3. Consider also various values of "2 , say, -2 to 4 by ones. (Hint: I suggest using PROC
REG for computations.)
a) Compute the bias in the least squares estimators, "^ 0 , "^ 1 .
b) Compute the bias in our usual variance estimate 5 ^ 2.
c) Would your results change if the values of "0 and "1 were changed? Explain.
10) Consider the simple linear regression problem, yi œ "0 "1 xi ei , for i œ 1, ..., 4 œ n
with xi œ i and the Gauss-Markov assumptions on ei . A direct route for getting the BLUE
would be to construct ALL linear unbiased estimators and then directly minimize the variance.
Write the unbiased estimators as !ai yi and focus on estimating the slope "1 .
n
a) Write the two linear equations in the ai 's that express constraints so that !ai yi is an
i=1
11) Prove the Gauss-Markov Theorem directly, that is, by constructing all linear estimators
aT y which are unbiased for -T b (find a family of solutions a(z)), and then minimizing the
variance 5 2 aT a.
Chapter 4, page 14
12) Show that if R is square and nonsingular and RT VR œ I then V-1 œ RT R. Do you need
the assumption that R is square? nonsingular?
13) Let Q œ X(XT V-1 X)g XT V-1 . Show that Q is a projection onto V(X). (This is difficult
without the Hint: first factor V œ LLT with Cholesky and work with a symmetric version
Q* œ L-1 QL)
14) Let 5 2 V be the N ‚ N covariance matrix for a first order moving average process:
Vij œ 1 !2 if i œ j
Vij œ ! if | i j | œ 1
Vij œ 0 if | i j | 1
Notice that V is banded with zeros outside of three bands, on the diagonal and above and
below the diagonal.
a) Show that the Cholesky factor of V is also banded (lower triangular, with nonzeros on the
diagonal, and below the diagonal).
b) Find the limit of the two nonzero elements in row N as Np_.
15) Consider the multiple regression problem including an intercept with the following list of
explanatory variables:
c1 œ cos(21i/7) s1 œ sin(21i/7)
c2 œ cos(212i/7) s2 œ sin(212i/7)
c3 œ cos(213i/7) s3 œ sin(213i/7)
c4 œ cos(214i/7) s4 œ sin(214i/7)
c5 œ cos(215i/7) s5 œ sin(215i/7)
c6 œ cos(216i/7) s6 œ sin(216i/7)
for i œ 1, ..., N.
a) Show thtat the last six variables (c4, s4, ..., s6) are linearly dependent on the first six (c1, s1,
..., s3) and an intercept.
b) Test whether the regression software that you commonly use can detect dependencies
among the explanatory variables, using 3.1416 as your approximation for 1, and various
values of N.
c) Repeat this exercise with a cruder approximation 3.14 for 1.
d) Repeat this exercise with 4 in place of 7 (that is, 21i/4, 41i/4, etc.).
16) Using the variables (1, c1, s1, c2, s2, c3, s3) from Exercise 15 above, is there any
multicollinearity problem?
Chapter 4, page 15
c) Show that #^ œ (y4 y2 )/2 and (^ œ (y5 y1 )/4 are also unbiased for " and also find the
variance of each.
Consider now another estimator of the slope parameter " the estimator $^ œ c #^ (1-c) (^ .
d) Find the variance of $^ in terms of c and 5 2 .
e) Find the value of c that minimizes the variance found in part d).
f) Suppose we knew that ! œ 0, would "^ still be BLUE?
18) Under Gauss-Markov assumptions, show that if cov( aT y , dT ^e ) œ 0 for all d, then aT y is
the BLUE for its expectation.
19) Under the Aitken Model, with Cov(e) œ 5 2 V, if cov( aT y , dT ^e ) œ 0 for all d, then is aT y
still the BLUE for its expectation?
20) Prove Aitken's Theorem (Theorem 4.2) directly. Consider the Aitken Model, and let -T b
be estimable. Let aT y be a linear unbiased estimator of -T b. Show that
Var(aT y) œ Var(-T b^ ) Var(aT y -T b ^ ).
GLS GLS
21) Let Yi , i œ 1, .., N be iid exponential(-), that is, each has density f(y) œ --1 e-y/- for y 0.
b) Find the variance of the usual variance estimator, Var( !(Yi Y)2 /(N 1) ).
a) Find E Yki for k œ 1, 2, 3, 4. _
i
22) Recall the analysis of the US population in Exercise 3.23. Find the VIF's for each of the
coefficients in the uncentered model, using t œ 1790 through 2000. If you center using
c œ 1890, what happens to the VIF's?
24) Ridge regression is a technique that has been recommended by some statisticians to
address multicollinearity problems arising in multiple regression. In our usual linear models
framework with E(y) œ Xb, and Cov(y) œ 5 2 IN , the ridge regression estimator takes the form
~
b œ (XT X kIp )-1 XT y
where k 0. Assume here that the X has full column rank, that is, rank(X) œ p.
~
a) Find E(b).
~
b) Is -T b an unbiased estimator of -T b ?
~
c) Find Cov(b).
d) Mean squared error is commonly used to assess the quality of an estimator. For the ridge
~
E( ¼b b¼ ) œ E( (b b)T (b b) ).
regression estimator b, find its mean squared error
~ 2 ~ ~
Chapter 4, page 16
Ô 1 x1 z1 × "0
Ö 1 x2 z2 ÙÔ ×
Xb œ Ö Ù "
... ... ... Õ 1 Ø
Õ 1 xN zN Ø "2
yielding the inner product matrix
ÔN 0 ×
X X œ 0 !xi !xi zi .
0
Õ 0 !x z !z2i Ø
T 2
4.10. References
J. H. Goodnight (1979) 'A Tutorial on the Sweep Operator,' The American Statistician 33:149-
158.
Result 4.9. Let P be a symmetric matrix, and e be a random vector. The components ei are iid
with the following four moments: E(ei ) œ 0, Var(ei ) œ E(e2i ) œ 5 2 , E(e3i ) œ #3 , E(e4i ) œ #4 ,
i j
Chapter 4, page 17
25 4 !P2ij !(#4 5 4 )P2ii
iÁ j i
Proof: Begin with
Var( (. e)T P(. e) ) œ
E[(. e)T P(. e)(. e)T P(. e)] E[(. e)T P(. e)]2
Note that E[(. e)T P(. e)] œ .T P. 5 2 tr(P) from Lemma 4.3. Now write out all 16
terms of E[(. e)T P(. e)(. e)T P(. e)] as
E[.T P..T P.] œ (.T P.)2
E[.T P.eT P.] œ 0
E[.T P..T Pe] œ 0
E[.T P.eT Pe] œ (.T P.)E[eT Pe] œ (.T P.)5 2 trace(P)
Only two difficult expressions remain: E[.T PeeT Pe] and E[eT PeeT Pe]. For both we need the
following algebra:
iiii
! E[ .j ei ei ei Pji Pii ] œ ! .i #3 Pji Pii
i i
iiij
iÁ j iÁj
Chapter 4, page 18
Note that the other three cases are zero, for example, E[ .i ej ei ei Pij Pii ] œ 0. This extends to
the other three possible combinations (two pair, one pair, all differ). Adding these two pieces
i iÁ j i j
which does not appear to simplify further.
iiii
! E[ ej ei ei ei Pji Pii ] œ 0 (all four cases: jiii, ijii, iiji, iiij)
i i
iiij
iÁ j
i i iÁ j
Corollary 4.10. Let P be a symmetric and idempotent, and the components of the random
iÁ j i
Chapter 4, page 19
Var(eT Pe) œ 45 2 (.T P2 .) 25 4 !Pij2 !(35 4 5 4 )Pii2
iÁ j i
i j i iÁ j
Chapter 4, page 20