Linear Statistical Models

STAT 714
LINEAR STATISTICAL MODELS

Fall, 2010
Lecture Notes
Joshua M. Tebbs
Department of Statistics
The University of South Carolina
TABLE OF CONTENTS STAT 714, J. TEBBS
Contents
1 Examples of the General Linear Model 1
2 The Linear Least Squares Problem 13
2.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Geometric considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Estimability and Least Squares Estimators 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Two-way crossed ANOVA with no interaction . . . . . . . . . . . 37
3.2.3 Two-way crossed ANOVA with interaction . . . . . . . . . . . . . 39
3.3 Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Forcing least squares solutions using linear constraints . . . . . . . . . . . 46
4 The Gauss-Markov Model 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Estimation of
2
in the GM model . . . . . . . . . . . . . . . . . . . . . 57
4.4 Implications of model selection . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Undertting (Misspecication) . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Overtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 The Aitken model and generalized least squares . . . . . . . . . . . . . . 63
5 Distributional Theory 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
i
TABLE OF CONTENTS STAT 714, J. TEBBS
5.2 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 Probability density function . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Moment generating functions . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.4 Less-than-full-rank normal distributions . . . . . . . . . . . . . . 73
5.2.5 Independence results . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.6 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Noncentral
2
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Noncentral F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Distributions of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Independence of quadratic forms . . . . . . . . . . . . . . . . . . . . . . . 85
5.7 Cochrans Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 Statistical Inference 95
6.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Testing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Testing linear parametric functions . . . . . . . . . . . . . . . . . . . . . 103
6.4 Testing models versus testing linear parametric functions . . . . . . . . . 107
6.5 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.1 Constrained estimation . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.2 Testing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Condence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6.1 Single intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6.2 Multiple intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
ii
CHAPTER 1 STAT 714, J. TEBBS
1 Examples of the General Linear Model
Complementary reading from Monahan: Chapter 1.
INTRODUCTION: Linear models are models that are linear in their parameters. The
general form of a linear model is given by
Y = X +,
where Y is an n 1 vector of observed responses, X is an n p (design) matrix of xed
constants, is a p 1 vector of xed but unknown parameters, and is an n 1 vector
of (unobserved) random errors. The model is called a linear model because the mean of
the response vector Y is linear in the unknown parameter .
SCOPE: Several models commonly used in statistics are examples of the general linear
model Y = X +. These include, but are not limited to, linear regression models and
analysis of variance (ANOVA) models. Regression models generally refer to those for
which X is full rank, while ANOVA models refer to those for which X consists of zeros
and ones.
GENERAL CLASSES OF LINEAR MODELS:
Model I: Least squares model : Y = X + . This model makes no assumptions
on . The parameter space is = { : R
p
}.
Model II: Gauss Markov model : Y = X +, where E() = 0 and cov() =
2
I.
The parameter space is = {(,
2
) : (,
2
) R
p
R
+
}.
Model III: Aitken model : Y = X + , where E() = 0 and cov() =
2
V, V
known. The parameter space is = {(,
2
) : (,
2
) R
p
R
+
}.
Model IV: General linear mixed model : Y = X + , where E() = 0 and
cov() = (). The parameter space is = {(, ) : (, ) R
p
},
where is the set of all values of for which () is positive denite.
PAGE 1
GAUSS MARKOV MODEL: Consider the linear model Y = X + , where E() = 0
and cov() =
2
I. This model is treated extensively in Chapter 4. We now highlight
special cases of this model.
Example 1.1. One-sample problem. Suppose that Y
1
, Y
2
, ..., Y
n
is an iid sample with
mean and variance
2
> 0. If
1
,
2
, ...,
n
are iid with mean E(
i
) = 0 and common
variance
2
, we can write the GM model
Y = X +,
where
Y
n1
=
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X
n1
=
_
_
_
_
_
_
_
_
1
1
.
.
.
1
_
_
_
_
_
_
_
_
,
11
= ,
n1
=
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
Note that E() = 0 and cov() =
2
I.
Example 1.2. Simple linear regression. Consider the model where a response variable
Y is linearly related to an independent variable x via
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n, where the
i
are uncorrelated random variables with mean 0 and
common variance
2
> 0. If x
1
, x
2
, ..., x
n
are xed constants, measured without error,
then this is a GM model Y = X + with
Y
n1
=
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X
n2
=
_
_
_
_
_
_
_
_
1 x
1
1 x
2
.
.
.
.
.
.
1 x
n
_
_
_
_
_
_
_
_
,
21
=
_
_

0
1
_
_
,
n1
=
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
2
I.
Example 1.3. Multiple linear regression. Suppose that a response variable Y is linearly
related to several independent variables, say, x
1
, x
2
, ..., x
k
via
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
PAGE 2
for i = 1, 2, ..., n, where
i
are uncorrelated random variables with mean 0 and common
variance
2
> 0. If the independent variables are xed constants, measured without
error, then this model is a special GM model Y = X + where
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X
np
=
_
_
_
_
_
_
_
_
1 x
11
x
12
x
1k
1 x
21
x
22
x
2k
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
x
n2
x
nk
_
_
_
_
_
_
_
_
,
p1
=
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
k
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
,
and p = k + 1. Note that E() = 0 and cov() =
2
I.
Example 1.4. One-way ANOVA. Consider an experiment that is performed to compare
a 2 treatments. For the ith treatment level, suppose that n
i
experimental units are
selected at random and assigned to the ith treatment. Consider the model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where the random errors
ij
are uncorrelated random
variables with zero mean and common variance
2
> 0. If the a treatment eects
1
,
2
, ...,
a
are best regarded as xed constants, then this model is a special case of the
GM model Y = X +. To see this, note that with n =
a
i=1
n
i
,
Y
n1
=
_
_
_
_
_
_
_
_
Y
11
Y
12
.
.
.
Y
ana
_
_
_
_
_
_
_
_
, X
np
=
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
na
0
na
0
na
1
na
_
_
_
_
_
_
_
_
,
p1
=
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
,
where p = a + 1 and
n1
= (
11
,
12
, ...,
ana
)
, and where 1
n
i
is an n
i
1 vector of ones
and 0
n
i
is an n
i
1 vector of zeros. Note that E() = 0 and cov() =
2
I.
NOTE: In Example 1.4, note that the rst column of X is the sum of the last a columns;
i.e., there is a linear dependence in the columns of X. From results in linear algebra,
we know that X is not of full column rank. In fact, the rank of X is r = a, one less
PAGE 3
than the number of columns p = a + 1. This is a common characteristic of ANOVA
models; namely, their X matrices are not of full column rank. On the other hand,
(linear) regression models are models of the form Y = X+, where X is of full column
rank; see Examples 1.2 and 1.3.
Example 1.5. Two-way nested ANOVA. Consider an experiment with two factors,
where one factor, say, Factor B, is nested within Factor A. In other words, every level
of B appears with exactly one level of Factor A. A statistical model for this situation is
Y
ijk
= +
i
+
ij
+
ijk
,
for i = 1, 2, ..., a, j = 1, 2, ..., b
i
, and k = 1, 2, ..., n
ij
. In this model, denotes the overall
mean,
i
represents the eect due to the ith level of A, and
ij
represents the eect
of the jth level of B, nested within the ith level of A. If all parameters are xed, and
the random errors
ijk
are uncorrelated random variables with zero mean and constant
unknown variance
2
> 0, then this is a special GM model Y = X + . For example,
with a = 3, b = 2, and n
ij
= 2, we have
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
111
Y
112
Y
121
Y
122
Y
211
Y
212
Y
221
Y
222
Y
311
Y
312
Y
321
Y
322
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 1 0 0 0 0 0
1 1 0 0 1 0 0 0 0 0
1 1 0 0 0 1 0 0 0 0
1 1 0 0 0 1 0 0 0 0
1 0 1 0 0 0 1 0 0 0
1 0 1 0 0 0 1 0 0 0
1 0 1 0 0 0 0 1 0 0
1 0 1 0 0 0 0 1 0 0
1 0 0 1 0 0 0 0 1 0
1 0 0 1 0 0 0 0 1 0
1 0 0 1 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
21
22
31
32
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
and = (
111
,
112
, ...,
322
)
. Note that E() = 0 and cov() =

2
I. The X matrix is not
of full column rank. The rank of X is r = 6 and there are p = 10 columns.
PAGE 4
Example 1.6. Two-way crossed ANOVA with interaction. Consider an experiment with
two factors (A and B), where Factor A has a levels and Factor B has b levels. In general,
we say that factors A and B are crossed if every level of A occurs in combination with
every level of B. Consider the two-factor (crossed) ANOVA model given by
Y
ijk
= +
i
+
j
+
ij
+
ijk
,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., n
ij
ij
are
uncorrelated random variables with zero mean and constant unknown variance
2
> 0.
If all the parameters are xed, this is a special GM model Y = X + . For example,
with a = 3, b = 2, and n
ij
= 3,
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
111
Y
112
Y
113
Y
121
Y
122
Y
123
Y
211
Y
212
Y
213
Y
221
Y
222
Y
223
Y
311
Y
312
Y
313
Y
321
Y
322
Y
323
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 1 0 1 0 0 0 0 0
1 1 0 0 1 0 1 0 0 0 0 0
1 1 0 0 1 0 1 0 0 0 0 0
1 1 0 0 0 1 0 1 0 0 0 0
1 1 0 0 0 1 0 1 0 0 0 0
1 1 0 0 0 1 0 1 0 0 0 0
1 0 1 0 1 0 0 0 1 0 0 0
1 0 1 0 1 0 0 0 1 0 0 0
1 0 1 0 1 0 0 0 1 0 0 0
1 0 1 0 0 1 0 0 0 1 0 0
1 0 1 0 0 1 0 0 0 1 0 0
1 0 1 0 0 1 0 0 0 1 0 0
1 0 0 1 1 0 0 0 0 0 1 0
1 0 0 1 1 0 0 0 0 0 1 0
1 0 0 1 1 0 0 0 0 0 1 0
1 0 0 1 0 1 0 0 0 0 0 1
1 0 0 1 0 1 0 0 0 0 0 1
1 0 0 1 0 1 0 0 0 0 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
21
22
31
32
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
and = (
111
,
112
, ...,
323
)

2
of full column rank. The rank of X is r = 6 and there are p = 12 columns.
PAGE 5
Example 1.7. Two-way crossed ANOVA without interaction. Consider an experiment
with two factors (A and B), where Factor A has a levels and Factor B has b levels. The
two-way crossed model without interaction is given by
Y
ijk
= +
i
+
j
+
ijk
,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., n
ij
ij
are
uncorrelated random variables with zero mean and common variance
2
> 0. Note that
no-interaction model is a special case of the interaction model in Example 1.6 when
H
0
:
11
=
12
= =
32
= 0 is true. That is, the no-interaction model is a reduced
version of the interaction model. With a = 3, b = 2, and n
ij
= 3 as before, we have
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
111
Y
112
Y
113
Y
121
Y
122
Y
123
Y
211
Y
212
Y
213
Y
221
Y
222
Y
223
Y
311
Y
312
Y
313
Y
321
Y
322
Y
323
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 1 0
1 1 0 0 1 0
1 1 0 0 1 0
1 1 0 0 0 1
1 1 0 0 0 1
1 1 0 0 0 1
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 1 0
1 0 1 0 0 1
1 0 1 0 0 1
1 0 1 0 0 1
1 0 0 1 1 0
1 0 0 1 1 0
1 0 0 1 1 0
1 0 0 1 0 1
1 0 0 1 0 1
1 0 0 1 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
and = (
111
,
112
, ...,
323
)

2
of full column rank. The rank of X is r = 4 and there are p = 6 columns. Also note that
PAGE 6
the design matrix for the no-interaction model is the same as the design matrix for the
interaction model, except that the last 6 columns are removed.
Example 1.8. Analysis of covariance. Consider an experiment to compare a 2
treatments after adjusting for the eects of a covariate x. A model for the analysis of
covariance is given by
Y
ij
= +
i
+
i
x
ij
+
ij
,
for i = 1, 2, ..., a, j = 1, 2, ..., n
i
ij
are uncorrelated random
variables with zero mean and common variance
2
> 0. In this model, represents the
overall mean,
i
represents the (xed) eect of receiving the ith treatment (disregarding
the covariates), and
i
denotes the slope of the line that relates Y to x for the ith
treatment. Note that this model allows the treatment slopes to be dierent. The x
ij
s
are assumed to be xed values measured without error.
NOTE: The analysis of covariance (ANCOVA) model is a special GM model Y = X+.
For example, with a = 3 and n
1
= n
2
= n
3
= 3, we have
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
13
Y
21
Y
22
Y
23
Y
31
Y
32
Y
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 x
11
0 0
1 1 0 0 x
12
0 0
1 1 0 0 x
13
0 0
1 0 1 0 0 x
21
0
1 0 1 0 0 x
22
0
1 0 1 0 0 x
23
0
1 0 0 1 0 0 x
31
1 0 0 1 0 0 x
32
1 0 0 1 0 0 x
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
13
21
22
23
31
32
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
2
I. The X matrix is not of full column rank. If there
are no linear dependencies among the last 3 columns, the rank of X is r = 6 and there
are p = 7 columns.
REDUCED MODEL: Consider the ANCOVA model in Example 1.8 which allows for
unequal slopes. If
1
=
2
= =
a
; that is, all slopes are equal, then the ANCOVA
PAGE 7
model reduces to
Y
ij
= +
i
+x
ij
+
ij
.
That is, the common-slopes ANCOVA model is a reduced version of the model that
allows for dierent slopes. Assuming the same error structure, this reduced ANCOVA
model is also a special GM model Y = X + . With a = 3 and n
1
= n
2
= n
3
= 3, as
before, we have
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
13
Y
21
Y
22
Y
23
Y
31
Y
32
Y
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 x
11
1 1 0 0 x
12
1 1 0 0 x
13
1 0 1 0 x
21
1 0 1 0 x
22
1 0 1 0 x
23
1 0 0 1 x
31
1 0 0 1 x
32
1 0 0 1 x
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
13
21
22
23
31
32
33
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
As long as at least one of the x
ij
s is dierent, the rank of X is r = 4 and there are p = 5
columns.
GOAL: We now provide examples of linear models of the form Y = X + that are not
GM models.
TERMINOLOGY: A factor of classication is said to be random if it has an innitely
large number of levels and the levels included in the experiment can be viewed as a
random sample from the population of possible levels.
Example 1.9. One-way random eects ANOVA. Consider the model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where the treatment eects
1
,
2
, ...,
a
are best
regarded as random; e.g., the a levels of the factor of interest are drawn from a large
population of possible levels, and the random errors
ij
are uncorrelated random variables
PAGE 8
with zero mean and common variance
2
> 0. For concreteness, let a = 4 and n
ij
= 3.
The model Y = X + looks like
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
13
Y
21
Y
22
Y
23
Y
31
Y
32
Y
33
Y
41
Y
42
Y
43
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
= 1
12
+
_
_
_
_
_
_
_
_
1
3
0
3
0
3
0
3
0
3
1
3
0
3
0
3
0
3
0
3
1
3
0
3
0
3
0
3
0
3
1
3
_
_
_
_
_
_
_
_
. .
= Z
1
_
_
_
_
_
_
_
_
4
_
_
_
_
_
_
_
_
. .
=
1
+
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
13
21
22
23
31
32
33
41
42
43
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
. .
=
2
= X +Z
1
1
+
2
,
where we identify X = 1
12
, = , and = Z
1
1
+
2
. This is not a GM model because
cov() = cov(Z
1
1
+
2
) = Z
1
cov(
1
)Z
1
+ cov(
2
) = Z
1
cov(
1
)Z
1
+
2
I,
provided that the
i
s and the errors
ij
are uncorrelated. Note that cov() =
2
I.
Example 1.10. Two-factor mixed model. Consider an experiment with two factors (A
and B), where Factor A is xed and has a levels and Factor B is random with b levels.
A statistical model for this situation is given by
Y
ijk
= +
i
+
j
+
ijk
,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., n
ij
. The
i
s are best regarded as xed
and the
j
s are best regarded as random. This model assumes no interaction.
APPLICATION: In a randomized block experiment, b blocks may have been selected
randomly from a large collection of available blocks. If the goal is to make a statement
PAGE 9
about the large population of blocks (and not those b blocks in the experiment), then
blocks are considered as random. The treatment eects
1
,
2
, ...,
a
are regarded as
xed constants if the a treatments are the only ones of interest.
NOTE: For concreteness, suppose that a = 2, b = 4, and n
ij
= 1. We can write the
model above as
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
13
Y
14
Y
21
Y
22
Y
23
Y
24
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
1
4
1
4
0
4
1
4
0
4
1
4
_
_
_
_
_
_
_
2
_
_
_
_
_
. .
= X
+
_
_
I
4
I
4
_
_
_
_
_
_
_
_
_
_
4
_
_
_
_
_
_
_
_
. .
= Z
1
1
+
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
13
14
21
22
23
24
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
. .
=
2
.
NOTE: If the
i
s are best regarded as random as well, then we have
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
13
Y
14
Y
21
Y
22
Y
23
Y
24
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
= 1
8
+
_
_
1
4
0
4
0
4
1
4
_
_
_
_

1
2
_
_
. .
= Z
1
1
+
_
_
I
4
I
4
_
_
_
_
_
_
_
_
_
_
4
_
_
_
_
_
_
_
_
. .
= Z
2
2
+
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
13
14
21
22
23
24
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
. .
=
3
.
This model is also known as a random eects or variance component model.
GENERAL FORM: A linear mixed model can be expressed generally as
Y = X +Z
1
1
+Z
2
2
+ +Z
k
k
,
where Z
1
, Z
2
, ..., Z
k
are known matrices (typically Z
k
= I
k
) and
1
,
2
, ...,
k
are uncorre-
lated random vectors with uncorrelated components.
PAGE 10
Example 1.11. Time series models. When measurements are taken over time, the GM
model may not be appropriate because observations are likely correlated. A linear model
of the form Y = X + , where E() = 0 and cov() =
2
V, V known, may be more
appropriate. The general form of V is chosen to model the correlation of the observed
responses. For example, consider the statistical model
Y
t
=
0
+
1
t +
t
,
for t = 1, 2, ..., n, where
t
=
t1
+ a
t
, a
t
iid N(0,
2
), and || < 1 (this is a
stationarity condition). This is called a simple linear trend model where the error
process {
t
: t = 1, 2, ..., n} follows an autoregressive model of order 1, AR(1). It is easy
to show that E(
t
) = 0, for all t, and that cov(
t
,
s
) =
2
|ts|
, for all t and s. Therefore,
if n = 5,
V =
2
_
_
_
_
_
_
_
_
_
_
_
1
2
4
1
2
2
1
2
2
1
2
1
_
_
_
_
_
_
_
_
_
_
_
.
Example 1.12. Random coecient models. Suppose that t measurements are taken
(over time) on n individuals and consider the model
Y
ij
= x
ij
i
+
ij
,
for i = 1, 2, ..., n and j = 1, 2, ..., t; that is, the dierent p 1 regression parameters
i
are subject-specic. If the individuals are considered to be a random sample, then we
can treat
1
,
2
, ...,
n
as iid random vectors with mean and p p covariance matrix
, say. We can write this model as

Y
ij
= x
ij
i
+
ij
= x
ij
..
xed
+x
ij
(
i
) +
ij
. .
random
.
If the
i
s are independent of the
ij
s, note that
var(Y
ij
) = x
ij
x
ij
+
2
=
2
.
PAGE 11
Example 1.13. Measurement error models. Consider the statistical model
Y
i
=
0
+
1
X
i
+
i
,
where
i
iid N(0,
2
). The X
i
s are not observed exactly; instead, they are measured
with non-negligible error so that
W
i
= X
i
+U
i
,
where U
i
iid N(0,
2
U
). Here,
Observed data: (Y
i
, W
i
)
Not observed: (X
i
,
i
, U
i
)
Unknown parameters: (
0
,
1
,
2
,
2
U
).
As a frame of reference, suppose that Y is a continuous measurement of lung function
in small children and that X denotes the long-term exposure to NO
2
. It is unlikely that
X can be measured exactly; instead, the surrogate W, the amount of NO
2
recorded at a
clinic visit, is more likely to be observed. Note that the model above can be rewritten as
Y
i
=
0
+
1
(W
i
U
i
) +
i
=
0
+
1
W
i
+ (
i
1
U
i
)
. .
=
i
.
Because the W
i
s are not xed in advance, we would at least need E(
i
|W
i
) = 0 for this
to be a GM linear model. However, note that
E(
i
|W
i
) = E(
i
1
U
i
|X
i
+U
i
)
= E(
i
|X
i
+U
i
)
1
E(U
i
|X
i
+U
i
).
The rst term is zero if
i
is independent of both X
i
and U
i
. The second term generally
is not zero (unless
1
= 0, of course) because U
i
and X
i
+ U
i
are correlated. Therefore,
this can not be a GM model.
PAGE 12
2 The Linear Least Squares Problem
Complementary reading from Monahan: Chapter 2 (except Section 2.4).
INTRODUCTION: Consider the general linear model
Y = X +,
where Y is an n1 vector of observed responses, X is an np matrix of xed constants,
is a p 1 vector of xed but unknown parameters, and is an n 1 vector of random
errors. If E() = 0, then
E(Y) = E(X +) = X.
Since is unknown, all we really know is that E(Y) = X C(X). To estimate E(Y),
it seems natural to take the vector in C(X) that is closest to Y.
2.1 Least squares estimation
DEFINITION: An estimate

is a least squares estimate of if X
is the vector in
C(X) that is closest to Y. In other words,

is a least squares estimate of if
= arg min
R
p
(YX)
(YX).
LEAST SQUARES: Let = (
1
,
2
, ...,
p
)
and dene the error sum of squares

Q() = (YX)
(YX),
the squared distance from Y to X. The point where Q() is minimized satises
Q()
= 0, or, in other words,

_
_
_
_
_
_
_
_
Q()
1
Q()
2
.
.
.
Q()
p
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
0
0
.
.
.
0
_
_
_
_
_
_
_
_
.
This minimization problem can be tackled either algebraically or geometrically.
PAGE 13
Result 2.1. Let a and b be p 1 vectors and A be a p p matrix of constants. Then
a
b
b
= a and
b
Ab
b
= (A+A
)b.
Proof. See Monahan, pp 14.
NOTE: In Result 2.1, note that
b
Ab
b
= 2Ab
if A is symmetric.
NORMAL EQUATIONS: Simple calculations show that
Q() = (YX)
(YX)
= Y
Y2Y
X +
X.
Using Result 2.1, we have
Q()
= 2X
Y+ 2X
X,
because X
X is symmetric. Setting this expression equal to 0 and rearranging gives

X
X = X
Y.
These are the normal equations. If X
X is nonsingular, then the unique least squares

estimator of is
= (X
X)
1
X
Y.
When X
X is singular, which can happen in ANOVA models (see Chapter 1), there can
be multiple solutions to the normal equations. Having already proved algebraically that
the normal equations are consistent, we know that the general form of the least squares
solution is
= (X
X)
Y+ [I (X
X)
X]z,
for z R
p
, where (X
X)
is a generalized inverse of X
X.
PAGE 14
2.2 Geometric considerations
CONSISTENCY: Recall that a linear system Ax = c is consistent if there exists an x
such that Ax
= c; that is, if c C(A). Applying this denition to

X
X = X
Y,
the normal equations are consistent if X
Y C(X
X). Clearly, X
Y C(X
). Thus, well
be able to establish consistency (geometrically) if we can show that C(X
X) = C(X
).
Result 2.2. N(X
X) = N(X).
Proof. Suppose that w N(X). Then Xw = 0 and X
Xw = 0 so that w N(X
X).
Suppose that w N(X
X). Then X
Xw = 0 and w
Xw = 0. Thus, ||Xw||
2
= 0
which implies that Xw = 0; i.e., w N(X).
Result 2.3. Suppose that S
1
and T
1
are orthogonal complements, as well as S
2
and T
2
.
If S
1
S
2
, then T
2
T
1
.
Proof. See Monahan, pp 244.
CONSISTENCY: We use the previous two results to show that C(X
X) = C(X
). Take
S
1
= N(X
X), T
1
= C(X
X), S
2
= N(X), and T
2
= C(X
). We know that S
1
and
T
1
(S
2
and T
2
) are orthogonal complements. Because N(X
X) N(X), the last result

guarantees C(X
) C(X
X). But, C(X
X) C(X
) trivially, so were done. Note also

C(X
X) = C(X
) =r(X
X) = r(X
) = r(X).
NOTE: We now state a result that characterizes all solutions to the normal equations.
Result 2.4. Q() = (YX)
(YX) is minimized at

if and only if

is a solution
to the normal equations.
Proof. (=) Suppose that

is a solution to the normal equations. Then,
Q() = (YX)
(YX)
= (YX
+X
X)
(YX
+X
X)
= (YX
(YX
) + (X
X)
(X
X),
PAGE 15
since the cross product term 2(X
X)
(YX
) = 0; verify this using the fact that
solves the normal equations. Thus, we have shown that Q() = Q(
) + z
z, where
z = X
X. Therefore, Q() Q(
) for all and, hence,

minimizes Q(). (=)
Now, suppose that

minimizes Q(). We already know that Q(
) Q(
), where
= (X
X)
Y, by assumption, but also Q(
) Q(
) because

minimizes Q().
Thus, Q(
) = Q(
). But because Q(
) = Q(
) +z
z, where z = X
, it must be
true that z = X
= 0; that is, X
= X
. Thus,
X
= X
= X
Y,
since

is a solution to the normal equations. This shows that

is also solution to the
normal equations.
INVARIANCE: In proving the last result, we have discovered a very important fact;
namely, if

and

both solve the normal equations, then X
= X
. In other words,
X
is invariant to the choice of

.
NOTE: The following result ties least squares estimation to the notion of a perpendicular
projection matrix. It also produces a general formula for the matrix.
Result 2.5. An estimate

is a least squares estimate if and only if X
= MY, where
M is the perpendicular projection matrix onto C(X).
Proof. We will show that
(YX)
(YX) = (YMY)
(YMY) + (MYX)
(MYX).
Both terms on the right hand side are nonnegative, and the rst term does not involve
. Thus, (YX)
(YX) is minimized by minimizing (MYX)
(MYX), the
squared distance between MY and X. This distance is zero if and only if MY = X,
which proves the result. Now to show the above equation:
(YX)
(YX) = (YMY+MYX)
(YMY+MYX)
= (YMY)
(YMY) + (YMY)
(MYX)
. .
()
+ (MYX)
(YMY)
. .
()
+(MYX)
(MYX).
PAGE 16
It suces to show that () and () are zero. To show that () is zero, note that
(YMY)
(MYX) = Y
(I M)(MYX) = [(I M)Y]
(MYX) = 0,
because (I M)Y N(X
) and MYX C(X). Similarly, () = 0 as well.

Result 2.6. The perpendicular projection matrix onto C(X) is given by
M = X(X
X)
.
Proof. We know that

= (X
X)
Y is a solution to the normal equations, so it is a

least squares estimate. But, by Result 2.5, we know X
= MY. Because perpendicular

projection matrices are unique, M = X(X
X)
as claimed.
NOTATION: Monahan uses P
X
to denote the perpendicular projection matrix onto
C(X). We will henceforth do the same; that is,
P
X
= X(X
X)
.
PROPERTIES: Let P
X
denote the perpendicular projection matrix onto C(X). Then
(a) P
X
is idempotent
(b) P
X
projects onto C(X)
(c) P
X
is invariant to the choice of (X
X)
(d) P
X
is symmetric
(e) P
X
is unique.
We have already proven (a), (b), (d), and (e); see Matrix Algebra Review 5. Part (c) must
be true; otherwise, part (e) would not hold. However, we can prove (c) more rigorously.
Result 2.7. If (X
X)
1
and (X
X)
2
are generalized inverses of X
X, then
1. X(X
X)
1
X
X = X(X
X)
2
X
X = X
2. X(X
X)
1
X
= X(X
X)
2
X
.
PAGE 17
Proof. For v R
n
, let v = v
1
+ v
2
, where v
1
C(X) and v
2
C(X). Since v
1
C(X),
we know that v
1
= Xd, for some vector d. Then,
v
X(X
X)
1
X
X = v
1
X(X
X)
1
X
X = d
X(X
X)
1
X
X = d
X = v
X,
since v
2
C(X). Since v and (X
X)
1
were arbitrary, we have shown the rst part. To
show the second part, note that
X(X
X)
1
X
v = X(X
X)
1
X
Xd = X(X
X)
2
X
Xd = X(X
X)
2
X
v.
Since v is arbitrary, the second part follows as well.
Result 2.8. Suppose X is n p with rank r p, and let P
X
be the perpendicular
projection matrix onto C(X). Then r(P
X
) = r(X) = r and r(I P
X
) = n r.
Proof. Note that P
X
is nn. We know that C(P
X
) = C(X), so the rst part is obvious.
To show the second part, recall that I P
X
is the perpendicular projection matrix onto
N(X
), so it is idempotent. Thus,
r(I P
X
) = tr(I P
X
) = tr(I) tr(P
X
) = n r(P
X
) = n r,
because the trace operator is linear and because P
X
is idempotent as well.
SUMMARY: Consider the linear model Y = X + , where E() = 0; in what follows,
the cov() =
2
I assumption is not needed. We have shown that a least squares estimate
of is given by
= (X
X)
Y.
This solution is not unique (unless X
X is nonsingular). However,
P
X
Y = X

Y
is unique. We call

Y the vector of tted values. Geometrically,

Y is the point in C(X)
that is closest to Y. Now, recall that I P
X
is the perpendicular projection matrix onto
N(X
). Note that
(I P
X
)Y = YP
X
Y = Y

Y e.
PAGE 18
We call e the vector of residuals. Note that e N(X
). Because C(X) and N(X
) are
orthogonal complements, we know that Y can be uniquely decomposed as
Y =

Y+e.
We also know that

Y and e are orthogonal vectors. Finally, note that
Y
Y = Y
IY = Y
(P
X
+I P
X
)Y
= Y
P
X
Y+Y
(I P
X
)Y
= Y
P
X
P
X
Y+Y
(I P
X
)(I P
X
)Y
=

Y
Y+e
e,
since P
X
and IP
X
are both symmetric and idempotent; i.e., they are both perpendicular
projection matrices (but onto orthogonal spaces). This orthogonal decomposition of Y
Y
is often given in a tabular display called an analysis of variance (ANOVA) table.
ANOVA TABLE: Suppose that Y is n 1, X is n p with rank r p, is p 1, and
is n 1. An ANOVA table looks like
Source df SS
Model r

Y
Y = Y
P
X
Y
Residual n r e
e = Y
(I P
X
)Y
Total n Y
Y = Y
IY
It is interesting to note that the sum of squares column, abbreviated SS, catalogues
3 quadratic forms, Y
P
X
Y, Y
(I P
X
Y), and Y
IY. The degrees of freedom column,

abbreviated df, catalogues the ranks of the associated quadratic form matrices; i.e.,
r(P
X
) = r
r(I P
X
) = n r
r(I) = n.
The quantity Y
P
X
Y is called the (uncorrected) model sum of squares, Y
(I P
X
)Y
is called the residual sum of squares, and Y
Y is called the (uncorrected) total sum of

squares.
PAGE 19
NOTE: The following visualization analogy is taken liberally from Christensen (2002).
VISUALIZATION: One can think about the geometry of least squares estimation in
three dimensions (i.e., when n = 3). Consider your kitchen table and take one corner of
the table to be the origin. Take C(X) as the two dimensional subspace determined by the
surface of the table, and let Y be any vector originating at the origin; i.e., any point in
R
3
. The linear model says that E(Y) = X, which just says that E(Y) is somewhere on
the table. The least squares estimate

Y = X
= P
X
Y is the perpendicular projection
of Y onto the surface of the table. The residual vector e = (I P
X
)Y is the vector
starting at the origin, perpendicular to the surface of the table, that reaches the same
height as Y. Another way to think of the residual vector is to rst connect Y and
P
X
Y with a line segment (that is perpendicular to the surface of the table). Then,
shift the line segment along the surface (keeping it perpendicular) until the line segment
has one end at the origin. The residual vector e is the perpendicular projection of Y
onto C(I P
X
) = N(X
); that is, the projection onto the orthogonal complement of the

table surface. The orthogonal complement C(I P
X
) is the one-dimensional space in
the vertical direction that goes through the origin. Once you have these vectors in place,
sums of squares arise from Pythagoreans Theorem.
A SIMPLE PPM: Suppose Y
1
, Y
2
, ..., Y
n
are iid with mean E(Y
i
) = . In terms of the
general linear model, we can write Y = X +, where
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X = 1 =
_
_
_
_
_
_
_
_
1
1
.
.
.
1
_
_
_
_
_
_
_
_
, = , =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
The perpendicular projection matrix onto C(X) is given by
P
1
= 1(1
1)
= n
1
11
= n
1
J,
where J is the n n matrix of ones. Note that
P
1
Y = n
1
JY = Y 1,
PAGE 20
where Y = n
1
n
i=1
Y
i
. The perpendicular projection matrix P
1
projects Y onto the
space
C(P
1
) = {z R
n
: z = (a, a, ..., a)
; a R}.
Note that r(P
1
) = 1. Note also that
(I P
1
)Y = YP
1
Y = YY 1 =
_
_
_
_
_
_
_
_
Y
1
Y
Y
2
Y
.
.
.
Y
n
Y
_
_
_
_
_
_
_
_
,
the vector which contains the deviations from the mean. The perpendicular projection
matrix I P
1
projects Y onto
C(I P
1
) =
_
z R
n
: z = (a
1
, a
2
, ..., a
n
)
; a
i
R,
n
i=1
a
i
= 0
_
.
Note that r(I P
1
) = n 1.
REMARK: The matrix P
1
plays an important role in linear models, and here is why.
Most linear models, when written out in non-matrix notation, contain an intercept
term. For example, in simple linear regression,
Y
i
=
0
+
1
x
i
+
i
,
or in ANOVA-type models like
Y
ijk
= +
i
+
j
+
ij
+
ijk
,
the intercept terms are
0
and , respectively. In the corresponding design matrices, the
rst column of X is 1. If we discard the other terms like
1
x
i
and
i
+
j
+
ij
in the
models above, then we have a reduced model of the form Y
i
= +
i
; that is, a model
that relates Y
i
to its overall mean, or, in matrix notation Y = 1+. The perpendicular
projection matrix onto C(1) is P
1
and
Y
P
1
Y = Y
P
1
P
1
Y = (P
1
Y)
(P
1
Y) = nY
2
.
PAGE 21
This is the model sum of squares for the model Y
i
= +
i
; that is, Y
P
1
Y is the sum of
squares that arises from tting the overall mean . Now, consider a general linear model
of the form Y = X + , where E() = 0, and suppose that the rst column of X is 1.
In general, we know that
Y
Y = Y
IY = Y
P
X
Y+Y
(I P
X
)Y.
Subtracting Y
P
1
Y from both sides, we get
Y
(I P
1
)Y = Y
(P
X
P
1
)Y+Y
(I P
X
)Y.
The quantity Y
(IP
1
)Y is called the corrected total sum of squares and the quantity
Y
(P
X
P
1
)Y is called the corrected model sum of squares. The term corrected
is understood to mean that we have removed the eects of tting the mean. This is
important because this is the sum of squares breakdown that is commonly used; i.e.,
Source df SS
Model (Corrected) r 1 Y
(P
X
P
1
)Y
Residual n r Y
(I P
X
)Y
Total (Corrected) n 1 Y
(I P
1
)Y
In ANOVA models, the corrected model sum of squares Y
(P
X
P
1
)Y is often broken
down further into smaller components which correspond to dierent parts; e.g., orthog-
onal contrasts, main eects, interaction terms, etc. Finally, the degrees of freedom are
simply the corresponding ranks of P
X
P
1
, I P
X
, and I P
1
.
NOTE: In the general linear model Y = X + , the residual vector from the least
squares t e = (I P
X
)Y N(X
), so e
X = 0; that is, the residuals in a least squares

t are orthogonal to the columns of X, since the columns of X are in C(X). Note that if
1 C(X), which is true of all linear models with an intercept term, then
e
1 =
n
i=1
e
i
= 0,
that is, the sum of the residuals from a least squares t is zero. This is not necessarily
true of models for which 1 / C(X).
PAGE 22
Result 2.9. If C(W) C(X), then P
X
P
W
is the perpendicular projection matrix
onto C[(I P
W
)X].
Proof. It suces to show that (a) P
X
P
W
is symmetric and idempotent and that (b)
C(P
X
P
W
) = C[(I P
W
)X]. First note that P
X
P
W
= P
W
because the columns of
P
W
are in C(W) C(X). By symmetry, P
W
P
X
= P
W
. Now,
(P
X
P
W
)(P
X
P
W
) = P
2
X
P
X
P
W
P
W
P
X
+P
2
W
= P
X
P
W
P
W
+P
W
= P
X
P
W
.
Thus, P
X
P
W
is idempotent. Also, (P
X
P
W
)
= P
X
P
W
= P
X
P
W
, so P
X
P
W
is symmetric. Thus, P
X
P
W
is a perpendicular projection matrix onto C(P
X
P
W
).
Suppose that v C(P
X
P
W
); i.e., v = (P
X
P
W
)d, for some d. Write d = d
1
+d
2
,
where d
1
C(X) and d
2
N(X
); that is, d
1
= Xa, for some a, and X
d
2
= 0. Then,
v = (P
X
P
W
)(d
1
+d
2
)
= (P
X
P
W
)(Xa +d
2
)
= P
X
Xa +P
X
d
2
P
W
Xa P
W
d
2
= Xa +0 P
W
Xa 0
= (I P
W
)Xa C[(I P
W
)X].
Thus, C(P
X
P
W
) C[(I P
W
)X]. Now, suppose that w C[(I P
W
)X]. Then
w = (I P
W
)Xc, for some c. Thus,
w = Xc P
W
Xc = P
X
Xc P
W
Xc = (P
X
P
W
)Xc C(P
X
P
W
).
This shows that C[(I P
W
)X] C(P
X
P
W
).
TERMINOLOGY: Suppose that V is a vector space and that S is a subspace of V; i.e.,
S V. The subspace
S
V
= {z V : zS}
is called the orthogonal complement of S with respect to V. If V = R
n
, then S
V
= S
is simply referred to as the orthogonal complement of S.

PAGE 23
Result 2.10. If C(W) C(X), then C(P
X
P
W
) = C[(I P
W
)X] is the orthogonal
complement of C(P
W
) with respect to C(P
X
); that is,
C(P
X
P
W
) = C(P
W
)
C(P
X
)
.
Proof. C(P
X
P
W
)C(P
W
) because (P
X
P
W
)P
W
= P
X
P
W
P
2
W
= P
W
P
W
= 0.
Because C(P
X
P
W
) C(P
X
), C(P
X
P
W
) is contained in the orthogonal complement
of C(P
W
X
). Now suppose that v C(P
X
) and vC(P
W
). Then,
v = P
X
v = (P
X
P
W
)v +P
W
v = (P
X
P
W
)v C(P
X
P
W
),
showing that the orthogonal complement of C(P
W
X
) is contained
in C(P
X
P
W
).
REMARK: The preceding two results are important for hypothesis testing in linear
models. Consider the linear models
Y = X + and Y = W +,
where C(W) C(X). As we will learn later, the condition C(W) C(X) implies that
Y = W + is a reduced model when compared to Y = X + , sometimes called
the full model. If E() = 0, then, if the full model is correct,
E(P
X
Y) = P
X
E(Y) = P
X
X = X C(X).
Similarly, if the reduced model is correct, E(P
W
Y) = W C(W). Note that if
the reduced model Y = W + is correct, then the full model Y = X + is also
correct since C(W) C(X). Thus, if the reduced model is correct, P
X
Y and P
W
Y
are attempting to estimate the same thing and their dierence (P
X
P
W
)Y should be
small. On the other hand, if the reduced model is not correct, but the full model is, then
P
X
Y and P
W
Y are estimating dierent things and one would expect (P
X
P
W
)Y to
be large. The question about whether or not to accept the reduced model as plausible
thus hinges on deciding whether or not (P
X
P
W
)Y, the (perpendicular) projection of
Y onto C(P
X
P
W
) = C(P
W
)
C(P
X
)
, is large or small.
PAGE 24
2.3 Reparameterization
REMARK: For estimation in the general linear model Y = X + , where E() = 0,
we can only learn about through X C(X). Thus, the crucial item needed is
P
X
, the perpendicular projection matrix onto C(X). For convenience, we call C(X) the
estimation space. P
X
is the perpendicular projection matrix onto the estimation space.
We call N(X
) the error space. I P

X
is the perpendicular projection matrix onto the
error space.
IMPORTANT: Any two linear models with the same estimation space are really the
same model; the models are said to be reparameterizations of each other. Any two
such models will give the same predicted values, the same residuals, the same ANOVA
table, etc. In particular, suppose that we have two linear models:
Y = X + and Y = W +.
If C(X) = C(W), then P
X
does not depend on which of X or W is used; it depends only
on C(X) = C(W). As we will nd out, the least-squares estimate of E(Y) is
Y = P
X
Y = X
= W .
IMPLICATION: The parameters in the model Y = X + , where E() = 0, are
not really all that crucial. Because of this, it is standard to reparameterize linear models
(i.e., change the parameters) to exploit computational advantages, as we will soon see.
The essence of the model is that E(Y) C(X). As long as we do not change C(X), the
design matrix X and the corresponding model parameters can be altered in a manner
suitable to our liking.
EXAMPLE: Recall the simple linear regression model from Chapter 1 given by
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n. Although not critical for this discussion, we will assume that
1
,
2
, ...,
n
are uncorrelated random variables with mean 0 and common variance
2
> 0. Recall
PAGE 25
that, in matrix notation,
Y
n1
=
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X
n2
=
_
_
_
_
_
_
_
_
1 x
1
1 x
2
.
.
.
.
.
.
1 x
n
_
_
_
_
_
_
_
_
,
21
=
_
_

0
1
_
_
,
n1
=
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
As long as (x
1
, x
2
, ..., x
n
)
is not a multiple of 1
n
and at least one x
i
= 0, then r(X) = 2
and (X
X)
1
exists. Straightforward calculations show that
X
X =
_
_
n
i
x
i
i
x
i
i
x
2
i
_
_
, (X
X)
1
=
_
_
1
n
+
x
2
i
(x
i
x)
2

x
i
(x
i
x)
2
i
(x
i
x)
2
1
i
(x
i
x)
2
.
_
_
and
X
Y =
_
_
i
Y
i
i
x
i
Y
i
_
_
.
Thus, the (unique) least squares estimator is given by
= (X
X)
1
X
Y =
_
_
1
_
_
=
_
_
Y

1
x
i
(x
i
x)(Y
i
Y )
i
(x
i
x)
2
_
_
.
For the simple linear regression model, it can be shown (verify!) that the perpendicular
projection matrix P
X
is given by
P
X
= X(X
X)
1
X
=
_
_
_
_
_
_
_
_
1
n
+
(x
1
x)
2
i
(x
i
x)
2
1
n
+
(x
1
x)(x
2
x)
i
(x
i
x)
2

1
n
+
(x
1
x)(xnx)
i
(x
i
x)
2
1
n
+
(x
1
x)(x
2
x)
i
(x
i
x)
2
1
n
+
(x
2
x)
2
i
(x
i
x)
2

1
n
+
(x
2
x)(xnx)
i
(x
i
x)
2
.
.
.
.
.
.
.
.
.
.
.
.
1
n
+
(x
1
x)(xnx)
i
(x
i
x)
2
1
n
+
(x
2
x)(xnx)
i
(x
i
x)
2

1
n
+
(xnx)
2
i
(x
i
x)
2
_
_
_
_
_
_
_
_
.
A reparameterization of the simple linear regression model Y
i
=
0
+
1
x
i
+
i
is
Y
i
=
0
+
1
(x
i
x) +
i
or Y = W +, where
Y
n1
=
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, W
n2
=
_
_
_
_
_
_
_
_
1 x
1
x
1 x
2
x
.
.
.
.
.
.
1 x
n
x
_
_
_
_
_
_
_
_
,
21
=
_
_

0
1
_
_
,
n1
=
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
PAGE 26
To see why this is a reparameterized model, note that if we dene
U =
_
_
1 x
0 1
_
_
,
then W = XU and X = WU
1
(verify!) so that C(X) = C(W). Moreover, E(Y) =
X = W = XU. Taking P
= (X
X)
1
X
leads to = P
X = P
XU = U; i.e.,
=
_
_

0
1
_
_
=
_
_

0
1
x
1
_
_
= U.
To nd the least-squares estimator for in the reparameterized model, observe that
W
W =
_
_
n 0
0
i
(x
i
x)
2
_
_
and (W
W)
1
=
_
_
1
n
0
0
1
i
(x
i
x)
2
_
_
.
Note that (W
W)
1
is diagonal; this is one of the benets to working with this param-
eterization. The least squares estimator of is given by
= (W
W)
1
W
Y =
_
_

0

1
_
_
=
_
_
Y
i
(x
i
x)(Y
i
Y )
i
(x
i
x)
2
_
_
,
which is dierent than

. However, it can be shown directly (verify!) that the perpen-
dicular projection matrix onto C(W) is
P
W
= W(W
W)
1
W
=
_
_
_
_
_
_
_
_
1
n
+
(x
1
x)
2
i
(x
i
x)
2
1
n
+
(x
1
x)(x
2
x)
i
(x
i
x)
2

1
n
+
(x
1
x)(xnx)
i
(x
i
x)
2
1
n
+
(x
1
x)(x
2
x)
i
(x
i
x)
2
1
n
+
(x
2
x)
2
i
(x
i
x)
2

1
n
+
(x
2
x)(xnx)
i
(x
i
x)
2
.
.
.
.
.
.
.
.
.
.
.
.
1
n
+
(x
1
x)(xnx)
i
(x
i
x)
2
1
n
+
(x
2
x)(xnx)
i
(x
i
x)
2

1
n
+
(xnx)
2
i
(x
i
x)
2
_
_
_
_
_
_
_
_
.
which is the same as P
X
. Thus, the tted values will be the same; i.e.,

Y = P
X
Y =
X
= W = P
W
Y, and the analysis will be the same under both parameterizations.
Exercise: Show that the one way xed eects ANOVA model Y
ij
= +
i
+
ij
, for
i = 1, 2, ..., a and j = 1, 2, ..., n
i
, and the cell means model Y
ij
=
i
+
ij
are reparameter-
izations of each other. Does one parameterization confer advantages over the other?
PAGE 27
3 Estimability and Least Squares Estimators
3.1 Introduction
REMARK: Estimability is one of the most important concepts in linear models. Consider
the general linear model
Y = X +,
where E() = 0. In our discussion that follows, the assumption cov() =
2
I is not
needed. Suppose that X is np with rank r p. If r = p (as in regression models), then
estimability concerns vanish as is estimated uniquely by

= (X
X)
1
X
Y. If r < p,
(a common characteristic of ANOVA models), then can not be estimated uniquely.
However, even if is not estimable, certain functions of may be estimable.
3.2 Estimability
DEFINITIONS:
1. An estimator t(Y) is said to be unbiased for
i E{t(Y)} =
, for all .
2. An estimator t(Y) is said to be a linear estimator in Y i t(Y) = c + a
Y, for
c R and a = (a
1
, a
2
, ..., a
n
)
, a
i
R.
3. A function
is said to be (linearly) estimable i there exists a linear unbiased

estimator for it. Otherwise,
is nonestimable.
Result 3.1. Under the model assumptions Y = X + , where E() = 0, a linear
function
is estimable i there exists a vector a such that
= a
X; that is,
R(X).
Proof. (=) Suppose that there exists a vector a such that
= a
X. Then, E(a
Y) =
a
X =
, for all . Therefore, a
Y is a linear unbiased estimator of
and hence
PAGE 28
is estimable. (=) Suppose that
is estimable. Then, there exists an estimator

c+a
Y that is unbiased for it; that is, E(c+a
Y) =
, for all . Note that E(c+a
Y) =
c + a
X, so
= c + a
X, for all . Taking = 0 shows that c = 0. Successively

taking to be the standard unit vectors convinces us that
= a
X; i.e.,
R(X).
Example 3.1. Consider the one-way xed eects ANOVA model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where E(
ij
) = 0. Take a = 3 and n
i
= 2 so that
Y =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
21
Y
22
Y
31
Y
32
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0
1 1 0 0
1 0 1 0
1 0 1 0
1 0 0 1
1 0 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, and =
_
_
_
_
_
_
_
_
3
_
_
_
_
_
_
_
_
.
Note that r(X) = 3, so X is not of full rank; i.e., is not uniquely estimable. Consider
the following parametric functions
:
Parameter
R(X)? Estimable?
1
=
1
= (1, 0, 0, 0) no no
2
=
1

2
= (0, 1, 0, 0) no no
3
= +
1

3
= (1, 1, 0, 0) yes yes
4
=
1
4
= (0, 1, 1, 0) yes yes
5
=
1
(
2
+
3
)/2
5
= (0, 1, 1/2, 1/2) yes yes
Because
3
= +
1
,
4
=
1

2
, and
5
=
1
(
2
+
3
)/2 are (linearly)
estimable, there must exist linear unbiased estimators for them. Note that
E(Y
1+
) = E
_
Y
11
+Y
12
2
_
=
1
2
( +
1
) +
1
2
( +
1
) = +
1
=
PAGE 29
and that Y
1+
= c +a
Y, where c = 0 and a
= (1/2, 1/2, 0, 0, 0, 0). Also,

E(Y
1+
Y
2+
) = ( +
1
) ( +
2
)
=
1
2
=
and that Y
1+
Y
2+
= c+a
Y, where c = 0 and a
= (1/2, 1/2, 1/2, 1/2, 0, 0). Finally,

E
_
Y
1+
_
Y
2+
+Y
3+
2
__
= ( +
1
)
1
2
{( +
2
) + ( +
3
)}
=
1
1
2
(
2
+
3
) =
5
.
Note that
Y
1+
_
Y
2+
+Y
3+
2
_
= c +a
Y,
where c = 0 and a
= (1/2, 1/2, 1/4, 1/4, 1/4, 1/4).

REMARKS:
1. The elements of the vector X are estimable.
2. If
1
,
2
, ...,
k
are estimable, then any linear combination of them; i.e.,
k
i=1
d
i
i
, where d
i
R, is also estimable.
3. If X is n p and r(X) = p, then R(X) = R
p
and
is estimable for all .

DEFINITION: Linear functions
1
,
2
, ...,
k
are said to be linearly independent
if
1
,
2
, ...,
k
comprise a set of linearly independent vectors; i.e., = (
1

2

k
)
has rank k.
Result 3.2. Under the model assumptions Y = X + , where E() = 0, we can
always nd r = r(X) linearly independent estimable functions. Moreover, no collection
of estimable functions can contain more than r linearly independent functions.
Proof. Let
i
denote the ith row of X, for i = 1, 2, ..., n. Clearly,
1
,
2
, ...,
n
are
estimable. Because r(X) = r, we can select r linearly independent rows of X; the corre-
sponding r functions
i
are linearly independent. Now, let
= (
1
,
2
, ...,
k
)
be any collection of estimable functions. Then,
i
R(X), for i = 1, 2, ..., k, and hence
PAGE 30
there exists a matrix A such that
= A
X. Therefore, r(
) = r(A
X) r(X) = r.
Hence, there can be at most r linearly independent estimable functions.
DEFINITION: A least squares estimator of an estimable function
is
, where
= (X
X)
Y is any solution to the normal equations.

Result 3.3. Under the model assumptions Y = X + , where E() = 0, if
is
estimable, then
for any two solutions

and

to the normal equations.
Proof. Suppose that
is estimable. Then
= a
X, for some a. From Result 2.5,
= a
= a
P
X
Y
= a
= a
P
X
Y.
This proves the result.
Alternate proof. If

and

both solve the normal equations, then X
X(
) = 0; that
is,

N(X
X) = N(X). If
is estimable, then
R(X) C(X
)
N(X). Thus,
) = 0; i.e.,
.
IMPLICATION: Least squares estimators of (linearly) estimable functions are invariant
to the choice of generalized inverse used to solve the normal equations.
Example 3.2. In Example 3.1, we considered the one-way xed eects ANOVA model
Y
ij
= +
i
+
ij
, for i = 1, 2, 3 and j = 1, 2. For this model, it is easy to show that
X
X =
_
_
_
_
_
_
_
_
6 2 2 2
2 2 0 0
2 0 2 0
2 0 0 2
_
_
_
_
_
_
_
_
and r(X
X) = 3. Here are two generalized inverses of X
X:
(X
X)
1
=
_
_
_
_
_
_
_
_
0 0 0 0
0
1
2
0 0
0 0
1
2
0
0 0 0
1
2
_
_
_
_
_
_
_
_
(X
X)
2
=
_
_
_
_
_
_
_
_
1
2

1
2

1
2
0
1
2
1
1
2
0
1
2
1
2
1 0
0 0 0 0
_
_
_
_
_
_
_
_
.
PAGE 31
Note that
X
Y =
_
_
_
_
_
_
_
_
1 1 1 1 1 1
1 1 0 0 0 0
0 0 1 1 0 0
0 0 0 0 1 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Y
11
Y
12
Y
21
Y
22
Y
31
Y
32
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
Y
11
+Y
12
+Y
21
+Y
12
+Y
31
+Y
32
Y
11
+Y
12
Y
21
+Y
22
Y
31
+Y
32
_
_
_
_
_
_
_
_
.
Two least squares solutions (verify!) are thus
= (X
X)
1
X
Y =
_
_
_
_
_
_
_
_
0
Y
1+
Y
2+
Y
3+
_
_
_
_
_
_
_
_
and

= (X
X)
2
X
Y =
_
_
_
_
_
_
_
_
Y
3+
Y
1+
Y
3+
Y
2+
Y
3+
0
_
_
_
_
_
_
_
_
.
Recall our estimable functions from Example 3.1:
Parameter
R(X)? Estimable?
3
= +
1

3
= (1, 1, 0, 0) yes yes
4
=
1
4
= (0, 1, 1, 0) yes yes
5
=
1
(
2
+
3
)/2
5
= (0, 1, 1/2, 1/2) yes yes
Note that
for
3
= +
1
, the (unique) least squares estimator is
= Y
1+
.
for
4
=
1
2
, the (unique) least squares estimator is
= Y
1+
Y
2+
.
for
5
=
1
(
2
+
3
)/2, the (unique) least squares estimator is
= Y
1+
1
2
(Y
2+
+Y
3+
).
PAGE 32
Finally, note that these three estimable functions are linearly independent since
=
_

3

4

5
_
=
_
_
_
_
_
_
_
_
1 0 0
1 1 1
0 1 1/2
0 0 1/2
_
_
_
_
_
_
_
_
has rank r() = 3. Of course, more estimable functions
i
can be found, but we can
nd no more linearly independent estimable functions because r(X) = 3.
Result 3.4. Under the model assumptions Y = X + , where E() = 0, the least
squares estimator
of an estimable function
is a linear unbiased estimator of
.
Proof. Suppose that

solves the normal equations. We know (by denition) that
is
the least squares estimator of
. Note that
{(X
X)
Y+ [I (X
X)
X]z}
=
(X
X)
Y+
[I (X
X)
X]z.
Also,
is estimable by assumption, so
R(X) C(X
) N(X). Re-
sult MAR5.2 says that [I(X
X)
X]z N(X
X) = N(X), so
[I(X
X)
X]z =
0. Thus,
(X
X)
Y, which is a linear estimator in Y. We now show that
is unbiased. Because
is estimable,
R(X) =
= a
X, for some a. Thus,

E(
) = E{
(X
X)
Y} =
(X
X)
E(Y)
=
(X
X)
X
= a
X(X
X)
X
= a
P
X
X = a
X =
.
SUMMARY: Consider the linear model Y = X + , where E() = 0. From the
denition, we know that
is estimable i there exists a linear unbiased estimator for

it, so if we can nd a linear estimator c+a
Y whose expectation equals
, for all , then
is estimable. From Result 3.1, we know that
is estimable i
R(X). Thus,
if
can be expressed as a linear combination of the rows of X, then
is estimable.
PAGE 33
IMPORTANT: Here is a commonly-used method of nding necessary and sucient
conditions for estimability in linear models with E() = 0. Suppose that X is n p
with rank r < p. We know that
is estimable i
R(X).
Typically, when we nd the rank of X, we nd r linearly independent columns of
X and express the remaining s = p r columns as linear combinations of the r
linearly independent columns of X. Suppose that c
1
, c
2
, ..., c
s
satisfy Xc
i
= 0, for
i = 1, 2, ..., s, that is, c
i
N(X), for i = 1, 2, ..., s. If {c
1
, c
2
, ..., c
s
} forms a basis
for N(X); i.e., c
1
, c
2
, ..., c
s
are linearly independent, then
c
1
= 0
c
2
= 0
.
.
.
c
s
= 0
are necessary and sucient conditions for
to be estimable.
REMARK: There are two spaces of interest: C(X
) = R(X) and N(X). If X is n p

with rank r < p, then dim{C(X
)} = r and dim{N(X)} = s = p r. Therefore, if

c
1
, c
2
, ..., c
s
are linearly independent, then {c
1
, c
2
, ..., c
s
} must be a basis for N(X). But,
estimable
R(X) C(X
)
is orthogonal to every vector in N(X)
is orthogonal to c
1
, c
2
, ..., c
s

c
i
= 0, i = 1, 2, ..., s.
Therefore,
is estimable i
c
i
= 0, for i = 1, 2, ..., s, where c
1
, c
2
, ..., c
s
are s linearly
independent vectors satisfying Xc
i
= 0.
TERMINOLOGY: A set of linear functions {
1
,
2
, ...,
k
} is said to be jointly
nonestimable if the only linear combination of
1
,
2
, ...,
k
that is estimable is
the trivial one; i.e., 0. These types of functions are useful in non-full-rank linear models
and are associated with side conditions.
PAGE 34
3.2.1 One-way ANOVA
GENERAL CASE: Consider the one-way xed eects ANOVA model Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where E(
ij
) = 0. In matrix form, X and are
X
np
=
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
na
0
na
0
na
1
na
_
_
_
_
_
_
_
_
and
p1
=
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
,
where p = a+1 and n =
i
n
i
. Note that the last a columns of Xare linearly independent
and the rst column is the sum of the last a columns. Hence, r(X) = r = a and
s = p r = 1. With c
1
= (1, 1
a
)
, note that Xc
1
= 0 so {c
1
} forms a basis for N(X).
Thus, the necessary and sucient condition for
=
0
+
a
i=1
i
to be estimable
is
c
1
= 0 =
0
=
a
i=1
i
.
Here are some examples of estimable functions:
1. +
i
2.
i
k
3. any contrast in the s; i.e.,
a
i=1
i
, where
a
i=1
i
= 0.
Here are some examples of nonestimable functions:
1.
2.
i
3.
a
i=1
n
i
i
.
There is only s = 1 jointly nonestimable function. Later we will learn that jointly non-
estimable functions can be used to force particular solutions to the normal equations.
PAGE 35
The following are examples of sets of linearly independent estimable functions (verify!):
1. { +
1
, +
2
, ..., +
a
}
2. { +
1
,
1
2
, ...,
1
a
}.
LEAST SQUARES ESTIMATES: We now wish to calculate the least squares estimates
of estimable functions. Note that X
X and one generalized inverse of X
X is given by
X
X =
_
_
_
_
_
_
_
_
_
_
_
n n
1
n
2
n
a
n
1
n
1
0 0
n
2
0 n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n
a
0 0 n
a
_
_
_
_
_
_
_
_
_
_
_
and (X
X)
=
_
_
_
_
_
_
_
_
_
_
_
0 0 0 0
0 1/n
1
0 0
0 0 1/n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1/n
a
_
_
_
_
_
_
_
_
_
_
_
For this generalized inverse, the least squares estimate is
= (X
X)
Y =
_
_
_
_
_
_
_
_
_
_
_
0 0 0 0
0 1/n
1
0 0
0 0 1/n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 1/n
a
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
.
.
.
j
Y
aj
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
0
Y
1+
Y
2+
.
.
.
Y
a+
_
_
_
_
_
_
_
_
_
_
_
.
REMARK: We know that this solution is not unique; had we used a dierent generalized
inverse above, we would have gotten a dierent least squares estimate of . However, least
squares estimates of estimable functions
are invariant to the choice of generalized

inverse, so our choice of (X
X)
above is as good as any other. From this solution, we

have the unique least squares estimates:
Estimable function,
Least squares estimate,
+
i
Y
i+
k
Y
i+
Y
k+
a
i=1
i
, where
a
i=1
i
= 0
a
i=1
i
Y
i+
PAGE 36
3.2.2 Two-way crossed ANOVA with no interaction
GENERAL CASE: Consider the two-way xed eects (crossed) ANOVA model
Y
ijk
= +
i
+
j
+
ijk
,
for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., n
ij
, where E(
ij
) = 0. For ease of
presentation, we take n
ij
= 1 so there is no need for a k subscript; that is, we can rewrite
the model as Y
ij
= +
i
+
j
+
ij
. In matrix form, X and are
X
np
=
_
_
_
_
_
_
_
_
1
b
1
b
0
b
0
b
I
b
1
b
0
b
1
b
0
b
I
b
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
b
0
b
0
b
1
b
I
b
_
_
_
_
_
_
_
_
and
p1
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
2
.
.
.
b
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
where p = a + b + 1 and n = ab. Note that the rst column is the sum of the last b
columns. The 2nd column is the sum of the last b columns minus the sum of columns 3
through a + 1. The remaining columns are linearly independent. Thus, we have s = 2
linear dependencies so that r(X) = a +b 1. The dimension of N(X) is s = 2. Taking
c
1
=
_
_
_
_
_
1
1
a
0
b
_
_
_
_
_
and c
2
=
_
_
_
_
_
1
0
a
1
b
_
_
_
_
_
produces Xc
1
= Xc
2
= 0. Since c
1
and c
2
are linearly independent; i.e., neither is
a multiple of the other, {c
1
, c
2
} is a basis for N(X). Thus, necessary and sucient
conditions for
to be estimable are
c
1
= 0 =
0
=
a
i=1
c
2
= 0 =
0
=
b
j=1
a+j
.
PAGE 37
Here are some examples of estimable functions:
1. +
i
+
j
2.
i
k
3.
j
k
a
i=1
i
, where
a
i=1
i
= 0
b
j=1
a+j
j
, where
b
j=1
a+j
= 0.
Here are some examples of nonestimable functions:
1.
2.
i
3.
j
4.
a
i=1
i
5.
b
j=1
j
.
We can nd s = 2 jointly nonestimable functions. Examples of sets of jointly nones-
timable functions are
1. {
a
,
b
}
2. {
i
,
j

j
}.
A set of linearly independent estimable functions (verify!) is
1. { +
1
+
1
,
1
2
, ...,
1
a
,
1
2
, ...,
1
b
}.
NOTE: When replication occurs; i.e., when n
ij
> 1, for all i and j, our estimability
ndings are unchanged. Replication does not change R(X). We obtain the following
least squares estimates:
PAGE 38
Estimable function,
Least squares estimate,
+
i
+
j
Y
ij+
l
Y
i++
Y
l++
l
Y
+j+
Y
+l+
a
i=1
c
i
i
, with
a
i=1
c
i
= 0
a
i=1
c
i
Y
i++
b
j=1
d
i
j
, with
b
j=1
d
i
= 0
b
j=1
d
i
Y
+j+
These formulae are still technically correct when n
ij
= 1. When some n
ij
= 0, i.e., there
are missing cells, estimability may be aected; see Monahan, pp 46-48.
3.2.3 Two-way crossed ANOVA with interaction
GENERAL CASE: Consider the two-way xed eects (crossed) ANOVA model
Y
ijk
= +
i
+
j
+
ij
+
ijk
,
for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., n
ij
, where E(
ij
) = 0.
SPECIAL CASE: With a = 3, b = 2, and n
ij
= 2, X and are
X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 1 0 1 0 0 0 0 0
1 1 0 0 1 0 1 0 0 0 0 0
1 1 0 0 0 1 0 1 0 0 0 0
1 1 0 0 0 1 0 1 0 0 0 0
1 0 1 0 1 0 0 0 1 0 0 0
1 0 1 0 1 0 0 0 1 0 0 0
1 0 1 0 0 1 0 0 0 1 0 0
1 0 1 0 0 1 0 0 0 1 0 0
1 0 0 1 1 0 0 0 0 0 1 0
1 0 0 1 1 0 0 0 0 0 1 0
1 0 0 1 0 1 0 0 0 0 0 1
1 0 0 1 0 1 0 0 0 0 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
and =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
11
12
21
22
31
32
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
PAGE 39
There are p = 12 parameters. The last six columns of Xare linearly independent, and the
other columns can be written as linear combinations of the last six columns, so r(X) = 6
and s = p r = 6. To determine which functions
are estimable, we need to nd a

basis for N(X). One basis {c
1
, c
2
, ..., c
6
} is
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
1
1
1
0
0
0
0
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
0
0
0
1
1
0
0
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
1
0
0
0
0
1
1
0
0
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
1
0
0
0
0
0
1
1
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
0
0
0
0
1
0
1
0
1
0
1
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
1
1
0
1
0
1
0
1
0
0
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
Functions
must satisfy
c
i
= 0, for each i = 1, 2, ..., 6, to be estimable. It should be
obvious that neither the main eect terms nor the interaction terms; i.e,
i
,
j
,
ij
, are
estimable on their own. The six
i
+
j
+
ij
cell means terms are, but these are not
that interesting. No longer are contrasts in the s or s estimable. Indeed, interaction
makes the analysis more dicult.
3.3 Reparameterization
SETTING: Consider the general linear model
Model GL: Y = X +, where E() = 0.
Assume that X is n p with rank r p. Suppose that W is an n t matrix such
that C(W) = C(X). Then, we know that there exist matrices T
pt
and S
pt
such that
PAGE 40
W = XT and X = WS
. Note that X = WS
= W, where = S
. The model
Model GL-R: Y = W +, where E() = 0,
is called a reparameterization of Model GL.
REMARK: Since X = WS
= W = XT, we might suspect that the estimation

of an estimable function
under Model GL should be essentially the same as the

estimation of
T under Model GL-R (and that estimation of an estimable function

q
under Model GL-R should be essentially the same as estimation of q
under
Model GL). The upshot of the following results is that, in determining a least squares
estimate of an estimable function
, we can work with either Model GL or Model

GL-R. The actual nature of these conjectured relationships is now made precise.
Result 3.5. Consider Models GL and GL-R with C(W) = C(X).
1. P
W
= P
X
.
2. If is any solution to the normal equations W
W = W
Y associated with Model

GL-R, then

= T is a solution to the normal equations X
X = X
Y associated
with Model GL.
3. If
is estimable under Model GL and if is any solution to the normal equations

W
W = W
Y associated with Model GL-R, then
T is the least squares

estimate of
.
4. If q
is estimable under Model GL-R; i.e., if q
R(W), then q
is estimable
under Model GL and its least squares estimate is given by q
, where is any
solution to the normal equations W
W = W
Y.
Proof.
1. P
W
= P
X
since perpendicular projection matrices are unique.
2. Note that
X
XT = X
W = X
P
W
Y = X
P
X
Y = X
Y.
Hence, T is a solution to the normal equations X
X = X
Y.
PAGE 41
3. This follows from (2), since the least squares estimate is invariant to the choice of the
solution to the normal equations.
4. If q
R(W), then q
= a
W, for some a. Then, q
= a
WS
= a
X R(X), so
that q
is estimable under Model GL. From (3), we know the least squares estimate
of q
is q
T . But,
q
T = a
WS
T = a
XT = a
W = q
.
WARNING: The converse to (4) is not true; i.e., q
being estimable under Model GL

doesnt necessarily imply that q
is estimable under Model GL-R. See Monahan, pp 52.

TERMINOLOGY: Because C(W) = C(X) and r(X) = r, W
nt
must have at least r
columns. If W has exactly r columns; i.e., if t = r, then the reparameterization of
Model GL is called a full rank reparameterization. If, in addition, W
W is diagonal,
the reparameterization of Model GL is called an orthogonal reparameterization; see,
e.g., the centered linear regression model in Section 2 (notes).
NOTE: A full rank reparameterization always exists; just delete the columns of Xthat are
linearly dependent on the others. In a full rank reparameterization, (W
W)
1
exists, so
the normal equations W
W = W
Y have a unique solution; i.e., = (W
W)
1
W
Y.
DISCUSSION: There are two (opposing) points of view concerning the utility of full rank
reparameterizations.
Some argue that, since making inferences about q
under the full rank reparam-

eterized model (Model GL-R) is equivalent to making inferences about q
in
the possibly-less-than-full rank original model (Model GL), the inclusion of the
possibility that the design matrix has less than full column rank causes a needless
complication in linear model theory.
The opposing argument is that, since computations required to deal with the repa-
rameterized model are essentially the same as those required to handle the original
model, we might as well allow for less-than-full rank models in the rst place.
PAGE 42
I tend to favor the latter point of view; to me, there is no reason not to include
less-than-full rank models as long as you know what you can and can not estimate.
Example 3.3. Consider the one-way xed eects ANOVA model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where E(
ij
) = 0. In matrix form, X and are
X
np
=
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
na
0
na
0
na
1
na
_
_
_
_
_
_
_
_
and
p1
=
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
,
where p = a + 1 and n =
i
n
i
. This is not a full rank model since the rst column is
the sum of the last a columns; i.e., r(X) = a.
Reparameterization 1: Deleting the rst column of X, we have
W
nt
=
_
_
_
_
_
_
_
_
1
n
1
0
n
1
0
n
1
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
0
na
0
na
1
na
_
_
_
_
_
_
_
_
and
t1
=
_
_
_
_
_
_
_
_
+
1
+
2
.
.
.
+
a
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
,
where t = a and
i
= E(Y
ij
) = +
i
. This is called the cell-means model and is
written Y
ij
=
i
+
ij
. This is a full rank reparameterization with C(W) = C(X). The
least squares estimate of is
= (W
W)
1
W
Y =
_
_
_
_
_
_
_
_
Y
1+
Y
2+
.
.
.
Y
a+
_
_
_
_
_
_
_
_
.
Exercise: What are the matrices T and S associated with this reparameterization?
PAGE 43
Reparameterization 2: Deleting the last column of X, we have
W
nt
=
_
_
_
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
n
a1
0
n
a1
0
n
a1
1
n
a1
1
na
0
na
0
na
0
na
_
_
_
_
_
_
_
_
_
_
_
and
t1
=
_
_
_
_
_
_
_
_
_
_
_
+
a
a
.
.
.
a1
a
_
_
_
_
_
_
_
_
_
_
_
,
where t = a. This is called the cell-reference model (what SAS uses by default). This
is a full rank reparameterization with C(W) = C(X). The least squares estimate of is
= (W
W)
1
W
Y =
_
_
_
_
_
_
_
_
_
_
_
Y
a+
Y
1+
Y
a+
Y
2+
Y
a+
.
.
.
Y
(a1)+
Y
a+
_
_
_
_
_
_
_
_
_
_
_
.
Reparameterization 3: Another reparameterization of the eects model uses
W
nt
=
_
_
_
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
n
a1
0
n
a1
0
n
a1
1
n
a1
1
na
1
na
1
na
1
na
_
_
_
_
_
_
_
_
_
_
_
and
t1
=
_
_
_
_
_
_
_
_
_
_
_
+
.
.
.
a1
_
_
_
_
_
_
_
_
_
_
_
,
where t = a and = a
1
i
. This is called the deviations from the mean model.
This is a full rank reparameterization with C(W) = C(X).
Example 3.4. Two part multiple linear regression model. Consider the linear model
Y = X + , where E() = 0. Suppose that X is full rank. Write X = (X
1
X
2
) and
= (
1
,
2
)
so that the model can be written as

Model GL: Y = X
1
1
+X
2
2
+.
Now, set W
1
= X
1
and W
2
= (I P
X
1
)X
2
, where P
X
1
= X
1
(X
1
X
1
)
1
X
1
is the
perpendicular projection matrix onto C(X
1
). A reparameterized version of Model GL is
Model GL-R: Y = W
1
1
+W
2
2
+,
PAGE 44
where E() = 0, W = (W
1
W
2
) and
=
_
_

1
2
_
_
=
_
_

1
+ (X
1
X
1
)
1
X
1
X
2
2
_
_
.
With this reparameterization, note that
W
W =
_
_
X
1
X
1
0
0 X
2
(I P
X
1
)X
2
_
_
and W
Y =
_
_
X
1
Y
X
2
(I P
X
1
)Y
_
_
so that
= (W
W)
1
W
Y =
_
_
(X
1
X
1
)
1
X
1
Y
{X
2
(I P
X
1
)X
2
}
1
X
2
(I P
X
1
)Y
_
_
_
_

1

2
_
_
.
In this reparameterization, W
2
can be thought of as the residual from regressing (each
column of) X
2
on X
1
. A further calculation shows that
= (X
X)
1
X
Y =
_
_
2
_
_
=
_
_

1
(X
1
X
1
)
1
X
1
X
2
2

2
_
_
,
where note that (X
1
X
1
)
1
X
1
X
2
is the estimate obtained from regressing X
2
on X
1
.
Furthermore, the estimate
2
can be thought of as the estimate obtained from regressing
Y on W
2
= (I P
X
1
)X
2
.
APPLICATION: Consider the two part full-rank regression model Y = X
1
1
+X
2
2
+,
where E() = 0. Suppose that X
2
= x
2
is n 1 and that
2
=
2
is a scalar. Consider
two dierent models:
Reduced model: Y = X
1
1
+
Full model: Y = X
1
1
+x
2
2
+.
We use the term reduced model since C(X
1
) C(X
1
, x
2
). Consider the full model
Y = X
1
1
+x
2
2
+ and premultiply by I P
X
1
to obtain
(I P
X
1
)Y = (I P
X
1
)X
1
1
+b
2
(I P
X
1
)x
2
+ (I P
X
1
)
= b
2
(I P
X
1
)x
2
+
,
PAGE 45
where
= (I P
X
1
). Now, note that
(I P
X
1
)Y = YP
X
1
Y e
Y|X
1
,
say, are the residuals from regressing Y on X
1
. Similarly, (I P
X
1
)x
2
e
x
2
|X
1
are the
residuals from regressing x
2
on X
1
. We thus have the following induced linear model
e
Y|X
1
= b
2
e
x
2
|X
1
+
,
where E(
) = 0. The plot of e
Y|X
1
versus e
x
2
|X
1
is called an added-variable plot (or
partial regression) plot. It displays the relationship between Y and x
2
, after adjusting
for the eects of X
1
being in the model.
If a linear trend exists in this plot, this suggests that x
2
enters into the (full) model
linearly. This plot can also be useful for detecting outliers and high leverage points.
On the down side, added-variable plots only look at one predictor at a time so one
can not assess multicolinearity; that is, if the predictor x
2
is close to C(X
1
), this
may not be detected in the plot.
The slope of the least squares regression line for the added variable plot is
2
= [{(I P
X
1
)x
2
}
(I P
X
1
)x
2
]
1
{(I P
X
1
)x
2
}
(I P
X
1
)Y
= {x
2
(I P
X
1
)x
2
}
1
x
2
(I P
X
1
)Y.
This is equal to the least squares estimate of
2
in the full model.
3.4 Forcing least squares solutions using linear constraints
REVIEW: Consider our general linear model Y = X + , where E() = 0, and X is
an n p matrix with rank r. The normal equations are X
X = X
Y.
If r = p, then a unique least squares solution exists; i.e.,

= (X
X)
1
X
Y.
If r < p, then a least squares solution is

= (X
X)
Y. This solution is not

unique; its value depends on which generalized inverse (X
X)
is used.
PAGE 46
Example 3.5. Consider the one-way xed eects ANOVA model Y
ij
= +
i
+
ij
, for
i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where E(
ij
) = 0. The normal equations are
X
X =
_
_
_
_
_
_
_
_
_
_
_
n n
1
n
2
n
a
n
1
n
1
0 0
n
2
0 n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n
a
0 0 n
a
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
.
.
.
j
Y
aj
_
_
_
_
_
_
_
_
_
_
_
= X
Y,
or, written another way,
n +
a
i=1
n
i
i
= Y
++
n
i
+ n
i
i
= Y
i+
, i = 1, 2, ..., a,
where Y
i+
=
j
Y
ij
, for i = 1, 2, ..., a, and Y
++
=
j
Y
ij
. This set of equations has no
unique solution. However, from our discussion on generalized inverses (and consideration
of this model), we know that
if we set = 0, then we get the solution = 0 and
i
= Y
i+
, for i = 1, 2, ..., a.
if we set
a
i=1
n
i
i
= 0, then we get the solution = Y
++
and
i
= Y
i+
Y
++
,
for i = 1, 2, ..., a.
if we set another nonestimable function equal to 0, well get a dierent solution to
the normal equations.
REMARK: Equations like = 0 and
a
i=1
n
i
i
= 0 are used to force a particular so-
lution to the normal equations and are called side conditions. Dierent side conditions
produce dierent least squares solutions. We know that in the one-way ANOVA model,
the parameters and
i
, for i = 1, 2, ..., a, are not estimable (individually). Imposing
side conditions does not change this. My feeling is that when we attach side conditions to
force a unique solution, we are doing nothing more than solving a mathematical problem
that isnt relevant. After all, estimable functions
have least squares estimates that

do not depend on which side condition was used, and these are the only functions we
should ever be concerned with.
PAGE 47
REMARK: We have seen similar results for the two-way crossed ANOVA model. In
general, what and how many conditions should we use to force a particular solution to
the normal equations? Mathematically, we are interested in imposing additional linear
restrictions of the form C = 0 where the matrix C does not depend on Y.
TERMINOLOGY: We say that the system of equations Ax = q is compatible if
c
A = 0 =c
q = 0; i.e., c N(A
) =c
q = 0.
Result 3.6. The system Ax = q is consistent if and only if it is compatible.
Proof. If Ax = q is consistent, then Ax
= q, for some x
. Hence, for any c such that

c
A = 0, we have c
q = c
Ax
= 0, so Ax = q is compatible. If Ax = q is compatible,
then for any c N(A
) = C(I P
A
), we have 0 = c
q = q
c = q
(I P
A
)z, for all
z. Successively taking z to be the standard unit vectors, we have q
(I P
A
) = 0 =
(I P
A
)q = 0 = q = A(A
A)
q = q = Ax
, where x
= (A
A)
q. Thus,
Ax = q is consistent.
AUGMENTED NORMAL EQUATIONS: We now consider adjoining the set of equations
C = 0 to the normal equations; that is, we consider the new set of equations
_
_
X
X
C
_
_
=
_
_
X
Y
0
_
_
.
These are called the augmented normal equations. When we add the constraint
C = 0, we want these equations to be consistent for all Y. We now would like to nd
a sucient condition for consistency. Suppose that w R(X
X) R(C). Note that

w R(X
X) = w = X
Xv
1
, for some v
1
w R(C) = w = C
v
2
, for some v
2
.
Thus, 0 = ww = X
Xv
1
+C
v
2
= 0 = v
1
X
X+v
2
C
= 0 = (v
1
v
2
)
_
_
X
X
C
_
_
= v
_
_
X
X
C
_
_
,
PAGE 48
where v
= (v
1
v
2
). We want C chosen so that
_
_
X
X
C
_
_
=
_
_
X
Y
0
_
_
is consistent, or equivalently from Result 3.6, is compatible. Compatibility occurs when
v
_
_
X
X
C
_
_
= 0 =v
_
_
X
Y
0
_
_
= 0.
Thus, we need v
1
X
Y = 0, for all Y. Successively taking Y to be standard unit vectors,

for i = 1, 2, ..., n, convinces us that v
1
X
= 0 Xv
1
= 0 X
Xv
1
= 0 =w = 0.
Thus, the augmented normal equations are consistent when R(X
X)R(C) = {0}. Since

R(X
X) = R(X), a sucient condition for consistency is R(X) R(C) = {0}. Now,

consider the parametric function
C, for some . We know that
C is estimable if
and only if
C R(X). However, clearly
C R(C). Thus,
C is estimable if and
only if
C = 0. In other words, writing

C =
_
_
_
_
_
_
_
_
c
1
c
2
.
.
.
c
s
_
_
_
_
_
_
_
_
,
the set of functions {c
1
, c
2
, ..., c
s
} is jointly nonestimable. Therefore, we can set
a collection of jointly nonestimable functions equal to zero and augment the normal
equations so that they remain consistent. We get a unique solution if
r
_
_
X
X
C
_
_
= p.
Because R(X
X) R(C) = {0},
p = r
_
_
X
X
C
_
_
= r(X
X) +r(C) = r +r(C),
showing that we need r(C) = s = p r.
PAGE 49
SUMMARY: To augment the normal equations, we can nd a set of s jointly nonestimable
functions {c
1
, c
2
, ..., c
s
} with
r(C) = r
_
_
_
_
_
_
_
_
c
1
c
2
.
.
.
c
s
_
_
_
_
_
_
_
_
= s.
Then,
_
_
X
X
C
_
_
=
_
_
X
Y
0
_
_
is consistent and has a unique solution.
Example 3.5 (continued). Consider the one-way xed eects ANOVA model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where E(
ij
) = 0. The normal equations are
X
X =
_
_
_
_
_
_
_
_
_
_
_
n n
1
n
2
n
a
n
1
n
1
0 0
n
2
0 n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n
a
0 0 n
a
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
.
.
.
j
Y
aj
_
_
_
_
_
_
_
_
_
_
_
= X
Y.
We know that r(X) = r = a < p (this system can not be solved uniquely) and that
s = p r = (a + 1) a = 1. Thus, to augment the normal equations, we need to nd
s = 1 (jointly) nonestimable function. Take c
1
= (1, 0, 0, ..., 0), which produces
c
1
= (1 0 0 0)
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
= .
PAGE 50
For this choice of c
1
, the augmented normal equations are
_
_
X
X
c
1
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
n n
1
n
2
n
a
n
1
n
1
0 0
n
2
0 n
2
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n
a
0 0 n
a
1 0 0 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
.
.
.
j
Y
aj
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
X
Y
0
_
_
.
Solving this (now full rank) system produces the unique solution
= 0

i
= Y
i+
i = 1, 2, ..., a.
Youll note that this choice of c
1
used to augment the normal equations corresponds to
specifying the side condition = 0.
Exercise. Redo this example using (a) the side condition
a
i=1
n
i
i
= 0, (b) the side
condition
a
= 0 (what SAS does), and (c) using another side condition.
Example 3.6. Consider the two-way xed eects (crossed) ANOVA model
Y
ij
= +
i
+
j
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., b, where E(
ij
) = 0. For purposes of illustration, lets
take a = b = 3, so that n = ab = 9 and p = a +b + 1 = 7. In matrix form, X and are
X
97
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 1 0 0 1 0 0
1 1 0 0 0 1 0
1 1 0 0 0 0 1
1 0 1 0 1 0 0
1 0 1 0 0 1 0
1 0 1 0 0 0 1
1 0 0 1 1 0 0
1 0 0 1 0 1 0
1 0 0 1 0 0 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
and
71
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
PAGE 51
We see that r(X) = r = 5 so that s = p r = 7 5 = 2. The normal equations are
X
X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
9 3 3 3 3 3 3
3 3 0 0 1 1 1
3 0 3 0 1 1 1
3 0 0 3 1 1 1
3 1 1 1 3 0 0
3 1 1 1 0 3 0
3 1 1 1 0 0 3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
j
Y
3j
i
Y
i1
i
Y
i2
i
Y
i3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
= X
Y.
This system does not have a unique solution. To augment the normal equations, we will
need a set of s = 2 linearly independent jointly nonestimable functions. From Section
3.2.2, one example of such a set is {
i
,
j

j
}. For this choice, our matrix C is
C =
_
_
c
1
c
2
_
_
=
_
_
0 1 1 1 0 0 0
0 0 0 0 1 1 1
_
_
.
Thus, the augmented normal equations become
_
_
X
X
C
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
9 3 3 3 3 3 3
3 3 0 0 1 1 1
3 0 3 0 1 1 1
3 0 0 3 1 1 1
3 1 1 1 3 0 0
3 1 1 1 0 3 0
3 1 1 1 0 0 3
0 1 1 1 0 0 0
0 0 0 0 1 1 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
3
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
j
Y
ij
j
Y
1j
j
Y
2j
j
Y
3j
i
Y
i1
i
Y
i2
i
Y
i3
0
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
X
Y
0
_
_
.
Solving this system produces the estimates of ,
i
and
j
under the side conditions
i
=
j

j
= 0. These estimates are
= Y
++

i
= Y
i+
Y
++
, i = 1, 2, 3
j
= Y
+j
Y
++
, j = 1, 2, 3.
PAGE 52
Exercise. Redo this example using (a) the side conditions
a
= 0 and
b
= 0 (what
SAS does) and (b) using another set of side conditions.
QUESTION: In general, can we give a mathematical form for the particular solution?
Note that we are now solving
_
_
X
X
C
_
_
=
_
_
X
Y
0
_
_
,
which is equivalent to
_
_
X
X
C
C
_
_
=
_
_
X
Y
0
_
_
since C = 0 i C
C = 0. Thus, any solution to this system must also satisfy

(X
X+C
C) = X
Y.
But,
r(X
X+C
C) = r
_
_
(X
)
_
_
X
C
_
_
_
_
= r
_
_
X
C
_
_
= p,
that is, X
X+C
C is nonsingular. Hence, the unique solution to the augmented normal

equations must be
= (X
X+C
C)
1
X
Y.
So, imposing s = p r conditions C = 0, where the elements of C are jointly non-
estimable, yields a particular solution to the normal equations. Finally, note that by
Result 2.5 (notes),
X
= X(X
X+C
C)
1
X
Y
= P
X
Y,
which shows that
P
X
= X(X
X+C
C)
1
X
is the perpendicular projection matrix onto C(X). This shows that (X
X+C
C)
1
is a
(non-singular) generalized inverse of X
X.
PAGE 53
4 The Gauss-Markov Model
4.1 Introduction
REVIEW: Consider the general linear model Y = X +, where E() = 0.
A linear estimator t(Y) = c +a
Y is said to be unbiased for
if and only if
E{t(Y)} =
,
for all . We have seen this implies that c = 0 and
R(X); i.e.,
is estimable.
When
is estimable, it is possible to nd several estimators that are unbiased for
. For example, in the one-way (xed eects) ANOVA model Y

ij
= +
i
+
ij
,
with E(
ij
) = 0, Y
11
, (Y
11
+ Y
12
)/2, and Y
1+
are each unbiased estimators of
= +
1
(there are others too).
If
is estimable, then the (ordinary) least squares estimator
, where

is any
solution to the normal equations X
X = X
Y, is unbiased for
. To see this,
recall that
(X
X)
Y = a
P
X
Y,
where
= a
X, for some a, that is,
R(X), and P
X
is the perpendicular
projection matrix onto C(X). Thus,
E(
) = E(a
P
X
Y) = a
P
X
E(Y) = a
P
X
X = a
X =
.
GOAL: Among all linear unbiased estimators for
, we want to nd the best linear

unbiased estimator in the sense that it has the smallest variance. We will show that
the least squares estimator
is the best linear unbiased estimator (BLUE) of
,
provided that
is estimable and cov() =

2
I.
PAGE 54
4.2 The Gauss-Markov Theorem
Result 4.1. Consider the Gauss-Markov model Y = X + , where E() = 0 and
cov() =
2
I. Suppose that
is estimable and let

denote any solution to the normal
equations X
X = X
Y. The (ordinary) least squares estimator
is the best linear

unbiased estimator (BLUE) of
, that is, the variance of
is uniformly less than

that of any other linear unbiased estimator of
.
Proof. Suppose that

= c + a
Y is another linear unbiased estimator of
. From
Result 3.1, we know that c = 0 and that
= a
X. Thus,

= a
Y, where
= a
X.
Now, write

=
+ (
). Note that
var(
) = var[
+ (
)]
= var(
) + var(
)
. .
0
+2cov(
).
We now show that cov(
) = 0. Recalling that
= a
P
X
Y, we have
cov(
) = cov(a
P
X
Y, a
Ya
P
X
Y)
= cov[a
P
X
Y, a
(I P
X
)Y]
= a
P
X
cov(Y, Y)[a
(I P
X
)]
=
2
Ia
P
X
(I P
X
)a = 0,
since P
X
(IP
X
) = 0. Thus, var(
) var(
), showing that
has variance no larger

than that of

. Equality results when var(
) = 0. However, if var(
) = 0,
then because E(
) = 0 as well,

is a degenerate random variable at 0; i.e.,

pr(
) = 1. This establishes uniqueness.

MULTIVARIATE CASE: Suppose that we wish to estimate simultaneously k estimable
linear functions
=
_
_
_
_
_
_
_
_
.
.
.
_
_
_
_
_
_
_
_
,
PAGE 55
where
i
= a
i
X, for some a
i
; i.e.,
i
R(X), for i = 1, 2, ..., k. We say that
is
estimable if and only if
i
, i = 1, 2, ..., k, are each estimable. Put another way,
is
estimable if and only if
= A
X, for some matrix A.

Result 4.2. Consider the Gauss-Markov model Y = X + , where E() = 0 and
cov() =
2
I. Suppose that
is any k-dimensional estimable vector and that c+A
Y is
any vector of linear unbiased estimators of the elements of
. Let

denote any solution
to the normal equations. Then, the matrix cov(c + A
Y) cov(
) is nonnegative
denite.
Proof. It suces to show that x
[cov(c +A
Y) cov(
)]x 0, for all x. Note that

x
[cov(c +A
Y) cov(
)]x = x
cov(c +A
Y)x x
cov(
)x
= var(x
c +x
Y) var(x
).
But x
= x
X (a scalar) is estimable since x
X R(X). Also, x
c +x
Y =
x
(c+A
Y) is a linear unbiased estimator of x
. The least squares estimator of x
is x
. Thus, by Result 4.1, var(x
c +x
Y) var(x
) 0.
OBSERVATION: Consider the Gauss-Markov linear model Y = X+, where E() = 0
and cov() =
2
I. If X is full rank, then X
X is nonsingular and every linear combination

of
is estimable. The (ordinary) least squares estimator of is

= (X
X)
1
X
Y. It
is unbiased and
cov(
) = cov[(X
X)
1
X
Y] = (X
X)
1
X
cov(Y)[(X
X)
1
X
= (X
X)
1
X
2
IX(X
X)
1
=
2
(X
X)
1
.
Note that this is not correct if X is less than full rank.
Example 4.1. Recall the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n, where
1
,
2
, ...,
n
are uncorrelated random variables with mean 0 and
common variance
2
> 0 (these are the Gauss Markov assumptions). Recall that, in
PAGE 56
matrix notation,
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
1 x
1
1 x
2
.
.
.
.
.
.
1 x
n
_
_
_
_
_
_
_
_
, =
_
_

0
1
_
_
, =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
The least squares estimator of is
= (X
X)
1
X
Y =
_
_
1
_
_
=
_
_
Y

1
x
i
(x
i
x)(Y
i
Y )
i
(x
i
x)
2
_
_
.
The covariance matrix of

is
cov(
) =
2
(X
X)
1
=
2
_
_
1
n
+
x
2
i
(x
i
x)
2

x
i
(x
i
x)
2
i
(x
i
x)
2
1
i
(x
i
x)
2
_
_
.
4.3 Estimation of
2
in the GM model
REVIEW: Consider the Gauss-Markov model Y = X+, where E() = 0 and cov() =
2
I. The best linear unbiased estimator (BLUE) for any estimable function
is
,
where

is any solution to the normal equations. Clearly, E(Y) = X is estimable and
the BLUE of E(Y) is
X
= X(X
X)
1
X
Y = P
X
Y =

Y,
the perpendicular projection of Y onto C(X); that is, the tted values from the least
squares t. The residuals are given by
e = Y

Y = YP
X
Y = (I P
X
)Y,
the perpendicular projection of Y onto N(X
). Recall that the residual sum of squares

is
Q(
) = (YX
(YX
) = e
e = Y
(I P
X
)Y.
We now turn our attention to estimating
2
.
PAGE 57
Result 4.3. Suppose that Z is a random vector with mean E(Z) = and covariance
matrix cov(Z) = . Let A be nonrandom. Then
E(Z
AZ) =
A +tr(A).
Proof. Note that Z
AZ is a scalar random variable; hence, Z
AZ = tr(Z
AZ). Also, recall

that expectation E() and tr() are linear operators. Finally, recall that tr(AB) = tr(BA)
for conformable A and B. Now,
E(Z
AZ) = E[tr(Z
AZ)] = E[tr(AZZ
)]
= tr[AE(ZZ
)]
= tr[A(+
)]
= tr(A) +tr(A
)
= tr(A) +tr(
A) = tr(A) +
A.
REMARK: Finding var(Z
AZ) is more dicult; see Section 4.9 in Monahan. Consider-

able simplication results when Z follows a multivariate normal distribution.
APPLICATION: We now nd an unbiased estimator of
2
under the GM model. Suppose
that Y is n1 and X is np with rank r p. Note that E(Y) = X. Applying Result
4.3 directly with A = I P
X
, we have
E[Y
(I P
X
)Y] = (X)
(I P
X
)X
. .
= 0
+tr[(I P
X
)
2
I]
=
2
[tr(I) tr(P
X
)]
=
2
[n r(P
X
)] =
2
(n r).
Thus,

2
= (n r)
1
Y
(I P
X
)Y
is an unbiased estimator of
2
in the GM model. In non-matrix notation,

2
= (n r)
1
n
i=1
(Y
i

Y
i
)
2
,
where

Y
i
is the least squares tted value of Y
i
.
PAGE 58
ANOVA: Consider the Gauss-Markov model Y = X +, where E() = 0 and cov() =
2
I. Suppose that Y is n 1 and X is n p with rank r p. Recall from Chapter 2
the basic form of an ANOVA table (with corrected sums of squares):
Source df SS MS F
Model (Corrected) r 1 SSR = Y
(P
X
P
1
)Y MSR =
SSR
r1
F =
MSR
MSE
Residual n r SSE = Y
(I P
X
)Y MSE =
SSE
nr
Total (Corrected) n 1 SST = Y
(I P
1
)Y
NOTES:
The degrees of freedom associated with each SS is the rank of its appropriate
perpendicular projection matrix; that is, r(P
X
P
1
) = r1 and r(IP
X
) = nr.
Note that
cov(
Y, e) = cov[P
X
Y, (I P
X
)Y] = P
X
2
I(I P
X
) = 0.
That is, the least squares tted values are uncorrelated with the residuals.
We have just shown that E(MSE) =
2
. If X / C(1), that is, the independent
variables in X add to the model (beyond an intercept, for example), then
E(SSR) = E[Y
(P
X
P
1
)Y] = (X)
(P
X
P
1
)X +tr[(P
X
P
1
)
2
I]
= (X)
X +
2
r(P
X
P
1
)
= (X)
X + (r 1)
2
.
Thus,
E(MSR) = (r 1)
1
E(SSR) =
2
+ (r 1)
1
(X)
X.
If X C(1), that is, the independent variables in X add nothing to the model,
then (X)
(P
X
P
1
)X = 0 and MSR and MSE are both unbiased estimators
of
2
. If this is true, F should be close to 1. Large values of F occur when
(X)
(P
X
P
1
)X is large, that is, when X is far away from C(1), that is,
when the independent variables in X are more relevant in explaining E(Y).
PAGE 59
4.4 Implications of model selection
REMARK: Consider the linear model Y = X+, where Y is n1 and X is np with
rank r p. We now investigate two issues: undertting (model misspecication) and
overtting. We assume that E() = 0 and cov() =
2
I, that is, our usual Gauss-Markov
assumptions.
4.4.1 Undertting (Misspecication)
UNDERFITTING: Suppose that, in truth, the correct model for Y is
Y = X +W +,
where E() = 0 and cov() =
2
I. The vector = W includes the variables and
coecients missing from X. If the analyst uses Y = X + to describe the data,
s/he is missing important variables that are in W, that is, the analyst is misspecifying
the true model by undertting. We now examine the eect of undertting on (a) least
squares estimates of estimable functions and (b) the estimate of the error variance
2
.
CONSEQUENCES: Suppose that
is estimable under Y = X +, where E() = 0

and cov() =
2
I; i.e.,
= a
X, for some vector a. The least squares estimator of
is given by
(X
X)
Y. If W = 0, then Y = X + is the correct model

and E(
) =
. If W = 0, then, under the correct model,

E(
) = E[
(X
X)
Y] =
(X
X)
E(Y)
=
(X
X)
(X +W)
= a
X(X
X)
X +a
X(X
X)
W
= a
P
X
X +a
P
X
W
=
+a
P
X
W,
showing that
is no longer unbiased, in general. The amount of the bias depends on

where W is. If = W is orthogonal to C(X), then P
X
W = 0 and the estimation of
with
is unaected. Otherwise, P
X
W = 0 and the estimate of
is biased.
PAGE 60
CONSEQUENCES: Now, lets turn to the estimation of
2
. Under the correct model,
E[Y
(I P
X
)Y] = (X +W)
(I P
X
)(X +W) +tr[(I P
X
)
2
I]
= (W)
(I P
X
)W +
2
(n r),
where r = r(X). Thus,
E(MSE) =
2
+ (n r)
1
(W)
(I P
X
)W,
that is,
2
= MSE is unbiased if and only if W C(X).
4.4.2 Overtting
OVERFITTING: Suppose that, in truth, the correct model for Y is
Y = X
1
1
+,
2
I, but, instead, we t
Y = X
1
1
+X
2
2
+,
that is, the extra variables in X
2
are not needed; i.e.,
2
= 0. Set X = [X
1
X
2
] and
suppose that X and X
1
have full column rank (i.e., a regression setting). The least
squares estimator of
1
under the true model is

1
= (X
1
X
1
)
1
X
1
Y. We know that
E(
1
) =
1
cov(
1
) =
2
(X
1
X
1
)
1
.
On the other hand, the normal equations associated with the larger (unnecessarily large)
model are
X
X = X
Y
_
_
X
1
X
1
X
1
X
2
X
2
X
1
X
2
X
2
_
_
_
_

1
2
_
_
=
_
_
X
1
Y
X
2
Y
_
_
and the least squares estimator of is
=
_
_
2
_
_
=
_
_
X
1
X
1
X
1
X
2
X
2
X
1
X
2
X
2
_
_
1
_
_
X
1
Y
X
2
Y
_
_
.
PAGE 61
The least squares estimator

is still unbiased in the unnecessarily large model, that is,
E(
1
) =
1
E(
2
) = 0.
Thus, we assess the impact of overtting by looking at cov(
1
). Under the unnecessarily
large model,
cov(
1
) =
2
(X
1
X
1
)
1
+
2
(X
1
X
1
)
1
X
1
X
2
[X
2
(I P
X
1
)X
2
]
1
X
2
X
1
(X
1
X
1
)
1
;
see Exercise A.72 (pp 268) in Monahan. Thus,
cov(
1
) cov(
1
) =
2
(X
1
X
1
)
1
X
1
X
2
[X
2
(I P
X
1
)X
2
]
1
X
2
X
1
(X
1
X
1
)
1
.
If the columns of X
2
are each orthogonal to C(X
1
), then X
1
X
2
= 0 and
X
X =
_
_
X
1
X
1
X
1
X
2
X
2
X
1
X
2
X
2
_
_
=
_
_
X
1
X
1
0
0 X
2
X
2
_
_
;
i.e., X
X is block diagonal, and

1
= (X
1
X
1
)
1
X
1
Y =

1
. This would mean that
using the unnecessarily large model has no eect on our estimate of
1
. However,
the precision with which we can estimate
2
is aected since r(IP
X
) < r(IP
X
1
);
that is, we have fewer residual degrees of freedom.
If the columns of X
2
are not all orthogonal to C(X
1
), then
cov(
1
) =
2
(X
1
X
1
)
1
.
Furthermore, as X
2
gets closer to C(X
1
), then X
2
(I P
X
1
)X
2
gets smaller.
This makes [X
2
(I P
X
1
)X
2
]
1
larger. This makes cov(
1
) larger.
Multicollinearity occurs when X
2
is close to C(X
1
). Severe multicollinearity
can greatly inate the variances of the least squares estimates. In turn, this can
have a deleterious eect on inference (e.g., condence intervals too wide, hypothesis
tests with no power, predicted values with little precision, etc.). Various diagnostic
measures exist to assess multicollinearity (e.g., VIFs, condition numbers, etc.); see
the discussion in Monahan, pp 80-82.
PAGE 62
4.5 The Aitken model and generalized least squares
TERMINOLOGY: The general linear model
Y = X +,
2
V, V known, is called the Aitken model. It is
more exible than the Guass-Markov (GM) model, because the analyst can incorporate
dierent correlation structures with the observed responses. The GM model is a special
case of the Aitken model with V = I; that is, where responses are uncorrelated.
REMARK: In practice, V is rarely known; rather, it must be estimated. We will discuss
this later. We assume that V is positive denite (pd), and, hence nonsingular, for reasons
that will soon be obvious. Generalizations are possible; see Christensen (Chapter 10).
RECALL: Because V is symmetric, we can write V in its Spectral Decomposition; i.e.,
V = QDQ
,
where Q is orthogonal and D is the diagonal matrix consisting of
1
,
2
, ...,
n
, the eigen-
values of V. Because V is pd, we know that
i
> 0, for each i = 1, 2, ..., n. The symmetric
square root of V is
V
1/2
= QD
1/2
Q
,
where D
1/2
= diag(
1
,
2
, ...,
n
). Note that V
1/2
V
1/2
= V and that V
1
=
V
1/2
V
1/2
, where
V
1/2
= QD
1/2
Q
and D
1/2
= diag(1/
1
, 1/
2
, ..., 1/
n
).
TRANSFORMATION: Consider the Aitken model Y = X + , where E() = 0 and
cov() =
2
V, where V is known. Premultiplying by V
1/2
, we get
V
1/2
Y = V
1/2
X +V
1/2
= U +
,
PAGE 63
where Y
= V
1/2
Y, U = V
1/2
X, and
= V
1/2
. It is easy to show that Y
=
U +
is now a GM model. To see this, note that

E(
) = V
1/2
E() = V
1/2
0 = 0
and
cov(
) = V
1/2
cov()V
1/2
= V
1/2
2
VV
1/2
=
2
I.
Note also that R(X) = R(U), because V
1/2
is nonsingular. This means that
is
estimable in the Aitken model if and only if
is estimable in the transformed GM

model. The covariance structure on does not aect estimability.
AITKEN EQUATIONS: In the transformed model, the normal equations are
U
U = U
.
However, note that
U
U = U
(V
1/2
X)
V
1/2
X = (V
1/2
X)
V
1/2
Y
X
V
1
X = X
V
1
Y
in the Aitken model. The equations
X
V
1
X = X
V
1
Y
are called the Aitken equations. These should be compared with the normal equations
X
X = X
Y
in the GM model. In general, we will denote by

GLS
and

OLS
the solutions to the Aitken
and normal equations, respectively. GLS stands for generalized least squares. OLS
stands for ordinary least squares.
GENERALIZED LEAST SQUARES: Any solution

GLS
to the Aitken equations is called
a generalized least squares (GLS) estimator of . It is not necessarily unique (unless
X is full rank). The solution

GLS
minimizes
Q
() = (Y
U)
(Y
U) = (YX)
V
1
(YX).
PAGE 64
When X is full rank, the unique GLS estimator is
GLS
= (X
V
1
X)
1
X
V
1
Y.
When X is not full rank, a GLS estimator is
GLS
= (X
V
1
X)
V
1
Y.
NOTE: If V is diagonal; i.e., V = diag(v
1
, v
2
, ..., v
n
), then
Q
() = (YX)
V
1
(YX) =
n
i=1
w
i
(Y
i
x
i
)
2
,
where w
i
= 1/v
i
and x
i
is the ith row of X. In this situation,

GLS
is called the weighted
least squares estimator.
Result 4.4. Consider the Aitken model Y = X+, where E() = 0 and cov() =
2
V,
where V is known. If
is estimable, then
GLS
is the BLUE for
.
Proof. Applying the Gauss-Markov Theorem to the transformed model Y
= U +
,
the GLS estimator
GLS
is the BLUE of
among all linear unbiased estimators

involving Y
= V
1/2
Y. However, any linear estimator in Y can be obtained from Y
because V
1/2
is invertible. Thus,
GLS
is the BLUE.
REMARK: If X is full rank, then estimability concerns vanish (as in the GM model) and
GLS
= (X
V
1
X)
1
X
V
1
Y is unique. In this case, straightforward calculations show
that E(
GLS
) = and
cov(
GLS
) =
2
(X
V
1
X)
1
.
Example 4.2. Heteroscedastic regression through the origin. Consider the regression
model Y
i
= x
i
+
i
, for i = 1, 2, ..., n, where E(
i
) = 0, var(
i
) =
2
g
2
(x
i
), for some real
function g(), and cov(
i
,
j
) = 0, for i = j. For this model,
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
_
_
_
_
_
_
, and V =
_
_
_
_
_
_
_
_
g
2
(x
1
) 0 0
0 g
2
(x
2
) 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 g
2
(x
n
)
_
_
_
_
_
_
_
_
.
PAGE 65
The OLS estimator of = is given by
OLS
= (X
X)
1
X
Y =
n
i=1
x
i
Y
i
n
i=1
x
2
i
.
The GLS estimator of = is given by
GLS
= (X
V
1
X)
1
X
V
1
Y =
n
i=1
x
i
Y
i
/g
2
(x
i
)
n
i=1
x
2
i
/g
2
(x
i
)
.
Which one is better? Both of these estimators are unbiased, so we turn to the variances.
Straightforward calculations show that
var(
OLS
) =

2
n
i=1
x
2
i
g
2
(x
i
)
(
n
i=1
x
2
i
)
2
var(
GLS
) =

2
n
i=1
x
2
i
/g
2
(x
i
)
.
We are thus left to compare
n
i=1
x
2
i
g
2
(x
i
)
(
n
i=1
x
2
i
)
2
with
1
n
i=1
x
2
i
/g
2
(x
i
)
.
Write x
2
i
= u
i
v
i
, where u
i
= x
i
g(x
i
) and v
i
= x
i
/g(x
i
). Applying Cauchy-Schwartzs
inequality, we get
_
n
i=1
x
2
i
_
2
=
_
n
i=1
u
i
v
i
_
2
i=1
u
2
i
n
i=1
v
2
i
=
n
i=1
x
2
i
g
2
(x
i
)
n
i=1
x
2
i
/g
2
(x
i
).
Thus,
1
n
i=1
x
2
i
/g
2
(x
i
)

n
i=1
x
2
i
g
2
(x
i
)
(
n
i=1
x
2
i
)
2
=var(
GLS
) var(
OLS
).
This result should not be surprising; after all, we know that

GLS
is BLUE.
Result 4.5. An estimate

is a generalized least squares estimate if and only if X
=
AY, where A = X(X
V
1
X)
V
1
.
Proof. The GLS estimate; i.e., the OLS estimate in the transformed model Y
= U+
,
where Y
= V
1/2
Y, U = V
1/2
X, and
= V
1/2
, satises
V
1/2
X[(V
1/2
X)
V
1/2
X]
(V
1/2
X)
V
1/2
Y = V
1/2
X
,
by Result 2.5. Multiplying through by V
1/2
and simplifying gives the result.
PAGE 66
Result 4.6. A = X(X
V
1
X)
V
1
is a projection matrix onto C(X).
Proof. We need to show that
(a) A is idempotent
(b) Aw C(X), for any w
(c) Az = z, for all z C(X).
The perpendicular projection matrix onto C(V
1/2
X) is
V
1/2
X[(V
1/2
X)
V
1/2
X]
(V
1/2
X)
,
which implies that
V
1/2
X[(V
1/2
X)
V
1/2
X]
(V
1/2
X)
V
1/2
X = V
1/2
X.
This can also be written as
V
1/2
AX = V
1/2
X.
Premultiplying by V
1/2
gives AX = X. Thus,
AA = AX(X
V
1
X)
V
1
= X(X
V
1
X)
V
1
= A,
showing that A is idempotent. To show (b), note Aw = X(X
V
1
X)
V
1
w C(X).
To show (c), it suces to show C(A) = C(X). But, A = X(X
V
1
X)
V
1
implies
that C(A) C(X) and AX = X implies that C(X) C(A).
Result 4.7. In the Aitken model, if C(VX) C(X), then the GLS and OLS estimates
will be equal; i.e., OLS estimates will be BLUE in the Aitken model.
Proof. The proof proceeds by showing that A = X(X
V
1
X)
V
1
is the perpendicular
projection matrix onto C(X) when C(VX) C(X). We already know that A is a
projection matrix onto C(X). Thus, all we have to show is that if wC(X), then Aw = 0.
If V is nonsingular, then r(VX) = r(X). The only way this and C(VX) C(X) holds
is if C(VX) = C(X), in which case VXB
1
= X and VX = XB
2
, for some matrices
B
1
and B
2
. Multiplying through by V
1
gives XB
1
= V
1
X and X = V
1
XB
2
.
Thus, C(V
1
X) = C(X) and C(V
1
X)
= C(X)
. If wC(X), then wC(V

1
X); i.e.,
w N(X
V
1
). Since Aw = X(X
V
1
X)
1
X
V
1
w = 0, we are done.
PAGE 67
5 Distributional Theory
Complementary reading from Monahan: Chapter 5.
5.1 Introduction
PREVIEW: Consider the Gauss-Markov linear model Y = X+, where Y is n1, X
is an n p with rank r p, is p 1, and is n 1 with E() = 0 and cov() =
2
I.
In addition to the rst two moment assumptions, it is common to assume that follows
a multivariate normal distribution. This additional assumption allows us to formally
pursue various questions dealing with inference. In addition to the multivariate normal
distribution, we will also examine noncentral distributions and quadratic forms.
RECALL: If Z N(0, 1), then the probability density function (pdf) of Z is given by
f
Z
(z) =
1
2
e
z
2
/2
I(z R).
The N(,
2
) family is a location-scale family generated by the standard density f
Z
(z).
TERMINOLOGY: The collection of pdfs
LS(f) =
_
f
X
(|, ) : f
X
(x|, ) =
1
f
Z
_
x
_
; R, > 0
_
is a location-scale family generated by f
Z
(z); see Casella and Berger, Chapter 3. That
is, if Z f
Z
(z), then
X = Z + f
X
(x|, ) =
1
f
Z
_
x
_
.
APPLICATION: With the standard normal density f
Z
(z), it is easy to see that
f
X
(x|, ) =
1
f
Z
_
x
_
=
1
2
e
1
2
2
(x)
2
I(x R).
That is, any normal random variable X N(,
2
) may be obtained by transforming
Z N(0, 1) via X = Z +.
PAGE 68
5.2 Multivariate normal distribution
5.2.1 Probability density function
STARTING POINT: Suppose that Z
1
, Z
2
, ..., Z
p
are iid standard normal random vari-
ables. The joint pdf of Z = (Z
1
, Z
2
, ..., Z
p
)
is given by
f
Z
(z) =
p
i=1
f
Z
(z
i
)
=
_
1
2
_
p
e
p
i=1
z
2
i
/2
p
i=1
I(z
i
R)
= (2)
p/2
exp(z
z/2)I(z R
p
).
If Z has pdf f
Z
(z), we say that Z has a standard multivariate normal distribution;
i.e., a multivariate normal distribution with mean 0
p1
and covariance matrix I
p
. We
write Z N
p
(0, I).
MULTIVARIATE NORMAL DISTRIBUTION: Suppose that Z N
p
(0, I). Suppose
that V is symmetric and positive denite (and, hence, nonsingular) and let V
1/2
be the
symmetric square root of V. Dene the transformation
Y = V
1/2
Z +,
where Y and are both p 1. Note that
E(Y) = E(V
1/2
Z +) = ,
since E(Z) = 0, and
cov(Y) = cov(V
1/2
Z +) = V
1/2
cov(Z)V
1/2
= V,
since cov(Z) = I. The transformation y = g(z) = V
1/2
z+ is linear in z (and hence, one-
to-one) and the pdf of Y can be found using a transformation. The inverse transformation
is z = g
1
(y) = V
1/2
(y ). The Jacobian of the inverse transformation is
g
1
(y)
y
= |V
1/2
|,
PAGE 69
where |A| denotes the determinant of A. The matrix V
1/2
is pd; thus, its determinant
is always positive. Thus, for y R
p
,
f
Y
(y) = f
Z
{g
1
(y)}|V
1/2
|
= |V|
1/2
f
Z
{V
1/2
(y )}
= (2)
p/2
|V|
1/2
exp[{V
1/2
(y )}
V
1/2
(y )/2]
= (2)
p/2
|V|
1/2
exp{(y )
V
1
(y )/2}.
If Y f
Y
(y), we say that Y has a multivariate normal distribution with mean
and covariance matrix V. We write Y N
p
(, V).
IMPORTANT: In the preceding derivation, we assumed V to be pd (hence, nonsingular).
If V is singular, then the distribution of Y is concentrated in a subspace of R
p
, with
dimension r(V). In this situation, the density function of Y does not exist.
5.2.2 Moment generating functions
REVIEW: Suppose that X is a random variable with cumulative distribution function
F
X
(x) = P(X x). If E(e
tX
) < for all |t| < , > 0, then
M
X
(t) = E(e
tX
) =
_
R
e
tx
dF
X
(x)
is dened for all t in an open neighborhood about zero. The function M
X
(t) is called the
moment generating function (mgf) of X.
Result 5.1.
1. If M
X
(t) exists, then E(|X|
j
) < , for all j 1, that is, the moment generating
function characterizes an innite set of moments.
2. M
X
(0) = 1.
3. The jth moment of X is given by
E(X
j
) =

j
M
X
(t)
t
j
t=0
.
PAGE 70
4. Uniqueness. If X
1
M
X
1
(t), X
2
M
X
2
(t), and M
X
1
(t) = M
X
2
(t) for all t in an
open neighborhood about zero, then F
X
1
(x) = F
X
2
(x) for all x.
5. If X
1
, X
2
, ..., X
n
are independent random variables with mgfs M
X
i
(t), i = 1, 2, ..., n,
and Y = a
0
+
n
i=1
a
i
X
i
, then
M
Y
(t) = e
a
0
t
n
i=1
M
X
i
(a
i
t).
Result 5.2.
1. If X N(,
2
), then M
X
(t) = exp(t +t
2
2
/2), for all t R.
2. If X N(,
2
), then Y = a +bX N(a +b, b
2
2
).
3. If X N(,
2
), then Z =
1
(X ) N(0, 1).
TERMINOLOGY: Dene the random vector X = (X
1
, X
2
, ..., X
p
)
and let t = (t
1
, t
2
, ..., t
p
)
.
The moment generating function for X is given by
M
X
(t) = E{exp(t
X)} =
_
R
p
exp(t
x)dF
X
(x),
provided that E{exp(t
X)} < , for all ||t|| < , > 0.

Result 5.3.
1. If M
X
(t) exists, then M
X
i
(t
i
) = M
X
(t
i
), where t
i
= (0, ..., 0, t
i
, 0, ..., 0)
. This
implies that E(|X
i
|
j
) < , for all j 1.
2. The expected value of X is
E(X) =
M
X
(t)
t
t=0
.
3. The p p second moment matrix
E(XX
) =

2
M
X
(t)
tt
t=0
.
Thus,
E(X
r
X
s
) =

2
M
X
(t)
t
r
t
s
tr=ts=0
.
PAGE 71
4. Uniqueness. If X
1
and X
2
are random vectors with M
X
1
(t) = M
X
2
(t) for all t in
an open neighborhood about zero, then F
X
1
(x) = F
X
2
(x) for all x.
5. If X
1
, X
1
, ..., X
n
are independent random vectors, and
Y = a
0
+
n
i=1
A
i
X
i
,
for conformable a
0
and A
i
; i = 1, 2, ..., n, then
M
Y
(t) = exp(a
0
t)
n
i=1
M
X
i
(A
i
t).
6. Let X = (X
1
, X
2
, ..., X
m
)
and suppose that M

X
(t) exists. Let M
X
i
(t
i
) denote the
mgf of X
i
. Then, X
1
, X
2
, ..., X
m
are independent if and only if
M
X
(t) =
n
i=1
M
X
i
(t
i
)
for all t = (t
1
, t
2
, ..., t
m
)
in an open neighborhood about zero.

Result 5.4. If Y N
p
(, V), then M
Y
(t) = exp(t
+t
Vt/2).
Proof. Exercise.
5.2.3 Properties
Result 5.5. Let Y N
p
(, V). Let a be p 1, b be k 1, and A be k p. Then
1. X = a
Y N(a
, a
Va).
2. X = AY +b N
k
(A +b, AVA
).
Result 5.6. If Y N
p
(, V), then any r 1 subvector of Y has an r-variate normal
distribution with the same means, variances, and covariances as the original distribution.
Proof. Partition Y = (Y
1
, Y
2
)
, where Y
1
is r 1. Partition = (
1
,
2
)
and
V =
_
_
V
11
V
12
V
21
V
22
_
_
PAGE 72
accordingly. Dene A = (I
r
0), where 0 is r(pr). Since Y
1
= AY is a linear function
of Y, it is normally distributed. Since A =
1
and AVA
= V
11
, we are done.
COROLLARY: If Y N
p
(, V), then Y
i
N(
i
,
2
i
), for i = 1, 2, ..., p.
WARNING: Joint normality implies marginal normality. That is, if Y
1
and Y
2
are
jointly normal, then they are marginally normal. However, if Y
1
and Y
2
are marginally
normal, this does not necessarily mean that they are jointly normal.
APPLICATION: Consider the general linear model
Y = X +,
where N
n
(0,
2
I). Note that E(Y) = X and that V = cov(Y) =
2
I. Fur-
thermore, because Y is a linear combination of , it is also normally distributed;
i.e., Y N
n
(X,
2
I). With P
X
= X(X
X)
, we know that

Y = P
X
Y and
e = (I P
X
)Y. Now,
E(
Y) = E(P
X
Y) = P
X
E(Y) = P
X
X = X
and
cov(
Y) = cov(P
X
Y) = P
X
cov(Y)P
X
=
2
P
X
IP
X
=
2
P
X
,
since P
X
is symmetric and idempotent. Also,

Y = P
X
Y is a linear combination of Y,
so it also has normal distribution. Putting everything together, we have
Y N
n
(X,
2
P
X
).
Exercise: Show that e N
n
{0,
2
(I P
X
)}.
5.2.4 Less-than-full-rank normal distributions
TERMINOLOGY: The random vector Y
p1
N
p
(, V) is said to have a p-variate
normal distribution with rank k if Y has the same distribution as
p1
+
pk
Z
k1
,
where
= V, r(V) = k < p, and Z N

k
(0, I).
PAGE 73
Example 5.1. Suppose that k = 1, Z
1
N(0, 1), and Y = (Y
1
, Y
2
)
, where
Y =
_
_
0
0
_
_
+
_
_

1
2
_
_
Z
1
=
_
_

1
Z
1
2
Z
1
_
_
,
where
= (
1
,
2
)
and r() = 1. Since r() = 1, this means that at least one of

1
and
2
is not equal to zero. Without loss, take
1
= 0, in which case
Y
2
=

2
1
Y
1
.
Note that E(Y) = 0 = (0, 0)
and
cov(Y) = E(YY
) = E
_
_
_
_
_

2
1
Z
2
1

1
2
Z
2
1
2
Z
2
1

2
2
Z
2
1
_
_
_
_
_
=
_
_

2
1

1
2

2
2
_
_
=
= V.
Note that |V| = 0. Thus, Y
21
is a random vector with all of its probability mass located
in the linear subspace {(y
1
, y
2
) : y
2
=
2
y
1
/
1
}. Since r(V) = 1 < 2, Y does not have a
density function.
5.2.5 Independence results
Result 5.7. Suppose that Y N(, V), where
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
m
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
2
.
.
.
m
_
_
_
_
_
_
_
_
, and V =
_
_
_
_
_
_
_
_
V
11
V
12
V
1m
V
21
V
22
V
2m
.
.
.
.
.
.
.
.
.
.
.
.
V
m1
V
m2
V
mm
_
_
_
_
_
_
_
_
.
Then, Y
1
, Y
2
, ..., Y
m
are jointly independent if and only if V
ij
= 0, for all i = j.
Proof. Suciency (=): Suppose Y
1
, Y
2
, ..., Y
m
are jointly independent. For all i = j,
V
ij
= E{(Y
i
i
)(Y
j
j
)
}
= E(Y
i
i
)E{(Y
j
j
)
} = 0.
Necessity (=): Suppose that V
ij
= 0 for all i = j, and let t = (t
1
, t
2
, ..., t
m
)
. Note
that
t
Vt =
m
i=1
t
i
V
ii
t
i
and t
=
m
i=1
t
i
.
PAGE 74
Thus,
M
Y
(t) = exp(t
+t
Vt/2)
= exp
_
m
i=1
t
i
+
m
i=1
t
i
V
ii
t
i
/2
_
=
m
i=1
exp(t
i
+t
i
V
ii
t
i
/2) =
m
i=1
M
Y
i
(t
i
).
Result 5.8. Suppose that X N(, ), and let Y
1
= a
1
+B
1
X and Y
2
= a
2
+B
2
X,
for nonrandom conformable a
i
and B
i
; i = 1, 2. Then, Y
1
and Y
2
are independent if
and only if B
1
B
2
= 0.
Proof. Write Y = (Y
1
, Y
2
)
as
_
_
Y
1
Y
2
_
_
=
_
_
a
1
a
2
_
_
+
_
_
B
1
B
2
_
_
X = a +BX.
Thus, Y is a linear combination of X; hence, Y follows a multivariate normal distribution
(i.e., Y
1
and Y
2
are jointly normal). Also, cov(Y
1
, Y
2
) = cov(B
1
X, B
2
X) = B
1
B
2
.
Now simply apply Result 5.7.
REMARK: If X
1
N(
1
,
1
), X
2
N(
2
,
2
), and cov(X
1
, X
2
) = 0, this does not
necessarily mean that X
1
and X
2
are independent! We need X = (X
1
, X
2
)
to be jointly
normal.
APPLICATION: Consider the general linear model
Y = X +,
where N
n
(0,
2
I). We have already seen that Y N
n
(X,
2
I). Also, note that
with P
X
= X(X
X)
,
_
_
Y
e
_
_
=
_
_
P
X
I P
X
_
_
Y,
a linear combination of Y. Thus,

Y and e are jointly normal. By the last result, we
know that

Y and e are independent since
cov(
Y, e) = P
X
2
I(I P
X
)
= 0.
PAGE 75
That is, the tted values and residuals from the least-squares t are independent. This
explains why residual plots that display nonrandom patterns are consistent with a vio-
lation of our model assumptions.
5.2.6 Conditional distributions
RECALL: Suppose that (X, Y )
N
2
(, ), where
=
_
_

X
Y
_
_
and =
_
_

2
X

X
Y

2
Y
_
_
and = corr(X, Y ). The conditional distribution of Y , given X, is also normally dis-
tributed, more precisely,
Y |{X = x} N
_
Y
+(
Y
/
X
)(x
X
),
2
Y
(1
2
)
_
.
It is important to see that the conditional mean E(Y |X = x) is a linear function of x.
Note also that the conditional variance var(Y |X = x) is free of x.
EXTENSION: We wish to extend the previous result to random vectors. In particular,
suppose that X and Y are jointly multivariate normal with
XY
= 0. That is, suppose
_
_
X
Y
_
_
N
_
_
_
_
_

X
Y
_
_
,
_
_

X

Y X
XY

Y
_
_
_
_
_
,
and assume that
X
is nonsingular. The conditional distribution of Y given X is
Y|{X = x} N(
Y |X
,
Y |X
),
where
Y |X
=
Y
+
Y X
1
X
(x
X
)
and
Y |X
=
Y

Y X
1
X

XY
.
Again, the conditional mean
Y |X
is a linear function of x and the conditional covariance
matrix
Y |X
is free of x.
PAGE 76
5.3 Noncentral
2
distribution
RECALL: Suppose that U
2
n
; that is, U has a (central)
2
distribution with n > 0
degrees of freedom. The pdf of U is given by
f
U
(u|n) =
1
(
n
2
)2
n/2
u
n
2
1
e
u/2
I(u > 0).
The
2
n
family of distributions is a gamma(, ) subfamily with shape parameter = n/2
and scale parameter = 2. Note that E(U) = n, var(U) = 2n, and M
U
(t) = (12t)
n/2
,
for t < 1/2.
RECALL: If Z
1
, Z
2
, ..., Z
n
are iid N(0, 1), then U
1
= Z
2
1

2
1
and
Z
Z =
n
i=1
Z
2
i

2
n
.
Proof. Exercise.
TERMINOLOGY: A univariate random variable V is said to have a noncentral
2
distribution with degrees of freedom n > 0 and noncentrality parameter > 0 if it has
the pdf
f
V
(v|n, ) =
j=0
_
e
j
j!
_
1
(
n+2j
2
)2
(n+2j)/2
v
n+2j
2
1
e
v/2
I(v > 0)
. .
f
U
(v|n+2j)
.
We write V
2
n
(). When = 0, the
2
n
() distribution reduces to the central
2
n
distribution. In the
2
n
() pdf, notice that e
j
/j! is the jth term of a Poisson pmf with
parameter > 0.
Result 5.9. If V |W
2
n+2W
and W Poisson(), then V
2
n
().
Proof. Note that
f
V
(v) =
j=0
f
V,W
(v, j)
=
j=0
f
W
(j)f
V |W
(v|j)
=
j=0
_
e
j
j!
_
1
(
n+2j
2
)2
(n+2j)/2
v
n+2j
2
1
e
v/2
I(v > 0).
PAGE 77
Result 5.10. If V
2
n
(), then
M
V
(t) = (1 2t)
n/2
exp
_
2t
1 2t
_
,
for t < 1/2.
Proof. The mgf of V , by denition, is M
V
(t) = E(e
tV
). Using an iterated expectation,
we can write, for t < 1/2,
M
V
(t) = E(e
tV
) = E{E(e
tV
|W)},
where W Poisson(). Note that E(e
tV
|W) is the conditional mgf of V , given W. We
know that V |W
2
n+2W
; thus, E(e
tV
|W) = (1 2t)
(n+2W)/2
and
E{E(e
tV
|W)} =
j=0
(1 2t)
(n+2j)/2
_
e
j
j!
_
= e
(1 2t)
n/2
j=0
j
j!
(1 2t)
j
= e
(1 2t)
n/2
j=0
_

12t
_
j
j!
= e
(1 2t)
n/2
exp
_

1 2t
_
= (1 2t)
n/2
exp
_
2t
1 2t
_
.
MEAN AND VARIANCE: If V
2
n
(), then
E(V ) = n + 2 and var(V ) = 2n + 8.
Result 5.11. If Y N(, 1), then U = Y
2

2
1
(), where =
2
/2.
Outline of the proof. The proof proceeds by nding the mgf of U and showing that it
equals
M
U
(t) = (1 2t)
n/2
exp
_
2t
1 2t
_
,
with n = 1 and =
2
/2. Note that
M
U
(t) = E(e
tU
) = E(e
tY
2
) =
_
R
e
ty
2 1
2
e
1
2
(y)
2
dy
PAGE 78
Now, combine the exponents in the integrand, square out the (y )
2
term, combine like
terms, complete the square, and collapse the expression to (1 2t)
1/2
exp{
2
t/(1 2t)}
times some normal density that is integrated over R.
Result 5.12. If U
1
, U
2
, ..., U
m
are independent random variables, where U
i

2
n
i
(
i
);
i = 1, 2, ..., m, then U =
i
U
i

2
n
(), where n =
i
n
i
and =
i
.
Result 5.13. Suppose that V
2
n
(). For xed n and c > 0, the quantity P
(V > c)
is a strictly increasing function of .
Proof. See Monahan, pp 106-108.
IMPLICATION: If V
1

2
n
(
1
) and V
2

2
n
(
2
), where
2
>
1
, then pr(V
2
> c) >
pr(V
1
> c). That is, V
2
is (strictly) stochastically greater than V
1
, written V
2
>
st
V
1
.
Note that
V
2
>
st
V
1
F
V
2
(v) < F
V
1
(v) S
V
2
(v) > S
V
1
(v),
for all v, where F
V
i
() denotes the cdf of V
i
and S
V
i
() = 1 F
V
i
() denotes the survivor
function of V
i
.
5.4 Noncentral F distribution
RECALL: A univariate random variable W is said to have a (central) F distribution with
degrees of freedom n
1
> 0 and n
2
> 0 if it has the pdf
f
W
(w|n
1
, n
2
) =
(
n
1
+n
2
2
)
_
n
1
n
2
_
n
1
/2
w
(n
1
2)/2
(
n
1
2
)(
n
2
2
)
_
1 +
n
1
w
n
2
_
(n
1
+n
2
)/2
I(w > 0).
We write W F
n
1
,n
2
. The moment generating function for the F distribution does not
exist in closed form.
RECALL: If U
1
and U
2
are independent central
2
random variables with degrees of
freedom n
1
and n
2
, respectively, then
W =
U
1
/n
1
U
2
/n
2
F
n
1
,n
2
.
PAGE 79
TERMINOLOGY: A univariate random variable W is said to have a noncentral F
distribution with degrees of freedom n
1
> 0 and n
2
> 0 and noncentrality parameter
> 0 if it has the pdf
f
W
(w|n
1
, n
2
, ) =
j=0
_
e
j
j!
_
(
n
1
+2j+n
2
2
)
_
n
1
+2j
n
2
_
(n
1
+2j)/2
w
(n
1
+2j2)/2
(
n
1
+2j
2
)(
n
2
2
)
_
1 +
n
1
w
n
2
_
(n
1
+2j+n
2
)/2
I(w > 0).
We write W F
n
1
,n
2
(). When = 0, the noncentral F distribution reduces to the
central F distribution.
MEAN AND VARIANCE: If W F
n
1
,n
2
(), then
E(W) =
n
2
n
2
2
_
1 +
2
n
1
_
and
var(W) =
2n
2
2
n
2
1
(n
2
2)
_
(n
1
+ 2)
2
(n
2
2)(n
2
4)
+
n
1
+ 4
n
2
4
_
.
E(W) exists only when n
2
> 2 and var(W) exists only when n
2
> 4. The moment
generating function for the noncentral F distribution does not exist in closed form.
Result 5.14. If U
1
and U
2
are independent random variables with U
1

2
n
1
() and
U
2

2
n
2
, then
W =
U
1
/n
1
U
2
/n
2
F
n
1
,n
2
().
Proof. See Searle, pp 51-52.
Result 5.15. Suppose that W F
n
1
,n
2
(). For xed n
1
, n
2
, and c > 0, the quantity
P
(W > c) is a strictly increasing function of . That is, if W

1
F
n
1
,n
2
(
1
) and
W
2
F
n
1
,n
2
(
2
), where
2
>
1
, then pr(W
2
> c) > pr(W
1
> c); i.e., W
2
>
st
W
1
.
REMARK: The fact that the noncentral F distribution tends to be larger than the
central F distribution is the basis for many of the tests used in linear models. Typically,
test statistics are used that have a central F distribution if the null hypothesis is true
and a noncentral F distribution if the null hypothesis is not true. Since the noncentral
F distribution tends to be larger, large values of the test statistic are consistent with
PAGE 80
the alternative hypothesis. Thus, the form of an appropriate rejection region is to reject
H
0
for large values of the test statistic. The power is simply the probability of rejection
region (dened under H
0
) when the probability distribution is noncentral F. Noncentral
F distributions are available in most software packages.
5.5 Distributions of quadratic forms
GOAL: We would like to nd the distribution of Y
AY, where Y N
p
(, V). We will
obtain this distribution by taking steps. Result 5.16 is a very small step. Result 5.17 is
a large step, and Result 5.18 is the nish line. There is no harm in assuming that A is
symmetric.
Result 5.16. Suppose that Y N
p
(, I) and dene
W = Y
Y = Y
IY =
p
i=1
Y
2
i
.
Result 5.11 says Y
2
i

2
1
(
2
i
/2), for i = 1, 2, ..., p. Thus, from Result 5.12,
Y
Y =
p
i=1
Y
2
i

2
p
(
/2).
LEMMA: The p p symmetric matrix A is idempotent of rank s if and only if there
exists a p s matrix P
1
such that (a) A = P
1
P
1
and (b) P
1
P
1
= I
s
.
Proof. (=) Suppose that A = P
1
P
1
and P
1
P
1
= I
s
. Clearly, A is symmetric. Also,
A
2
= P
1
P
1
P
1
P
1
= P
1
P
1
= A.
Note also that r(A) = tr(A) = tr(P
1
P
1
) = tr(P
1
P
1
) = tr(I
s
) = s. Now, to go the other
way (=), suppose that A is a symmetric, idempotent matrix of rank s. The spectral
decomposition of A is given by A = QDQ
, where D = diag(
1
,
2
, ...,
p
) and Q is
orthogonal. Since A is idempotent, we know that s of the eigenvalues
1
,
2
, ...,
p
are
equal to 1 and other p s eigenvalues are equal to 0. Thus, we can write
A = QDQ
=
_
P
1
P
2
_
_
_
I
s
0
0 0
_
_
_
_
P
1
P
2
_
_
= P
1
P
1
.
PAGE 81
Thus, we have shown that (a) holds. To show that (b) holds, note that because Q is
orthogonal,
I
p
= Q
Q =
_
_
P
1
P
2
_
_
_
P
1
P
2
_
=
_
_
P
1
P
1
P
1
P
2
P
2
P
1
P
2
P
2
_
_
.
It is easy to convince yourself that P
1
P
1
is an identity matrix. Its dimension is s s
because tr(P
1
P
1
) = tr(P
1
P
1
) = tr(A) = r(A), which equals s by assumption.
p
(, I). If A is idempotent of rank s, then Y
AY
2
s
(), where =
1
2
A.
Proof. Suppose that Y N
p
(, I) and that A is idempotent of rank s. By the last
lemma, we know that A = P
1
P
1
, where P
1
is p s, and P
1
P
1
= I
s
. Thus,
Y
AY = Y
P
1
P
1
Y = X
X,
where X = P
1
Y. Since Y N
p
(, I), and since X = P
1
Y is a linear combination of Y,
we know that
X N
s
(P
1
, P
1
IP
1
) N
s
(P
1
, I
s
).
Result 5.16 says that Y
AY = X
X
2
s
{(P
1
)
(P
1
)/2}. But,

1
2
(P
1
)
1
=
1
2
P
1
P
1
=
1
2
A.
p
(, V), where r(V) = p. If AV is idempotent of
rank s, then Y
AY
2
s
(), where =
1
2
A.
Proof. Since Y N
p
(, V), and since X = V
1/2
Y is a linear combination of Y, we
know that
X N
p
(V
1/2
, V
1/2
VV
1/2
) N
p
(V
1/2
, I
p
).
Now,
Y
AY = X
V
1/2
AV
1/2
X = X
BX,
where B = V
1/2
AV
1/2
. Recall that V
1/2
is the symmetric square root of V. From Result
5.17, we know that Y
AY = X
BX
2
s
() if B is idempotent of rank s. However, note
that
r(B) = r(V
1/2
AV
1/2
) = r(A) = r(AV) = s,
PAGE 82
since AV has rank s (by assumption) and V and V
1/2
are both nonsingular. Also, AV
is idempotent by assumption so that
AV = AVAV A = AVA
V
1/2
AV
1/2
= V
1/2
AV
1/2
V
1/2
AV
1/2
B = BB.
Thus, B is idempotent of rank s. This implies that Y
AY = X
BX
2
s
(). Noting
that
=
1
2
(V
1/2
)
B(V
1/2
) =
1
2
V
1/2
V
1/2
AV
1/2
V
1/2
=
1
2
A
completes the argument.
Example 5.2. Suppose that Y = (Y
1
, Y
2
, ..., Y
n
)
N
n
(1,
2
I), so that = 1 and
V =
2
I, where 1 is n 1 and I is n n. The statistic
(n 1)S
2
=
n
i=1
(Y
i
Y )
2
= Y
(I n
1
J)Y = Y
AY,
where A = I n
1
J. Thus, consider the quantity
(n 1)S
2
2
= Y
BY,
where B =
2
A =
2
(I n
1
J). Note that BV =
2
(I n
1
J)
2
I = I n
1
J = A,
which is idempotent with rank
r(BV) = tr(BV) = tr(A) = tr(I n
1
J) = n n
1
n = n 1.
Result 5.18 says that (n 1)S
2
/
2
= Y
BY
2
n1
(), where =
1
2
B. However,
=
1
2
B =
1
2
(1)
2
(I n
1
J)1 = 0,
since 1 C(1) and I n
1
J is the ppm onto C(1)
. Thus,
(n 1)S
2
2
= Y
BY
2
n1
,
a central
2
distribution with n 1 degrees of freedom.
PAGE 83
Example 5.3. Consider the general linear model
Y = X +,
where X is np with rank r p and N
n
(0,
2
I). Let P
X
= X(X
X)
denote the
perpendicular projection matrix onto C(X). We know that Y N
n
(X,
2
I). Consider
the (uncorrected) partitioning of the sums of squares given by
Y
Y = Y
P
X
Y+Y
(I P
X
)Y.
We rst consider the residual sum of squares Y
(I P
X
)Y. Dividing this quantity
by
2
, we get
Y
(I P
X
)Y/
2
= Y
{
2
(I P
X
)}Y = Y
AY,
where A =
2
(I P
X
). With V =
2
I, note that
AV =
2
(I P
X
)
2
I = I P
X
,
an idempotent matrix with rank
r(I P
X
) = tr(I P
X
) = tr(I) tr(P
X
) = tr(I) r(P
X
) = n r,
since r(P
X
) = r(X) = r, by assumption. Result 5.18 says that
Y
(I P
X
)Y/
2
= Y
AY
2
nr
(),
where =
1
2
A. However,
=
1
2
A =
1
2
(X)
2
(I P
X
)X = 0,
because X C(X) and I P
X
projects onto the orthogonal complement N(X
).
Thus, we have shown that Y
(I P
X
)Y/
2

2
nr
, a central
2
distribution with
n r degrees of freedom.
Now, we turn our attention to the (uncorrected) model sum of squares Y
P
X
Y.
Dividing this quantity by
2
, we get
Y
P
X
Y/
2
= Y
(
2
P
X
)Y = Y
BY,
PAGE 84
where B =
2
P
X
. With V =
2
I, note that
BV =
2
P
X
2
I = P
X
,
an idempotent matrix with rank r(P
X
) = r(X) = r. Result 5.18 says that
Y
P
X
Y/
2
= Y
BY
2
r
(),
where =
1
2
B. Note that
=
1
2
B =
1
2
(X)
2
P
X
X = (X)
X/2
2
.
That is, Y
P
X
Y/
2
has a noncentral
2
distribution with r degrees of freedom and
noncentrality parameter = (X)
X/2
2
.
In the last calculation, note that = (X)
X/2
2
= 0 i X = 0. In this case, both
quadratic forms Y
(I P
X
)Y/
2
and Y
P
X
Y/
2
have central
2
distributions.
5.6 Independence of quadratic forms
GOALS: In this subsection, we consider two problems. With Y N(, V), we would
like to establish sucient conditions for
(a) Y
AY and BY to be independent, and

(b) Y
AY and Y
BY to be independent.
p
(, V). If BVA = 0, then Y
AY and BY are
independent.
Proof. We may assume that A is symmetric. Write A = QDQ
, where D =
diag(
1
,
2
, ...,
p
) and Q is orthogonal. We know that s p of the eigenvalues
1
,
2
, ...,
p
are nonzero where s = r(A). We can thus write
A = QDQ
=
_
P
1
P
2
_
_
_
D
1
0
0 0
_
_
_
_
P
1
P
2
_
_
= P
1
D
1
P
1
,
PAGE 85
where D
1
= diag(
1
,
2
, ...,
s
). Thus,
Y
AY = Y
P
1
D
1
P
1
Y = X
D
1
X,
where X = P
1
Y. Notice that
_
_
BY
X
_
_
=
_
_
B
P
1
_
_
Y N
_
_
_
_
_
B
P
_
_
,
_
_
BVB
BVP
1
P
1
VB
1
VP
1
_
_
_
_
_
.
Suppose that BVA = 0. Then,
0 = BVA = BVP
1
D
1
P
1
= BVP
1
D
1
P
1
P
1
.
But, because Q is orthogonal,
I
p
= Q
Q =
_
_
P
1
P
2
_
_
_
P
1
P
2
_
=
_
_
P
1
P
1
P
1
P
2
P
2
P
1
P
2
P
2
_
_
.
This implies that P
1
P
1
= I
s
, and, thus,
0 = BVP
1
D
1
P
1
P
1
= BVP
1
D
1
= BVP
1
D
1
D
1
1
= BVP
1
.
Therefore, cov(BY, X) = 0, that is, X and BY are independent. But, Y
AY = X
D
1
X,
a function of X. Thus, Y
AY and BY are independent as well.

Example 5.4. Suppose that Y = (Y
1
, Y
2
, ..., Y
n
)
N
n
(1,
2
I), where 1 is n 1 and I
is n n, so that = 1 and V =
2
I. Recall that
(n 1)S
2
= Y
(I n
1
J)Y = YAY,
where A = I n
1
J. Also,
Y = n
1
1
Y = BY,
where B = n
1
1
. These two statistics are independent because

BVA = n
1
1
2
I(I n
1
J) =
2
n
1
1
(I n
1
J) = 0,
because I n
1
J is the ppm onto C(1)
. Since functions of independent statistics are

also independent, Y and S
2
are also independent.
PAGE 86
p
(, V). If BVA = 0, then Y
AY and Y
BY are
independent.
Proof. Write A and B in their spectral decompositions; that is, write
A = PDP
=
_
P
1
P
2
_
_
_
D
1
0
0 0
_
_
_
_
P
1
P
2
_
_
= P
1
D
1
P
1
,
where D
1
= diag(
1
,
2
, ...,
s
) and s = r(A). Similarly, write
B = QRQ
=
_
Q
1
Q
2
_
_
_
R
1
0
0 0
_
_
_
_
Q
1
Q
2
_
_
= Q
1
R
1
Q
1
,
where R
1
= diag(
1
,
2
, ...,
t
) and t = r(B). Since P and Q are orthogonal, this implies
that P
1
P
1
= I
s
and Q
1
Q
1
= I
t
. Suppose that BVA = 0. Then,
0 = BVA = Q
1
R
1
Q
1
VP
1
D
1
P
1
= Q
1
Q
1
R
1
Q
1
VP
1
D
1
P
1
P
1
= R
1
Q
1
VP
1
D
1
= R
1
1
R
1
Q
1
VP
1
D
1
D
1
1
= Q
1
VP
1
= cov(Q
1
Y, P
1
Y).
Now,
_
_
P
1
Q
1
_
_
Y N
_
_
_
_
_
P
_
_
,
_
_
P
1
VP
1
0
0 Q
1
VQ
1
_
_
_
_
_
.
That is, P
1
Y and Q
1
Y are jointly normal and uncorrelated; thus, they are independent.
So are Y
P
1
D
1
P
1
Y and Y
Q
1
R
1
Q
1
Y. But A = P
1
D
1
P
1
and B = Q
1
R
1
Q
1
, so we are
done.
Example 5.5. Consider the general linear model
Y = X +,
where X is n p with rank r p and N
n
(0,
2
I). Let P
X
= X(X
X)
denote
the perpendicular projection matrix onto C(X). We know that Y N
n
(X,
2
I). In
PAGE 87
Example 5.3, we showed that
Y
(I P
X
)Y/
2

2
nr
,
and that
Y
P
X
Y/
2

2
r
(),
where = (X)
X/2
2
. Note that
Y
(I P
X
)Y/
2
= Y
{
2
(I P
X
)}Y = Y
AY,
where A =
2
(I P
X
). Also,
Y
P
X
Y/
2
= Y
(
2
P
X
)Y = Y
BY,
where B =
2
P
X
. Applying Result 5.20, we have
BVA =
2
P
X
2
I
2
(I P
X
) = 0,
that is, Y
(I P
X
)Y/
2
and Y
P
X
Y/
2
are independent quadratic forms. Thus, the
statistic
F =
Y
P
X
Y/r
Y
(I P
X
)Y/(n r)
=

2
Y
P
X
Y/r
2
Y
(I P
X
)Y/(n r)
F
r,nr
_
(X)
X/2
2
_
,
a noncentral F distribution with degrees of freedom r (numerator) and n r (denomi-
nator) and noncentrality parameter = (X)
X/2
2
.
OBSERVATIONS:
Note that if X = 0, then F F
r,nr
since the noncentrality parameter = 0.
On the other hand, as the length of X gets larger, so does . This shifts the
noncentral F
r,nr
{(X)
X/2
2
} distribution to the right, because the noncentral
F distribution is stochastically increasing in its noncentrality parameter.
Therefore, large values of F are consistent with large values of ||X||.
PAGE 88
5.7 Cochrans Theorem
REMARK: An important general notion in linear models is that sums of squares like
Y
P
X
Y and Y
Y can be broken down into sums of squares of smaller pieces. We now

discuss Cochrans Theorem (Result 5.21), which serves to explain why this is possible.
n
(,
2
I). Suppose that A
1
, A
2
, ..., A
k
are n n
symmetric and idempotent matrices, where r(A
i
) = s
i
, for i = 1, 2, ..., k. If A
1
+ A
2
+
+ A
k
= I
n
, then Y
A
1
Y/
2
, Y
A
2
Y/
2
, ..., Y
A
k
Y/
2
follow independent
2
s
i
(
i
)
distributions, where
i
=
A
i
/2
2
, for i = 1, 2, ..., k, and
k
i=1
s
i
= n.
Outline of proof. If A
1
+A
2
+ +A
k
= I
n
, then A
i
idempotent implies that both (a)
A
i
A
j
= 0, for i = j, and (b)
k
i=1
s
i
= n hold. That Y
A
i
Y/
2

2
s
i
(
i
) follows from
Result 5.18 with V =
2
I. Because A
i
A
j
= 0, Result 5.20 with V =
2
I guarantees
that Y
A
i
Y/
2
and Y
A
j
Y/
2
are independent.
IMPORTANCE: We now show how Cochrans Threorem can be used to deduce the joint
distribution of the sums of squares in an analysis of variance. Suppose that we partition
the design matrix X and the parameter vector in Y = X+ into k +1 parts, so that
Y = (X
0
X
1
X
k
)
_
_
_
_
_
_
_
_
1
.
.
.
k
_
_
_
_
_
_
_
_
,
where the dimensions of X
i
and
i
are n p
i
and p
i
1, respectively, and
k
i=0
p
i
= p.
We can now write this as a k + 1 part model (the full model):
Y = X
0
0
+X
1
1
+ +X
k
k
+.
Now consider tting each of the k submodels
Y = X
0
0
+
Y = X
0
0
+X
1
1
+
.
.
.
Y = X
0
0
+X
1
1
+ +X
k1
k1
+,
PAGE 89
and let R(
0
,
1
, ...,
i
) denote the regression (model) sum of squares from tting the
ith submodel, for i = 0, 1, ..., k; that is,
R(
0
,
1
, ...,
i
) = Y
(X
0
X
1
X
i
) [(X
0
X
1
X
i
)
(X
0
X
1
X
i
)]
(X
0
X
1
X
i
)
Y
= Y
P
X
i
Y,
where
P
X
i
= (X
0
X
1
X
i
) [(X
0
X
1
X
i
)
(X
0
X
1
X
i
)]
(X
0
X
1
X
i
)
is the perpendicular projection matrix onto C(X
i
), where X
i
= (X
0
X
1
X
i
) and i =
0, 1, ..., k. Clearly,
C(X
0
) C(X
1
) C(X
k1
) C(X).
We can now partition the total sums of squares as
Y
Y = R(
0
) + [R(
0
,
1
) R(
0
)] + [R(
0
,
1
,
2
) R(
0
,
1
)] +
+ [R(
0
,
1
, ...,
k
) R(
0
,
1
, ...,
k1
)] + [Y
YR(
0
,
1
, ...,
k
)]
= Y
A
0
Y+Y
A
1
Y+Y
A
2
Y+ +Y
A
k
Y+Y
A
k+1
Y,
where
A
0
= P
X
0
A
i
= P
X
i
P
X
i1
i = 1, 2, ..., k 1,
A
k
= P
X
P
X
k1
A
k+1
= I P
X
.
Note that A
0
+A
1
+A
2
+ +A
k+1
= I. Note also that the A
i
matrices are symmetric,
for i = 0, 1, ..., k + 1, and that
A
2
i
= (P
X
i
P
X
i1
)(P
X
i
P
X
i1
)
= P
2
X
i
P
X
i1
P
X
i1
+P
2
X
i1
= P
X
i
P
X
i1
= A
i
,
for i = 1, 2, ..., k, since P
X
i
P
X
i1
= P
X
i1
and P
X
i1
P
X
i
= P
X
i1
. Thus, A
i
is idempo-
tent for i = 1, 2, ..., k. However, clearly A
0
= P
X
0
= X
0
(X
0
X
0
)
0
and A
k+1
= I P
X
PAGE 90
are also idempotent. Let
s
0
= r(A
0
) = r(X
0
)
s
i
= r(A
i
) = tr(A
i
) = tr(P
X
i
) tr(P
X
i1
) = r(P
X
i
) r(P
X
i1
), i = 1, 2, ..., k,
s
k+1
= r(A
k+1
) = n r(X).
It is easy to see that
k+1
i=0
s
i
= n.
APPLICATION: Consider the Gauss-Markov model Y = X+, where N
n
(0,
2
I)
so that Y N
n
(X,
2
I). Cochrans Theorem applies and we have
1
2
Y
A
0
Y
2
s
0
[
0
= (X)
A
0
X/2
2
]
1
2
Y
A
i
Y
2
s
i
[
i
= (X)
A
i
X/2
2
], i = 1, 2, ..., k,
1
2
Y
A
k+1
Y =
1
2
Y
(I P
X
)Y
2
s
k+1
,
where s
k+1
= n r(X). Note that the last quadratic follows a central
2
distribution
because
k+1
= (X)
(IP
X
)X/2
2
= 0. Cochrans Theorem also guarantees that the
quadratic forms Y
A
0
Y/
2
, Y
A
1
Y/
2
, ..., Y
A
k
Y/
2
, Y
A
k+1
Y/
2
are independent.
ANOVA TABLE: The quadratic forms Y
A
i
Y, for i = 0, 1, ..., k + 1, and the degrees of
freedom s
i
= r(A
i
) are often presented in the following ANOVA table:
Source df SS Noncentrality
0
s
0
R(
0
)
0
= (X)
A
0
X/2
2
1
(after
0
) s
1
R(
0
,
1
) R(
0
)
1
= (X)
A
1
X/2
2
2
(after
0
,
1
) s
2
R(
0
,
1
,
2
) R(
0
,
1
)
2
= (X)
A
2
X/2
2
.
.
.
.
.
.
.
.
.
.
.
.
k
(after
0
, ...,
k1
) s
k
R(
0
, ...,
k
) R(
0
, ...,
k1
)
k
= (X)
A
k
X/2
2
Residual s
k+1
Y
YR(
0
, ...,
k
)
k+1
= 0
Total n Y
Y (X)
X/2
2
Note that if X
0
= 1, then
0
= , R(
0
) = Y
P
1
Y = nY
2
. The R() notation
will come in handy when we talk about hypothesis testing later. The sums of squares
PAGE 91
R(
0
), R(
0
,
1
) R(
0
), ..., R(
0
, ...,
k
) R(
0
, ...,
k1
) are called the sequential
sums of squares. These correspond to the Type I sums of squares printed out by SAS
in the ANOVA and GLM procedures. We will also use the notation
R(
i
|
0
,
1
, ...,
i1
) = R(
0
,
1
, ...,
i
) R(
0
,
1
, ...,
i1
).
Example 5.6. Consider the one-way (xed eects) analysis of variance model
Y
ij
= +
i
+
ij
,
for i = 1, 2, ..., a and j = 1, 2, ..., n
i
, where
ij
iid N(0,
2
) random variables. In matrix
form, Y, X, and are
Y
n1
=
_
_
_
_
_
_
_
_
Y
11
Y
12
.
.
.
Y
ana
_
_
_
_
_
_
_
_
, X
np
=
_
_
_
_
_
_
_
_
1
n
1
1
n
1
0
n
1
0
n
1
1
n
2
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
na
0
na
0
na
1
na
_
_
_
_
_
_
_
_
, and
p1
=
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
a
_
_
_
_
_
_
_
_
_
_
_
,
where p = a + 1 and n =
i
n
i
. Note that we can write
X = (X
0
X
1
) ,
where X
0
= 1,
X
1
=
_
_
_
_
_
_
_
_
1
n
1
0
n
1
0
n
1
0
n
2
1
n
2
0
n
2
.
.
.
.
.
.
.
.
.
.
.
.
0
na
0
na
1
na
_
_
_
_
_
_
_
_
,
and = (
0
,
1
)
, where
0
= and
1
= (
1
,
2
, ...,
a
)
. That is, we can express this

model in the form
Y = X
0
0
+X
1
1
+.
The submodel is
Y = X
0
0
+,
PAGE 92
where X
0
= 1 and
0
= . We have
A
0
= P
1
A
1
= P
X
P
1
A
2
= I P
X
.
These matrices are clearly symmetric and idempotent. Also, note that
s
0
+s
1
+s
2
= r(P
1
) +r(P
X
P
1
) + r(I P
X
) = 1 + (a 1) + (n a) = n
and that A
0
+A
1
+A
2
= I. Therefore, Cochrans Theorem applies and we have
1
2
Y
P
1
Y
2
1
[
0
= (X)
P
1
X/2
2
]
1
2
Y
(P
X
P
1
)Y
2
a1
[
1
= (X)
(P
X
P
1
)X/2
2
]
1
2
Y
(I P
X
)Y
2
na
.
Cochrans Theorem also guarantees the quadratic forms Y
P
1
Y/
2
, Y
(P
X
P
1
)Y/
2
,
and Y
(I P
X
)Y/
2
are independent. The sums of squares, using our new notation, are
Y
P
1
Y = R()
Y
(P
X
P
1
)Y = R(,
1
, ...,
a
) R() R(
1
, ...,
a
|)
Y
(I P
X
)Y = Y
YR(,
1
, ...,
a
).
The ANOVA table for this one-way analysis of variance model is
Source df SS Noncentrality
1 R()
0
= (X)
P
1
X/2
2
1
, ...,
a
(after ) a 1 R(,
1
, ...,
a
) R()
1
= (X)
(P
X
P
1
)X/2
2
Residual n a Y
YR(,
1
, ...,
a
) 0
Total n Y
Y (X)
X/2
2
F STATISTIC: Because
1
2
Y
(P
X
P
1
)Y
2
a1
(
1
)
PAGE 93
and
1
2
Y
(I P
X
)Y
2
na
,
and because these two quadratic forms are independent, it follows that
F =
Y
(P
X
P
1
)Y/(a 1)
Y
(I P
X
)Y/(n a)
F
a1,na
(
1
).
This is the usual F statistic to test H
0
:
1
=
2
= =
a
= 0.
If H
0
is true, then X C(1) and
1
= (X)
(P
X
P
1
)X/2
2
= 0 since P
X
P
1
is the ppm onto C(1)
C(X)
. In this case, F F
a1,na
, a central F distribution. A
level rejection region is therefore RR = {F : F > F
a1,na,
}.
If H
0
is not true, then X / C(1) and
1
= (X)
(P
X
P
1
)X/2
2
> 0. In
this case, F F
a1,na
(
1
), which is stochastically larger than the central F
a1,na
distribution. Therefore, if H
0
is not true, we would expect F to be large.
EXPECTED MEAN SQUARES: We already know that
MSE = (n a)
1
Y
(I P
X
)Y
2
, i.e., E(MSE) =
2
. Recall that if V
2
n
(), then
E(V ) = n + 2. Therefore,
E
_
1
2
Y
(P
X
P
1
)Y
_
= (a 1) + (X)
(P
X
P
1
)X/
2
and
E(MSR) = E
_
(a 1)
1
Y
(P
X
P
1
)Y
=
2
+ (X)
(P
X
P
1
)X/(a 1)
=
2
+
a
i=1
n
i
(
i
+
)
2
/(a 1),
where
+
= a
1
a
i=1
i
. Again, note that if H
0
:
1
=
2
= =
a
= 0 is true, then
MSR is also an unbiased estimator of
2
. Therefore, values of
F =
Y
(P
X
P
1
)Y/(a 1)
Y
(I P
X
)Y/(n a)
should be close to 1 when H
0
is true and larger than 1 otherwise.
PAGE 94
6 Statistical Inference
Complementary reading from Monahan: Chapter 6 (and revisit Sections 3.9 and 4.7).
6.1 Estimation
PREVIEW: Consider the general linear model Y = X + , where X is n p with
rank r p and N
n
(0,
2
I). Note that this is the usual Gauss-Markov model with
the additional assumption of normality. With this additional assumption, we can now
rigorously pursue questions that deal with statistical inference. We start by examining
minimum variance unbiased estimation and maximum likelihood estimation.
SUFFICIENCY: Under the assumptions stated above, we know that Y N
n
(X,
2
I).
Set = (
,
2
)
. The pdf of Y, for all y R

n
, is given by
f
Y
(y|) = (2)
n/2
(
2
)
n/2
exp{(y X)
(y X)/2
2
}
= (2)
n/2
(
2
)
n/2
exp{y
y/2
2
+y
X/
2
(X)
X/2
2
}
= (2)
n/2
(
2
)
n/2
exp{(X)
X/2
2
} exp{y
y/2
2
+
y/
2
}
= h(y)c() exp{w
1
()t
1
(y) +w
2
()t
2
(y)},
where h(y) = (2)
n/2
I(y R
n
), c() = (
2
)
n/2
exp{(X)
X/2
2
}, and
w
1
() = 1/2
2
t
1
(y) = y
y
w
2
() = /
2
t
2
(y) = X
y,
that is, Y has pdf in the exponential family (see Casella and Berger, Chapter 3). The
family is full rank (i.e., it is not curved), so we know that T(Y) = (Y
Y, X
Y) is
a complete sucient statistic for . We also know that minimum variance unbiased
estimators (MVUEs) of functions of are unbiased functions of T(Y).
Result 6.1. Consider the general linear model Y = X+, where X is np with rank
r p and N
n
(0,
2
I). The MVUE for an estimable function
is given by
,
PAGE 95
where

= (X
X)
Y is any solution to the normal equations. The MVUE for

2
is
MSE = (n r)
1
Y
(I P
X
)Y
= (n r)
1
(Y
Y),
where P
X
is the perpendicular projection matrix onto C(X).
Proof. Both
and MSE are unbiased estimators of
and
2
, respectively. These
estimators are also functions of T(Y) = (Y
Y, X
Y), the complete sucient statistic.

Thus, each estimator is the MVUE for its expected value.
MAXIMUM LIKELIHOOD: Consider the general linear model Y = X+, where X is
n p with rank r p and N
n
(0,
2
I). The likelihood function for = (
,
2
)
is
L(|y) = L(,
2
|y) = (2)
n/2
(
2
)
n/2
exp{(y X)
(y X)/2
2
}.
Maximum likelihood estimators for and
2
are found by maximizing
log L(,
2
|y) =
n
2
log(2)
n
2
log
2
(y X)
(y X)/2
2
with respect to and
2
. For every value of
2
, maximizing the loglikelihood is the same
as minimizing Q() = (y X)
(y X), that is, the least squares estimator
= (X
X)
Y,
is also an MLE. Now substitute (y X
(y X
) = y
(I P
X
)y in for Q() and
dierentiate with respect to
2
. The MLE of
2
is

2
MLE
= n
1
Y
(I P
X
)Y.
Note that the MLE for
2
is biased. The MLE is rarely used in practice; MSE is the
conventional estimator for
2
.
INVARIANCE: Under the normal GM model, the MLE for an estimable function
is
, where

is any solution to the normal equations. This is true because of the
invariance property of maximum likelihood estimators (see, e.g., Casella and Berger,
Chapter 7). If
is estimable, recall that
is unique even if

is not.
PAGE 96
6.2 Testing models
PREVIEW: We now provide a general discussion on testing reduced versus full models
within a Gauss Markov linear model framework. Assuming normality will allow us to
derive the sampling distribution of the resulting test statistic.
PROBLEM: Consider the linear model
Y = X +,
where r(X) = r p, E() = 0, and cov() =
2
I. Note that these are our usual GM
model assumptions. For the purposes of this discussion, we assume that this model (the
full model) is a correct model for the data. Consider also the linear model
Y = W +,
where E() = 0, cov() =
2
I and C(W) C(X). We call this a reduced model because
the estimation space is smaller than in the full model. Our goal is to test whether or not
the reduced model is also correct.
If the reduced model is also correct, there is no reason not to use it. Smaller
models are easier to interpret and fewer degrees of freedom are spent in estimating
2
. Thus, there are practical and statistical advantages to using the reduced model
if it is also correct.
Hypothesis testing in linear models essentially reduces to putting a constraint on
the estimation space C(X) in the full model. If C(W) = C(X), then the W model
is a reparameterization of the X model and there is nothing to test.
RECALL: Let P
W
and P
X
denote the perpendicular projection matrices onto C(W) and
C(X), respectively. Because C(W) C(X), we know that P
X
P
W
is the ppm onto
C(P
X
P
W
) = C(W)
C(X)
.
GEOMETRY: In a general reduced-versus-full model testing framework, we start by as-
suming the full model Y = X + is essentially correct so that E(Y) = X C(X).
PAGE 97
If the reduced model is also correct, then E(Y) = W C(W) C(X). Geometri-
cally, performing a reduced-versus-full model test therefore requires the analyst to decide
whether E(Y) is more likely to be in C(W) or C(X) C(W). Under the full model,
our estimate for E(Y) = X is P
X
Y. Under the reduced model, our estimate for
E(Y) = W is P
W
Y.
If the reduced model is correct, then P
X
Y and P
W
Y are estimates of the same
thing, and P
X
YP
W
Y = (P
X
P
W
)Y should be small.
If the reduced model is not correct, then P
X
Y and P
W
Y are estimating dierent
things, and P
X
YP
W
Y = (P
X
P
W
)Y should be large.
The decision about reduced model adequacy therefore hinges on assessing whether
(P
X
P
W
)Y is large or small. Note that (P
X
P
W
)Y is the perpendicular
projection of Y onto C(W)
C(X)
.
MOTIVATION: An obvious measure of the size of (P
X
P
W
)Y is its squared length,
that is,
{(P
X
P
W
)Y}
(P
X
P
W
)Y = Y
(P
X
P
W
)Y.
However, the length of (P
X
P
W
)Y is also related to the sizes of C(X) and C(W). We
therefore adjust for these sizes by using
Y
(P
X
P
W
)Y/r(P
X
P
W
).
We now compute the expectation of this quantity when the reduced model is/is not
correct. For notational simplicity, set r
= r(P
X
P
W
). When the reduced model is
correct, then
E{Y
(P
X
P
W
)Y/r
} =
1
r
_
(W)
(P
X
P
W
)W +tr{(P
X
P
W
)
2
I}
=
1
r
{
2
tr(P
X
P
W
)}
=
1
r
(r
2
) =
2
.
This is correct because (P
X
P
W
)W = 0 and tr(P
X
P
W
) = r(P
X
P
W
) = r
.
Thus, if the reduced model is correct, Y
(P
X
P
W
)Y/r
2
.
PAGE 98
When the reduced model is not correct, then
E{Y
(P
X
P
W
)Y/r
} =
1
r
_
(X)
(P
X
P
W
)X +tr{(P
X
P
W
)
2
I}
=
1
r
{(X)
(P
X
P
W
)X +r
2
}
=
2
+ (X)
(P
X
P
W
)X/r
.
Thus, if the reduced model is not correct, Y
(P
X
P
W
)Y/r
is estimating something
larger than
2
. Of course,
2
is unknown, so it must be estimated. Because the full
model is assumed to be correct,
MSE = (n r)
1
Y
(I P
X
)Y,
the MSE from the full model, is an unbiased estimator of
2
.
TEST STATISTIC: To test the reduced model versus the full model, we use
F =
Y
(P
X
P
W
)Y/r
MSE
.
Using only our GM model assumptions (i.e., not necessarily assuming normality), we can
surmise the following:
When the reduced model is correct, the numerator and denominator of F are both
unbiased estimators of
2
, so F should be close to 1.
When the reduced model is not correct, the numerator in F is estimating something
larger than
2
, so F should be larger than 1. Thus, values of F much larger than
1 are not consistent with the reduced model being correct.
Values of F much smaller than 1 may mean something drastically dierent; see
Christensen (2003).
OBSERVATIONS: In the numerator of F, note that
Y
(P
X
P
W
)Y = Y
P
X
YY
P
W
Y = Y
(P
X
P
1
)YY
(P
W
P
1
)Y,
which is the dierence in the regression (model) sum of squares, corrected or uncorrected,
from tting the two models. Also, the term
r
= r(P
X
P
W
) = tr(P
X
P
W
) = tr(P
X
) tr(P
W
) = r(P
X
) r(P
W
) = r r
0
,
PAGE 99
say, where r
0
= r(P
W
) = r(W). Thus, r
= r r
0
is the dierence in the ranks of the
X and W matrices. This also equals the dierence in the model degrees of freedom from
the two ANOVA tables.
REMARK: You will note that we have formulated a perfectly sensible strategy for testing
reduced versus full models while avoiding the question, What is the distribution of F?
Our entire argument is based on rst and second moment assumptions, that is, E() = 0
and cov() =
2
I, the GM assumptions. We now address the distributional question.
DISTRIBUTION OF F: To derive the sampling distribution of
F =
Y
(P
X
P
W
)Y/r
MSE
,
we require that N
n
(0,
2
I), from which it follows that Y N
n
(X,
2
I). First,
we handle the denominator MSE = Y
(I P
X
)Y/(n r). In Example 5.3 (notes), we
showed that
Y
(I P
X
)Y/
2

2
nr
.
This distributional result holds regardless of whether or not the reduced model is correct.
Now, we turn our attention to the numerator. Take A =
2
(P
X
P
W
) and consider
the quadratic form
Y
AY = Y
(P
X
P
W
)Y/
2
.
With V =
2
I, the matrix
AV =
2
(P
X
P
W
)
2
I = P
X
P
W
is idempotent with rank r(P
X
P
W
) = r
. Therefore, we know that Y
AY
2
r
(),
where
=
1
2
A =
1
2
2
(X)
(P
X
P
W
)X.
Now, we make the following observations:
If the reduced model is correct and X C(W), then (P
X
P
W
)X = 0 because
P
X
P
W
projects onto C(W)
C(X)
. This means that the noncentrality parameter
= 0 and Y
(P
X
P
W
)Y/
2

2
r
, a central
2
distribution.
PAGE 100
If the reduced model is not correct and X / C(W), then (P
X
P
W
)X = 0 and
> 0. In this event, Y
(P
X
P
W
)Y/
2

2
r
(), a noncentral
2
distribution
with noncentrality parameter .
Regardless of whether or not the reduced model is correct, the quadratic forms
Y
(P
X
P
W
)Y and Y
(I P
X
)Y are independent since
(P
X
P
W
)
2
I(I P
X
) =
2
(P
X
P
W
)(I P
X
) = 0.
CONCLUSION: Putting this all together, we have that
F =
Y
(P
X
P
W
)Y/r
MSE
=

2
Y
(P
X
P
W
)Y/r
2
Y
(I P
X
)Y/(n r)
F
r
,nr
(),
where r
= r r
0
and
=
1
2
2
(X)
(P
X
P
W
)X.
If the reduced model is correct, that is, if X C(W), then = 0 and F F
r
,nr
, a
central F distribution. Note also that if the reduced model is correct,
E(F) =
n r
n r 2
1.
This rearms our (model free) assertion that values of F close to 1 are consistent with the
reduced model being correct. Because the noncentral F family is stochastically increasing
in , larger values of F are consistent with the reduced model not being correct.
SUMMARY: Consider the linear model Y = X +, where X is n p with rank r p
and N
n
(0,
2
I), Suppose that we would like to test
H
0
: Y = W +
versus
H
1
: Y = X +,
where C(W) C(X). An level rejection region is
RR = {F : F > F
r
,nr,
},
where r
= r r
0
, r
0
= r(W), and F
r
,nr,
is the upper quantile of the F
r
,nr
distribution.
PAGE 101
Example 6.1. Consider the simple linear regression model
Y
i
=
0
+
1
(x
i
x) +
i
,
for i = 1, 2, ..., n, where
i
iid N(0,
2
). In matrix notation,
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
1 x
1
x
1 x
2
x
.
.
.
.
.
.
1 x
n
x
_
_
_
_
_
_
_
_
, =
_
_

0
1
_
_
, =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
,
where N
n
(0,
2
I). Suppose that we would like to test whether the reduced model
Y
i
=
0
+
i
,
for i = 1, 2, ..., n, also holds. In matrix notation, the reduced model can be expressed as
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, W =
_
_
_
_
_
_
_
_
1
1
.
.
.
1
_
_
_
_
_
_
_
_
= 1, =
0
, =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
,
where N
n
(0,
2
I) and 1 is an n 1 vector of ones. Note that C(W) C(X) with
r
0
= 1, r = 2, and r
= r r
0
= 1. When the reduced model is correct,
F =
Y
(P
X
P
W
)Y/r
MSE
F
1,n2
,
where MSE is the mean-squared error from the full model. When the reduced model is
not correct, F F
1,n2
(), where
=
1
2
2
(X)
(P
X
P
W
)X
=
2
1
n
i=1
(x
i
x)
2
/2
2
.
Exercises: (a) Verify that this expression for the noncentrality parameter is correct.
(b) Suppose that n is even and the values of x
i
can be selected anywhere in the interval
(d
1
, d
2
). How should we choose the x
i
values to maximize the power of a level test?
PAGE 102
6.3 Testing linear parametric functions
PROBLEM: Consider our usual Gauss-Markov linear model with normal errors; i.e.,
Y = X + , where X is n p with rank r p and N
n
(0,
2
I). We now consider
the problem of testing
H
0
: K
= m
versus
H
1
: K
= m,
where K is a p s matrix with r(K) = s and m is s 1.
Example 6.2. Consider the regression model Y
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+
4
x
i4
+
i
,
for i = 1, 2, ..., n. Express each hypothesis in the form H
0
: K
= m:
1. H
0
:
1
= 0
2. H
0
:
3
=
4
= 0
3. H
0
:
1
+
3
= 1,
2
4
= 1
4. H
0
:
2
=
3
=
4
.
Example 6.3. Consider the analysis of variance model Y
ij
= +
i
+
ij
, for i = 1, 2, 3, 4
and j = 1, 2, ..., n
i
. Express each hypothesis in the form H
0
: K
= m:
1. H
0
: +
1
= 5,
3
4
= 1
2. H
0
:
1
2
=
3
4
3. H
0
:
1
2 =
1
3
(
2
+
3
+
4
).
TERMINOLOGY: The general linear hypothesis H
0
: K
= m is said to be testable
i K has full column rank and each component of K
is estimable. In other words, K
contains s linearly independent estimable functions. Otherwise, H

0
: K
= m is said to
be nontestable.
PAGE 103
GOAL: Our goal is to develop a test for H
0
: K
= m, K
estimable, in the Gauss-

Markov model with normal errors. We start by noting that the BLUE of K
is K
,
where

is a least squares estimator of . Also,
K
= K
(X
X)
Y,
a linear function of Y, so K
follows an s-variate normal distribution with mean

E(K
) = K
and covariance matrix

cov(K
) = K
cov(
)K =
2
K
(X
X)
X(X
X)
K =
2
K
(X
X)
K =
2
H,
where H = K
(X
X)
K. That is, we have shown K
N
s
(K
,
2
H).
NOTE: In the calculation above, note that K
(X
X)
X(X
X)
K = K
(X
X)
K only
because K
is estimable; i.e., K
= A
Xfor some A. It is also true that His nonsingular.

LEMMA: If K
is estimable, then H is nonsingular.

Proof. First note that H is an s s matrix. We can write
H = K
(X
X)
K = A
P
X
A = A
P
X
P
X
A = A
X
P
X
A,
since K
= A
X for some A (this follows since K
is estimable). Therefore,
r(H) = r(A
X
P
X
A) = r(P
X
A)
and
s = r(K) = r(X
A) = r(X
P
X
A) r(P
X
A) = r[X(X
X)
A] r(X
A) = s.
Therefore, r(H) = r(P
X
A) = s, showing that H is nonsingular.
IMPLICATION: The lemma above is important, because it convinces us that the distri-
bution of K
is full rank. Subtracting m, we have

K
m N
s
(K
m,
2
H).
F STATISTIC: Now, consider the quadratic form
(K
m)
(
2
H)
1
(K
m).
PAGE 104
Note that (
2
H)
1
2
H = I
s
, an idempotent matrix with rank s. Therefore,
(K
m)
(
2
H)
1
(K
m)
2
s
(),
where the noncentrality parameter
=
1
2
2
(K
m)
H
1
(K
m).
We have already shown that Y
(I P
X
)Y/
2

2
nr
. Also,
(K
m)
(
2
H)
1
(K
m) and Y
(I P
X
)Y/
2
are independent (verify!). Thus, the ratio
F =
(K
m)
H
1
(K
m)/s
MSE
=
(K
m)
H
1
(K
m)/s
Y
(I P
X
)Y/(n r)
=
(K
m)
(
2
H)
1
(K
m)/s
2
Y
(I P
X
)Y/(n r)
F
s,nr
(),
where = (K
m)
H
1
(K
m)/2
2
. Note that if H
0
: K
= m is true, the
noncentrality parameter = 0 and F F
s,nr
. Therefore, an level rejection region for
the test of H
0
: K
= m versus H
1
: K
= m is
RR = {F : F > F
s,nr,
}.
SCALAR CASE: We now consider the special case of testing H
0
: K
= m when
r(K) = 1, that is, K
is a scalar estimable function. This function (and hypothesis) is

perhaps more appropriately written as H
0
: k
= m to emphasize that k is a p1 vector

and m is a scalar. Often, k is chosen in a way so that m = 0 (e.g., testing a contrast in
an ANOVA model, etc.). The hypotheses in Example 6.2 (#1) and Example 6.3 (#3)
are of this form. Testing a scalar hypothesis is a mere special case of the general test we
have just derived. However, additional exibility results in the scalar case; in particular,
we can test for one sided alternatives like H
1
: k
> m or H
1
: k
< m. We rst discuss

one more noncentral distribution.
PAGE 105
TERNINOLOGY: Suppose Z N(, 1) and V
2
k
. If Z and V are independent, then
T =
Z
_
V/k
follows a noncentral t distribution with k degrees of freedom and noncentrality pa-
rameter . We write T t
k
(). If = 0, the t
k
() distribution reduces to a central t
distribution with k degrees of freedom. Note that T t
k
() implies T
2
F
1,k
(
2
/2).
TEST PROCEDURE: Consider the Gauss-Markov linear model with normal errors; i.e.,
Y = X + , where X is n p with rank r p and N
n
(0,
2
I). Suppose that k
is estimable; i.e., k
R(X), and that our goal is to test

H
0
: k
= m
versus
H
1
: k
= m
or versus a suitable one-sided alternative H
1
. The BLUE for k
is k
, where

is a
least squares estimator. Straightforward calculations show that
k
N{k
,
2
k
(X
X)
k} = Z =
k
2
k
(X
X)
k
N(0, 1).
We know that V = Y
(IP
X
)Y/
2

2
nr
and that Z and V are independent (verify!).
Thus,
T =
(k
)/
_
2
k
(X
X)
k
_
2
Y
(I P
X
)Y/(n r)
=
k
_
MSE k
(X
X)
k
t
nr
.
When H
0
: k
= m is true, the statistic

T =
k
m
_
MSE k
(X
X)
k
t
nr
.
Therefore, an level rejection region, when H
1
is two sided, is
RR = {T : T t
nr,/2
}.
One sided tests use rejection regions that are suitably adjusted. When H
0
is not true,
T t
nr
(), where
=
k
m
_
2
k
(X
X)
k
.
This distribution is of interest for power and sample size calculations.
PAGE 106
6.4 Testing models versus testing linear parametric functions
SUMMARY: Under our Gauss Markov linear model Y = X + , where X is n p
with rank r p and N
n
(0,
2
I), we have presented F statistics to test (a) a reduced
model versus a full model in Section 6.2 and (b) a hypothesis dened by H
0
: K
= m,
for K
estimable, in Section 6.3. In fact, testing models and testing linear parametric
functions essentially is the same thing, as we now demonstrate. For simplicity, we take
m = 0, although the following argument can be generalized.
DISCUSSION: Consider the general linear model Y = X + , where X is n p with
rank r p and R
p
. Consider the testable hypothesis H
0
: K
= 0. We can write
this hypothesis in the following way:
H
0
: Y = X + and K
= 0.
We now nd a reduced model that corresponds to this hypothesis. Note that K
= 0
holds if and only if C(K). To identify the reduced model, pick a matrix U such that
C(U) = C(K)
. We then have
K
= 0 C(K) C(U) = U,
for some vector . Substituting = U into the linear model Y = X + gives the
reduced model Y = XU +, or letting W = XU, our hypothesis above can be written
H
0
: Y = W +,
where C(W) C(X).
OBSERVATION: When K
is estimable, that is, when K
= D
X for some ns matrix

D, we can nd the perpendicular projection matrix for testing H
0
: K
= 0 in terms
of D and P
X
. From Section 6.2, recall that the numerator sum of squares to test the
reduced model Y = W + versus the full model Y = X+ is Y
(P
X
P
W
)Y, where
P
X
P
W
is the ppm onto C(P
X
P
W
) = C(W)
C(X)
. For testing the estimable function
K
= 0, we now show that the ppm onto C(P

X
D) is also the ppm onto C(W)
C(X)
; i.e.,
that C(P
X
D) = C(W)
C(X)
.
PAGE 107
PROPOSITION: C(P
X
P
W
) = C(W)
C(X)
= C(XU)
C(X)
= C(P
X
D).
Proof. We showed C(P
X
P
W
) = C(W)
C(X)
in Chapter 2 and W = XU, so the second
equality is obvious. Suppose that v C(XU)
C(X)
. Then, v
XU = 0, so that X
vC(U).
Because C(U) = C(K)
, we know that X
v C(K) = C(X
D), since K
= D
X. Thus,
v = P
X
v = X(X
X)
v C[X(X
X)
D] = C(P
X
D).
Suppose that v C(P
X
D). Clearly, v C(X). Also, v = P
X
Dd, for some d and
v
XU = d
P
X
XU = d
XU = d
U = 0,
because C(U) = C(K)
. Thus, v C(XU)
C(X)
.
IMPLICATION: It follows immediately that the numerator sum of squares for testing
the reduced model Y = W + versus the full model Y = X+ is Y
M
P
X
D
Y, where
M
P
X
D
= P
X
D[(P
X
D)
(P
X
D)]
(P
X
D)
= P
X
D(D
P
X
D)
P
X
is the ppm onto C(P
X
D). If N
n
(0,
2
I), the resulting test statistic
F =
Y
M
P
X
D
Y/r(M
P
X
D
)
Y
(I P
X
)Y/r(I P
X
)
F
r(M
P
X
D
),r(IP
X
)
(),
where the noncentrality parameter
=
1
2
2
(X)
M
P
X
D
X.
GOAL: Our goal now is to show that the F statistic above is the same F statistic we
derived in Section 6.3 with m = 0, that is,
F =
(K
H
1
K
/s
Y
(I P
X
)Y/(n r)
.
Recall that this statistic was derived for the testable hypothesis H
0
: K
= 0. First,
we show that r(M
P
X
D
) = s, where, recall, s = r(K). To do this, it suces to show
that r(K) = r(P
X
D). Because K
is estimable, we know that K
= D
X, for some D.
Writing K = X
D, we see that for any vector a,

X
Da = 0 DaC(X),
PAGE 108
which occurs i P
X
Da = 0. Note that the X
Da = 0 P
X
Da = 0 equivalence
implies that
N(X
D) = N(P
X
D) C(D
X)
= C(D
P
X
)
C(D
X) = C(D
P
X
)
so that r(D
X) = r(D
P
X
). But K
= D
X, so r(K) = r(D
X) = r(D
P
X
) = r(P
X
D),
which is what we set out to prove. Now, consider the quadratic form Y
M
P
X
D
Y, and
let

denote a least squares estimator of . Result 2.5 says that X
= P
X
Y, so that
K
= D
= D
P
X
Y. Substitution gives
Y
M
P
X
D
Y = Y
P
X
D(D
P
X
D)
P
X
Y
= (K
[D
X(X
X)
D]
= (K
[K
(X
X)
K]
.
Recalling that H = K
(X
X)
K and that H is nonsingular (when K
is estimable)
should convince you that the numerator sum of squares in
F =
Y
M
P
X
D
Y/r(M
P
X
D
)
Y
(I P
X
)Y/r(I P
X
)
and
F =
(K
H
1
K
/s
Y
(I P
X
)Y/(n r)
are equal. We already showed that r(M
P
X
D
) = s, and because r(I P
X
) = n r, we
are done.
6.5 Likelihood ratio tests
6.5.1 Constrained estimation
REMARK: In the linear model Y = X+, where E() = 0 (note the minimal assump-
tions), we have, up until now, allowed the p 1 parameter vector to take on any value
in R
p
, that is, we have made no restrictions on the parameters in . We now consider
the case where is restricted to the subspace of R
p
consisting of values of that satisfy
PAGE 109
P
= , where P is a p q matrix and is q 1. To avoid technical diculties, we

will assume that the system P
= is consistent; i.e., C(P
). Otherwise, the set

{ R
p
: P
= } could be empty.
PROBLEM: In the linear model Y = X+, where E() = 0, we would like to minimize
Q() = (YX)
(YX)
subject to the constraint that P
= . Essentially, this requires us to nd the minimum

value of Q() over the linear subspace { R
p
: P
= }. This is a restricted mini-

mization problem and standard Lagrangian methods apply; see Appendix B in Monahan.
The Lagrangian a(, ) is a function of and the Lagrange multipliers in and can be
written as
a(, ) = (YX)
(YX) + 2
(P
).
Taking partial derivatives, we have
a(, )
= 2X
Y+ 2X
X + 2P
a(, )
= 2(P
).
Setting these equal to zero leads to the restricted normal equations (RNEs), that is,
_
_
X
X P
P
0
_
_
_
_

_
_
=
_
_
X
_
_
.
Denote by

H
and

H
the solutions to the RNEs, respectively. The solution

H
is called
a restricted least squares estimator.
DISCUSSION: We now present some facts regarding this restricted linear model and its
(restricted) least squares estimator. We have proven all of these facts for the unrestricted
model; restricted versions of the proofs are all in Monahan.
1. The restricted normal equations are consistent; see Result 3.8, Monahan (pp 62-63).
2. A solution

H
minimizes Q() over the set T { R
p
: P
= }; see Result
3.9, Monahan (pp 63).
PAGE 110
3. The function
is estimable in the restricted model

Y = X +, E() = 0, P
= ,
if and only if
= a
X+d
, for some a and d, that is,
R
_
_
X
P
_
_
.
See Result 3.7, Monahan (pp 60).
4. If
is estimable in the unrestricted model; i.e., the model without the linear
restriction, then
is estimable in the restricted model. The converse is not true.

5. Under the GM model assumptions, if
is estimable in the restricted model, then
H
is the BLUE of
in the restricted model. See Result 4.5, Monahan (pp

89-90).
6.5.2 Testing procedure
SETTING: Consider the Gauss Markov linear model Y = X+, where X is np with
rank r p and N
n
(0,
2
I). We are interested in deriving the likelihood ratio test
(LRT) for
H
0
: K
= m
versus
H
1
: K
= m,
where K is a p s matrix with r(K) = s and m is s 1. We assume that H
0
: K
= m
is testable, that is, K
is estimable.
RECALL: A likelihood ratio testing procedure is intuitive. One simply compares the
maximized likelihood over the restricted parameter space (that is, the space under H
0
)
to the maximized likelihood over the entire parameter space. If the former is small when
compared to the latter, then there a large amount of evidence against H
0
.
PAGE 111
DERIVATION: Under our model assumptions, we know that Y N
n
(X,
2
I). The
likelihood function for = (
,
2
)
is
L(|y) = L(,
2
|y) = (2
2
)
n/2
exp{Q()/2
2
},
where Q() = (y X)
(y X). The unrestricted parameter space is

= { : R
p
,
2
R
+
}.
The restricted parameter space, that is, the parameter space under H
0
: K
= m, is
0
= { : R
p
, K
= m,
2
R
+
}.
The likelihood ratio statistic is
(Y) =
sup
0
L(|Y)
sup
L(|Y)
.
We reject the null hypothesis H
0
for small values of = (Y). Thus, to perform a level
test, reject H
0
when < c, where c (0, 1) is chosen to satisfy P
H
0
{(Y) c} = .
We have seen (Section 6.1) that the unrestricted MLEs of and
2
are
= (X
X)
Y and
2
=
Q(
)
n
.
Similarly, maximizing L(|y) over
0
produces the solutions

H
and
2
= Q(
H
)/n,
where

H
is any solution to
_
_
X
X K
K
0
_
_
_
_

_
_
=
_
_
X
Y
m
_
_
,
the restricted normal equations. Algebra shows that
=
L(
H
,
2
|Y)
L(
,
2
|Y)
=
_

2

2
_
n/2
=
_
Q(
)
Q(
H
)
_
n/2
.
More algebra shows that
_
Q(
)
Q(
H
)
_
n/2
< c
{Q(
H
) Q(
)}/s
Q(
)/(n r)
> c
,
PAGE 112
where s = r(K) and c
= s
1
(n r)(c
2/n
1). Furthermore, Monahans Theorem 6.1
(pp 139-140) shows that when K
is estimable,
Q(
H
) Q(
) = (K
m)
H
1
(K
m),
where H = K
(X
X)
K. Applying this result, and noting that Q(
)/(n r) = MSE,
we see that
{Q(
H
) Q(
)}/s
Q(
)/(n r)
> c
F =
(K
m)
H
1
(K
m)/s
MSE
> c
.
That is, the LRT species that we reject H
0
when F is large. Choosing c
= F
s,nr,
provides a level test. Therefore, under the Gauss Markov model with normal errors,
the LRT for H
0
: K
= m is the same test as that in Section 6.3.

6.6 Condence intervals
6.6.1 Single intervals
PROBLEM: Consider the Gauss Markov linear model Y = X + , where X is n p
with rank r p and N
n
(0,
2
I). Suppose that
estimable, that is,
= a
X, for
some vector a. Our goal is to write a 100(1 ) percent condence interval for
.
DERIVATION: We start with the obvious point estimator
, the least squares esti-

mator (and MLE) of
. Under our model assumptions, we know that
N{
,
2
(X
X)
}
and, hence,
Z =

(X
X)
N(0, 1).
If
2
was known, our work would be done as Z is a pivot. More likely, this is not the
case, so we must estimate it. An obvious point estimator for
2
is MSE, where
MSE = (n r)
1
Y
(I P
X
)Y.
PAGE 113
We consider the quantity
T =

_
MSE
(X
X)
and subsequently show that T t

nr
. Note that
T =

_
MSE
(X
X)
=
(
)/
_
(X
X)
_
(n r)
1
Y
(I P
X
)Y/
2
N(0, 1)
_
2
nr
/(n r)
.
To verify that T t
nr
, it remains only to show that Z and Y
(I P
X
)Y/
2
are
independent, or equivalently, that
and Y
(I P
X
)Y are, since
is a function of
Z and since
2
is not random. Note that
= a
X(X
X)
Y = a
P
X
Y,
a linear function of Y. Using Result 5.19,
and Y
(I P
X
)Y are independent since
a
P
X
2
I(I P
X
) = 0. Thus, T t
nr
, i.e., t is a pivot, so that
pr
_
t
nr,/2
<

_
MSE
(X
X)
< t
nr,/2
_
= 1 .
Algebra shows that this probability statement is the same as
pr
_
t
nr,/2
_
MSE
(X
X)
<
<
+t
nr,/2
_
MSE
(X
X)
_
= 1,
showing that
t
nr,/2
_
MSE
(X
X)
is a 100(1 ) percent condence interval for
.
Example 6.4. Recall the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n, where
1
,
2
, ...,
n
are iid N(0,
2
). Recall also that the least squares
estimator of = (
0
,
1
)
is
= (X
X)
1
X
Y =
_
_
1
_
_
=
_
_
Y

1
x
i
(x
i
x)(Y
i
Y )
i
(x
i
x)
2
_
_
,
PAGE 114
and that the covariance matrix of

is
cov(
) =
2
(X
X)
1
=
2
_
_
1
n
+
x
2
i
(x
i
x)
2

x
i
(x
i
x)
2
i
(x
i
x)
2
1
i
(x
i
x)
2
_
_
.
We now consider the problem of writing a 100(1 ) percent condence interval for
E(Y |x = x
0
) =
0
+
1
x
0
,
the mean response of Y when x = x
0
. Note that E(Y |x = x
0
) =
0
+
1
x
0
=
, where
= (1 x
0
). Also,
is estimable because this is a regression model so our previous

work applies. The least squares estimator (and MLE) of E(Y |x = x
0
) is
0
+

1
x
0
.
Straightforward algebra (verify!) shows that
(X
X)
1
=
_
1 x
0
_
_
_
1
n
+
x
2
i
(x
i
x)
2

x
i
(x
i
x)
2
i
(x
i
x)
2
1
i
(x
i
x)
2
_
_
_
_
1
x
0
_
_
=
1
n
+
(x
0
x)
2
i
(x
i
x)
2
.
Thus, a 100(1 ) percent condence interval for
= E(Y |x = x
0
) is
(
0
+

1
x
0
) t
n2,/2
MSE
_
1
n
+
(x
0
x)
2
i
(x
i
x)
2
_
.
6.6.2 Multiple intervals
PROBLEM: Consider the Gauss Markov linear model Y = X+, where X is np with
rank r p and N
n
(0,
2
I). We now consider the problem of writing simultaneous
condence intervals for the k estimable functions
1
,
2
, ...,
k
. Let the pk matrix
= (
1

2

k
) so that
=
=
_
_
_
_
_
_
_
_
.
.
.
_
_
_
_
_
_
_
_
.
PAGE 115
Because =
is estimable, it follows that

=
N
k
(
,
2
H),
where H =
(X
X)
. Furthermore, because
, ...,
are jointly normal, we

have that
N(
j
,
2
h
jj
),
where h
jj
is the jth diagonal element of H. Using our previous results, we know that
t
nr,/2
_

2
h
jj
,
where
2
= MSE, is a 100(1 ) percent condence interval for
j
, that is,
pr
_
t
nr,/2
_

2
h
jj
<
j
<
+t
nr,/2
_

2
h
jj
_
= 1 .
This statement is true for a single interval.
SIMULTANEOUS COVERAGE: To investigate the simultaneous coverage probability
of the set of intervals
{
t
nr,/2
_

2
h
jj
, j = 1, 2, ..., k},
let E
j
denote the event that interval j contains
j
, that is, pr(E
j
) = 1 , for j =
1, 2, ..., k. The probability that each of the k intervals includes their target
j
is
pr
_
k
j=1
E
j
_
= 1 pr
_
k
_
j=1
E
j
_
,
by DeMorgans Law. In turn, Booles Inequality says that
pr
_
k
_
j=1
E
j
_
j=1
pr(E
j
) = k.
Thus, the probability that each interval contains its intended target is
pr
_
k
j=1
E
j
_
1 k.
PAGE 116
Obviously, this lower bound 1 k can be quite a bit lower than 1 , that is, the
simultaneous coverage probability of the set of
{
t
nr,/2
_

2
h
jj
, j = 1, 2, ..., k}
can be much lower than the single interval coverage probability.
GOAL: We would like the set of intervals {
d
_

2
h
jj
, j = 1, 2, ..., k} to have a
simultaneous coverage probability of at least 1. Here, d represents a probability
point that guarantees the desired simultaneous coverage. Because taking d = t
nr,/2
does not guarantee this minimum, we need to take d to be larger.
BONFERRONI : From the argument on the last page, it is clear that if one takes d =
t
nr,/2k
, then
pr
_
k
j=1
E
j
_
1 k(/k) = 1 .
Thus, 100(1 ) percent simultaneous condence intervals for
1
,
2
, ...,
k
are
t
nr,/2k
_

2
h
jj
for j = 1, 2, ..., k.
SCHEFF
E: The idea behind Schees approach is to consider an arbitrary linear combi-

nation of =
, say, u
= u
and construct a condence interval

C(u, d) = (u

2
u
Hu, u
+d

2
u
Hu),
where d is chosen so that
pr{u
C(u, d), for all u} = 1 .

Since d is chosen in this way, one guarantees the necessary simultaneous coverage proba-
bility for all possible linear combinations of =
(an innite number of combinations).

Clearly, the desired simultaneous coverage is then conferred for the k functions of interest
j
=
j
, j = 1, 2, ..., k; these functions result from taking u to be the standard unit
vectors. The argument in Monahan (pp 144) shows that d = (kF
k,nr,
)
1/2
.
PAGE 117

Linear Statistical Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Statistical Models

Uploaded by

Copyright:

Available Formats

STAT 714

LINEAR STATISTICAL MODELS

. Note that E() = 0 and cov() =

. Note that E() = 0 and cov() =

. Note that E() = 0 and cov() =

, say. We can write this model as

and dene the error sum of squares

= 0, or, in other words,

X is symmetric. Setting this expression equal to 0 and rearranging gives

X is nonsingular, then the unique least squares

= c; that is, if c C(A). Applying this denition to

X) N(X), the last result

X). But, C(X

) trivially, so were done. Note also

) = 0; verify this using the fact that

solves the normal equations. Thus, we have shown that Q() = Q(

) for all and, hence,

Y, by assumption, but also Q(

is invariant to the choice of

(YX) is minimized by minimizing (MYX)

(I M)(MYX) = [(I M)Y]

) and MYX C(X). Similarly, () = 0 as well.

Y is a solution to the normal equations, so it is a

= MY. Because perpendicular

). Because C(X) and N(X

IY. The degrees of freedom column,

Y is called the (uncorrected) total sum of

); that is, the projection onto the orthogonal complement of the

X = 0; that is, the residuals in a least squares

is simply referred to as the orthogonal complement of S.

) the error space. I P

is said to be (linearly) estimable i there exists a linear unbiased

is estimable i there exists a vector a such that

, for all . Therefore, a

Y is a linear unbiased estimator of

is estimable. (=) Suppose that

is estimable. Then, there exists an estimator

Y that is unbiased for it; that is, E(c+a

, for all . Note that E(c+a

X, for all . Taking = 0 shows that c = 0. Successively

= (1/2, 1/2, 0, 0, 0, 0). Also,

= (1/2, 1/2, 1/2, 1/2, 0, 0). Finally,

= (1/2, 1/2, 1/4, 1/4, 1/4, 1/4).

is estimable for all .

be any collection of estimable functions. Then,

Y is any solution to the normal equations.

for any two solutions

X, for some a. From Result 2.5,

X) = 3. Here are two generalized inverses of X

is a linear unbiased estimator of

Y, which is a linear estimator in Y. We now show that

X, for some a. Thus,

is estimable i there exists a linear unbiased estimator for

Y whose expectation equals

, for all , then

is estimable. From Result 3.1, we know that

can be expressed as a linear combination of the rows of X, then

) = R(X) and N(X). If X is n p

)} = r and dim{N(X)} = s = p r. Therefore, if

X and one generalized inverse of X

are invariant to the choice of generalized

above is as good as any other. From this solution, we

Least squares estimate,