Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Chapter 8

Model selection and assessment

This chapter is partly based on the book ‘The Elements of Statistical Learning’
by Hastie, Tibshirani and Friedman.

8.1 Terminology

• ‘Model Selection’ is to choose the ‘best’ one among different models by com-
paring estimate of model performance. Having chosen a final model, ‘Model
assessment’ is to estimate its generalization error (or prediction error) on new

data.

• We are given a ‘training data set’ T , consisting of N pairs ( X1 , y1 ), · · · , ( X N , y N ),


often grouped together into a matrix and a vector T = ( X, y). Suppose we fit
y = f ( X ) + e using the training dataset T . Our aim is to use this training set to

find the function fb( X ) that best estimates f ( X ).

• Loss function: Finding the ‘best’ estimate implies we need a measure of how
‘good’ an estimate is. One of such measure called loss function is denoted by
L(y; yb). Given X and the true y, this function tells us how ‘good’ our prediction

yb = fb( X ) is. A squared loss function is L(y, yb) = (y − yb)2 and an absolute loss
function is L(y, yb) = |y − yb|
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 69

• We would like to have a model with a small value of the loss function, but the
loss function is not directly evaluated since a model should be selected before

we observe new data.

• Definition 1. Test error (or generalization error)

ErrT = E( L(y, fb( X ))|T ),

where ( X, y) are randomly drawn from their joint distribution (population).


This is the prediction error over an independent test sample. Here the training
set T is fixed and test error refers to the error by fb for this specific training set.

• Definition 2. Expected prediction error (or expected test error)

Err = E( L(y, fb( X ))) = E( ErrT ).

Note that the above expectation averages over everything that is random, in-

cluding the randomness in the training set that produced fb.

 Err is more amenable to statistical analysis as most methods effectively esti-


mate the expected error rather than ErrT .

• Definition 3. Training error: Training error is the average loss over the training
sample.

N
err = N −1 ∑ L(yi , fb( xi ))
i =1

and fb( xi ) is a fitted value using {( xi , yi )}in=1 .

 Note that OLS minimizes this training error under the square loss function.
Typically, the training error will understate the generalization error (why?).
Whereas the ‘best’ model lies at some intermediate level of complexity, a naive
minimization of the training error would lead us to choose the most complex

model possible instead.


CHAPTER 8. MODEL SELECTION AND ASSESSMENT 70

• Definition 4. Expected prediction error at X = x0

Err ( x0 ) = Ey NEW ,T { L(y NEW , fb( X ))| X = x0 }.

For the square loss function,

Err ( x0 ) = Ey NEW ,T [(y NEW − fb( x0 ))2 | x = x0 ]

= Ey NEW ,T [(y NEW − f ( x0 ) + f ( x0 ) − ET { fb( x0 )} + ET { fb( x0 )} − fb( x0 ))2 | x = x0 ]

= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + ET [ fb( x0 ) − E fb( x0 )]2

= σe2 + bias2 ( fb( x0 )) + var ( fb( x0 )).

In the context of usual linear models (i.e. assume the design matrix X is given.),

Err ( x0 ) = Ey NEW ,T [(y NEW − x0 βb)2 | x = x0 ]

= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + x0 ( X T X )−1 x0T σe2 .

While the variance part (the third term) changes with x0 , its average (with x0
p +1 2
taken to be each of the sample values xi ) is N σ . Then, we have

N N
1 p+1 2
N −1 ∑ Err ( xi ) = σe2 + ∑ [ f (xi ) − E( fb(xi ))]2 + σ ,
i =1
N i =1
N e

where the model complexity is directly related to the number of parameters.

• Definition 5. In-sample error:

In-sample error is the average prediction error evaluated at x0 = x1 , x2 , · · · , x N .

It is obtained when new responses are observed for the training set features
given X.
N
1
Errin =
N ∑ EyiNEW { L(yiNEW , fb(xi ))|T }.
i =1

 Please note that there are two versions of in-sample error. Another version is
N −1 ∑iN=1 Err ( xi ). Note that expectation in the first version is under yiNEW while
the expectation in the second version is under y NEW and T .

 ErrT can be thought of as extra-sample error since the test input vectors do
not need to coincide with the training input vectors.
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 71

8.2 Bias and variance decomposition

• Bias-variance trade off: For the square loss function, recall that we have

Err ( x0 ) = σe2 + bias2 ( fb( x0 )) + var ( fb( x0 ))

which consists of three components.

 The irreducible error σe2 in y NEW : It is the variance of the target around
its true mean f ( x0 ). Even if our statistical model f exactly represents the real
underlying model, it would still contain this error so that it can not be avoided

(irreducible error). i.e. σe2 is due to y NEW .

 Variance of a fitted model, var ( fb( x0 )) is the expected squared deviation of


fb( x0 ) around its mean. This is a random source of error in the fitted model.

 Bias: bias2 ( fb( x0 )). This represents the amount by which the average of our
estimate differs from the true mean. It is a systematic source of error arising
due to misspecification of the model.

 Typically, the more complex we make the model fb, the lower the (squared)
bias but the higher variance. For example, consider the linear model example

introduced earlier. Recall that var ( fb( x0 )) = x0 ( X T X )−1 x0T σe2 . When we consider
the average of the expected prediction error over training samples, the variance
p +1 2
component becomes N σe which become larger as the number of parameters

increases.

• Example: In a reduced model with less number of parameters, we anticipate


smaller variance and larger bias. Consequently, in-sample error could increase
or decrease. To see what is going on, consider the following example.

Denoting X from new sample by X̃, let us compare the expectation of the gen-

eralization error of the following two models, Err1 of Model 1, y = X1 β∗1 + e∗


CHAPTER 8. MODEL SELECTION AND ASSESSMENT 72

and Err2 of Model 2, y = X1 β 1 + X2 β 2 + e when the true model is Model 2:

Err1 = σ2 + var ( X̃1 βb∗1 ) + ( X̃1 ( β 1 + Aβ 2 ) − X̃1 β 1 − X̃2 β 2 )2

= σ2 + var ( X̃1 βb∗1 ) + {( X̃1 A − X̃2 ) β 2 }2

= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + {( X̃1 A − X̃2 ) β 2 }2 ,

where A = ( X1T X1 )−1 X1T X2 .

Err2 = σ2 + var ( X̃ βb)

= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T

where B = X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 .

If
{( X̃1 A − X̃2 ) β 2 }2 ≤ σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T

model 1 (incorrect model) will have smaller in-sample error.

 This result reveals the startling fact that even if minimising the training error
will lead to the model with the least bias (model 2), we may intentionally choose
a biased model (model 1), so as to improve prediction. This illustrates that there
is an optimal model complexity that gives minimum generalization error with

a bias-variance tradeoff. Then, a judicious way to find the model with best
prediction is to accurately estimate Err and choose the model that minimizes
that quantity.

8.3 Optimism of the training error rate

• optimism optimism is defined as the difference between Errin and the training

error err:
op = Errin − err

 Typically, optimism is positive since err is usually biased downward as an


estimate of Errin .
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 73

• Expected optimism (or average optimism) The expectation is over training set
outcome values:

ω = Ey (op) = Ey { Errin − err }

 Expected error ω is usually estimated instead of op.

• Motivation to define expected optimism is to estimate in-sample error as the


sum of the estimated expected optimism and the training error.

8.4 Estimators of in-sample error

The general form of the in-sample estimates is

Errin = err + ω.
d b

• There are two strategies to estimate the in-sample error, direct and indirect
estimation.

• Indirect estimation:

Recall ω = Ey ( Errin − err ), where

N
err = N −1 ∑ (yi − fb( xi ))2
i =1
N
= N −1 ∑ [yi − f ( xi ) + f ( xi ) − E{ fb( xi )} + E{ fb( xi )} − fb( xi )]2 .
i =1

Then,

1  
Ey (err ) =
N ∑ σe2 b b 2
+ var (yi ) + ( f ( xi ) − E(yi )) − 2cov(yi , yi ) .
b
i

Also,

1  
Errin =
N ∑ σe2 + (ybi − E(ybi ))2 + ( f ( xi ) − E(ybi ))2 − 2(ybi − E(ybi ))( f ( xi ) − E(ybi ))
i
1  
Ey ( Errin ) =
N ∑ σe2 + var (ybi ) + ( f ( xi ) − E(ybi ))2
i
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 74

Thus,

ω = E(op) = Ey ( Errin ) − Ey (err )


2
=
N ∑ cov(yi , ybi ).
i

For linear models, ∑iN=1 cov(yi , ybi ) = tr {cov(y, Hy)} = σe2 ( p + 1), which essen-

tially represents complexity of a model.

Since E( Errin ) = E(err ) + 2σe2 ( p + 1)/N, err + 2b


σe2 ( p + 1)/N is an unbiased
estimator of in-sample error when we have an unbiased estimator of σe2 .

 Methods like Mallow’s C p , the Akaike Information Criterion, and the Bayesian
Information Criterion, estimate Errin by estimating the expected optimism and

then adding it to err (which can easily be calculated using the data in T ).

 Mallow’s C p : Definition

( p + 1) 2
C p = err + 2 σe
b
N

where N is the number of observations in the training set and p is the number
σe2 is from the full model.
of variables in a model. b

The C p statistic simply adds the estimate of the optimism to the training error to
obtain an estimate for in-sample error. This also coincides with approximated

GCV for large N (see the end of this chapter).

σe2 + 2( p + 1) −
In linear regression analysis, the Mallow’s C p is defined as SSE/b
N, where SSE is from a considering model. If the considering model is full

model, C p = p + 1. Note that these two expressions are equivalent since

( p + 1) 2 1 ( p + 1) 2
err + 2 σe =
b SSE + 2 σe
b
N N N
1 2 2

= σ SSE/b σe + 2( p + 1)
N e
b

2 1 2 
2

= b
σe + b σ SSE/b σe + 2( p + 1) − N .
N e
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 75

• Direct estimation: Cross validation: The reason that err understates Err is that
the same data are used to fit the model and to assess its goodness of fit. Cross-

validation (Stone 1974) solves this problem by splitting the data into K separate
segments, fitting the model using K − 1 of these segments and assessing the
goodness of fit using the last segment. Since the last segment of data are ‘new’

(not used to fit the model), the cross validation attempts to directly estimate the
generalization error. Below we discuss when K = N. That is ‘Leave-one-out’
at a time to fit model N times, that is a multiple of PRESS we discussed in the

previous chapter.

LOCV = N −1 PRESS = N −1 ∑iN=1 (yi − Xi βb(i) )2 = N −1 ∑iN=1 ( 1−eih )2 .


ii

 Is LOCV an unbiased estimator of Errin ? To check this, recall that under


correct model specification, ei = ui ( I − H )y = ηi + ui ( I − H )e and E(ei2 ) =
E(e T ( I − H )Ki ( I − H )e) = σ2 (1 − hii ), where ηi = ui ( I − H ) Xβ. (ui and Ki

are defined in 7.4. ui is a row vector of 0’s except that the element on the i th
position is 1 and Ki = uiT ui .) Note that if the considering model is true model,
ηi = 0 since ( I − H ) X = 0. H is from a considering model while X is from the

true model.

ηi2 + σ2 (1 − hii )
N
E( N × LOCV ) = ∑
i =1
(1 − hii )2
( )
N
E( N × Errin ) = E ∑ EyiNEW (yiNEW − Xi βb)2
i =1
N
= ∑ (1 + hii )σ2
i =1

In fact,
n ηi2 + σ2 (1 − hii ) n
E( N × LOCV ) − E( N × Errin ) = ∑ (1 − hii )2 − ∑ (1 + hii )σ2
i =1 i =1
n ηi2 hii n h2ii
= ∑ (1 − hii )2 ∑ (1 − hii ) σ2 > 0
+
i =1 i =1
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 76

since 0 < hii < 1.

 Sometimes it is easier to calculate tr ( H ) than individual hii ’s. Generalized


cross validation score is defined by replacing hii with its average, tr ( H )/N,
N  2
ei
GCV = N −1
∑ 1 − tr ( H )/N
.
i =1

For large N, using the approximation 1/(1 − x )2 ≈ 1 + 2x,

N  2
ei 2( p + 1)
GCV = N −1
∑ (1 − ( p + 1)/N )
≈ N −1 e T e + { N −1 e T e }
N
i =1

which is approximately Mallow’s C p if we consider e T e/N is an estimator of σe2


Chapter 9

Analysis of variance

9.1 Non-full-rank models

• Analysis of variance model can be posed as a linear model but is a little tricky
since the design matrix may not be of full rank.

• Example: Consider the following one-way analysis of variance model.

yij = µ + τi + eij ,

for i = 1, 2 and j = 1, 2, 3. i.e.

y11 = µ + τ1 + e11 , y12 = µ + τ1 + e12 , y13 = µ + τ1 + e13


y21 = µ + τ2 + e21 , y22 = µ + τ2 + e22 , y23 = µ + τ2 + e23 .

 µ can be considered as a overall mean without the effect of the treatments


and τi is the contribution of the treatment i to the mean.

 In the textbook, two chemical additives for increasing the mileage of gasoline.
The response is the mileage (mile per gallon).

 In matrix form, we can write

     
y11 1 1 0 e11
 y12   1 1 0    e12 
    µ  
 y13   1 1 0   e13 
Y= =   τ1  +   = Xβ + e, (9.1)
 y21   1 0 1   e21 
    τ2
 
 y22   1 0 1   e22 
y23 1 0 1 e23
CHAPTER 9. ANALYSIS OF VARIANCE 78

where β = (µ, τ1 , τ2 ) T . Note that in this application, ( X T X )−1 does not exist.
Also β = (15, 1, 3), (10,6,8), (25,-9,-7) give the same fit, that is µ, τ1 , and τ2 are

not unique and therefore cannot be estimated uniquely. With three parameters
and rank( X ) = 2, the model is said to be overparameterized. Note that increasing
the number of observations (replications) for each of the two additives will not

change the rank of X.

• Lack of estimability of β: The model can be written in usual way Y = Xβ + e,


E(Y) = Xβ, var (Y) = σ2 In , and normal equation is X T Xβ = X T Y. However due
to lack of rank in X, X T X is not invertible!

We can use a g-inverse for X T X, which yields

βb = (X T X)− X T Y.
Recall g-inverse is not unique, thus βb is not unique. Such βb is not unbiased since
E( βb) = (X T X)− X T Xβ 6= β. Then, does an unbiased estimator of β exist? That

is, does pxn matrix A exist to satisfy E(AY) = AXβ = β? The answer is no. To
see this, note that β = E( Ay) = AXβ, and the question boils down to whether
there exists A such that AX = I p . But using Theorem 2.4(i) from the textbook,

and rank( X ) < p in (9.1), we have rank( AX ) ≤ rank( X ) < p. Hence AX cannot
be equal to I p .

• There are two common ways to handle an overparameterized model:

(1) Dealing with estimable linear combination: Work with unique and well
defined (estimable) linear combinations of parameters.

(2) Converting to a full rank problem: There are a few ways to convert the
problem into a familiar full rank problem. (2.1) Reparameterization: redefine
the model using a smaller number of new parameters that are unique. (2.2)

Imposing restriction (side conditions): place constraints on the parameters so


that they become unique.
CHAPTER 9. ANALYSIS OF VARIANCE 79

 Example of (2.1): µ1 = µ + τ1 and µ2 = µ + τ2


     
y11 1 0 e11
 y12   1 0   e12 
      
 y13   1 0  µ1  e13 
Y=   =   +   (9.2)
 y 21
  0 1  µ2
  
 e21



 y22   0 1   e22 
y23 0 1 e23

 Example of (2.2): τ1 + τ2 = 0

     
y11 1 1 e11

 y12  
  1 1 

  e12
 

 y13   1 1  µ  e13 
Y= = + 

 y21  
  1 −1  τ1
  e21



 y22   1 −1   e22 
y23 1 −1 e23

 Example of (1): As we examine the parameters in the model illustrated in


(9.1), we see some linear combinations that are unique. For example, τ1 − τ2 =
−2, µ + τ1 = 16, and µ + τ2 = 18, remain the same for all alternative values of
µ, τ1 , and τ2 . Such unique linear combinations can be estimated.

 These approaches are related.

 Illustration: Related approaches: For the model (9.1), we have


   
6 3 3 y..
X T X =  3 3 0  and X T Y =  y1. 
3 0 3 y2.
2 3 3
where y.. = ∑i=1 ∑ j=1 yij , yi. = ∑ j=1 yij .
 
0 0 0
With (X T X)− =  0 31 0  we have
0 0 31
 
0
βb = (X T X)− X T Y =  ȳ1.  (9.3)
ȳ2. .
We can easily check
 
0
E( βb) = (X T X)− X T Xβ =  µ + τ1  .
µ + τ2
CHAPTER 9. ANALYSIS OF VARIANCE 80

We obtain identical results with the reparameterized model (9.2)

1
      
µ
b1 T −1 T 3 0 y1. ȳ1
= (X X) X Y= 1 =
µ
b2 0 3
y2. ȳ2

9.2 Estimable function of β

If we cannot estimate β, the next logical question is whether there is a linear

combination of β, say λ T β that is estimable. If exists, how can we tell a linear


combination is estimable or not? The theorem below provides answers to this
question.

• Definition of estimable functions of β: A linear function of parameters λ T β is

said to be estimable if there exists a vector a such that E( a T y) = λ T β.

• Theorem 12.2b. In the model Y = Xβ + e, where E(Y) = Xβ and X is n × p of


rank k < p ≤ n, the linear function λ T β is estimable if and only if any one of
the following equivalent conditions holds:

(i) λ T is a linear combination of the rows of X; that is, there exists a vector a
such that

aT X = λT .

(ii) λ T is a linear combination of the rows of X T X or λ is a linear combination


of the columns of X T X, that is, there exists a vector r such that

r T X T X = λ T or X T Xr = λ
CHAPTER 9. ANALYSIS OF VARIANCE 81

(iii) λ or λ T satisfies

X T X(X T X)− λ = λ or λ T (X T X)− X T X = λ T (9.4)

(Proof) We only give ‘if’ part.

(i) If there exists a vector a such that λ T = a T X, then, using this vector a, we
have E(a T Y) = a T E(Y) = a T Xβ = λ T β.

(ii) If there exists a solution r for X T Xr = λ, then, by defining a = Xr, we obtain


E(a T Y) = E(r T X T Y) = r T X T E(Y ) = r T X T Xβ = λ T β.

(iii) If X T X(X T X)− λ = λ, then (X T X)− λ is a solution to X T Xr = λ.

• Theorem 12.3a. Let λ T β be an estimable function of β in the model Y = Xβ + e,


where E(Y) = Xβ and X is n × p of rank k < p ≤ n. Let βb be any solution to the

normal equations X T X βb = X T Y, and let r be any solution to X T Xr = λ. Then


the two estimators λ T βb and r T X T Y have the following properties:
(i) E(λ T βb) = E(r T X T Y) = λ T β.

(ii) λ T βb is equal to r T X T Y for any βb or any r.


(iii) λ T βb and r T X T Y are invariant to the choice of βb or r.

• Example: Model (9.1) We can show τ1 − τ2 is estimable since we can write


   
6 3 3 0
X T Xr =  3 3 0  r =  1  = λ,
3 0 3 −1
with r T = (0, 13 , − 13 ). Then, τ\ T T T
1 − τ2 = r X Y = ȳ1. − ȳ2. . We can verify λ β =

τ1 − τ2 = E(ȳ1. − ȳ2. ) = E(r T X T Y) = r T X T Xβ.


To obtain the same result using λ T β, we first find a solution to the normal

equations

X T X βb = X T Y as in (9.3) and obtain λ βb = ȳ1. − ȳ2. .


CHAPTER 9. ANALYSIS OF VARIANCE 82

• Theorem 12.3b. Let λ T β be an estimable function in the model Xβ + e, where X


is n × p of rank k < p ≤ n and cov(y)=σ2 I. Let r be any solution to X T Xr = λ,

and let βb be any solution to X T X βb = X T Y. Then the variance of λ T βb or r T X T Y


has the following properties:
(i) var (r T X T Y) = σ2 r T X T Xr = σ2 r T λ.

(ii) var (λ T βb) = σ2 λ T (X T X)− λ.


(iii) var (λ T βb) is unique, that is, invariant to the choice of r or (XT X)− .

(Proof)

(i) var (r T X T Y) = rX T cov(Y)Xr = r T X T (σ2 I)Xr = σ2 r T X T Xr = σ2 r T λ.


(ii)var (λ T βb) = λ T cov( βb)λ = σ2 λ T (X T X)− X T X(X T X)− λ

By Theorem 12.2b.(iii), λ T (X T X)− X T X = λ T and therefore var (λ T βb) = σ2 λ T (X T X)− λ.


(iii) To show that r T λ is invariant to r, let r1 and r2 be such that X T Xr1 = λ and
X T Xr2 = λ. Multiplying these two equations by r2T and r1T , we obtain r2T λ = r1T λ.

To show that λ T (X T X)− λ is invariant to the choice of X T X− , let G1 and G2 be


two generalized inverses of X T X. Then by Theorem 2.8c(v), we have

XG1 X T = XG2 X T .

Multiplying both sides by a such that a T X = λ T [See Theorem 12.2b(i)], we

obtain a T XG1 X T a = a T XG2 X T a, which yields λ T G1 λ = λ T G2 λ.

• Theorem 12.3c If λ1T β and λ2T β are two estimable functions in the model Y =

Xβ + e, where X is n × p of rank k < p ≤ n and cov(Y) = σ2 I. The covariance


of their estimation is given by

b λ T βb) = σ2 r T λ2 = σ2 λ T r2 = σ2 λ T (X T X)− λ2 .
cov(λ1T β, 2 1 1 1

where X T Xr1 = λ1 and X T Xr2 = λ2 .

SSE
• Theorem 12.3e. Let s2 = n−k , where SSE = (Y − X βb) T (Y − X βb) = Y T (I −

X(X T X)− X T )Y. For s2 defined for the non-full-rank model, we have the follow-
CHAPTER 9. ANALYSIS OF VARIANCE 83

ing properties: (i)E(s2 ) = σ2 . (ii) s2 is invariant to the choice of βb or to the choice


of generalized inverse ( X T X )− .

(Proof)

E(SSE) = E(Y T (I − X(X T X)− X T )Y)


= tr ((I − X(X T X)− X T )σ2 I) + β T X T (I − X(X T X)− X T )Xβ
= ( n − k ) σ2 .

• Theorem 12.3f. If Y ∼ Nn (Xβ, σ2 I), where X is n x p of rank k, p < n, then the

maximum likelihood estimators for β and σ2 are given by βbMLE = (X T X)− X T Y


σ2 = n−1 (Y − X βb) T (Y − X βb).
b

• Theorem 12.3g If Y is Nn (Xβ, σ2 I), where X is n × p of rank k < p ≤ n. then the


maximum likelihood estimators βb and s2 (corrected for bias) have the following

properties:
(i) βb is N p [(X T X)− X T Xβ, σ2 (X T X)− X T X(X T X)− ].
(ii) (n − k )s2 /σ2 is χ2 (n − k).

(iii) βb and s2 are independent.

9.3 Formalizing reparametrization

• In reparameterization, we transform the non-full-rank model Y = Xβ + e, where


X is n x p of rank k, to the full-rank model Y = Zγ + e, where Z is n x k and

γ = Uβ is a set of k linearly independent estimable functions of β. Note that


U a k × p matrix, of rank k < p, and the matrix UU T is nonsingular and write
X = ZU. We can see that Xβ = ZUβ = Zγ. Since XU T = ZUU T , we can write

Z = XU T (UU T )−1 .
Note that Z is of full rank, since rank(Z) ≥ rank(ZU) = rank(X) = k, and we
b = (Z T Z)−1 Z T Y. Since Zγ = Xβ, we have Zγ
obtain γ b and (Y − X βb) T (Y −
b = X β,
b) T ( Y − Z γ
X βb) = (Y − Zγ b) .
CHAPTER 9. ANALYSIS OF VARIANCE 84

• Uβ = γ is only one possible set of linearly independent estimable functions.


Thus, the estimator of the estimable function does not change depending on

the choice of U. Let Vβ = δ, be another set of linearly independent estimable


functions. Then there exists a matrix W such that Y = Wδ + e. Then estimable
function can be expressed as λ T β = b T γ = c T δ. Hence λ T βb = b T γ
b = c T δb

yielding the same estimator for differently expressed linear combinations.

9.4 Side conditions

• Side conditions provide linear constraints that make the parameters unique and
estimable. (9.3) illustrates how the side condition τ1 + τ2 = 0 leads to handle
rank deficiency of X. Then, can we impose any such side conditions? Should

side conditions satisfy certain requirement? This is the topic of this section.

• Requirement for side conditions: The goal of constraint is to rescue rank de-

ficiency of X. Consider linear constraint, Tβ = 0. If a side condition were an


estimable function of β, then it could be expressed as a linear combination of
the rows of X T Xβ which would contribute nothing to obtaining a solution vec-

tor β.
b Thus, this does not help relieving rank deficient problem. Therefore side

conditions should be nonestimable function of β. Furthermore to make up the


rank deficiency, T should be a ( p − k )xp matrix with rank p − k. Combining

two equations Y = Xβ + e and 0 = Tβ + 0 we can write


     
Y X e
= β+
0 T 0
 
X
where is a (n + p − k ) × p matrix with rank p. Then the minimizer of
T
sum of square should satisfy
 T    T  
X X b X Y
β= .
T T T 0

i.e. (X T X + T T T) β
b = X T Y. Recall X is a n x p of rank k, p ≤ n and T is a ( p −

k )xp of rank p − k such that Tβ is nonestimable. If T is defined to satisfy this


CHAPTER 9. ANALYSIS OF VARIANCE 85

condition, the solution that satisfies X T Xβ = X T Y and Tβ = 0 simultaneously


is unique, and

 
Y
βb = [X T X + T T T]−1 (X T , T T ) = ( X T X + T T T ) −1 X T Y
0

• Example We go back to (9.3) and verify the result. The model is yij = µ + τi + eij ,

i = 1, 2, j = 1, 2, 3 and τ1 + τ2 is non-estimable. Setting τ1 + τ2 = 0, (0, 1, 1) β = 0,


and T = (0, 1, 1)
 
6 6 6
XT X + TT T =  3 4 1 
3 1 4
 
5 −3 −3
(X T X + T T T)−1 = 12 1 
−2 5 1 
−2 1 5
 
y..
T
X Y=  y1.  .
y2.
 
ȳ..
βb =  ȳ1. − ȳ.. 
ȳ2. − ȳ..
The solution simultaneously satisfy X T X βb = X T Y and T βb = (0, 1, 1) βb = 0.

9.5 Testing hypothesis

• Testable hypothesis

 Definition: It can be shown that unless a hypothesis can be expressed in


terms of estimable functions, it cannot be tested. This leads to the notion of

testable hypothesis. A hypothesis is said to be testable if it can be expressed


in terms of estimable functions. For example, a hypothesis such as H0 : β 1 =
β 2 = · · · = β q is said to be testable if there exists a set of linearly independent

estimable functions λ1T β, · · · , λtT β such that H0 is true if and only if λ1T β =
0, · · · , λtT β = 0.
CHAPTER 9. ANALYSIS OF VARIANCE 86

 Illustration: Suppose that we have the model yij = µ + αi + β j + eij , i = 1, 2, 3,


j = 1, 2, 3, and a hypothesis of interest is H0 : α1 = α2 = α3 . By taking linear

combinations of the rows of Xβ, we can obtain the two linearly independent
estimable functions α1 − α2 and α1 + α2 − 2α3 . The hypothesis H0 : α1 = α2 = α3
is true if and only if α1 − α2 and α1 + α2 − 2α3 are simultaneously equal to zero

(see Problem 12.21). Therefore, H0 is a testable hypothesis and is equivalent to


testing α1 − α2 = 0 and α1 + α2 − 2α3 = 0.

• Hypothesis testing after reparameterization: We can reparameterize to a full-


rank representation such as

y = Xβ + e = Zγ + e = Z1 γ1 + Z2 γ2 + e,

where γ1 = (λ1T β, · · · , λtT β) T and γ2 = (λtT+1 β, · · · , λkT β) T such that γ1 and γ2

are jointly linearly independent.

We can use the method discussed in Chapter 7 for testing H0 : γ1 = 0. Specif-

ically, SS(γ1 |γ2 ) = y T ( H − H2 )y = γ


bT Z T y − γ
b2T Z2T y and the corresponding

F-statistics is
SS(γ1 |γ2 )/t
F= .
SSE/(n − k )

• General linear hypothesis for testable hypothesis:

Theorem 12.7b

If Y is Nn (Xβ, σ2 I), where X is n × p of rank k < p ≤ n, if C is m × p of rank


m ≤ k such that Cβ is a set of m linearly independent estimable functions, and
if βb = (X T X)− X T Y, then

(i) C(X T X)− C T is nonsingular.


(ii) C βb is Nm [Cβ, σ2 C(X T X)− C T ]
(iii) SSH/σ2 = (C βb) T [C(X T X)− C T ]−1 C β/σ
b 2 is χ2 (m, λ),

where λ = (Cβ) T [C(X T X)− C T ]−1 Cβ/2σ2 .


CHAPTER 9. ANALYSIS OF VARIANCE 87

(iv) SSE/σ2 = Y T [I − X(X T X)− X T ]Y/σ2 is χ2 (n − k ).


(v) SSH and SSE are independent.

(Proof)
(i) Since

c1T β
 
 c2T β 
Cβ = 
 
.. 
 . 

cm

is a set of m linearly independent estimable functions, then by Theorem 12.2b(iii)

we have ciT (X T X)− X T X = ciT for i = 1, 2, · · · , m. Hence

C(X T X)− X T X = C. (9.5)

Using Theorem 2.4(i)

rank(C) ≤ rank[C(X T X)− X T ] ≤ rank(C).

Hence rank[C(X T X)− X T ] = rank(C) = m. Now, since rank(AA T ) = rank(A),


we can write
rank(C) = rank[C(X T X)− X T ]
= rank[C(X T X)− X T ][C(X T X)− X T ]T
= rank[C(X T X)− X T X(X T X)− C T ].

By (9.5), C(X T X)− X T X = C, and we have

rank(C) = rank[C(X T X)− C T ].

Thus the m × m matrix C(X T X)− C T is nonsingular.

(ii)

E(C βb) = CE( βb) = C(X T X)− X T Xβ = Cβ.


CHAPTER 9. ANALYSIS OF VARIANCE 88

var (C βb) = Cvar ( βb)C T


= C ( X T X )− X T X ( X T X )− C T σ2
= C ( X T X )− C T σ2
(iii), (iv) and (v) can be shown similarily as before.

You might also like