Appliedstat 2017 Chapter 8 9

Chapter 8
Model selection and assessment
This chapter is partly based on the book ‘The Elements of Statistical Learning’
by Hastie, Tibshirani and Friedman.
8.1 Terminology
• ‘Model Selection’ is to choose the ‘best’ one among different models by com-
paring estimate of model performance. Having chosen a final model, ‘Model
assessment’ is to estimate its generalization error (or prediction error) on new
data.
• We are given a ‘training data set’ T , consisting of N pairs ( X1 , y1 ), · · · , ( X N , y N ),

often grouped together into a matrix and a vector T = ( X, y). Suppose we fit
y = f ( X ) + e using the training dataset T . Our aim is to use this training set to
find the function fb( X ) that best estimates f ( X ).
• Loss function: Finding the ‘best’ estimate implies we need a measure of how
‘good’ an estimate is. One of such measure called loss function is denoted by
L(y; yb). Given X and the true y, this function tells us how ‘good’ our prediction
yb = fb( X ) is. A squared loss function is L(y, yb) = (y − yb)2 and an absolute loss
function is L(y, yb) = |y − yb|
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 69
• We would like to have a model with a small value of the loss function, but the
loss function is not directly evaluated since a model should be selected before
we observe new data.
• Definition 1. Test error (or generalization error)
ErrT = E( L(y, fb( X ))|T ),
where ( X, y) are randomly drawn from their joint distribution (population).

This is the prediction error over an independent test sample. Here the training
set T is fixed and test error refers to the error by fb for this specific training set.
• Definition 2. Expected prediction error (or expected test error)
Err = E( L(y, fb( X ))) = E( ErrT ).
Note that the above expectation averages over everything that is random, in-
cluding the randomness in the training set that produced fb.
Err is more amenable to statistical analysis as most methods effectively esti-

mate the expected error rather than ErrT .
• Definition 3. Training error: Training error is the average loss over the training
sample.
N
err = N −1 ∑ L(yi , fb( xi ))
i =1
and fb( xi ) is a fitted value using {( xi , yi )}in=1 .
Note that OLS minimizes this training error under the square loss function.
Typically, the training error will understate the generalization error (why?).
Whereas the ‘best’ model lies at some intermediate level of complexity, a naive
minimization of the training error would lead us to choose the most complex
model possible instead.

• Definition 4. Expected prediction error at X = x0
Err ( x0 ) = Ey NEW ,T { L(y NEW , fb( X ))| X = x0 }.
For the square loss function,
Err ( x0 ) = Ey NEW ,T [(y NEW − fb( x0 ))2 | x = x0 ]
= Ey NEW ,T [(y NEW − f ( x0 ) + f ( x0 ) − ET { fb( x0 )} + ET { fb( x0 )} − fb( x0 ))2 | x = x0 ]
= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + ET [ fb( x0 ) − E fb( x0 )]2
= σe2 + bias2 ( fb( x0 )) + var ( fb( x0 )).
In the context of usual linear models (i.e. assume the design matrix X is given.),
Err ( x0 ) = Ey NEW ,T [(y NEW − x0 βb)2 | x = x0 ]
= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + x0 ( X T X )−1 x0T σe2 .
While the variance part (the third term) changes with x0 , its average (with x0
p +1 2
taken to be each of the sample values xi ) is N σ . Then, we have
N N
1 p+1 2
N −1 ∑ Err ( xi ) = σe2 + ∑ [ f (xi ) − E( fb(xi ))]2 + σ ,
i =1
N i =1
N e
where the model complexity is directly related to the number of parameters.
• Definition 5. In-sample error:
In-sample error is the average prediction error evaluated at x0 = x1 , x2 , · · · , x N .
It is obtained when new responses are observed for the training set features
given X.
N
1
Errin =
N ∑ EyiNEW { L(yiNEW , fb(xi ))|T }.
i =1
Please note that there are two versions of in-sample error. Another version is
N −1 ∑iN=1 Err ( xi ). Note that expectation in the first version is under yiNEW while
the expectation in the second version is under y NEW and T .
ErrT can be thought of as extra-sample error since the test input vectors do
not need to coincide with the training input vectors.
8.2 Bias and variance decomposition
• Bias-variance trade off: For the square loss function, recall that we have
Err ( x0 ) = σe2 + bias2 ( fb( x0 )) + var ( fb( x0 ))
which consists of three components.
The irreducible error σe2 in y NEW : It is the variance of the target around
its true mean f ( x0 ). Even if our statistical model f exactly represents the real
underlying model, it would still contain this error so that it can not be avoided
(irreducible error). i.e. σe2 is due to y NEW .
Variance of a fitted model, var ( fb( x0 )) is the expected squared deviation of

fb( x0 ) around its mean. This is a random source of error in the fitted model.
Bias: bias2 ( fb( x0 )). This represents the amount by which the average of our
estimate differs from the true mean. It is a systematic source of error arising
due to misspecification of the model.
Typically, the more complex we make the model fb, the lower the (squared)
bias but the higher variance. For example, consider the linear model example
introduced earlier. Recall that var ( fb( x0 )) = x0 ( X T X )−1 x0T σe2 . When we consider
the average of the expected prediction error over training samples, the variance
p +1 2
component becomes N σe which become larger as the number of parameters
increases.
• Example: In a reduced model with less number of parameters, we anticipate

smaller variance and larger bias. Consequently, in-sample error could increase
or decrease. To see what is going on, consider the following example.
Denoting X from new sample by X̃, let us compare the expectation of the gen-
eralization error of the following two models, Err1 of Model 1, y = X1 β∗1 + e∗

and Err2 of Model 2, y = X1 β 1 + X2 β 2 + e when the true model is Model 2:
Err1 = σ2 + var ( X̃1 βb∗1 ) + ( X̃1 ( β 1 + Aβ 2 ) − X̃1 β 1 − X̃2 β 2 )2
= σ2 + var ( X̃1 βb∗1 ) + {( X̃1 A − X̃2 ) β 2 }2
= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + {( X̃1 A − X̃2 ) β 2 }2 ,
where A = ( X1T X1 )−1 X1T X2 .
Err2 = σ2 + var ( X̃ βb)
= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T
where B = X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 .
If
{( X̃1 A − X̃2 ) β 2 }2 ≤ σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T
model 1 (incorrect model) will have smaller in-sample error.
This result reveals the startling fact that even if minimising the training error
will lead to the model with the least bias (model 2), we may intentionally choose
a biased model (model 1), so as to improve prediction. This illustrates that there
is an optimal model complexity that gives minimum generalization error with
a bias-variance tradeoff. Then, a judicious way to find the model with best
prediction is to accurately estimate Err and choose the model that minimizes
that quantity.
8.3 Optimism of the training error rate
• optimism optimism is defined as the difference between Errin and the training
error err:
op = Errin − err
Typically, optimism is positive since err is usually biased downward as an

estimate of Errin .
• Expected optimism (or average optimism) The expectation is over training set
outcome values:
ω = Ey (op) = Ey { Errin − err }
Expected error ω is usually estimated instead of op.
• Motivation to define expected optimism is to estimate in-sample error as the

sum of the estimated expected optimism and the training error.
8.4 Estimators of in-sample error
The general form of the in-sample estimates is
Errin = err + ω.
d b
• There are two strategies to estimate the in-sample error, direct and indirect
estimation.
• Indirect estimation:
Recall ω = Ey ( Errin − err ), where
N
err = N −1 ∑ (yi − fb( xi ))2
i =1
N
= N −1 ∑ [yi − f ( xi ) + f ( xi ) − E{ fb( xi )} + E{ fb( xi )} − fb( xi )]2 .
i =1
Then,
1
Ey (err ) =
N ∑ σe2 b b 2
+ var (yi ) + ( f ( xi ) − E(yi )) − 2cov(yi , yi ) .
b
i
Also,
1
Errin =
N ∑ σe2 + (ybi − E(ybi ))2 + ( f ( xi ) − E(ybi ))2 − 2(ybi − E(ybi ))( f ( xi ) − E(ybi ))
i
1
Ey ( Errin ) =
N ∑ σe2 + var (ybi ) + ( f ( xi ) − E(ybi ))2
i
Thus,
ω = E(op) = Ey ( Errin ) − Ey (err )

2
=
N ∑ cov(yi , ybi ).
i
For linear models, ∑iN=1 cov(yi , ybi ) = tr {cov(y, Hy)} = σe2 ( p + 1), which essen-
tially represents complexity of a model.
Since E( Errin ) = E(err ) + 2σe2 ( p + 1)/N, err + 2b

σe2 ( p + 1)/N is an unbiased
estimator of in-sample error when we have an unbiased estimator of σe2 .
Methods like Mallow’s C p , the Akaike Information Criterion, and the Bayesian
Information Criterion, estimate Errin by estimating the expected optimism and
then adding it to err (which can easily be calculated using the data in T ).
Mallow’s C p : Definition
( p + 1) 2
C p = err + 2 σe
b
N
where N is the number of observations in the training set and p is the number
σe2 is from the full model.
of variables in a model. b
The C p statistic simply adds the estimate of the optimism to the training error to
obtain an estimate for in-sample error. This also coincides with approximated
GCV for large N (see the end of this chapter).
σe2 + 2( p + 1) −
In linear regression analysis, the Mallow’s C p is defined as SSE/b
N, where SSE is from a considering model. If the considering model is full
model, C p = p + 1. Note that these two expressions are equivalent since
( p + 1) 2 1 ( p + 1) 2
err + 2 σe =
b SSE + 2 σe
b
N N N
1 2 2

= σ SSE/b σe + 2( p + 1)
N e
b
2 1 2
2

= b
σe + b σ SSE/b σe + 2( p + 1) − N .
N e
• Direct estimation: Cross validation: The reason that err understates Err is that
the same data are used to fit the model and to assess its goodness of fit. Cross-
validation (Stone 1974) solves this problem by splitting the data into K separate
segments, fitting the model using K − 1 of these segments and assessing the
goodness of fit using the last segment. Since the last segment of data are ‘new’
(not used to fit the model), the cross validation attempts to directly estimate the
generalization error. Below we discuss when K = N. That is ‘Leave-one-out’
at a time to fit model N times, that is a multiple of PRESS we discussed in the
previous chapter.
LOCV = N −1 PRESS = N −1 ∑iN=1 (yi − Xi βb(i) )2 = N −1 ∑iN=1 ( 1−eih )2 .

ii
Is LOCV an unbiased estimator of Errin ? To check this, recall that under

correct model specification, ei = ui ( I − H )y = ηi + ui ( I − H )e and E(ei2 ) =
E(e T ( I − H )Ki ( I − H )e) = σ2 (1 − hii ), where ηi = ui ( I − H ) Xβ. (ui and Ki
are defined in 7.4. ui is a row vector of 0’s except that the element on the i th
position is 1 and Ki = uiT ui .) Note that if the considering model is true model,
ηi = 0 since ( I − H ) X = 0. H is from a considering model while X is from the
true model.
ηi2 + σ2 (1 − hii )
N
E( N × LOCV ) = ∑
i =1
(1 − hii )2
( )
N
E( N × Errin ) = E ∑ EyiNEW (yiNEW − Xi βb)2
i =1
N
= ∑ (1 + hii )σ2
i =1
In fact,
n ηi2 + σ2 (1 − hii ) n
E( N × LOCV ) − E( N × Errin ) = ∑ (1 − hii )2 − ∑ (1 + hii )σ2
i =1 i =1
n ηi2 hii n h2ii
= ∑ (1 − hii )2 ∑ (1 − hii ) σ2 > 0
+
i =1 i =1
since 0 < hii < 1.
Sometimes it is easier to calculate tr ( H ) than individual hii ’s. Generalized

cross validation score is defined by replacing hii with its average, tr ( H )/N,
N 2
ei
GCV = N −1
∑ 1 − tr ( H )/N
.
i =1
For large N, using the approximation 1/(1 − x )2 ≈ 1 + 2x,
N 2
ei 2( p + 1)
GCV = N −1
∑ (1 − ( p + 1)/N )
≈ N −1 e T e + { N −1 e T e }
N
i =1
which is approximately Mallow’s C p if we consider e T e/N is an estimator of σe2

Chapter 9
Analysis of variance
9.1 Non-full-rank models
• Analysis of variance model can be posed as a linear model but is a little tricky
since the design matrix may not be of full rank.
• Example: Consider the following one-way analysis of variance model.
yij = µ + τi + eij ,
for i = 1, 2 and j = 1, 2, 3. i.e.
y11 = µ + τ1 + e11 , y12 = µ + τ1 + e12 , y13 = µ + τ1 + e13

y21 = µ + τ2 + e21 , y22 = µ + τ2 + e22 , y23 = µ + τ2 + e23 .
µ can be considered as a overall mean without the effect of the treatments

and τi is the contribution of the treatment i to the mean.
In the textbook, two chemical additives for increasing the mileage of gasoline.
The response is the mileage (mile per gallon).
In matrix form, we can write
     
y11 1 1 0 e11
 y12   1 1 0    e12 
    µ  
 y13   1 1 0   e13 
Y= =   τ1  +   = Xβ + e, (9.1)
 y21   1 0 1   e21 
    τ2
 
 y22   1 0 1   e22 
y23 1 0 1 e23
CHAPTER 9. ANALYSIS OF VARIANCE 78
where β = (µ, τ1 , τ2 ) T . Note that in this application, ( X T X )−1 does not exist.
Also β = (15, 1, 3), (10,6,8), (25,-9,-7) give the same fit, that is µ, τ1 , and τ2 are
not unique and therefore cannot be estimated uniquely. With three parameters
and rank( X ) = 2, the model is said to be overparameterized. Note that increasing
the number of observations (replications) for each of the two additives will not
change the rank of X.
• Lack of estimability of β: The model can be written in usual way Y = Xβ + e,

E(Y) = Xβ, var (Y) = σ2 In , and normal equation is X T Xβ = X T Y. However due
to lack of rank in X, X T X is not invertible!
We can use a g-inverse for X T X, which yields
βb = (X T X)− X T Y.
Recall g-inverse is not unique, thus βb is not unique. Such βb is not unbiased since
E( βb) = (X T X)− X T Xβ 6= β. Then, does an unbiased estimator of β exist? That
is, does pxn matrix A exist to satisfy E(AY) = AXβ = β? The answer is no. To
see this, note that β = E( Ay) = AXβ, and the question boils down to whether
there exists A such that AX = I p . But using Theorem 2.4(i) from the textbook,
and rank( X ) < p in (9.1), we have rank( AX ) ≤ rank( X ) < p. Hence AX cannot
be equal to I p .
• There are two common ways to handle an overparameterized model:
(1) Dealing with estimable linear combination: Work with unique and well
defined (estimable) linear combinations of parameters.
(2) Converting to a full rank problem: There are a few ways to convert the
problem into a familiar full rank problem. (2.1) Reparameterization: redefine
the model using a smaller number of new parameters that are unique. (2.2)
Imposing restriction (side conditions): place constraints on the parameters so

that they become unique.
Example of (2.1): µ1 = µ + τ1 and µ2 = µ + τ2

     
y11 1 0 e11
 y12   1 0   e12 
     
 y13   1 0  µ1  e13 
Y=   =   +   (9.2)
 y 21
  0 1  µ2
  
 e21



 y22   0 1   e22 
y23 0 1 e23
Example of (2.2): τ1 + τ2 = 0
     
y11 1 1 e11

 y12  
  1 1 

 e12
 

 y13   1 1  µ  e13 
Y= = + 

 y21  
  1 −1  τ1
  e21



 y22   1 −1   e22 
y23 1 −1 e23
Example of (1): As we examine the parameters in the model illustrated in

(9.1), we see some linear combinations that are unique. For example, τ1 − τ2 =
−2, µ + τ1 = 16, and µ + τ2 = 18, remain the same for all alternative values of
µ, τ1 , and τ2 . Such unique linear combinations can be estimated.
These approaches are related.
Illustration: Related approaches: For the model (9.1), we have

   
6 3 3 y..
X T X =  3 3 0  and X T Y =  y1. 
3 0 3 y2.
2 3 3
where y.. = ∑i=1 ∑ j=1 yij , yi. = ∑ j=1 yij .
 
0 0 0
With (X T X)− =  0 31 0  we have
0 0 31
 
0
βb = (X T X)− X T Y =  ȳ1.  (9.3)
ȳ2. .
We can easily check
 
0
E( βb) = (X T X)− X T Xβ =  µ + τ1  .
µ + τ2
We obtain identical results with the reparameterized model (9.2)
1

µ
b1 T −1 T 3 0 y1. ȳ1
= (X X) X Y= 1 =
µ
b2 0 3
y2. ȳ2
9.2 Estimable function of β
If we cannot estimate β, the next logical question is whether there is a linear
combination of β, say λ T β that is estimable. If exists, how can we tell a linear

combination is estimable or not? The theorem below provides answers to this
question.
• Definition of estimable functions of β: A linear function of parameters λ T β is
said to be estimable if there exists a vector a such that E( a T y) = λ T β.
• Theorem 12.2b. In the model Y = Xβ + e, where E(Y) = Xβ and X is n × p of

rank k < p ≤ n, the linear function λ T β is estimable if and only if any one of
the following equivalent conditions holds:
(i) λ T is a linear combination of the rows of X; that is, there exists a vector a
such that
aT X = λT .
(ii) λ T is a linear combination of the rows of X T X or λ is a linear combination

of the columns of X T X, that is, there exists a vector r such that
r T X T X = λ T or X T Xr = λ
(iii) λ or λ T satisfies
X T X(X T X)− λ = λ or λ T (X T X)− X T X = λ T (9.4)
(Proof) We only give ‘if’ part.
(i) If there exists a vector a such that λ T = a T X, then, using this vector a, we
have E(a T Y) = a T E(Y) = a T Xβ = λ T β.
(ii) If there exists a solution r for X T Xr = λ, then, by defining a = Xr, we obtain

E(a T Y) = E(r T X T Y) = r T X T E(Y ) = r T X T Xβ = λ T β.
(iii) If X T X(X T X)− λ = λ, then (X T X)− λ is a solution to X T Xr = λ.
• Theorem 12.3a. Let λ T β be an estimable function of β in the model Y = Xβ + e,

where E(Y) = Xβ and X is n × p of rank k < p ≤ n. Let βb be any solution to the
normal equations X T X βb = X T Y, and let r be any solution to X T Xr = λ. Then

the two estimators λ T βb and r T X T Y have the following properties:
(i) E(λ T βb) = E(r T X T Y) = λ T β.
(ii) λ T βb is equal to r T X T Y for any βb or any r.

(iii) λ T βb and r T X T Y are invariant to the choice of βb or r.
• Example: Model (9.1) We can show τ1 − τ2 is estimable since we can write

   
6 3 3 0
X T Xr =  3 3 0  r =  1  = λ,
3 0 3 −1
with r T = (0, 13 , − 13 ). Then, τ\ T T T
1 − τ2 = r X Y = ȳ1. − ȳ2. . We can verify λ β =
τ1 − τ2 = E(ȳ1. − ȳ2. ) = E(r T X T Y) = r T X T Xβ.

To obtain the same result using λ T β, we first find a solution to the normal
equations
X T X βb = X T Y as in (9.3) and obtain λ βb = ȳ1. − ȳ2. .

• Theorem 12.3b. Let λ T β be an estimable function in the model Xβ + e, where X

is n × p of rank k < p ≤ n and cov(y)=σ2 I. Let r be any solution to X T Xr = λ,
and let βb be any solution to X T X βb = X T Y. Then the variance of λ T βb or r T X T Y

has the following properties:
(i) var (r T X T Y) = σ2 r T X T Xr = σ2 r T λ.
(ii) var (λ T βb) = σ2 λ T (X T X)− λ.

(iii) var (λ T βb) is unique, that is, invariant to the choice of r or (XT X)− .
(Proof)
(i) var (r T X T Y) = rX T cov(Y)Xr = r T X T (σ2 I)Xr = σ2 r T X T Xr = σ2 r T λ.

(ii)var (λ T βb) = λ T cov( βb)λ = σ2 λ T (X T X)− X T X(X T X)− λ
By Theorem 12.2b.(iii), λ T (X T X)− X T X = λ T and therefore var (λ T βb) = σ2 λ T (X T X)− λ.

(iii) To show that r T λ is invariant to r, let r1 and r2 be such that X T Xr1 = λ and
X T Xr2 = λ. Multiplying these two equations by r2T and r1T , we obtain r2T λ = r1T λ.
To show that λ T (X T X)− λ is invariant to the choice of X T X− , let G1 and G2 be

two generalized inverses of X T X. Then by Theorem 2.8c(v), we have
XG1 X T = XG2 X T .
Multiplying both sides by a such that a T X = λ T [See Theorem 12.2b(i)], we
obtain a T XG1 X T a = a T XG2 X T a, which yields λ T G1 λ = λ T G2 λ.
• Theorem 12.3c If λ1T β and λ2T β are two estimable functions in the model Y =
Xβ + e, where X is n × p of rank k < p ≤ n and cov(Y) = σ2 I. The covariance

of their estimation is given by
b λ T βb) = σ2 r T λ2 = σ2 λ T r2 = σ2 λ T (X T X)− λ2 .
cov(λ1T β, 2 1 1 1
where X T Xr1 = λ1 and X T Xr2 = λ2 .
SSE
• Theorem 12.3e. Let s2 = n−k , where SSE = (Y − X βb) T (Y − X βb) = Y T (I −
X(X T X)− X T )Y. For s2 defined for the non-full-rank model, we have the follow-
ing properties: (i)E(s2 ) = σ2 . (ii) s2 is invariant to the choice of βb or to the choice

of generalized inverse ( X T X )− .
(Proof)
E(SSE) = E(Y T (I − X(X T X)− X T )Y)

= tr ((I − X(X T X)− X T )σ2 I) + β T X T (I − X(X T X)− X T )Xβ
= ( n − k ) σ2 .
• Theorem 12.3f. If Y ∼ Nn (Xβ, σ2 I), where X is n x p of rank k, p < n, then the
maximum likelihood estimators for β and σ2 are given by βbMLE = (X T X)− X T Y

σ2 = n−1 (Y − X βb) T (Y − X βb).
b
• Theorem 12.3g If Y is Nn (Xβ, σ2 I), where X is n × p of rank k < p ≤ n. then the

maximum likelihood estimators βb and s2 (corrected for bias) have the following
properties:
(i) βb is N p [(X T X)− X T Xβ, σ2 (X T X)− X T X(X T X)− ].
(ii) (n − k )s2 /σ2 is χ2 (n − k).
(iii) βb and s2 are independent.
9.3 Formalizing reparametrization
• In reparameterization, we transform the non-full-rank model Y = Xβ + e, where

X is n x p of rank k, to the full-rank model Y = Zγ + e, where Z is n x k and
γ = Uβ is a set of k linearly independent estimable functions of β. Note that

U a k × p matrix, of rank k < p, and the matrix UU T is nonsingular and write
X = ZU. We can see that Xβ = ZUβ = Zγ. Since XU T = ZUU T , we can write
Z = XU T (UU T )−1 .
Note that Z is of full rank, since rank(Z) ≥ rank(ZU) = rank(X) = k, and we
b = (Z T Z)−1 Z T Y. Since Zγ = Xβ, we have Zγ
obtain γ b and (Y − X βb) T (Y −
b = X β,
b) T ( Y − Z γ
X βb) = (Y − Zγ b) .
• Uβ = γ is only one possible set of linearly independent estimable functions.

Thus, the estimator of the estimable function does not change depending on
the choice of U. Let Vβ = δ, be another set of linearly independent estimable

functions. Then there exists a matrix W such that Y = Wδ + e. Then estimable
function can be expressed as λ T β = b T γ = c T δ. Hence λ T βb = b T γ
b = c T δb
yielding the same estimator for differently expressed linear combinations.
9.4 Side conditions
• Side conditions provide linear constraints that make the parameters unique and
estimable. (9.3) illustrates how the side condition τ1 + τ2 = 0 leads to handle
rank deficiency of X. Then, can we impose any such side conditions? Should
side conditions satisfy certain requirement? This is the topic of this section.
• Requirement for side conditions: The goal of constraint is to rescue rank de-
ficiency of X. Consider linear constraint, Tβ = 0. If a side condition were an

estimable function of β, then it could be expressed as a linear combination of
the rows of X T Xβ which would contribute nothing to obtaining a solution vec-
tor β.
b Thus, this does not help relieving rank deficient problem. Therefore side
conditions should be nonestimable function of β. Furthermore to make up the

rank deficiency, T should be a ( p − k )xp matrix with rank p − k. Combining
two equations Y = Xβ + e and 0 = Tβ + 0 we can write

Y X e
= β+
0 T 0

X
where is a (n + p − k ) × p matrix with rank p. Then the minimizer of
T
sum of square should satisfy
T T
X X b X Y
β= .
T T T 0
i.e. (X T X + T T T) β
b = X T Y. Recall X is a n x p of rank k, p ≤ n and T is a ( p −
k )xp of rank p − k such that Tβ is nonestimable. If T is defined to satisfy this

condition, the solution that satisfies X T Xβ = X T Y and Tβ = 0 simultaneously

is unique, and

Y
βb = [X T X + T T T]−1 (X T , T T ) = ( X T X + T T T ) −1 X T Y
0
• Example We go back to (9.3) and verify the result. The model is yij = µ + τi + eij ,
i = 1, 2, j = 1, 2, 3 and τ1 + τ2 is non-estimable. Setting τ1 + τ2 = 0, (0, 1, 1) β = 0,

and T = (0, 1, 1)
 
6 6 6
XT X + TT T =  3 4 1 
3 1 4
 
5 −3 −3
(X T X + T T T)−1 = 12 1 
−2 5 1 
−2 1 5
 
y..
T
X Y=  y1.  .
y2.
 
ȳ..
βb =  ȳ1. − ȳ.. 
ȳ2. − ȳ..
The solution simultaneously satisfy X T X βb = X T Y and T βb = (0, 1, 1) βb = 0.
9.5 Testing hypothesis
• Testable hypothesis
Definition: It can be shown that unless a hypothesis can be expressed in

terms of estimable functions, it cannot be tested. This leads to the notion of
testable hypothesis. A hypothesis is said to be testable if it can be expressed

in terms of estimable functions. For example, a hypothesis such as H0 : β 1 =
β 2 = · · · = β q is said to be testable if there exists a set of linearly independent
estimable functions λ1T β, · · · , λtT β such that H0 is true if and only if λ1T β =
0, · · · , λtT β = 0.
Illustration: Suppose that we have the model yij = µ + αi + β j + eij , i = 1, 2, 3,

j = 1, 2, 3, and a hypothesis of interest is H0 : α1 = α2 = α3 . By taking linear
combinations of the rows of Xβ, we can obtain the two linearly independent
estimable functions α1 − α2 and α1 + α2 − 2α3 . The hypothesis H0 : α1 = α2 = α3
is true if and only if α1 − α2 and α1 + α2 − 2α3 are simultaneously equal to zero
(see Problem 12.21). Therefore, H0 is a testable hypothesis and is equivalent to

testing α1 − α2 = 0 and α1 + α2 − 2α3 = 0.
• Hypothesis testing after reparameterization: We can reparameterize to a full-

rank representation such as
y = Xβ + e = Zγ + e = Z1 γ1 + Z2 γ2 + e,
where γ1 = (λ1T β, · · · , λtT β) T and γ2 = (λtT+1 β, · · · , λkT β) T such that γ1 and γ2
are jointly linearly independent.
We can use the method discussed in Chapter 7 for testing H0 : γ1 = 0. Specif-
ically, SS(γ1 |γ2 ) = y T ( H − H2 )y = γ

bT Z T y − γ
b2T Z2T y and the corresponding
F-statistics is
SS(γ1 |γ2 )/t
F= .
SSE/(n − k )
• General linear hypothesis for testable hypothesis:
Theorem 12.7b
If Y is Nn (Xβ, σ2 I), where X is n × p of rank k < p ≤ n, if C is m × p of rank

m ≤ k such that Cβ is a set of m linearly independent estimable functions, and
if βb = (X T X)− X T Y, then
(i) C(X T X)− C T is nonsingular.

(ii) C βb is Nm [Cβ, σ2 C(X T X)− C T ]
(iii) SSH/σ2 = (C βb) T [C(X T X)− C T ]−1 C β/σ
b 2 is χ2 (m, λ),
where λ = (Cβ) T [C(X T X)− C T ]−1 Cβ/2σ2 .

(iv) SSE/σ2 = Y T [I − X(X T X)− X T ]Y/σ2 is χ2 (n − k ).

(v) SSH and SSE are independent.
(Proof)
(i) Since
c1T β
 
 c2T β 
Cβ = 
 
.. 
 . 
Tβ
cm
is a set of m linearly independent estimable functions, then by Theorem 12.2b(iii)
we have ciT (X T X)− X T X = ciT for i = 1, 2, · · · , m. Hence
C(X T X)− X T X = C. (9.5)
Using Theorem 2.4(i)
rank(C) ≤ rank[C(X T X)− X T ] ≤ rank(C).
Hence rank[C(X T X)− X T ] = rank(C) = m. Now, since rank(AA T ) = rank(A),

we can write
rank(C) = rank[C(X T X)− X T ]
= rank[C(X T X)− X T ][C(X T X)− X T ]T
= rank[C(X T X)− X T X(X T X)− C T ].
By (9.5), C(X T X)− X T X = C, and we have
rank(C) = rank[C(X T X)− C T ].
Thus the m × m matrix C(X T X)− C T is nonsingular.
(ii)
E(C βb) = CE( βb) = C(X T X)− X T Xβ = Cβ.

var (C βb) = Cvar ( βb)C T

= C ( X T X )− X T X ( X T X )− C T σ2
= C ( X T X )− C T σ2
(iii), (iv) and (v) can be shown similarily as before.

Appliedstat 2017 Chapter 8 9

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appliedstat 2017 Chapter 8 9

Uploaded by

Copyright:

Available Formats

Chapter 8

Model selection and assessment

• We are given a ‘training data set’ T , consisting of N pairs ( X1 , y1 ), · · · , ( X N , y N ),

find the function fb( X ) that best estimates f ( X ).

we observe new data.

• Definition 1. Test error (or generalization error)

ErrT = E( L(y, fb( X ))|T ),

where ( X, y) are randomly drawn from their joint distribution (population).

• Definition 2. Expected prediction error (or expected test error)

Err = E( L(y, fb( X ))) = E( ErrT ).

cluding the randomness in the training set that produced fb.

Err is more amenable to statistical analysis as most methods effectively esti-

and fb( xi ) is a fitted value using {( xi , yi )}in=1 .

model possible instead.

• Definition 4. Expected prediction error at X = x0

Err ( x0 ) = Ey NEW ,T { L(y NEW , fb( X ))| X = x0 }.

For the square loss function,

Err ( x0 ) = Ey NEW ,T [(y NEW − fb( x0 ))2 | x = x0 ]

= Ey NEW ,T [(y NEW − f ( x0 ) + f ( x0 ) − ET { fb( x0 )} + ET { fb( x0 )} − fb( x0 ))2 | x = x0 ]

= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + ET [ fb( x0 ) − E fb( x0 )]2

= σe2 + bias2 ( fb( x0 )) + var ( fb( x0 )).

Err ( x0 ) = Ey NEW ,T [(y NEW − x0 βb)2 | x = x0 ]

= σe2 + [ E{ fb( x0 )} − f ( x0 )]2 + x0 ( X T X )−1 x0T σe2 .

where the model complexity is directly related to the number of parameters.

• Definition 5. In-sample error:

In-sample error is the average prediction error evaluated at x0 = x1 , x2 , · · · , x N .

8.2 Bias and variance decomposition

Err ( x0 ) = σe2 + bias2 ( fb( x0 )) + var ( fb( x0 ))

which consists of three components.

(irreducible error). i.e. σe2 is due to y NEW .

Variance of a fitted model, var ( fb( x0 )) is the expected squared deviation of

• Example: In a reduced model with less number of parameters, we anticipate

eralization error of the following two models, Err1 of Model 1, y = X1 β∗1 + e∗

and Err2 of Model 2, y = X1 β 1 + X2 β 2 + e when the true model is Model 2:

Err1 = σ2 + var ( X̃1 βb∗1 ) + ( X̃1 ( β 1 + Aβ 2 ) − X̃1 β 1 − X̃2 β 2 )2

= σ2 + var ( X̃1 βb∗1 ) + {( X̃1 A − X̃2 ) β 2 }2

= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + {( X̃1 A − X̃2 ) β 2 }2 ,

where A = ( X1T X1 )−1 X1T X2 .

Err2 = σ2 + var ( X̃ βb)

= σ2 + σ2 X̃1 ( X1T X1 )−1 X̃1T + σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T

where B = X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 .

model 1 (incorrect model) will have smaller in-sample error.

8.3 Optimism of the training error rate

Typically, optimism is positive since err is usually biased downward as an

ω = Ey (op) = Ey { Errin − err }

Expected error ω is usually estimated instead of op.

• Motivation to define expected optimism is to estimate in-sample error as the

8.4 Estimators of in-sample error

The general form of the in-sample estimates is

Recall ω = Ey ( Errin − err ), where

ω = E(op) = Ey ( Errin ) − Ey (err )

tially represents complexity of a model.

Since E( Errin ) = E(err ) + 2σe2 ( p + 1)/N, err + 2b

GCV for large N (see the end of this chapter).

model, C p = p + 1. Note that these two expressions are equivalent since

LOCV = N −1 PRESS = N −1 ∑iN=1 (yi − Xi βb(i) )2 = N −1 ∑iN=1 ( 1−eih )2 .

Is LOCV an unbiased estimator of Errin ? To check this, recall that under

since 0 < hii < 1.

Sometimes it is easier to calculate tr ( H ) than individual hii ’s. Generalized

For large N, using the approximation 1/(1 − x )2 ≈ 1 + 2x,

which is approximately Mallow’s C p if we consider e T e/N is an estimator of σe2

9.1 Non-full-rank models