Professional Documents
Culture Documents
Appliedstat 2017 Chapter 8 9
Appliedstat 2017 Chapter 8 9
This chapter is partly based on the book ‘The Elements of Statistical Learning’
by Hastie, Tibshirani and Friedman.
8.1 Terminology
• ‘Model Selection’ is to choose the ‘best’ one among different models by com-
paring estimate of model performance. Having chosen a final model, ‘Model
assessment’ is to estimate its generalization error (or prediction error) on new
data.
• Loss function: Finding the ‘best’ estimate implies we need a measure of how
‘good’ an estimate is. One of such measure called loss function is denoted by
L(y; yb). Given X and the true y, this function tells us how ‘good’ our prediction
yb = fb( X ) is. A squared loss function is L(y, yb) = (y − yb)2 and an absolute loss
function is L(y, yb) = |y − yb|
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 69
• We would like to have a model with a small value of the loss function, but the
loss function is not directly evaluated since a model should be selected before
Note that the above expectation averages over everything that is random, in-
• Definition 3. Training error: Training error is the average loss over the training
sample.
N
err = N −1 ∑ L(yi , fb( xi ))
i =1
Note that OLS minimizes this training error under the square loss function.
Typically, the training error will understate the generalization error (why?).
Whereas the ‘best’ model lies at some intermediate level of complexity, a naive
minimization of the training error would lead us to choose the most complex
In the context of usual linear models (i.e. assume the design matrix X is given.),
While the variance part (the third term) changes with x0 , its average (with x0
p +1 2
taken to be each of the sample values xi ) is N σ . Then, we have
N N
1 p+1 2
N −1 ∑ Err ( xi ) = σe2 + ∑ [ f (xi ) − E( fb(xi ))]2 + σ ,
i =1
N i =1
N e
It is obtained when new responses are observed for the training set features
given X.
N
1
Errin =
N ∑ EyiNEW { L(yiNEW , fb(xi ))|T }.
i =1
Please note that there are two versions of in-sample error. Another version is
N −1 ∑iN=1 Err ( xi ). Note that expectation in the first version is under yiNEW while
the expectation in the second version is under y NEW and T .
ErrT can be thought of as extra-sample error since the test input vectors do
not need to coincide with the training input vectors.
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 71
• Bias-variance trade off: For the square loss function, recall that we have
The irreducible error σe2 in y NEW : It is the variance of the target around
its true mean f ( x0 ). Even if our statistical model f exactly represents the real
underlying model, it would still contain this error so that it can not be avoided
Bias: bias2 ( fb( x0 )). This represents the amount by which the average of our
estimate differs from the true mean. It is a systematic source of error arising
due to misspecification of the model.
Typically, the more complex we make the model fb, the lower the (squared)
bias but the higher variance. For example, consider the linear model example
introduced earlier. Recall that var ( fb( x0 )) = x0 ( X T X )−1 x0T σe2 . When we consider
the average of the expected prediction error over training samples, the variance
p +1 2
component becomes N σe which become larger as the number of parameters
increases.
Denoting X from new sample by X̃, let us compare the expectation of the gen-
If
{( X̃1 A − X̃2 ) β 2 }2 ≤ σ2 ( X̃1 A − X̃2 ) B−1 ( X̃1 A − X̃2 )T
This result reveals the startling fact that even if minimising the training error
will lead to the model with the least bias (model 2), we may intentionally choose
a biased model (model 1), so as to improve prediction. This illustrates that there
is an optimal model complexity that gives minimum generalization error with
a bias-variance tradeoff. Then, a judicious way to find the model with best
prediction is to accurately estimate Err and choose the model that minimizes
that quantity.
• optimism optimism is defined as the difference between Errin and the training
error err:
op = Errin − err
• Expected optimism (or average optimism) The expectation is over training set
outcome values:
Errin = err + ω.
d b
• There are two strategies to estimate the in-sample error, direct and indirect
estimation.
• Indirect estimation:
N
err = N −1 ∑ (yi − fb( xi ))2
i =1
N
= N −1 ∑ [yi − f ( xi ) + f ( xi ) − E{ fb( xi )} + E{ fb( xi )} − fb( xi )]2 .
i =1
Then,
1
Ey (err ) =
N ∑ σe2 b b 2
+ var (yi ) + ( f ( xi ) − E(yi )) − 2cov(yi , yi ) .
b
i
Also,
1
Errin =
N ∑ σe2 + (ybi − E(ybi ))2 + ( f ( xi ) − E(ybi ))2 − 2(ybi − E(ybi ))( f ( xi ) − E(ybi ))
i
1
Ey ( Errin ) =
N ∑ σe2 + var (ybi ) + ( f ( xi ) − E(ybi ))2
i
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 74
Thus,
For linear models, ∑iN=1 cov(yi , ybi ) = tr {cov(y, Hy)} = σe2 ( p + 1), which essen-
Methods like Mallow’s C p , the Akaike Information Criterion, and the Bayesian
Information Criterion, estimate Errin by estimating the expected optimism and
then adding it to err (which can easily be calculated using the data in T ).
Mallow’s C p : Definition
( p + 1) 2
C p = err + 2 σe
b
N
where N is the number of observations in the training set and p is the number
σe2 is from the full model.
of variables in a model. b
The C p statistic simply adds the estimate of the optimism to the training error to
obtain an estimate for in-sample error. This also coincides with approximated
σe2 + 2( p + 1) −
In linear regression analysis, the Mallow’s C p is defined as SSE/b
N, where SSE is from a considering model. If the considering model is full
( p + 1) 2 1 ( p + 1) 2
err + 2 σe =
b SSE + 2 σe
b
N N N
1 2 2
= σ SSE/b σe + 2( p + 1)
N e
b
2 1 2
2
= b
σe + b σ SSE/b σe + 2( p + 1) − N .
N e
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 75
• Direct estimation: Cross validation: The reason that err understates Err is that
the same data are used to fit the model and to assess its goodness of fit. Cross-
validation (Stone 1974) solves this problem by splitting the data into K separate
segments, fitting the model using K − 1 of these segments and assessing the
goodness of fit using the last segment. Since the last segment of data are ‘new’
(not used to fit the model), the cross validation attempts to directly estimate the
generalization error. Below we discuss when K = N. That is ‘Leave-one-out’
at a time to fit model N times, that is a multiple of PRESS we discussed in the
previous chapter.
are defined in 7.4. ui is a row vector of 0’s except that the element on the i th
position is 1 and Ki = uiT ui .) Note that if the considering model is true model,
ηi = 0 since ( I − H ) X = 0. H is from a considering model while X is from the
true model.
ηi2 + σ2 (1 − hii )
N
E( N × LOCV ) = ∑
i =1
(1 − hii )2
( )
N
E( N × Errin ) = E ∑ EyiNEW (yiNEW − Xi βb)2
i =1
N
= ∑ (1 + hii )σ2
i =1
In fact,
n ηi2 + σ2 (1 − hii ) n
E( N × LOCV ) − E( N × Errin ) = ∑ (1 − hii )2 − ∑ (1 + hii )σ2
i =1 i =1
n ηi2 hii n h2ii
= ∑ (1 − hii )2 ∑ (1 − hii ) σ2 > 0
+
i =1 i =1
CHAPTER 8. MODEL SELECTION AND ASSESSMENT 76
N 2
ei 2( p + 1)
GCV = N −1
∑ (1 − ( p + 1)/N )
≈ N −1 e T e + { N −1 e T e }
N
i =1
Analysis of variance
• Analysis of variance model can be posed as a linear model but is a little tricky
since the design matrix may not be of full rank.
yij = µ + τi + eij ,
In the textbook, two chemical additives for increasing the mileage of gasoline.
The response is the mileage (mile per gallon).
y11 1 1 0 e11
y12 1 1 0 e12
µ
y13 1 1 0 e13
Y= = τ1 + = Xβ + e, (9.1)
y21 1 0 1 e21
τ2
y22 1 0 1 e22
y23 1 0 1 e23
CHAPTER 9. ANALYSIS OF VARIANCE 78
where β = (µ, τ1 , τ2 ) T . Note that in this application, ( X T X )−1 does not exist.
Also β = (15, 1, 3), (10,6,8), (25,-9,-7) give the same fit, that is µ, τ1 , and τ2 are
not unique and therefore cannot be estimated uniquely. With three parameters
and rank( X ) = 2, the model is said to be overparameterized. Note that increasing
the number of observations (replications) for each of the two additives will not
βb = (X T X)− X T Y.
Recall g-inverse is not unique, thus βb is not unique. Such βb is not unbiased since
E( βb) = (X T X)− X T Xβ 6= β. Then, does an unbiased estimator of β exist? That
is, does pxn matrix A exist to satisfy E(AY) = AXβ = β? The answer is no. To
see this, note that β = E( Ay) = AXβ, and the question boils down to whether
there exists A such that AX = I p . But using Theorem 2.4(i) from the textbook,
and rank( X ) < p in (9.1), we have rank( AX ) ≤ rank( X ) < p. Hence AX cannot
be equal to I p .
(1) Dealing with estimable linear combination: Work with unique and well
defined (estimable) linear combinations of parameters.
(2) Converting to a full rank problem: There are a few ways to convert the
problem into a familiar full rank problem. (2.1) Reparameterization: redefine
the model using a smaller number of new parameters that are unique. (2.2)
Example of (2.2): τ1 + τ2 = 0
y11 1 1 e11
y12
1 1
e12
y13 1 1 µ e13
Y= = +
y21
1 −1 τ1
e21
y22 1 −1 e22
y23 1 −1 e23
1
µ
b1 T −1 T 3 0 y1. ȳ1
= (X X) X Y= 1 =
µ
b2 0 3
y2. ȳ2
(i) λ T is a linear combination of the rows of X; that is, there exists a vector a
such that
aT X = λT .
r T X T X = λ T or X T Xr = λ
CHAPTER 9. ANALYSIS OF VARIANCE 81
(iii) λ or λ T satisfies
(i) If there exists a vector a such that λ T = a T X, then, using this vector a, we
have E(a T Y) = a T E(Y) = a T Xβ = λ T β.
equations
(Proof)
XG1 X T = XG2 X T .
• Theorem 12.3c If λ1T β and λ2T β are two estimable functions in the model Y =
b λ T βb) = σ2 r T λ2 = σ2 λ T r2 = σ2 λ T (X T X)− λ2 .
cov(λ1T β, 2 1 1 1
SSE
• Theorem 12.3e. Let s2 = n−k , where SSE = (Y − X βb) T (Y − X βb) = Y T (I −
X(X T X)− X T )Y. For s2 defined for the non-full-rank model, we have the follow-
CHAPTER 9. ANALYSIS OF VARIANCE 83
(Proof)
properties:
(i) βb is N p [(X T X)− X T Xβ, σ2 (X T X)− X T X(X T X)− ].
(ii) (n − k )s2 /σ2 is χ2 (n − k).
Z = XU T (UU T )−1 .
Note that Z is of full rank, since rank(Z) ≥ rank(ZU) = rank(X) = k, and we
b = (Z T Z)−1 Z T Y. Since Zγ = Xβ, we have Zγ
obtain γ b and (Y − X βb) T (Y −
b = X β,
b) T ( Y − Z γ
X βb) = (Y − Zγ b) .
CHAPTER 9. ANALYSIS OF VARIANCE 84
• Side conditions provide linear constraints that make the parameters unique and
estimable. (9.3) illustrates how the side condition τ1 + τ2 = 0 leads to handle
rank deficiency of X. Then, can we impose any such side conditions? Should
side conditions satisfy certain requirement? This is the topic of this section.
• Requirement for side conditions: The goal of constraint is to rescue rank de-
tor β.
b Thus, this does not help relieving rank deficient problem. Therefore side
i.e. (X T X + T T T) β
b = X T Y. Recall X is a n x p of rank k, p ≤ n and T is a ( p −
Y
βb = [X T X + T T T]−1 (X T , T T ) = ( X T X + T T T ) −1 X T Y
0
• Example We go back to (9.3) and verify the result. The model is yij = µ + τi + eij ,
• Testable hypothesis
estimable functions λ1T β, · · · , λtT β such that H0 is true if and only if λ1T β =
0, · · · , λtT β = 0.
CHAPTER 9. ANALYSIS OF VARIANCE 86
combinations of the rows of Xβ, we can obtain the two linearly independent
estimable functions α1 − α2 and α1 + α2 − 2α3 . The hypothesis H0 : α1 = α2 = α3
is true if and only if α1 − α2 and α1 + α2 − 2α3 are simultaneously equal to zero
y = Xβ + e = Zγ + e = Z1 γ1 + Z2 γ2 + e,
F-statistics is
SS(γ1 |γ2 )/t
F= .
SSE/(n − k )
Theorem 12.7b
(Proof)
(i) Since
c1T β
c2T β
Cβ =
..
.
Tβ
cm
(ii)