Selection Criteria PDF

Statistics 203: Introduction to Regression
and Analysis of Variance

Model Selection: General Techniques
Jonathan Taylor
- p. 1/16
Today
● Today
● Crude outlier detection test
■ Outlier detection / simultaneous inference.
● Bonferroni correction
● Simultaneous inference for β
■ Goals of model selection.
● Model selection: goals
● Model selection: general ■ Criteria to compare models.
● Model selection: strategies
● Possible criteria ■ (Some) model selection.
● Mallow’s Cp
● AIC & BIC
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats
- p. 2/16
Crude outlier detection test
● Today
■ If the studentized residuals are large: observation may be an
outlier.
● Model selection: general
■ Problem: if n is large, if we “threshold” at t1−α/2,n−p−1 we
● Possible criteria
will get many outliers by chance even if model is correct.
● Mallow’s Cp
● AIC & BIC
■ Solution: Bonferroni correction, threshold at t1−α/2n,n−p−1 .
estimation
● Caveats
- p. 3/16
Bonferroni correction
● Today
■ If we are doing many t (or other) tests, say m > 1 we can
control overall false positive rate at α by testing each one at
● Model selection: goals level α/m.
■ Proof:
● Mallow’s Cp P (at least one false positive)
● AIC & BIC
m

estimation
= P ∪i=1 |Ti | ≥ t1−α/2m,n−p−1
m
X

● Caveats ≤ P |Ti | ≥ t1−α/2m,n−p−1
i=1
Xm
α
= = α.
i=1
m
■ Known as “simultaneous inference”: controlling overall false
positive rate at α while performing many tests.
- p. 4/16
Simultaneous inference for β
● Today
■ Other common situations in which simultaneous inference
occurs is “simultaneous inference” for β.
■ Using the facts that

● Mallow’s Cp
b 2 t
β ∼ N β, σ (X X) −1
● AIC & BIC
2
χ n−p
estimation
● AIC for a linear model b2 ∼ σ 2 ·
σ
● Search strategies n−p
● Caveats
along with βb ⊥ σ
b2 leads to
b t (X t X)(βb − β)/p
(β − β) χ2p /p
∼ 2 ∼ Fp,n−p
b2
σ χn−p /(n − p)
■ (1 − α) · 100% simultaneous confidence region:
n o
b t (X t X)(βb − β) ≤ pb
β : (β − β) σ 2 Fp,n−p,1−α
- p. 5/16
Model selection: goals
● Today
■ When we have many predictors (with many possible
interactions), it can be difficult to find a good model.
■ Which main effects do we include?
■ Which interactions do we include?
● Mallow’s Cp
● AIC & BIC ■ Model selection tries to “simplify” this task.
estimation
● Caveats
- p. 6/16
Model selection: general
● Today
■ This is an “unsolved” problem in statistics: there are no
magic procedures to get you the “best model.”
■ In some sense, model selection is “data mining.”
■ Data miners / machine learners often work with very many
Cp
● Mallow’s
● AIC & BIC
predictors.
estimation
● Caveats
- p. 7/16
Model selection: strategies
● Today
■ To “implement” this, we need:
● Bonferroni correction ◆ a criterion or benchmark to compare two models.
● Model selection: goals ◆ a search strategy.
■ With a limited number of predictors, it is possible to search
● Mallow’s Cp all possible models.
● AIC & BIC
estimation
● Caveats
- p. 8/16
Possible criteria
● Today
■ R2 : not a good criterion. Always increase with model size –>
“optimum” is to take the biggest model.
■ Adjusted R2 : better. It “penalized” bigger models.
● Possible criteria ■ Mallow’s Cp .
● Mallow’s Cp
● AIC & BIC ■ Akaike’s Information Criterion (AIC), Schwarz’s BIC.
estimation
● Caveats
- p. 9/16
Mallow’s Cp
● Today ■
● Bonferroni correction SSE(M)
● Simultaneous inference for β Cp (M) = − n + 2 · p(M).
● Model selection: goals σb2
● Model selection: strategies ■ b2 = SSE(F )/dfF is the “best” estimate of σ 2 we have (use
σ
● Mallow’s Cp the fullest model)
● AIC & BIC
estimation
■ SSE(M) = kY − YbM k2 is the SSE of the model M
● Search strategies ■ p(M) is the number of predictors in M, or the degrees of
● Caveats freedom used up by the model.
■ Based on an estimate of
n
1 X 2

2
E (Yi − E(Yi ))
σ i=1
1 X
n
= 2 E (Yi − Ybi ) + Var(Ybi )
2
σ i=1
- p. 10/16
AIC & BIC
● Today
■ Mallow’s Cp is (almost) a special case of Akaike Information
Criterion (AIC)
AIC(M) = −2 log L(M) + 2 · p(M).
● Mallow’s Cp ■ L(M) is the likelihood function of the parameters in model
● AIC & BIC
● Maximum likelihood M evaluated at the MLE (Maximum Likelihood Estimators).
estimation
■ Schwarz’s Bayesian Information Criterion (BIC)
● Caveats
BIC(M) = −2 log L(M) + p(M) · log n
- p. 11/16
Maximum likelihood estimation
● Today
■ If the model is correct then the log-likelihood of (β, σ) is
● Simultaneous inference for β n 1
log L(β, σ|X, Y ) = − log(2π) + log σ − 2 kY − Xβk2
2
2 2σ
● Mallow’s Cp where Y is the vector of observed responses.
● AIC & BIC
● Maximum likelihood ■ MLE for β in this case is the same as least squares estimate
estimation
because first term does not depend on β
● Caveats
■ MLE for σ 2 :

∂ n 1
log L(β, σ) = − + kY − X b 2=0
βk
∂σ 2 b 2 2σ 2 2σ 4
β,b
σ
■ Solving for σ 2 :
2 1 b 2 = 1 SSE(M)
σ
bM LE = kY − X βk
n n
Note that the MLE is biased.
- p. 12/16
AIC for a linear model
Using βbM LE = βb
● Today
■
2 1
σ
bM LE = SSE(M)
● Model selection: strategies n
● Mallow’s Cp we see that the AIC of a multiple linear regression model is
● AIC & BIC
estimation
AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1)
● Caveats
■ If σ 2 is known, then
SSE(M)
2

AIC(M) = n log(2π) + log(σ ) + + 2p(M)
σ2
which is almost Cp (M) + Kn .
- p. 13/16
Search strategies
● Today
■ “Best subset”: search all possible models and take the one
with highest Ra2 or lowest Cp .
■ Stepwise (forward, backward or both): useful when the
number of predictors is large. Choose an initial model and
● Mallow’s Cp be “greedy”.
● AIC & BIC
estimation
■ “Greedy” means always take the biggest jump (up or down)
in your selected criterion.
● Caveats
- p. 14/16
Implementations in R
● Today
■ “Best subset”: use the function leaps. Works only for
multiple linear regression models.
■ Stepwise: use the function step. Works for any model with
Akaike Information Criterion (AIC). In multiple linear
● Mallow’s Cp regression, AIC is (almost) a linear function of Cp .
● AIC & BIC
estimation
■ Here is an example.
● Caveats
- p. 15/16
Caveats
● Today
■ Many other “criteria” have been proposed.
■ Some work well for some types of data, others for different
data.
■ These criteria are not “direct measures” of predictive power.
● Mallow’s Cp
● AIC & BIC ■ Later – we will see cross-validation which is an estimate of
estimation predictive power.
● Caveats
- p. 16/16

Selection Criteria PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Selection Criteria PDF

Uploaded by

Copyright:

Available Formats

Statistics 203: Introduction to Regression

and Analysis of Variance

You might also like