Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Statistics 203: Introduction to Regression

and Analysis of Variance


Model Selection: General Techniques

Jonathan Taylor

- p. 1/16
Today

● Today
● Crude outlier detection test
■ Outlier detection / simultaneous inference.
● Bonferroni correction
● Simultaneous inference for β
■ Goals of model selection.
● Model selection: goals
● Model selection: general ■ Criteria to compare models.
● Model selection: strategies
● Possible criteria ■ (Some) model selection.
● Mallow’s Cp
● AIC & BIC
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 2/16
Crude outlier detection test

● Today
● Crude outlier detection test
■ If the studentized residuals are large: observation may be an
● Bonferroni correction
● Simultaneous inference for β
outlier.
● Model selection: goals
● Model selection: general
■ Problem: if n is large, if we “threshold” at t1−α/2,n−p−1 we
● Model selection: strategies
● Possible criteria
will get many outliers by chance even if model is correct.
● Mallow’s Cp
● AIC & BIC
■ Solution: Bonferroni correction, threshold at t1−α/2n,n−p−1 .
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 3/16
Bonferroni correction

● Today
● Crude outlier detection test
■ If we are doing many t (or other) tests, say m > 1 we can
● Bonferroni correction
● Simultaneous inference for β
control overall false positive rate at α by testing each one at
● Model selection: goals level α/m.
● Model selection: general
● Model selection: strategies
● Possible criteria
■ Proof:
● Mallow’s Cp P (at least one false positive)
● AIC & BIC
m

● Maximum likelihood
estimation
= P ∪i=1 |Ti | ≥ t1−α/2m,n−p−1
● AIC for a linear model
m
X
● Search strategies
● Implementations in R

● Caveats ≤ P |Ti | ≥ t1−α/2m,n−p−1
i=1
Xm
α
= = α.
i=1
m
■ Known as “simultaneous inference”: controlling overall false
positive rate at α while performing many tests.

- p. 4/16
Simultaneous inference for β

● Today
● Crude outlier detection test
■ Other common situations in which simultaneous inference
● Bonferroni correction
● Simultaneous inference for β
occurs is “simultaneous inference” for β.
● Model selection: goals
● Model selection: general
■ Using the facts that
● Model selection: strategies

● Possible criteria
● Mallow’s Cp
b 2 t
β ∼ N β, σ (X X) −1
● AIC & BIC
2
● Maximum likelihood
χ n−p
estimation
● AIC for a linear model b2 ∼ σ 2 ·
σ
● Search strategies n−p
● Implementations in R
● Caveats
along with βb ⊥ σ
b2 leads to
b t (X t X)(βb − β)/p
(β − β) χ2p /p
∼ 2 ∼ Fp,n−p
b2
σ χn−p /(n − p)
■ (1 − α) · 100% simultaneous confidence region:
n o
b t (X t X)(βb − β) ≤ pb
β : (β − β) σ 2 Fp,n−p,1−α

- p. 5/16
Model selection: goals

● Today
● Crude outlier detection test
■ When we have many predictors (with many possible
● Bonferroni correction
● Simultaneous inference for β
interactions), it can be difficult to find a good model.
● Model selection: goals
● Model selection: general
■ Which main effects do we include?
● Model selection: strategies
● Possible criteria
■ Which interactions do we include?
● Mallow’s Cp
● AIC & BIC ■ Model selection tries to “simplify” this task.
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 6/16
Model selection: general

● Today
● Crude outlier detection test
■ This is an “unsolved” problem in statistics: there are no
● Bonferroni correction
● Simultaneous inference for β
magic procedures to get you the “best model.”
● Model selection: goals
● Model selection: general
■ In some sense, model selection is “data mining.”
● Model selection: strategies
● Possible criteria
■ Data miners / machine learners often work with very many
Cp
● Mallow’s
● AIC & BIC
predictors.
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 7/16
Model selection: strategies

● Today
● Crude outlier detection test
■ To “implement” this, we need:
● Bonferroni correction ◆ a criterion or benchmark to compare two models.
● Simultaneous inference for β
● Model selection: goals ◆ a search strategy.
● Model selection: general
● Model selection: strategies
● Possible criteria
■ With a limited number of predictors, it is possible to search
● Mallow’s Cp all possible models.
● AIC & BIC
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 8/16
Possible criteria

● Today
● Crude outlier detection test
■ R2 : not a good criterion. Always increase with model size –>
● Bonferroni correction
● Simultaneous inference for β
“optimum” is to take the biggest model.
● Model selection: goals
● Model selection: general
■ Adjusted R2 : better. It “penalized” bigger models.
● Model selection: strategies
● Possible criteria ■ Mallow’s Cp .
● Mallow’s Cp
● AIC & BIC ■ Akaike’s Information Criterion (AIC), Schwarz’s BIC.
● Maximum likelihood
estimation
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 9/16
Mallow’s Cp

● Today ■
● Crude outlier detection test
● Bonferroni correction SSE(M)
● Simultaneous inference for β Cp (M) = − n + 2 · p(M).
● Model selection: goals σb2
● Model selection: general
● Model selection: strategies ■ b2 = SSE(F )/dfF is the “best” estimate of σ 2 we have (use
σ
● Possible criteria
● Mallow’s Cp the fullest model)
● AIC & BIC
● Maximum likelihood
estimation
■ SSE(M) = kY − YbM k2 is the SSE of the model M
● AIC for a linear model
● Search strategies ■ p(M) is the number of predictors in M, or the degrees of
● Implementations in R
● Caveats freedom used up by the model.
■ Based on an estimate of
n
1 X 2

2
E (Yi − E(Yi ))
σ i=1

1 X  
n
= 2 E (Yi − Ybi ) + Var(Ybi )
2
σ i=1

- p. 10/16
AIC & BIC

● Today
● Crude outlier detection test
■ Mallow’s Cp is (almost) a special case of Akaike Information
● Bonferroni correction
● Simultaneous inference for β
Criterion (AIC)
● Model selection: goals
● Model selection: general
● Model selection: strategies
AIC(M) = −2 log L(M) + 2 · p(M).
● Possible criteria
● Mallow’s Cp ■ L(M) is the likelihood function of the parameters in model
● AIC & BIC
● Maximum likelihood M evaluated at the MLE (Maximum Likelihood Estimators).
estimation
● AIC for a linear model
● Search strategies
■ Schwarz’s Bayesian Information Criterion (BIC)
● Implementations in R
● Caveats
BIC(M) = −2 log L(M) + p(M) · log n

- p. 11/16
Maximum likelihood estimation

● Today
● Crude outlier detection test
■ If the model is correct then the log-likelihood of (β, σ) is
● Bonferroni correction
● Simultaneous inference for β n  1
● Model selection: goals
log L(β, σ|X, Y ) = − log(2π) + log σ − 2 kY − Xβk2
2
● Model selection: general
● Model selection: strategies
2 2σ
● Possible criteria
● Mallow’s Cp where Y is the vector of observed responses.
● AIC & BIC
● Maximum likelihood ■ MLE for β in this case is the same as least squares estimate
estimation
● AIC for a linear model
● Search strategies
because first term does not depend on β
● Implementations in R
● Caveats
■ MLE for σ 2 :

∂ n 1
log L(β, σ) = − + kY − X b 2=0
βk
∂σ 2 b 2 2σ 2 2σ 4
β,b
σ

■ Solving for σ 2 :

2 1 b 2 = 1 SSE(M)
σ
bM LE = kY − X βk
n n
Note that the MLE is biased.

- p. 12/16
AIC for a linear model

Using βbM LE = βb
● Today
● Crude outlier detection test

● Bonferroni correction
● Simultaneous inference for β
2 1
● Model selection: goals
● Model selection: general
σ
bM LE = SSE(M)
● Model selection: strategies n
● Possible criteria
● Mallow’s Cp we see that the AIC of a multiple linear regression model is
● AIC & BIC
● Maximum likelihood
estimation
● AIC for a linear model
AIC(M) = n (log(2π) + log(SSE(M)) − log(n))+2(n+p(M)+1)
● Search strategies
● Implementations in R
● Caveats
■ If σ 2 is known, then
SSE(M)
2

AIC(M) = n log(2π) + log(σ ) + + 2p(M)
σ2
which is almost Cp (M) + Kn .

- p. 13/16
Search strategies

● Today
● Crude outlier detection test
■ “Best subset”: search all possible models and take the one
● Bonferroni correction
● Simultaneous inference for β
with highest Ra2 or lowest Cp .
● Model selection: goals
● Model selection: general
■ Stepwise (forward, backward or both): useful when the
● Model selection: strategies
● Possible criteria
number of predictors is large. Choose an initial model and
● Mallow’s Cp be “greedy”.
● AIC & BIC
● Maximum likelihood
estimation
■ “Greedy” means always take the biggest jump (up or down)
● AIC for a linear model
● Search strategies
in your selected criterion.
● Implementations in R
● Caveats

- p. 14/16
Implementations in R

● Today
● Crude outlier detection test
■ “Best subset”: use the function leaps. Works only for
● Bonferroni correction
● Simultaneous inference for β
multiple linear regression models.
● Model selection: goals
● Model selection: general
■ Stepwise: use the function step. Works for any model with
● Model selection: strategies
● Possible criteria
Akaike Information Criterion (AIC). In multiple linear
● Mallow’s Cp regression, AIC is (almost) a linear function of Cp .
● AIC & BIC
● Maximum likelihood
estimation
■ Here is an example.
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 15/16
Caveats

● Today
● Crude outlier detection test
■ Many other “criteria” have been proposed.
● Bonferroni correction
● Simultaneous inference for β
■ Some work well for some types of data, others for different
● Model selection: goals
● Model selection: general
data.
● Model selection: strategies
● Possible criteria
■ These criteria are not “direct measures” of predictive power.
● Mallow’s Cp
● AIC & BIC ■ Later – we will see cross-validation which is an estimate of
● Maximum likelihood
estimation predictive power.
● AIC for a linear model
● Search strategies
● Implementations in R
● Caveats

- p. 16/16

You might also like