Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/277137869

A simple lecture note on AIC and BIC

Research · May 2015


DOI: 10.13140/RG.2.1.1771.9200

CITATIONS READS

0 2,000

1 author:

Kuan-Wei Tseng
National Taiwan University
12 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Kuan-Wei Tseng on 25 May 2015.

The user has requested enhancement of the downloaded file.


Generalized Linear Models: Project 2
R00323024
曾冠瑋

1. Kullback-Leibler Distance
The Kullback-Leibler distance or Kullback-Leibler information is a
measure of the difference between two distributions, which is de-
fined as
  
f (y|x; θ0 )
K = min E ln
θ∈Θ f (y|x; θ)
  
f (y|x; θ0 )
= E ln ,
f (y|x; θ∗ )
where f (y|x; θ0 ) denotes the “true” conditional density of y given
x.
The Kullback-Leibler information can be used as a criterion for
model selections. For a given model with parameters estimated θ̂,
its Kullback-Leibler distance is
" !#
f (y|x; θ0 )
K̂ = E ln .
f (y|x; θ̂)

Then we choose the model with the minimum Kullback-Leibler dis-


tance among competing models.
However, the true conditional density and the distribution of
x are never known in practice. Consider an approximation by the
sample counterparts,
n n
1X 1X
K̃ = ln f (yi |xi ; θ0 ) − ln f (yi |xi ; θ̂).
n i=1 n i=1

Although the first term is unknown, it is a constant in any compet-


ing models. Hence, the model with the minimum value of Kullback-
Leibler distance is chosen, which is equivalent to that the model

1
with maximum value of
n
1X
ln f (yi |xi ; θ̂)
n i=1

is chosen.
This criterion has similar disadvantage to R2 , that is, log-likelihood
is non-decreasing as the number of indepedent variables increases.
This criterion tends to prefer the large-scale models, and hence
overfitting problems may occur.

2. Akaike Information Criterion


The idea of the Akaike information criterion (AIC) is based on the
minimization of the Kullback-Leibler distance but to penalize the
high-dimensional models by finding some K̃∗ such that the expecta-
tion of K̃∗ is asymptotically identical to the expectation of K̂, i.e.,
E(K̃∗ − K̂) = 0, while dim(θ) is considered as penalty.
By the definition,
n
X n
X
nK̃ = ln f (yi |xi ; θ0 ) − ln f (yi |xi ; θ̂).
i=1 i=1

Using Taylor expansion of the first term around θ̂ as an approxi-


mation,
n n
X ∂f (yi |xi ; θ̂) 1X ∂ 2 f (yi |xi ; θ̂)
nK̃ = (θ0 −θ̂)+ (θ0 −θ̂)0 (θ0 −θ̂)+op (1),
i=1
∂θ 2 i=1 ∂θ∂θ0

where the first term is zero by the likelihood equation. Thus, nK̃
is asymptotically equivalent to
n
− (θ̂ − θ0 )0 I(θ0 )(θ̂ − θ0 ),
2
where I(θ0 ) is the Fisher information. We obtain that
d
− χ2p ,
−2nK̃ →

where p = dim(θ̂). Hence, the asymptotic expectation of nK̃ is


−p/2.

2
Similarly, from the Taylor expansion of nK̂ around θ0 ,
 
∂f (y|x; θ0 )
nK̂ = −nE (θ̂ − θ0 )
∂θ
 2 
n 0 ∂ f (y|x; θ0 )
− (θ̂ − θ0 ) E (θ̂ − θ0 ) + op (1)
2 ∂θ∂θ0
n
= (θ̂ − θ0 )0 I(θ0 )(θ̂ − θ0 ) + op (1),
2

since the first term is zero by the property of the log-likelihood. nK̂
is asymptotically distributed as χ2p and its asymptotic expecation
is p/2.
By the argument above, we have that the asymptotic expecta-
tion of the difference between nK̃ and nK̂ is −p. To find a K̃∗ such
that E(nK̃∗ − nK̂) = 0 in the limit sense, consider
p
K̃∗ = K̃ + .
n
Therefore, the Akaike information criterion is defined as
n
X
AIC = ln f (yi |xi ; θ̂) − p = `(θ̂) − p.
i=1

Then choose the model which has the maximum value. We see that
the AIC is a criterion based on minimizing of the Kullback-Leibler
distance but with a penal term for the number of variables.
Another form of the AIC is that the model with the minimum
value of
n
X
AIC = −2 ln f (yi |xi ; θ̂) + 2p = −2`(θ̂) + 2p
i=1

is chosen. With the form, the first term −2`(θ̂) can be directly
replaced by other asymptotically identical test statistics such as in
the Wald or the score test.
The AIC can be applied to both nested and non-nested models.
However, the AIC procedure cannot be used as a hypothesis testing.
The numerical value does not imply whether model is true. More-
over, in some cases, such as with two normal nested linear models,
suppose M1 ⊂ M2 and M1 is true, the probability of choosing M1
does not converge to one.

3
3. Bayesian Information Criterion (Schwarz Criterion)
Another criterion for model selections is the Bayesian Informa-
tion Criterion (BIC), also as known as Schwarz Criterion. Schwarz
(1987) developed the criterion from Bayesian likelihood maximiza-
tion. Schwarz also proved that the BIC is valid since it does not
depend on the prior distribution.
The BIC is defined as
p
BIC = `(θ̂) − ln n,
2

where `(θ̂) is the log-likelihood, p = dim(θ̂) and n is the number of


observations. Like the AIC, it can also be presented as

BIC = −2`(θ̂) + p ln n.

The penalty associated with the dimensions of parameters esti-


mated in the BIC is 0.5 ln n instead of 1 in the AIC. It means that
the BIC leans more to low-dimensional models than the AIC (when
the sample size is larger than 8).

View publication stats

You might also like