Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CQF ML Lab

Estimating Default Probability with Logistic Regression

MLE Methodology: a reminder from linear regression to logit

1. When the assumption of Normality of residuals holds: t is iid N (0, σ 2 ), the linear
regression yt = β̂xt + t has MLE properties. That means estimated coefficients β̂ are

(a) consistent (i.e., close to unknown true estimates β with low tolerance) and
(b) asymptotically efficient (i.e., their variance is known and minimised).

Estimates β̂ in fact, maximise the following joint likelihood:


T
1 21 22 2T
  
1
L= √ exp − + + ··· + 2
2πσ 2 2 σ2 σ2 σ

Substituting t = yt − β̂xt and taking log gives for an individual observation,

1 1 (yt − β̂xt )2
log Lt = − log 2πσ 2 −
2 2 σ2

The log-likelihood for a regression model is the sum of contributions from each observation
log L = Tt=1 Lt . Numerical MLE (eg, by Excel Solver) varies β̂ to maximise log L.
P

The log-likelihood is maximised by minimising the sum of squared errors 2t = (yt − β̂xt )2 .

PT
Exercise: consider a case with the intercept and one variable t=1 (yt −(β0 +β1 xt ))
2 = L∗ .
Apply the first-order optimisation condition, i.e., find partial derivatives and set
∂L∗ ∂L∗
=0 and =0
∂β0 ∂β1

Obtain a system of two ‘Normal equations’ and confirm the well-known analytical solution
P
result for regression coefficients β1 = Pt −x̄)(yt2−ȳ)
(x
and β0 = ȳ − β1 x̄.
(xt −x̄)

The document prepared by CQF Faculty, Dr Richard Diamond at r.diamond@fitchlearning.com

1
2. MLE principles. For a known distribution, we can estimate the parameter θ that max-
imises the probability mass. The parameter name here is generic to the Exponential
Family (Normal, Chi-squared, Binomial, and even Poisson). The parameter is the mean
of distribution, which is θ = µ for Normal and θ = p probability for Bernoulli with {0, 1}
outcomes.
A. We treat observations Y = [y1 , y2 , . . . , yN ]0 as independently and identically distributed.
Their joint density (PDF) is simply the probability of observing events together – by
multiplication rule of independent probabilities.

f (y1 , y2 , . . . , yN ; θ) = f (y1 ; θ) × f (y2 ; θ) × · · · × f (yN ; θ)

N
Y obs

= f (yi ; θ)
i=1

B. The log of the product becomes the sum of log-likelihood contributions:

N
Yobs N
X obs

`Y1 ,...,YN (θ) = log f (yi ; θ) = log f (yi ; θ)


i=1 i=1

Example. Let’s find an analytical solution to the maximum likelihood estimation problem
for a mix of independent but identically distributed Bernoulli variables.
The joint probability density of many default/no default events yi = {0, 1} is
n
Y
f (y1 , y2 , . . . , yn ; θ) = θyi (1 − θ)1−yi
i=1
= θ (1 − θ)n(1−ȳ)
nȳ
h in
= θȳ (1 − θ)(1−ȳ)

The log-likelihood function for the identically distributed Bernoulli variables (N draws) is
h in
`Y1 ,...,YN (θ) = log θȳ (1 − θ)(1−ȳ)

= n [ȳ log θ + (1 − ȳ) log(1 − θ)]

In general, the log-likelihood function for any of the Exponential family distributions
follows a convenient linear form `(θ) = a(ȳ)b(θ) + c(θ) + d(ȳ).
Analytical solution to optimisation is found by equating the first derivative wrt θ to zero,
 
∂ ȳ 1 − ȳ
`Y ,...,YN (θ) = n −
∂θ 1 θ 1−θ

Setting the result equal to zero in order to find θ̂ that gives argmax `Y1 ,...,YN (θ)

2
 
ȳ 1 − ȳ
n − = 0
θ̂ 1 − θ̂

ȳ(1 − θ̂) = (1 − ȳ)θ̂


θ̂ = ȳ

This is consistent with what we know: average probability of default θ̂ = ȳ is a parameter


that maximises likelihood. We fitted all observations to one parameter, namely ‘location’.
For indicator variable yi = 0, 1 the population probability of default is PD = E[Y |X].

3. Information Matrix might be a new term for you, but it is at the core of any regression
(linear or logistic) because it has standard errors2 of the coefficients on the diagonal.
Standard errors are needed to check for significance – if the true coefficients are zero or
not.
Example. Let’s consider results for multiple observations that are independent but follow
same-located Normal density f (yi ; µ). The joint likelihood is product of pdf s

L = f (y1 ; µ) × f (y2 ; µ) × · · · × f (yT ; µ)


T
1 (y1 − µ)2 (y2 − µ)2 (yT − µ)2
  
1
= √ exp − + + · · · +
2πσ 2 2 σ2 σ2 σ2

The log-likelihood of the joint density simplifies to


T 1 
ln 2πσ 2 − 2 (y1 − µ)2 + (y2 − µ)2 + · · · + (yT − µ)2

log L = −
2 2σ

∂ log L 1
= 2 [(y1 − µ) + (y2 − µ) + · · · + (yT − µ)]
∂µ σ
∂ log L
(a) Equating ∂µ = 0 gives the usual formula for arithmetic average – i.e., for the
Normal density mean is the MLE estimate.
(b) Let’s keep differentiating wrt µ,

∂ 2 log L T
=− 2
∂µ∂µ σ

Information matrix for any Exponential family distribution is defined as Hessian,


in the canonical form µ is equal to distribution parameter θ
 2 
∂ log L T
I = −E = 2
∂θ∂θ σ

The inverse of information matrix I−1 = σ 2 /T is an easily recognised as squared



standard error σ/ T . If T → ∞ there is no estimation error!

3
4. Specific to the Logistic Regression, the Information Matrix has been solved as
 2 
∂ log L
I = −E = XP0 X
∂βj ∂βk
where log L is log-likelihood of Binomial density and P is a diagonal matrix of pi (1 − pi ).

N
X obs

Ij,k = pi,j (1 − pi,j )xi,j xi,k


i=1
This element-wise calculation of I with P as a vector reveals ‘condensing’ of Binomial
variance.

(a) This numerical result for computation of Information Matrix has been obtained by
differentiating Equation (1) as in the lecture, where
N obs
∂ log L X
= (yi − p̂i )xi,j
∂βj
i=1

(b) MLE estimates are found by equating a system of such equations to zero and finding
such β̂ that satisfy them under the link function pi = Λ(Xi0 β̂). The root-finding is
done numerically by a scheme such as the Newton-Raphson.

After the estimates β̂ are obtained, the asymptotic covariance matrix I−1 is calculated.
Its diagonal provides squared standard errors used in significance testing.

β̂ ∼ N β, I−1 .


Likelihood Ratio tests: are two regression models different?

5. The MLE framework allows for model selection by hypothesis testing, which involves
comparing log-likelihoods for two models, rather than doing a simple throwing out of
insignificant parameters (which is a bad form of statistical analysis).
We are testing for deviance (difference, distance) in the specific case of nested mod-
els/regressions: our restricted model with dropped features (insignificant coefficients) is
nested within the unrestricted model. Restricted model has joint likelihoood PDF f (y; θb0 ),
likewise for the unrestricted.

b − log f (y; θb0 ))


Deviance = 2(log f (y; θ)

Better fit minimises Deviance implying minimal entropy (more on that below).
Too large Deviance is actually bad news because it means that two models are two different
and can’t be used interchangeably.
Please see Corporate PD.xlsm file for how simple it is to properly compare the full regres-
sion model vs its restricted version.

4
Similar ideas from MLE Approach to statistics and econometrics:

• Likelihood Ratio tests the difference between two log-likelihoods using Chi-square
distribution.
• Wald test that tests difference between two means (θ1 − θ2 ) using F-distribution.
• Lagrange Multiplier test checks if ∂ log L/∂θ = 0.

The null hypothesis is that the restricted model is the ‘true’ model: ie, the same data is
just as likely to produce this restricted regression.

6. Likelihood Ratio test can be represented within more general approach of Entropy Pooling
– you will recognise entropy formula below as the more general version from which Deviance
formula came.
For two probability distributions, generic f˜(y) and reference f (y) the relative information
entropy is defined as
Z h i
ε[f˜(y), f (y)] = f˜(y) log f˜(y) − log f (y) dy
R

Imposing restrictions on f˜(y) makes it to deviate from f (y).


Entropy ε is a natural measure of how distorted a modified/restricted distribution f˜(y) is,
with respect to the reference f (y).
Compare with Deviance formula for LR test

b − log f (y; θb0 )).


Deviance = 2(log f (y; θ)

5
Exponential Family

Most familiar distributions belong to the Exponential Family: Normal, Chi-squared, Bino-
mial, Poisson, Gamma, Beta. . . but, not Student’s t. It is always possible to express any PDF
of a distribution from the Exponential Family in canonical form using the key parameter θ.

f (y; θ) = ea(y)b(θ)+c(θ)+d(y)

An interesting fact: the variance for any of the Exponential family’s distribution can be
expressed using V (θ), a function of θ = E[y], which is known in a general form

Var[y] = V (θ)φ

Because of the known general expression for Var[y], we do not need to know the actual
distribution of yi to conduct Quasi-MLE (maximisation of quasi-likelihood). To put another
way, the similarity of canonical representations for all Exponential PDFs makes it so that even
if we choose a wrong distribution for yi we are still likely to obtain robust β 0 for yi = Xi β 0 !

1. Poisson Distribution is the most easy to represent canonically. Its single intensity param-
eter λ is substituted by θ. The distribution for a Poisson random variable is

θy e−θ
f (y; θ) = = exp (y log θ − θ − log y!)
y!

a(y) = y
b(θ) = log θ is the choice for the link function
c(θ) = −θ
d(y) = − log y

2. Binomial Distribution also has parameter θ ≡ p buts the workings are nuanced
!
n y
f (y; θ) = θ (1 − θ)n−y
y

application of log first and exponent second give the canonic form as
"   !#
θ n
f (y; θ) = exp y log + n log(1 − θ) + log
1−θ y

a(y) = y
 
θ
b(θ) = log the choice for the link function
1−θ
c(θ) = n log(1 − θ)
!
n
d(y) = log
y

6
3. For the Normal Distribution, parameter θ matches the mean (also called location) µ.
Exponential Family representation working takes a few lines, it is omitted here for the
task to be usable in exam.

You might also like