Professional Documents
Culture Documents
Lecture 14 - Logistic and Softmax Regression - Plain
Lecture 14 - Logistic and Softmax Regression - Plain
Lecture 14 - Logistic and Softmax Regression - Plain
Learning(2):
Logistic and Softmax Regression
A linear hyperplane
𝑛 =1 𝑛=1
“cross-entropy” loss (a
The negative log-likelihood simplifies to popular loss function for
classification)
𝑁
Loss function 𝑁𝐿𝐿 ( 𝒘 ) =∑ − [ 𝑦 𝑛 log 𝜇𝑛 + ( 1− 𝑦 𝑛 ) log (1− 𝜇𝑛 ) ] Very large loss if close to 1and
𝑛=1
close to 0, or vice-versa
Plugging in and simplifying Good news: For
LR, NLL is No closed-form expression for
convex Iterative opt needed (gradient or
Hessian based). Exercise: Try
𝑁
working out the gradient of
𝑁𝐿𝐿 ( 𝒘 ) =− ∑ [ 𝑦 𝑛 𝒘 ⊤ 𝒙 𝑛 − log (1+exp (𝒘 ⊤ 𝒙 𝑛 )) ] NLL and notice the
𝑛=1
expression’s form CS771: Intro to ML
5
An Alternate Notation
If we assume the label as -1/+1 (not 0/1), the likelihood can be written as
1
𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) =
⊤
⊤
=𝜎 (𝑦 𝑛 𝒘 𝒙 𝑛)
1+exp (− 𝑦 𝑛 𝒘 𝒙𝑛 )
Slightly more convenient notation: A single expression gives the probabilities
of both possible label values
CS771: Intro to ML
6
MAP Estimation for Logistic Regression
Need a prior on the weight vector
Just like probabilistic linear regression, can use a zero-mean Gaussian prior
𝑝 ( 𝒘 )= 𝒩 ( 𝒘|𝟎 , 𝜆 𝑰 𝐷 ) ∝ exp −
−1
( 𝜆 ⊤
2
𝒘 𝒘 ) Or NLL – log of prior
CS771: Intro to ML
7
Fully Bayesian Inference for Logistic Regression
Doing fully Bayesian inference would require computing the posterior
Unfortunately, Gaussian
Gaussian Bernoulli and Bernoulli are not
𝑁
conjugate with each other,
𝑝 ( 𝒘 ) ∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛)
𝑝 ( 𝒘| 𝑿 , 𝒚 )=
𝑝 (𝒘 ) 𝑝 (𝒚 ∨ 𝑿 , 𝒘)
=
𝑛=1 so analytic expression for
𝑝(𝒚 ∨ 𝑿) 𝑁
the posterior can’t be
∫ 𝑝( 𝒘 )∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛 ) 𝑑 𝒘
𝑛=1 obtained unlike prob. linear
regression
Need to approximate the posterior in this case
We will use a simple approximation called Laplace approximation
Approximates the posterior
of by a Gaussian whose Can also employ more
mean is the MAP solution advanced posterior
approximation methods,
and covariance matrix is the like MCMC and variational
inverse of the Hessian inference (beyond the
(Hessian: second derivative scope of CS771)
of the negative log-posterior
of the LR model) CS771: Intro to ML
8
Posterior for LR: An Illustration
Can sample from the posterior of the LR model
Each sample will give a weight vec defining a hyperplane separator
CS771: Intro to ML
9
Logistic Regression: Predictive Distribution
When using MLE/MAP solution , can use the plug-in predictive distribution
≈ 𝑝 ( 𝑦 ∗ =1|𝒘
^ 𝑜𝑝𝑡 , 𝒙 ∗ ) =𝜎 ( 𝒘
^ 𝑜𝑝𝑡⊤ 𝒙 𝑛)
When using fully Bayesian inference, we must compute the posterior predictive
Monte-Carlo approximation of
this integral is one possible way Generate samples , from the Gaussian approx. of posterior and use
CS771: Intro to ML
10
LR: Plug-in Prediction vs Postrerior Averaging
CS771: Intro to ML
11
Multiclass Logistic (a.k.a. Softmax) Regression
Also called multinoulli/multinomial regression: Basically, LR for classes
In this case, and label probabilities are defined as
Softmax
function
exp (𝒘 ⊤
𝑘 𝒙 𝑛)
𝑝 ( 𝑦 𝑛 =𝑘| 𝒙 𝑛 , 𝑾 ) = 𝐾
=𝜇 𝑛𝑘 Also note that =1 for any
∑ exp (𝒘 ⊤
ℓ 𝒙𝑛 ) input
ℓ =1
CS771: Intro to ML