Lecture 14 - Logistic and Softmax Regression - Plain

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 12

Probabilistic Models for Supervised

Learning(2):
Logistic and Softmax Regression

CS771: Introduction to Machine Learning


Piyush Rai
Both very widely
2
Logistic Regression (LR) Multi-class extension
known as “softmax
used
regression” The word “regression”
is a misnomer. Both
 A probabilistic model for binary classification are classification
models
 Learns the PMF of the output label given the input, i.e.,
 A discriminative model: Does not model inputs (only relationship b/w and )
 Uses the sigmoid function to define the conditional probability of being 1
𝜇 𝑥 =𝑝 ( 𝑦 =1|𝒘 , 𝒙 )= 𝜎 (𝒘 ⊤ 𝒙 )
(z) 1
1
¿ ⊤
1+exp (−𝒘 𝒙 ) 0.5
exp (𝒘 ⊤ 𝒙 )
¿ ⊤
A linear 1+exp (𝒘 𝒙 )
model 0
z
 Here is the score for input . The sigmoid turns it into a probability
CS771: Intro to ML
LR: Decision Boundary
 At the decision boundary where both classes are equiprobable

A linear hyperplane

 Very large positive means close to 1


 Very large negative means close to 1
 At decision boundary, = 0 implies = = 0.5
CS771: Intro to ML
4
MLE for Logistic Regression
Assumed 0/1, not -1/+1
 Likelihood (PMF of each input’s label) is Bernoulli with prob
𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) = Bernoulli ( 𝜇 𝑛 )= 𝜇𝑛 (1− 𝜇𝑛 )
𝑦𝑛 1 − 𝑦𝑛

 Overall likelihood, assuming i.i.d. observations


𝑁 𝑁
𝑝 ( 𝒚 |𝒘 , 𝑿 )= ∏ 𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) =∏ 𝜇𝑛𝑦 (1 − 𝜇𝑛 )1 − 𝑦
𝑛 𝑛

𝑛 =1 𝑛=1
“cross-entropy” loss (a
 The negative log-likelihood simplifies to popular loss function for
classification)
𝑁
Loss function 𝑁𝐿𝐿 ( 𝒘 ) =∑ − [ 𝑦 𝑛 log 𝜇𝑛 + ( 1− 𝑦 𝑛 ) log (1− 𝜇𝑛 ) ] Very large loss if close to 1and
𝑛=1
close to 0, or vice-versa
 Plugging in and simplifying Good news: For
LR, NLL is No closed-form expression for
convex Iterative opt needed (gradient or
Hessian based). Exercise: Try
𝑁
working out the gradient of
𝑁𝐿𝐿 ( 𝒘 ) =− ∑ [ 𝑦 𝑛 𝒘 ⊤ 𝒙 𝑛 − log (1+exp (𝒘 ⊤ 𝒙 𝑛 )) ] NLL and notice the
𝑛=1
expression’s form CS771: Intro to ML
5
An Alternate Notation
 If we assume the label as -1/+1 (not 0/1), the likelihood can be written as
1
𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) =


=𝜎 (𝑦 𝑛 𝒘 𝒙 𝑛)
1+exp (− 𝑦 𝑛 𝒘 𝒙𝑛 )
 Slightly more convenient notation: A single expression gives the probabilities
of both possible label values

 In this case, the total negative log-likelihood will be

CS771: Intro to ML
6
MAP Estimation for Logistic Regression
 Need a prior on the weight vector
 Just like probabilistic linear regression, can use a zero-mean Gaussian prior

𝑝 ( 𝒘 )= 𝒩 ( 𝒘|𝟎 , 𝜆 𝑰 𝐷 ) ∝ exp −
−1
( 𝜆 ⊤
2
𝒘 𝒘 ) Or NLL – log of prior

 The MAP objective (log of posterior) will be log-likelihood + log of prior


 Therefore the MAP solution (ignoring terms that don’t depend on ) will be
𝜆 ⊤ Good news:
^ 𝑀 𝐴𝑃 =arg min 𝑁𝐿𝐿 ( 𝒘 ) + 𝒘 𝒘
𝒘 convex objective
𝒘 2
 Just like MLE case, no closed form solution. Iterative opt methods needed
 Highly efficient solvers (both first and second order) exist for MLE/MAP estimation for LR

CS771: Intro to ML
7
Fully Bayesian Inference for Logistic Regression
 Doing fully Bayesian inference would require computing the posterior
Unfortunately, Gaussian
Gaussian Bernoulli and Bernoulli are not
𝑁
conjugate with each other,
𝑝 ( 𝒘 ) ∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛)
𝑝 ( 𝒘| 𝑿 , 𝒚 )=
𝑝 (𝒘 ) 𝑝 (𝒚 ∨ 𝑿 , 𝒘)
=
𝑛=1 so analytic expression for
𝑝(𝒚 ∨ 𝑿) 𝑁
the posterior can’t be
∫ 𝑝( 𝒘 )∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛 ) 𝑑 𝒘
𝑛=1 obtained unlike prob. linear
regression
 Need to approximate the posterior in this case
 We will use a simple approximation called Laplace approximation
Approximates the posterior
of by a Gaussian whose Can also employ more
mean is the MAP solution advanced posterior
approximation methods,
and covariance matrix is the like MCMC and variational
inverse of the Hessian inference (beyond the
(Hessian: second derivative scope of CS771)
of the negative log-posterior
of the LR model) CS771: Intro to ML
8
Posterior for LR: An Illustration
 Can sample from the posterior of the LR model
 Each sample will give a weight vec defining a hyperplane separator

Not all separators are


equally good; their
goodness depends on their
posterior probabilities

When making predictions,


we can still use all of them
but weighted by their
importance based on their
posterior probabilities

That’s exactly what we do when


computing the predictive
distribution

CS771: Intro to ML
9
Logistic Regression: Predictive Distribution
 When using MLE/MAP solution , can use the plug-in predictive distribution

≈ 𝑝 ( 𝑦 ∗ =1|𝒘
^ 𝑜𝑝𝑡 , 𝒙 ∗ ) =𝜎 ( 𝒘
^ 𝑜𝑝𝑡⊤ 𝒙 𝑛)

 When using fully Bayesian inference, we must compute the posterior predictive

Integral not tractable and


sigmoid Gaussian (if using Laplace approx.)
must be approximated

Monte-Carlo approximation of
this integral is one possible way Generate samples , from the Gaussian approx. of posterior and use
CS771: Intro to ML
10
LR: Plug-in Prediction vs Postrerior Averaging

Posterior averaging is like


using an ensemble of
models. In this example,
each model is a linear
classifier but the ensemble-
like effect resulted in
nonlinear boundaries

CS771: Intro to ML
11
Multiclass Logistic (a.k.a. Softmax) Regression
 Also called multinoulli/multinomial regression: Basically, LR for classes
 In this case, and label probabilities are defined as
Softmax
function
exp (𝒘 ⊤
𝑘 𝒙 𝑛)
𝑝 ( 𝑦 𝑛 =𝑘| 𝒙 𝑛 , 𝑾 ) = 𝐾
=𝜇 𝑛𝑘 Also note that =1 for any
∑ exp (𝒘 ⊤
ℓ 𝒙𝑛 ) input
ℓ =1

 weight vecs (one per class), each -dim, and


 Each likelihood is a multinoulli distribution. Therefore total likelihood
𝑁 𝐾
𝑝 ( 𝒚 ∨ 𝑿 , 𝑾 ) =∏ ∏ 𝜇𝑛𝑦 ℓ 𝑛ℓ Notation: = 1 if true class of is
and = 0
𝑛 =1 ℓ =1

 Can do MLE/MAP/fully Bayesian estimation for similar to LR model


CS771: Intro to ML
12
Coming up next
 Generative models for supervised learning

CS771: Intro to ML

You might also like