Lecture 14 - Logistic and Softmax Regression - Plain

Probabilistic Models for Supervised
Learning(2):
Logistic and Softmax Regression
CS771: Introduction to Machine Learning

Piyush Rai
Both very widely
2
Logistic Regression (LR) Multi-class extension
known as “softmax
used
regression” The word “regression”
is a misnomer. Both
 A probabilistic model for binary classification are classification
models
 Learns the PMF of the output label given the input, i.e.,
 A discriminative model: Does not model inputs (only relationship b/w and )
 Uses the sigmoid function to define the conditional probability of being 1
𝜇 𝑥 =𝑝 ( 𝑦 =1|𝒘 , 𝒙 )= 𝜎 (𝒘 ⊤ 𝒙 )
(z) 1
1
¿ ⊤
1+exp (−𝒘 𝒙 ) 0.5
exp (𝒘 ⊤ 𝒙 )
¿ ⊤
A linear 1+exp (𝒘 𝒙 )
model 0
z
 Here is the score for input . The sigmoid turns it into a probability
CS771: Intro to ML
LR: Decision Boundary
 At the decision boundary where both classes are equiprobable
A linear hyperplane
 Very large positive means close to 1

 Very large negative means close to 1
 At decision boundary, = 0 implies = = 0.5
CS771: Intro to ML
4
MLE for Logistic Regression
Assumed 0/1, not -1/+1
 Likelihood (PMF of each input’s label) is Bernoulli with prob
𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) = Bernoulli ( 𝜇 𝑛 )= 𝜇𝑛 (1− 𝜇𝑛 )
𝑦𝑛 1 − 𝑦𝑛
 Overall likelihood, assuming i.i.d. observations

𝑁 𝑁
𝑝 ( 𝒚 |𝒘 , 𝑿 )= ∏ 𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) =∏ 𝜇𝑛𝑦 (1 − 𝜇𝑛 )1 − 𝑦
𝑛 𝑛
𝑛 =1 𝑛=1
“cross-entropy” loss (a
 The negative log-likelihood simplifies to popular loss function for
classification)
𝑁
Loss function 𝑁𝐿𝐿 ( 𝒘 ) =∑ − [ 𝑦 𝑛 log 𝜇𝑛 + ( 1− 𝑦 𝑛 ) log (1− 𝜇𝑛 ) ] Very large loss if close to 1and
𝑛=1
close to 0, or vice-versa
 Plugging in and simplifying Good news: For
LR, NLL is No closed-form expression for
convex Iterative opt needed (gradient or
Hessian based). Exercise: Try
𝑁
working out the gradient of
𝑁𝐿𝐿 ( 𝒘 ) =− ∑ [ 𝑦 𝑛 𝒘 ⊤ 𝒙 𝑛 − log (1+exp (𝒘 ⊤ 𝒙 𝑛 )) ] NLL and notice the
𝑛=1
expression’s form CS771: Intro to ML
5
An Alternate Notation
 If we assume the label as -1/+1 (not 0/1), the likelihood can be written as
1
𝑝 ( 𝑦 𝑛|𝒘 , 𝒙 𝑛 ) =
⊤
⊤
=𝜎 (𝑦 𝑛 𝒘 𝒙 𝑛)
1+exp (− 𝑦 𝑛 𝒘 𝒙𝑛 )
 Slightly more convenient notation: A single expression gives the probabilities
of both possible label values
 In this case, the total negative log-likelihood will be
CS771: Intro to ML
6
MAP Estimation for Logistic Regression
 Need a prior on the weight vector
 Just like probabilistic linear regression, can use a zero-mean Gaussian prior
𝑝 ( 𝒘 )= 𝒩 ( 𝒘|𝟎 , 𝜆 𝑰 𝐷 ) ∝ exp −
−1
( 𝜆 ⊤
2
𝒘 𝒘 ) Or NLL – log of prior
 The MAP objective (log of posterior) will be log-likelihood + log of prior

 Therefore the MAP solution (ignoring terms that don’t depend on ) will be
𝜆 ⊤ Good news:
^ 𝑀 𝐴𝑃 =arg min 𝑁𝐿𝐿 ( 𝒘 ) + 𝒘 𝒘
𝒘 convex objective
𝒘 2
 Just like MLE case, no closed form solution. Iterative opt methods needed
 Highly efficient solvers (both first and second order) exist for MLE/MAP estimation for LR
CS771: Intro to ML
7
Fully Bayesian Inference for Logistic Regression
 Doing fully Bayesian inference would require computing the posterior
Unfortunately, Gaussian
Gaussian Bernoulli and Bernoulli are not
𝑁
conjugate with each other,
𝑝 ( 𝒘 ) ∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛)
𝑝 ( 𝒘| 𝑿 , 𝒚 )=
𝑝 (𝒘 ) 𝑝 (𝒚 ∨ 𝑿 , 𝒘)
=
𝑛=1 so analytic expression for
𝑝(𝒚 ∨ 𝑿) 𝑁
the posterior can’t be
∫ 𝑝( 𝒘 )∏ 𝑝 ( 𝑦 𝑛 ∨𝒘 , 𝒙 𝑛 ) 𝑑 𝒘
𝑛=1 obtained unlike prob. linear
regression
 Need to approximate the posterior in this case
 We will use a simple approximation called Laplace approximation
Approximates the posterior
of by a Gaussian whose Can also employ more
mean is the MAP solution advanced posterior
approximation methods,
and covariance matrix is the like MCMC and variational
inverse of the Hessian inference (beyond the
(Hessian: second derivative scope of CS771)
of the negative log-posterior
of the LR model) CS771: Intro to ML
8
Posterior for LR: An Illustration
 Can sample from the posterior of the LR model
 Each sample will give a weight vec defining a hyperplane separator
Not all separators are

equally good; their
goodness depends on their
posterior probabilities
When making predictions,

we can still use all of them
but weighted by their
importance based on their
posterior probabilities
That’s exactly what we do when

computing the predictive
distribution
CS771: Intro to ML
9
Logistic Regression: Predictive Distribution
 When using MLE/MAP solution , can use the plug-in predictive distribution
≈ 𝑝 ( 𝑦 ∗ =1|𝒘
^ 𝑜𝑝𝑡 , 𝒙 ∗ ) =𝜎 ( 𝒘
^ 𝑜𝑝𝑡⊤ 𝒙 𝑛)
 When using fully Bayesian inference, we must compute the posterior predictive
Integral not tractable and

sigmoid Gaussian (if using Laplace approx.)
must be approximated
Monte-Carlo approximation of
this integral is one possible way Generate samples , from the Gaussian approx. of posterior and use
CS771: Intro to ML
10
LR: Plug-in Prediction vs Postrerior Averaging
Posterior averaging is like

using an ensemble of
models. In this example,
each model is a linear
classifier but the ensemble-
like effect resulted in
nonlinear boundaries
CS771: Intro to ML
11
Multiclass Logistic (a.k.a. Softmax) Regression
 Also called multinoulli/multinomial regression: Basically, LR for classes
 In this case, and label probabilities are defined as
Softmax
function
exp (𝒘 ⊤
𝑘 𝒙 𝑛)
𝑝 ( 𝑦 𝑛 =𝑘| 𝒙 𝑛 , 𝑾 ) = 𝐾
=𝜇 𝑛𝑘 Also note that =1 for any
∑ exp (𝒘 ⊤
ℓ 𝒙𝑛 ) input
ℓ =1
 weight vecs (one per class), each -dim, and

 Each likelihood is a multinoulli distribution. Therefore total likelihood
𝑁 𝐾
𝑝 ( 𝒚 ∨ 𝑿 , 𝑾 ) =∏ ∏ 𝜇𝑛𝑦 ℓ 𝑛ℓ Notation: = 1 if true class of is
and = 0
𝑛 =1 ℓ =1
 Can do MLE/MAP/fully Bayesian estimation for similar to LR model

CS771: Intro to ML
12
Coming up next
 Generative models for supervised learning
CS771: Intro to ML

Lecture 14 - Logistic and Softmax Regression - Plain

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 14 - Logistic and Softmax Regression - Plain

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 14 - Logistic and Softmax Regression - Plain

Uploaded by

Copyright:

Available Formats

Probabilistic Models for Supervised

CS771: Introduction to Machine Learning

 Very large positive means close to 1

 Overall likelihood, assuming i.i.d. observations

 In this case, the total negative log-likelihood will be

 The MAP objective (log of posterior) will be log-likelihood + log of prior

Not all separators are

When making predictions,

That’s exactly what we do when

Integral not tractable and

Posterior averaging is like

 weight vecs (one per class), each -dim, and

 Can do MLE/MAP/fully Bayesian estimation for similar to LR model

You might also like