Lec 05

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Logistic Regression

Jia-Bin Huang
ECE-5424G / CS-5824 Virginia Tech Spring 2019
Administrative
• Please start HW 1 early!

• Questions are welcome!


Two principles for estimating parameters
• Maximum Likelihood Estimate (MLE)
Choose that maximizes probability of observed data

• Maximum a posteriori estimation (MAP)


Choose that is most probable given prior probability and
data

Slide credit: Tom Mitchell


Naïve Bayes classifier
• Want to learn
• But require parameters...

• How about applying Bayes rule?

• : Need parameters
•: Need 1 parameter

• Apply conditional independence assumption


• : Need parameters
Naïve Bayes classifier
• Bayes rule:

• Assume conditional independence among ’s:

• Pick the most probable Y

Slide credit: Tom Mitchell


Example

Bayes rule Conditional indep.


• Estimating parameters
0.6

• Test example:
•:
•:
Naïve Bayes algorithm – discrete
• For each value
Estimate
For each value of each attribute
Estimate

• Classify

Slide credit: Tom Mitchell


Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

Slide credit: Tom Mitchell


• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month

𝑃 ( 𝐹∨𝑆, 𝐷 ,𝐺 ) =𝑃 ( 𝐹 ) P ( S|F ) P ( D|F ) P(G∨F)


Naïve Bayes: Subtlety #1
• Often the are not really conditionally independent

• Naïve Bayes often works pretty well anyway


• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])

• What is the effect on estimated ?


• What if we have two copies:

Slide credit: Tom Mitchell


Naïve Bayes: Subtlety #2
MLE estimate for might be zero.
(for example, = birthdate. = Feb_4_1995)

• Why worry about just one parameter out of many?

• What can we do to address this?


• MAP estimates (adding “imaginary” examples)
Slide credit: Tom Mitchell
Estimating parameters: discrete
• Maximum likelihood estimates (MLE)

• MAP estimates (Dirichlet priors):

Slide credit: Tom Mitchell


What if we have continuous
• Gaussian Naïve Bayes (GNB): assume

• Additional assumption on :
• Is independent of ()
• Is independent of ()
• Is independent of and ()

Slide credit: Tom Mitchell


Naïve Bayes algorithm – continuous
• For each value
Estimate
For each attribute estimate
Class conditional mean , variance

• Classify

Slide credit: Tom Mitchell


Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule

• Estimating parameters from data


• Maximum likelihood (ML) maximize
• Maximum a posteriori estimation (MAP) maximize

• Naive Bayes
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
1 (Yes)
Malignant?

0 (No)
Tumor Size
h 𝜃 ( 𝑥 ) =𝜃 ⊤ 𝑥

• Threshold classifier output at 0.5


• If predict “”
• If , predict “”

Slide credit: Andrew Ng


Classification: or

(from linear regression)


can be or

Logistic regression:

Logistic regression is actually for classification

Slide credit: Andrew Ng


Hypothesis representation
• Want 1
h𝜃 ( 𝑥) = −𝜃 𝑥


1+ 𝑒

where
𝑔(𝑧)
• Sigmoid function
• Logistic function
𝑧 Slide credit: Andrew Ng
Interpretation of hypothesis output
• estimated probability that on input

• Example: If
• 0.7

• Tell patient that 70% chance of tumor being malignant

Slide credit: Andrew Ng


Logistic regression
𝑔(𝑧)


𝑧 =𝜃 𝑥
Suppose predict “y = 1” if

predict “y = 0” if

Slide credit: Andrew Ng


Decision boundary

Age
E.g.,

Tumor Size

• Predict “” if

Slide credit: Andrew Ng


E.g.,

• Predict “” if

Slide credit: Andrew Ng


Where does the form come from?
• Logistic regression hypothesis representation

• Consider learning f: , where


• is a vector of real-valued features
• is Boolean
• Assume all are conditionally independent given
• Model as Gaussian
• Model as Bernoulli

What is ? Slide credit: Tom Mitchell


Applying Bayes rule
Divide by

Apply

Plug in
2 2
𝜇𝑖 0 − 𝜇𝑖1 𝜇 −𝜇
2
( 𝑥 −𝜇 𝑖𝑘 )

𝑃 ( 𝑥∨𝑦 𝑘) =
1
𝑒

2𝜎
2
𝑖
∑( 𝜎
2 ¿ 𝑋 𝑖+
𝑖1

2𝜎
2
𝑖0
)¿
√2 𝜋 𝜎 𝑖 𝑖 𝑖 𝑖

Slide credit: Tom Mitchell


Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Training set with examples

How to choose parameters ?

Slide credit: Andrew Ng


Cost function for Linear Regression

Slide credit: Andrew Ng


Cost function for Logistic
Regression
Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0

if 𝑦=1 if 𝑦=0

0 h𝜃 ( 𝑥) 1 0 h𝜃 ( 𝑥) 1 Slide credit: Andrew Ng


Logistic regression cost function

•:
•:

Slide credit: Andrew Ng


Logistic regression

Learning: fit parameter Prediction: given new


Output

Slide credit: Andrew Ng


Where does the cost come from?
• Training set with examples

• Maximum likelihood estimate for parameter

• Maximum conditional likelihood estimate for parameter

Slide credit: Tom Mitchell


• Goal: choose to maximize conditional likelihood of training data

• Training data
• Data likelihood
• Data conditional likelihood

Slide credit: Tom Mitchell


Expressing conditional log-
likelihood

Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=
{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Gradient descent

Goal:
Good news: Convex function!
Repeat { Bad news: No analytical solution

(Simultaneously update all )


} 𝑚
𝜕 1
𝐽 ( 𝜃 ) = ∑ (h 𝜃 ( 𝑥 ) − 𝑦 )𝑥 𝑗
( 𝑖) (𝑖) (𝑖)
𝜕𝜃 𝑗 𝑚 𝑖=1
Slide credit: Andrew Ng
Gradient descent

Goal:

Repeat {
(Simultaneously update all )
}

Slide credit: Andrew Ng


Gradient descent for Linear Regression
Repeat {

}
h 𝜃 ( 𝑥 ) =𝜃 𝑥

Gradient descent for Logistic Regression


Repeat {
1
h𝜃 ( 𝑥) = −𝜃 𝑥

} 1+ 𝑒
Slide credit: Andrew Ng
Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
How about MAP?
• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)


Prior
• Common choice of :
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell


MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)


Logistic Regression
• Hypothesis representation

• Cost function

• Logistic regression with gradient descent

• Regularization

• Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng


Binary classification Multiclass classification

𝑥2 𝑥2

𝑥1 𝑥1
One-vs-all (one-vs-rest) 𝑥2
(1 )
h ( 𝑥)
𝜃
𝑥1
𝑥2
(2 ) 𝑥2
h (𝑥)𝜃

𝑥1 𝑥1
Class 1:
Class 2: (3 ) 𝑥2
Class 3: h (𝑥)
𝜃

h ( 𝑥 )=𝑃 ( 𝑦 =𝑖|𝑥 ; 𝜃 ) (𝑖=1 , 2 , 3)


(𝑖 )
𝜃 𝑥1
Slide credit: Andrew Ng
One-vs-all
• Train a logistic regression classifier for each class
to predict the probability that

• Given a new input , pick the class that maximizes

Slide credit: Andrew Ng


Generative Approach Discriminative Approach
Ex: Naïve Bayes Ex: Logistic regression

Estimate and Estimate directly


(Or a discriminant function: e.g., SVM)

Prediction Prediction
Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic
Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

• Andrew Ng, Michael Jordan


On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-cla
ssifiers-a-comparison-of-logistic-regression-and-naive-bayes.pd
f
Things to remember
1
h𝜃 ( 𝑥) =
• Hypothesis representation

−𝜃 𝑥
1+ 𝑒

• Cost function Cost (h 𝜃 ( 𝑥 ) , 𝑦 )=


{ − log ( h𝜃 ( 𝑥 ) ) if 𝑦 =1
− log ( 1 −h 𝜃 ( 𝑥 ) ) if 𝑦 =0

• Logistic regression with gradient descent


𝑚
1
𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼 ∑ ( h𝜃 ( 𝑥 )− 𝑦 ) 𝑥 𝑗
( 𝑖) (𝑖) (𝑖)

• Regularization 𝑚 𝑖=1 𝑚
1
𝜃 𝑗 ≔ 𝜃 𝑗 − 𝛼𝜆 𝜃 𝑗 − 𝛼 ∑ ( h 𝜃 𝑥 − 𝑦 ) 𝑥 𝑗
( )
( 𝑖) (𝑖) (𝑖)
𝑚 𝑖=1
• Multi-class classification
Coming up…
• Regularization

• Support Vector Machine

You might also like