05 LogisticRegression

Logis&c
Regression
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aIribu&on. Please send comments and correc&ons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Classifica&on Based on Probability
•  Instead of just predic&ng the class, give the probability
of the instance being that class
–  i.e., learn p(y | x)
•  Comparison to perceptron:
–  Perceptron doesn’t produce probability es&mate
–  Perceptron (and other discrimina&ve classifiers) are only
interested in producing a discrimina&ve model
•  Recall that:
0  p(event)  1
p(event) + p(¬event) = 1
2
Logis&c Regression
•  Takes a probabilis&c approach to learning
discrimina&ve func&ons (i.e., a classifier)
•  should give
h✓ (x) p(y = 1 | x; ✓) Can’t just use linear
regression with a
–  Want 0  h✓ (x)  1 threshold
•  Logis&c regression model:

Logis&c / Sigmoid Func&on
h✓ (x) = g (✓ | x)
1 g(z) =
g(z) = 1+
1+e z
1
h✓ (x) = ✓T x
1+e 3
Interpreta&on of Hypothesis Output
h✓ (x) = es&mated p(y = 1 | x; ✓)
Example: Cancer diagnosis from tumor size

 
x0 1
x= =
x1 tumorSize
h✓ (x) = 0.7
à Tell pa&ent that 70% chance of tumor being malignant
Note that: p(y = 0 | x; ✓) + p(y = 1 | x; ✓) = 1
Therefore, p(y = 0 | x; ✓) = 1 p(y = 1 | x; ✓)

Based on example by Andrew Ng 4
Another Interpreta&on
•  Equivalently, logis&c regression assumes that
p(y = 1 | x; ✓)
log = ✓0 + ✓1 x1 + . . . + ✓d xd
p(y = 0 | x; ✓)
odds of y = 1
Side Note: the odds in favor of an event is the quan&ty

p / (1 − p), where p is the probability of the event

E.g., If I toss a fair dice, what are the odds that I will have a 6?
•  In other words, logis&c regression assumes that the

log odds is a linear func&on of x
Based on slide by Xiaoli Fern 5

Logis&c Regression
|
1
h✓ (x) = g (✓ x) g(z) = z
1+e
1
g(z) =
1+e z
| |
should be large nega&ve
h✓ (x) = g (✓ x) should be large posi&ve
h✓ (x) = g (✓ x)
values for nega&ve instances values for posi&ve instances
•  Assume a threshold and...

–  Predict y = 1 if h✓ (x) 0.5 y = 1
✓
–  Predict y = 0 if h✓ (x) < 0.5
y = 0
Based on slide by Andrew Ng 6
Non-Linear Decision Boundary
•  Can apply basis func&on expansion to features, same
as with linear regression
2 3
1
6 x1 7
6 7
6 x2 7
6 7
2 3 6 x1 x2 7
6 7
1 6 2 7
4 5 6 x 1 7
x = x1 ! 6 7
6 x2 7 2
x2 6 7
6 2 7
x
6 1 2 7x
6 7
6 x1 x2 72
4 5
..
. 7
Logis&c Regression
n⇣ ⌘ ⇣ ⌘ ⇣ ⌘o
•  Given x(1) , y (1) , x(2) , y (2) , . . . , x(n) , y (n)
where x(i) 2 Rd , y (i) 2 {0, 1}

•  Model: h✓ (x) = g (✓ | x)
1
g(z) =
1+e z
2 3
✓0
6 ✓1 7 |
⇥ ⇤
6 7 x = 1 x1 ... xd
✓=6 . 7
4 .. 5
✓d 8
Logis&c Regression Objec&ve Func&on
•  Can’t just use squared loss as in linear regression:
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1

–  Using the logis&c regression model
1
h✓ (x) =
1+e ✓T x
results in a non-convex op&miza&on
9
Deriving the Cost Func&on via
Maximum Likelihood Es&ma&on
n
Y
•  Likelihood of data is given by: l(✓) = p(y (i) | x(i) ; ✓)
i=1
•  So, looking for the θ that maximizes the likelihood

Y n
✓MLE = arg max l(✓) = arg max p(y (i) | x(i) ; ✓)
✓ ✓
i=1
•  Can take the log without changing the solu&on:
Yn
✓MLE = arg max log p(y (i) | x(i) ; ✓)
✓
i=1
n
X
= arg max log p(y (i) | x(i) ; ✓)
✓
i=1 10
Deriving the Cost Func&on via
Maximum Likelihood Es&ma&on
•  Expand as follows:
n
X
✓MLE = arg max log p(y (i) | x(i) ; ✓)
✓
i=1
Xn h ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i) (i) (i)
= arg max y log p(y = 1 | x ; ✓) + 1 y log 1 p(y = 1 | x ; ✓)
✓
i=1
•  Subs&tute in model, and take nega&ve to yield

Logis,c regression objec,ve:

min J(✓)
✓
n h
X ⇣ ⌘ ⇣ ⌘i
J(✓) = y (i) log h✓ (x(i) ) + 1 y (i) log 1 h✓ (x(i) )
i=1
11
Intui&on Behind the Objec&ve
n h
X ⇣ ⌘ ⇣ ⌘i
i=1
•  Cost of a single instance:

⇢
log(h✓ (x)) if y = 1
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0
•  Can re-write objec&ve func&on as

Xn ⇣ ⌘
J(✓) = cost h✓ (x(i) ), y (i)
i=1
1 X ⇣ ⇣ (i) ⌘ ⌘2
n
Compare to linear regression: J(✓) = h✓ x y (i)
2n i=1 12
⇢
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0
Aside: Recall the plot of log(z)
13
⇢
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0
If y = 1
•  Cost = 0 if predic&on is correct
If y = 1 •  As h✓ (x) ! 0, cost ! 1
•  Captures intui&on that larger
cost
mistakes should get larger
penal&es
–  e.g., predict , but y = 1
h✓ (x) = 0
0 h✓ (x) = 0 1
⇢
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0
If y = 0
•  Cost = 0 if predic&on is correct
If y = 1 •  As (1 h✓ (x)) ! 0, cost ! 1
If y = 0
•  Captures intui&on that larger
cost
mistakes should get larger
penal&es
0 h✓ (x) = 0 1
Regularized Logis&c Regression
n h
X ⇣ ⌘ ⇣ ⌘i
i=1
•  We can regularize logis&c regression exactly as before:

d
X
Jregularized (✓) = J(✓) + ✓j2
2 j=1
= J(✓) + k✓[1:d] k22

2
16
Gradient Descent for Logis&c Regression
n h
X ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i)
Jreg (✓) = y log h✓ (x ) + 1 y log 1 h✓ (x ) + k✓[1:d] k22
i=1
2
Want min J(✓)

✓
•  Ini&alize ✓
•  Repeat un&l convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
Use the natural logarithm (ln = loge) to cancel with the exp() in h✓ (x) =
17
n h
X ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i)
Jreg (✓) = y log h✓ (x ) + 1 y log 1 h✓ (x ) + k✓[1:d] k22
i=1
2
Want min J(✓)

✓
•  Repeat un&l convergence (simultaneous update for j = 0 ... d)
Xn ⇣ ⇣ ⌘ ⌘
✓0 ✓0 ↵ h✓ x(i) y (i)
i=1
" n ⇣
#
X ⇣ ⌘ ⌘
(i)
✓j ✓j ↵ h✓ x(i) y (i) xj + ✓j
i=1
18
•  Repeat un&l convergence (simultaneous update for j = 0 ... d)
Xn ⇣ ⇣ ⌘ ⌘
✓0 ✓0 ↵ h✓ x(i) y (i)
i=1
" n #
X⇣ ⇣ ⌘ ⌘
(i)
(i) (i)
✓j ✓j ↵ h✓ x y xj + ✓j
i=1
This looks IDENTICAL to linear regression!!!

•  Ignoring the 1/n constant
•  However, the form of the model is very different:
1
h✓ (x) = ✓T x
1+e 19
Mul&-Class Classifica&on
Binary classifica&on: Mul&-class classifica&on:
x2 x2
x1 x1
Disease diagnosis: healthy / cold / flu / pneumonia

Object classifica&on: desk / chair / monitor / bookcase

20
Mul&-Class Logis&c Regression
•  For 2 classes:
1 exp(✓ T x)
h✓ (x) = T
=
1 + exp( ✓ x) 1 + exp(✓ T x)
weight assigned weight assigned
to y = 0 to y = 1
•  For C classes {1, ..., C }:

exp(✓cT x)
p(y = c | x; ✓1 , . . . , ✓C ) = PC
exp(✓ T x)
c=1 c
–  Called the so3max func&on
21
Mul&-Class Logis&c Regression
Split into One vs Rest:
x2
x1
•  Train a logis&c regression classifier for each class i

to predict the probability that y = i with
exp(✓cT x)
hc (x) = PC
exp(✓ T x)
c=1 c
22
Implemen&ng Mul&-Class
Logis&c Regression
exp(✓cT x)
•  Use as the model for class c
hc (x) = PC
exp(✓ T x)
c=1 c
•  Gradient descent simultaneously updates all parameters

for all models
–  Same deriva&ve as before, just with the above hc(x)
•  Predict class label as the most probable label

max hc (x)
c
23

05 LogisticRegression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 LogisticRegression

Uploaded by

Copyright:

Available Formats

Logis&c

• Logis&c regression model:

Example: Cancer diagnosis from tumor size

Note that: p(y = 0 | x; ✓) + p(y = 1 | x; ✓) = 1

Therefore, p(y = 0 | x; ✓) = 1 p(y = 1 | x; ✓)

Side Note: the odds in favor of an event is the quan&ty

• In other words, logis&c regression assumes that the

Based on slide by Xiaoli Fern 5

• Assume a threshold and...

• So, looking for the θ that maximizes the likelihood

• Subs&tute in model, and take nega&ve to yield

Logis,c regression objec,ve:

• Cost of a single instance:

• Can re-write objec&ve func&on as

Aside: Recall the plot of log(z)

• We can regularize logis&c regression exactly as before:

= J(✓) + k✓[1:d] k22

Want min J(✓)

Want min J(✓)

This looks IDENTICAL to linear regression!!!

Disease diagnosis: healthy / cold / ﬂu / pneumonia

Object classiﬁca&on: desk / chair / monitor / bookcase

• For C classes {1, ..., C }:

• Train a logis&c regression classiﬁer for each class i

• Gradient descent simultaneously updates all parameters

• Predict class label as the most probable label

You might also like

•  Logis&c regression model:

•  In other words, logis&c regression assumes that the

•  Assume a threshold and...

•  So, looking for the θ that maximizes the likelihood

•  Subs&tute in model, and take nega&ve to yield

•  Cost of a single instance:

•  Can re-write objec&ve func&on as

•  We can regularize logis&c regression exactly as before:

•  For C classes {1, ..., C }:

•  Train a logis&c regression classiﬁer for each class i

•  Gradient descent simultaneously updates all parameters

•  Predict class label as the most probable label