Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Logis&c

Regression

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aIribu&on. Please send comments and correc&ons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Classifica&on Based on Probability
•  Instead of just predic&ng the class, give the probability
of the instance being that class
–  i.e., learn p(y | x)

•  Comparison to perceptron:
–  Perceptron doesn’t produce probability es&mate
–  Perceptron (and other discrimina&ve classifiers) are only
interested in producing a discrimina&ve model

•  Recall that:
0  p(event)  1
p(event) + p(¬event) = 1
2
Logis&c Regression
•  Takes a probabilis&c approach to learning
discrimina&ve func&ons (i.e., a classifier)
•  should give
h✓ (x) p(y = 1 | x; ✓) Can’t just use linear
regression with a
–  Want 0  h✓ (x)  1 threshold

•  Logis&c regression model:


Logis&c / Sigmoid Func&on
h✓ (x) = g (✓ | x)
1 g(z) =
g(z) = 1+
1+e z
1
h✓ (x) = ✓T x
1+e 3
Interpreta&on of Hypothesis Output
h✓ (x) = es&mated p(y = 1 | x; ✓)

Example: Cancer diagnosis from tumor size


 
x0 1
x= =
x1 tumorSize
h✓ (x) = 0.7
à Tell pa&ent that 70% chance of tumor being malignant

Note that: p(y = 0 | x; ✓) + p(y = 1 | x; ✓) = 1

Therefore, p(y = 0 | x; ✓) = 1 p(y = 1 | x; ✓)


Based on example by Andrew Ng 4
Another Interpreta&on
•  Equivalently, logis&c regression assumes that
p(y = 1 | x; ✓)
log = ✓0 + ✓1 x1 + . . . + ✓d xd
p(y = 0 | x; ✓)
odds of y = 1

Side Note: the odds in favor of an event is the quan&ty


p / (1 − p), where p is the probability of the event

E.g., If I toss a fair dice, what are the odds that I will have a 6?

•  In other words, logis&c regression assumes that the


log odds is a linear func&on of x

Based on slide by Xiaoli Fern 5


Logis&c Regression
|
1
h✓ (x) = g (✓ x) g(z) = z
1+e
1
g(z) =
1+e z

| |
should be large nega&ve
h✓ (x) = g (✓ x) should be large posi&ve
h✓ (x) = g (✓ x)
values for nega&ve instances values for posi&ve instances

•  Assume a threshold and...


–  Predict y = 1 if h✓ (x) 0.5 y = 1

–  Predict y = 0 if h✓ (x) < 0.5
y = 0
Based on slide by Andrew Ng 6
Non-Linear Decision Boundary
•  Can apply basis func&on expansion to features, same
as with linear regression
2 3
1
6 x1 7
6 7
6 x2 7
6 7
2 3 6 x1 x2 7
6 7
1 6 2 7
4 5 6 x 1 7
x = x1 ! 6 7
6 x2 7 2
x2 6 7
6 2 7
x
6 1 2 7x
6 7
6 x1 x2 72
4 5
..
. 7
Logis&c Regression
n⇣ ⌘ ⇣ ⌘ ⇣ ⌘o
•  Given x(1) , y (1) , x(2) , y (2) , . . . , x(n) , y (n)
where x(i) 2 Rd , y (i) 2 {0, 1}

•  Model: h✓ (x) = g (✓ | x)
1
g(z) =
1+e z
2 3
✓0
6 ✓1 7 |
⇥ ⇤
6 7 x = 1 x1 ... xd
✓=6 . 7
4 .. 5
✓d 8
Logis&c Regression Objec&ve Func&on
•  Can’t just use squared loss as in linear regression:
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1

–  Using the logis&c regression model
1
h✓ (x) =
1+e ✓T x
results in a non-convex op&miza&on

9
Deriving the Cost Func&on via
Maximum Likelihood Es&ma&on
n
Y
•  Likelihood of data is given by: l(✓) = p(y (i) | x(i) ; ✓)
i=1

•  So, looking for the θ that maximizes the likelihood


Y n
✓MLE = arg max l(✓) = arg max p(y (i) | x(i) ; ✓)
✓ ✓
i=1
•  Can take the log without changing the solu&on:
Yn
✓MLE = arg max log p(y (i) | x(i) ; ✓)

i=1
n
X
= arg max log p(y (i) | x(i) ; ✓)

i=1 10
Deriving the Cost Func&on via
Maximum Likelihood Es&ma&on
•  Expand as follows:
n
X
✓MLE = arg max log p(y (i) | x(i) ; ✓)

i=1
Xn h ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i) (i) (i)
= arg max y log p(y = 1 | x ; ✓) + 1 y log 1 p(y = 1 | x ; ✓)

i=1

•  Subs&tute in model, and take nega&ve to yield


Logis,c regression objec,ve:


min J(✓)

n h
X ⇣ ⌘ ⇣ ⌘i
J(✓) = y (i) log h✓ (x(i) ) + 1 y (i) log 1 h✓ (x(i) )
i=1
11
Intui&on Behind the Objec&ve
n h
X ⇣ ⌘ ⇣ ⌘i
J(✓) = y (i) log h✓ (x(i) ) + 1 y (i) log 1 h✓ (x(i) )
i=1

•  Cost of a single instance:



log(h✓ (x)) if y = 1
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0

•  Can re-write objec&ve func&on as


Xn ⇣ ⌘
J(✓) = cost h✓ (x(i) ), y (i)
i=1

1 X ⇣ ⇣ (i) ⌘ ⌘2
n
Compare to linear regression: J(✓) = h✓ x y (i)
2n i=1 12
Intui&on Behind the Objec&ve

log(h✓ (x)) if y = 1
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0

Aside: Recall the plot of log(z)

13
Intui&on Behind the Objec&ve

log(h✓ (x)) if y = 1
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0

If y = 1
•  Cost = 0 if predic&on is correct
If y = 1 •  As h✓ (x) ! 0, cost ! 1
•  Captures intui&on that larger
cost
mistakes should get larger
penal&es
–  e.g., predict , but y = 1
h✓ (x) = 0
0 h✓ (x) = 0 1
Based on example by Andrew Ng 14
Intui&on Behind the Objec&ve

log(h✓ (x)) if y = 1
cost (h✓ (x), y) =
log(1 h✓ (x)) if y = 0

If y = 0
•  Cost = 0 if predic&on is correct
If y = 1 •  As (1 h✓ (x)) ! 0, cost ! 1
If y = 0
•  Captures intui&on that larger
cost
mistakes should get larger
penal&es
0 h✓ (x) = 0 1
Based on example by Andrew Ng 15
Regularized Logis&c Regression
n h
X ⇣ ⌘ ⇣ ⌘i
J(✓) = y (i) log h✓ (x(i) ) + 1 y (i) log 1 h✓ (x(i) )
i=1

•  We can regularize logis&c regression exactly as before:


d
X
Jregularized (✓) = J(✓) + ✓j2
2 j=1

= J(✓) + k✓[1:d] k22


2

16
Gradient Descent for Logis&c Regression
n h
X ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i)
Jreg (✓) = y log h✓ (x ) + 1 y log 1 h✓ (x ) + k✓[1:d] k22
i=1
2

Want min J(✓)


•  Ini&alize ✓
•  Repeat un&l convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d

Use the natural logarithm (ln = loge) to cancel with the exp() in h✓ (x) =
17
Gradient Descent for Logis&c Regression
n h
X ⇣ ⌘ ⇣ ⌘i
(i) (i) (i) (i)
Jreg (✓) = y log h✓ (x ) + 1 y log 1 h✓ (x ) + k✓[1:d] k22
i=1
2

Want min J(✓)


•  Ini&alize ✓
•  Repeat un&l convergence (simultaneous update for j = 0 ... d)
Xn ⇣ ⇣ ⌘ ⌘
✓0 ✓0 ↵ h✓ x(i) y (i)
i=1
" n ⇣
#
X ⇣ ⌘ ⌘
(i)
✓j ✓j ↵ h✓ x(i) y (i) xj + ✓j
i=1
18
Gradient Descent for Logis&c Regression
•  Ini&alize ✓
•  Repeat un&l convergence (simultaneous update for j = 0 ... d)
Xn ⇣ ⇣ ⌘ ⌘
✓0 ✓0 ↵ h✓ x(i) y (i)
i=1
" n #
X⇣ ⇣ ⌘ ⌘
(i)
(i) (i)
✓j ✓j ↵ h✓ x y xj + ✓j
i=1

This looks IDENTICAL to linear regression!!!


•  Ignoring the 1/n constant
•  However, the form of the model is very different:
1
h✓ (x) = ✓T x
1+e 19
Mul&-Class Classifica&on
Binary classifica&on: Mul&-class classifica&on:

x2 x2

x1 x1

Disease diagnosis: healthy / cold / flu / pneumonia


Object classifica&on: desk / chair / monitor / bookcase


20
Mul&-Class Logis&c Regression
•  For 2 classes:
1 exp(✓ T x)
h✓ (x) = T
=
1 + exp( ✓ x) 1 + exp(✓ T x)
weight assigned weight assigned
to y = 0 to y = 1

•  For C classes {1, ..., C }:


exp(✓cT x)
p(y = c | x; ✓1 , . . . , ✓C ) = PC
exp(✓ T x)
c=1 c
–  Called the so3max func&on

21
Mul&-Class Logis&c Regression
Split into One vs Rest:

x2

x1

•  Train a logis&c regression classifier for each class i


to predict the probability that y = i with
exp(✓cT x)
hc (x) = PC
exp(✓ T x)
c=1 c
22
Implemen&ng Mul&-Class
Logis&c Regression
exp(✓cT x)
•  Use as the model for class c
hc (x) = PC
exp(✓ T x)
c=1 c

•  Gradient descent simultaneously updates all parameters


for all models
–  Same deriva&ve as before, just with the above hc(x)

•  Predict class label as the most probable label


max hc (x)
c
23

You might also like