Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Department of Computer Engineering and Applications

Logistic Regression

Gunjan Bharadwaj
Assistant Professor
Dept of CEA
Agenda
• Classification
• Introduction to Logistic Regression
• Gradient Descent
• Multi-class Classification
Classification
• The classification problem is just like the regression problem, except
that the values we now want to predict take on only a small number
of discrete values.
• For now, we will focus on the binary classification problem in
which y can take on only two values, 0 and 1.
– For instance, if we are trying to build a spam classifier for email, then x(i) may be
some features of a piece of email, and y may be 1 if it is a piece of spam mail,
and 0 otherwise.
• Hence, y ∈ {0,1}. 0 is also called the negative class, and 1 the
positive class, and they are sometimes also denoted by the
symbols “-” and “+”.
• Given x(i), the corresponding y(i) is also called the label for the
training example.
Logistic Regression Model
• We could approach the classification problem ignoring the fact that y is
discrete-valued, and use our old linear regression algorithm to try to predict
y given x.
• However, it is easy to construct examples where this method performs very
poorly.
Y
1.2

0.8

0.6

0.4

0.2

0
0 1 2 3 4 5 6 7 8 9 10
Y
Hypothesis Representation
Linear vs Logistic Regression
Logistic Regression Model
• Intuitively, it also doesn’t make sense for
ℎ𝜃 𝑥 to take values larger than 1 or
smaller than 0 when we know that
y ∈ {0, 1}.
• To fix this, let’s change the form for our
hypotheses ℎ𝜃 𝑥 to satisfy eq 0≤ ℎ𝜃 𝑥 ≤1.
• This is accomplished by plugging 𝜃 𝑇 𝑥
into the Logistic Function.
Logistic Function/Sigmoid Function
ℎ𝜃 𝑥 = 𝑔(𝜃 𝑇 𝑥)
𝑧 = 𝜃𝑇𝑥
1
𝑔 𝑧 =
(1 + 𝑒 −𝑧 )
An Example
Let us consider a problem of finding the tumor
being malignant or benign, Given the size of the
tumor.
An Example
An Example
Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1) or benign
(y=0). Our logistic regression classifier outputs, for a specific tumor, hθ(x)=P(y=1∣x;θ)=0.7, so
we estimate that there is a 70% chance of this tumor being malignant. What should be our
estimate for P(y=0∣x;θ), the probability the tumor is benign?

P(y=0∣x;θ)=0.3
P(y=0∣x;θ)=0.7
P(y=0∣x;θ)=0.7
P(y=0∣x;θ)=0.3×0.7
An Example
Suppose we want to predict, from data x about a tumor, whether it is malignant (y=1) or benign
(y=0). Our logistic regression classifier outputs, for a specific tumor, hθ(x)=P(y=1∣x;θ)=0.7, so
we estimate that there is a 70% chance of this tumor being malignant. What should be our
estimate for P(y=0∣x;θ), the probability the tumor is benign?

P(y=0∣x;θ)=0.3
P(y=0∣x;θ)=0.7
P(y=0∣x;θ)=0.7
P(y=0∣x;θ)=0.3×0.7

Correct option: option 1


Logistic Regression
𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙)
𝒛 = 𝜽𝑻 𝒙
𝟏
𝒈 𝒛 =
(𝟏 + 𝒆−𝒛 )

• This hypothesis is outputting the estimates of


probability of whether y = 1 or y = 0.
• So it predicts in the following way:
▪ y = 1 when ℎ𝜃 𝑥 ≥ 0.5
▪ y = 0 when ℎ𝜃 (𝑥) < 0.5
Logistic Regression
𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙)
𝒛 = 𝜽𝑻 𝒙
𝟏
𝒈 𝒛 =
(𝟏 + 𝒆−𝒛 )
𝟏
• 𝒉𝜽 𝒙 = 𝒈 𝒛 =
(𝟏+𝒆−𝒛 )
• 𝑔 𝑧 ≥ 0.5 𝑖𝑓 𝑧 ≥ 0 𝑎𝑛𝑑 𝑔 𝑧 < 0.5 𝑖𝑓 𝑧 < 0
Can be rewritten as:
𝒉𝜽 𝒙 = 𝒈 𝜽𝑻 𝒙 ≥ 𝟎. 𝟓 𝒘𝒉𝒆𝒏𝒆𝒗𝒆𝒓 𝜽𝑻 𝒙 ≥ 𝟎
Similarly, 𝒉𝜽 𝒙 = 𝒈 𝜽𝑻 𝒙 < 𝟎. 𝟓 𝒘𝒉𝒆𝒏𝒆𝒗𝒆𝒓 𝜽𝑻 𝒙 < 𝟎
Decision Boundary
• We should now learn, how to find out the curve/line/shape that separates the
two classes (Binary Classification). This shape can be called a decision boundary.
• Also, we need to find out the methods for finding out values of parameters in 𝜃.

• We can easily deduce from the distribution of


the points in the graph that a linear equation
will be able to separate them. So we can use the
following hypothesis involving two features x1
and x2.

• We will have to optimize the values of these


parameters by some algorithm like gradient
descent.
Here we show data points • Let us choose parameter for the hypothesis as -
having two output labels. We 3,1,1.
try to find a decision boundary • Let us see how to find out the line separating
which separates them. this distribution in two different classes.
Decision Boundary

• So now let’s assume the values of parameters as -3,1,1.


• So the equation now becomes,
• ℎ𝜃 𝑥 = 𝑔 −3 + 𝑥1 + 𝑥2
• So, now we know that the value of output variable y=1 when the hypothesis
ℎ𝜃 𝑥 ≥ 0.5 and for that, g(z) >=0.5 and for that 𝑧 = 𝜃 𝑇 𝑥 ≥ 0

• We can rewrite it as: −3 + 𝑥1 + 𝑥2 ≥ 0 for the


y=1 hypothesis to predict the value of y=1.
• Subsequently, −3 + 𝑥1 + 𝑥2 < 0 for predicting
the value of y=0
• And, −3 + 𝑥1 + 𝑥2 = 0 for the decision
boundary.
y=0 • This forms a line shown in the graph below by
magenta line and the two classes are also shown.
Problem
Consider logistic regression with two features x1​ and x2​.
Suppose 𝜃0 = 5, 𝜃1 = −1 , 𝜃2 = 0 , so that ℎ𝜃 (𝑥) = 𝑔(5 − 𝑥1 ). Which of these
shows the decision boundary of ℎ𝜃 (𝑥)?

a) b)

c) d)
Problem
Consider logistic regression with two features x1​ and x2​.
Suppose 𝜃0 = 5, 𝜃1 = −1 , 𝜃2 = 0 , so that ℎ𝜃 (𝑥) = 𝑔(5 − 𝑥1 ). Which of these
shows the decision boundary of ℎ𝜃 (𝑥)?

a) b)

c) d)

Correct Ans - a
Cost Function
• We cannot use the same cost function that we
use for linear regression because the Logistic
Function will cause the output to be wavy,
causing many local optima. In other words, it
will not be a convex function.
Cost Function
Cost Function
When y = 1, we get the following plot for 𝐽(𝜃) vs
ℎ𝜃 (𝑥):

If our correct answer 'y' is 0,


then
• If our hypothesis function
outputs 0, then the cost
function will be 0.
• If our hypothesis
approaches 1, then the
cost function will
approach infinity.
Cost Function
Similarly, when y = 0, we get the following plot
for 𝐽(𝜃) vs ℎ𝜃 (𝑥):

If our correct answer 'y' is 1,


then
• the cost function will be 0 if
our hypothesis function
outputs 1.
• If our hypothesis approaches
0, then the cost function will
approach infinity.
Cost Function
Combining both,
we get a cost function curve as given below:
Cost Function

• If our correct answer 'y' is 0, then


– If our hypothesis function outputs 0, then the cost function will be 0.
– If our hypothesis approaches 1, then the cost function will approach
infinity.
• If our correct answer 'y' is 1, then
– the cost function will be 0 if our hypothesis function outputs 1.
– If our hypothesis approaches 0, then the cost function will approach
infinity.
• Note that writing the cost function in this way guarantees that J(θ)
is convex for logistic regression.
Problem
In logistic regression, the cost function for our hypothesis outputting (predicting)
ℎ𝜃 𝑥 on a training example that has label y∈{0,1} is:

cost(hθ​(x),y) = { −loghθ​(x) if y = 1
−log(1−hθ​(x))​ if y = 0​
Which of the following are true? Check all that apply.
a) If hθ​(x)=y, then cost(hθ​(x),y)=0 (for y=0 and y=1).
b) If y=0, then cost(hθ​(x),y)→∞ as hθ​(x)→1.
c) If y=0, then cost(hθ​(x),y)→∞ hθ​(x)→0.
d) Regardless of whether y=0 or y=1, if hθ​(x)=0.5, then cost(hθ​(x),y)>0.
Problem
In logistic regression, the cost function for our hypothesis outputting (predicting)
ℎ𝜃 𝑥 on a training example that has label y∈{0,1} is:

cost(hθ​(x),y) = { −loghθ​(x) if y = 1
−log(1−hθ​(x))​ if y = 0​
Which of the following are true? Check all that apply.
a) If hθ​(x)=y, then cost(hθ​(x),y)=0 (for y=0 and y=1).
b) If y=0, then cost(hθ​(x),y)→∞ as hθ​(x)→1.
c) If y=0, then cost(hθ​(x),y)→∞ hθ​(x)→0.
d) Regardless of whether y=0 or y=1, if hθ​(x)=0.5, then cost(hθ​(x),y)>0.

Correct Ans – a, b and d


Logistic Regression
Cost Function

We can compress our cost function's two conditional cases into one case:
Logistic Regression
Cost Function
• We can write our cost function as follows:
Problem
Gradient Descent
• Remember the general form of gradient descent
is:

We can work out the derivative part using calculus to get:


Multiclass Classification
Examples
• Email foldering/Tagging:
– Work, Friends, Family, Hobby

• Medical Illness:
– Not ill, Cold, Flu

• Weather:
– Sunny, Cloudy, Rainy, Snowy
Multi-Class Classification

• Now we will approach the classification of


data when we have more than two categories.
Instead of y = {0,1} we will expand our
definition so that y = {0,1...n}.
Multiclass Classification
Multiclass Classification
Multiclass Classification

Takes triangles as class 1 i. e.


positive class and rest as class 0
i. e. negative class.

Takes squares as class 1 i. e.


positive class and rest as class 0
i. e. negative class.

Takes crosses as class 1 i. e.


positive class and rest as class 0
i. e. negative class.
Multi class Classification
Since y = {0,1...n}, we divide our problem into n+1
binary classification problems; in each one, we predict
the probability that 'y' is a member of one of our
classes
Multi Class Classification
• We are basically choosing one class and then
lumping all the others into a single second
class. We do this repeatedly, applying binary
logistic regression to each case, and then use
the hypothesis that returned the highest value
as our prediction.
Multi Class Classification
Problem
Suppose you have a multi-class classification problem with k classes (so y∈{1,2,…,k}).
Using the 1-vs.-all method, how many different logistic regression classifiers will you
end up training?

a) k−1

b) k

c) k+1

d) Approximately log2​(k)

Ans - c

You might also like