5 Logistic Regression

ARTIFICIAL INTELLIGENCE
M AC H I N E L E A R N I N G
z
- Lecture 5 - AI - ML
Logistic Dr. Aicha BOUTORH
Regression ‫اﻟﺬﻛﺎء اﻻﺻﻄﻨﺎﻋﻲ‬

‫واﻟﺘﻌﻠﻢ اﻵﻟﻲ‬
2020 /2021
Dr. A.
Classification
2 Boutorh
§ Classification predictive modeling involves assigning a class

label to input examples.
§ Binary classification refers to predicting one of two classes
§ Multi-class classification involves predicting one of more than

two classes.
§ Multi-label classification involves predicting one or more

classes for each example
ML - Logistic Regression
§ Imbalanced classification refers to classification tasks where

the distribution of examples across the classes is not equal.
Dr. A.
3 Boutorh
Logistic Regression
Dr. A.
4 Boutorh
0: Negative Class
z
E.g. Benign Tumor, Not Spam Email, …
Goal: Take an input vector x and

y 𝝐 { 0, 1 } assign it to one of 2 classes y.
1: Positive Class
E.g. Malignant Tumor, Spam Email, …

Dr. A.
5 Boutorh
z Linear Classifier
§ Given examples ( x(i) , y(i) ), learn a classifier that is able to
predict y* given new point x*.
§ It should generalize well to new x* .
§ – Example
Ø x1: Fish Weight
Ø x2 : Fish Length
Ø y : Fish Species.
Figure Source:
Logistic Regression,
Dr. Patras, Hospedales
Dr. A.
6 Boutorh
z
h 𝛉(x) = 𝛉T x
0.5
h 𝛉(x) = 𝛉T x
Threshold Classifier Output h 𝜽(x) at 0.5

Ø If h 𝛉(x) ≥ 0.5 then predict y=1
Ø If h 𝛉(x) < 0.5 then predict y = 0

Dr. A.
7 Boutorh
Ø Binary linear classifier is a classifier that separates

two classes using a line, a plane, or a hyper- plane.
Ø Classification : y = 0 or y= 1
Ø Logistic Regression : 0 ≤ h 𝛉(x) ≤ 1

Dr. A.
8 Boutorh
Logistic Regression
Dr. A.
9
z
Linear vs Logistic Boutorh
Regression
§ Linear and Logistic Regression use different
Hypothesis / Representation / Model Assumptions.
Ø Linear Regression h 𝜽(x) ∈ [-∞, + ∞]

Ø Logistic Regression h 𝜽(x) ∈ [ 0, 1 ]

Dr. A.
10 Boutorh
z = 𝒈 ( 𝜽
h 𝜽(x) T x)
𝟏
h 𝜽(x) = _
𝜽Tx
𝟏 𝟏m𝒆
𝒈 𝒛 = _𝒛
𝟏+𝒆
Logistic Regression 𝒈 𝒛
0 ≤ h 𝛉(x) ≤ 1
0.5
Sigmoid Function
Logistic Function
z
Dr. A.
11
Interpretation of Hypothesis Output Boutorh
Ø h 𝜽(x) estimates probability that y = 1 on Input x
Ø Example : xt = [x0, x 1] = [1, TumorSize]
If h 𝜽(x) = 0.8 then y =1
§ It signifies : 80% chance of tumor being Malignant
☞ h 𝜽(x) = P ( y=1 | x; 𝜽)
Probability that y = 1 , given x parametrized by 𝜽
Ø P ( y=1 | x; 𝜽) + P ( y=0 | x; 𝜽) = 𝟏 Probability of

tumor being
Ø P ( y=0 | x; 𝜽) = 1 - P ( y=1 | x; 𝜽) Benign is 20%
12
v Sigmoid Function
z
also called
Logistic Function
The function g(z) maps any real number to the (0,1) interval making any valued
function better suited for classification.
h 𝜽(x) will give the probability that the output is 1. The probability that the prediction
is 0 is the complement of the probability that is 1.
Dr. A.
13 Boutorh
Logistic Regression
Dr. A.
14 h 𝜽(x) = 𝒈 𝜽T x Boutorh
z
h 𝜽(x) ≥ 0.5
0.5
h 𝜽(x) < 0.5
𝜽T x < 0 𝜽T x ≥ 0
𝜽T x
h 𝜽(x) = 𝒈 ( 𝜽T x) = P(y=1| x ; 𝛉)
ØPredict y = 1 if h 𝜽(x) ≥ 0.5 ☞ 𝜽T x ≥ 0

ØPredict y = 0 if h 𝜽(x) < 0.5 𝜽T x < 0
Dr. A.
15 Boutorh
z Decision Boundary
§ The Decision Boundary is the boundary between two

classes, where:
P ( y = 1 | x; 𝜽) = P ( y=0 | x; 𝜽) = 0.5
𝟏
h 𝜽(x) = _
𝜽Tx . = 0.5
𝟏m𝒆
_𝜽Tx
𝟏+ 𝒆 = 2
𝜽Tx = 0
16
z Decision Boundary
-3
v Example : 𝛉 = 1
1
Ø h 𝜽(x) = g (𝜽T x)
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 )
Ø 𝜽T x = −𝟑 + x1 + x2
Ø y = 1 if 𝜽T x ≥ 0
Ø y = 1 if −𝟑 + x1 + x2 ≥ 0
Ø y = 1 if x1 + x2 ≥ 𝟑
Source:
ML – Andrew Ng
17
z Decision Boundary
-3
1
Ø h 𝜽(x) = g (𝜽T x)
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 )
Ø 𝜽T x = −𝟑 + x1 + x2
Ø y = 1 if −𝟑 + x1 + x2 ≥ 0
Ø y = 1 if x1 + x2 ≥ 𝟑
Ø Decision Boundary : x1 + x2 = 3
Source:
ML – Andrew Ng
18
z Decision Boundary
-3
1
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 )
Ø y = 1 if −𝟑 + x1 + x2 ≥ 0
☞ h 𝜽(x) = 0.5
Ø y = 1 if x1 + x2 ≥ 𝟑
Ø y = 0 if x1 + x2 < 𝟑
Source:
ML – Andrew Ng
19
z Decision Boundary
-3
1
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 )
Ø Predict y = 1 if −𝟑 + x1 + x2 ≥ 0
☞ h 𝜽(x) = 0.5
Even if the data set is taken away, the decision boundary is the same
( the region where predict y =1 versus y = 0) that's a property of the

hypothesis and its parameters and not a property of the data set
Source:
ML – Andrew Ng
20
z Non-Linear Decision Boundary

More complex example :
Given a training set as presented in the
figure, how to get logistic regression to fit
the sort of data?
For example the hypothesis looks like this:
h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x22 )
Two extra features were added to the

features: x1 squared and x2 squared.
Source:
ML – Andrew Ng
21

-1
0
0
v Take 𝛉 = 1
1
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x22 )
Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )
Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0

Ø Predict y =1 if x12 + x22 ≥ 1

Source:
ML – Andrew Ng
22

Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x22 )
Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )
Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0

By adding these more complex, or polynomial terms to the features, more
complex decision boundaries is obtained. It does not just try to separate the
positive and negative examples in a straight line, but the decision boundary
is a circle. Source:
ML – Andrew Ng
23
z
Non-Linear Decision Boundary
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x22 )
Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )

Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0
Ø Once again, the decision boundary is a property of the hypothesis under the
parameters and not of the training set. So, so long as the parameter vector
theta is given, that defines the decision boundary, which is the circle.
Ø The training set is used to fit the parameters theta but not to define the
decision boundary. Source:
ML – Andrew Ng
24
z
Non-Linear Decision Boundary
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x22 )
Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )
Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0
Example 2:
Ø h 𝜽(x) = g ( 𝛉0 x0 + 𝛉1 x1 + 𝛉2 x2 +
𝛉3 x12 + 𝛉4 x12 x2 + 𝛉5 x12 x22 +

𝛉6 x13x2 )
Source:
ML – Andrew Ng
Dr. A.
25 Boutorh
Logistic Regression
Dr. A.
26 Boutorh
Ø Training : {(x
z (1), y(1)) , (x(2), y(2)), … , (x(m), y(m)) }
Ø m examples in the dataset

x0
x1
Ø x = ∈ℝ
x2 n+1 x0 = 1, y ∈ {0,1}
…
xn
Ø How to choose parameters 𝛉 ?

Dr. A.
27
Cost Function
Boutorh
𝒎
𝟏 𝟏
§ Linear Regression : J(𝜽 ) = ~ 𝒉𝜽(𝒙(𝒊) ) − 𝒚(𝒊) 𝟐
𝒎 𝒊„𝟏 𝟐
𝟏 𝒎
§ Logistic Regression : J(𝜽 ) = …𝒊„𝟏 𝑪𝒐𝒔𝒕 𝒉𝜽(𝒙(𝒊) ), 𝒚(𝒊)
𝒎
𝟏
§ 𝒉𝜽 𝒙 = 𝑪𝒐𝒔𝒕 𝒉𝜽(𝒙(𝒊) ), 𝒚(𝒊) = ??

𝟏m𝒆 _𝜽𝑻𝒙
Dr. A.
28 Boutorh
z The Cost:
Negative Log Likelihood
- log ( P ( y=1 | x; 𝜽) ) if y = 1
§ 𝑪𝒐𝒔𝒕 𝒉𝜽 (𝒙), 𝒚 =
- log ( 1- P ( y=1 | x; 𝜽)) if y = 0

Dr. A.
29 Boutorh
zLogistic Regression Cost Function
Ø The cost is the penalty that the algorithm pays for the value of
h𝛉(x) (h 𝛉(x) is a number like 0.8 which is the predicted value ),
relative to the value of the label y.
v The cost is :
Ø - log ( h𝛉(x) ) if y = 1
Ø - log ( 1 - h𝛉(x) ) if y = 0
Dr. A.
30
Logistic Regression Cost Function Boutorh
Cost = 0 if y = 1 ;
h𝛉(x) = 1
Cost → ∞ ; h𝛉(x) → 0
If h𝛉(x) = 0 but y = 1
1
The learning Algorithm
will be penalized by a
y=1 very large cost. Figure Source:
https://www.geeksforg
eeks.org/ml-cost-
function-in-logistic-
regression/
Dr. A.
31
y=0
Ø The curve goes to plus
infinity as h(x) goes to 1
Ø If the label y = 0 but the

hypothesis predicted
that y = 1, then the
algorithm pays a very
large cost.
Figure Source:
https://www.geeksforg
eeks.org/ml-cost-
function-in-logistic-
regression/
Dr. A.
32 Boutorh
Logistic Regression
Dr. A.
33 Boutorh
q Rather than writing out this cost function in two separated cases,
y=1 and y =0, the function can be simplified and the two lines can be
compressed into one equation.
q This would make it more convenient to write out a cost function and
derive gradient descent.
Dr. A.
34 Boutorh
Cost (𝐡𝛉 𝐱 , 𝐲) = - y log (𝐡𝛉 𝐱 ) − (1 – y) log ( 1- 𝐡𝛉 𝐱 )

Ø If y = 1 : Cost (𝐡𝛉 𝐱 , 𝐲) = – log (𝐡𝛉 𝐱 )

Ø If y = 0 : Cost (𝐡𝛉 𝐱 , 𝐲) = – log ( 1- 𝐡𝛉 𝐱 )
Dr. A.
35
𝟏𝒎
§ J(𝜽 ) = …𝒊„𝟏 𝑪𝒐𝒔𝒕 𝒉𝜽(𝒙(𝒊) ), 𝒚(𝒊)
𝒎
𝟏 (𝐢) (𝒊)
§ J(𝜽 ) = − [ ∑mi=1 y(i) log (𝐡𝛉 𝐱 + (1 – y(i)) log ( 1- 𝐡𝛉 𝐱 )]
𝒎
§ Parameters 𝛉 : Find 𝛉 that Minimize the cost J (𝛉)
§ To make prediction given new “x” :
𝟏
Output 𝒉𝜽 𝒙 =
Dr. A.
36 Boutorh
z
Gradient Descent
𝟏 ( 𝐢) ( 𝒊)
§ J(𝜽 ) = − [ ∑m i=1 y(i) log (𝐡𝛉 𝐱 + (1 – y(i)) log ( 1- 𝐡𝛉 𝐱 )]
𝒎
𝟏
𝒉𝜽 𝒙 =
Until Convergence Figure Source:

Dr. A.
37 Boutorh
Logistic Regression
Dr. A.
38 Boutorh
z
Multiclass classification
§ Multiclass classification involves predicting one of more than two
classes, y = {1, 2, 3, ….., K} for K possible classes
§ Example : Emails Classification ( K = 5)
Ø Work ☞ y = 1
Ø Friends ☞ y = 2
Ø Family ☞ y = 3
Ø Contacts ☞ y = 4
Ø Services ☞ y = 5
v
z
39
Dr. A.
40 Boutorh
From Binary to Multiclass

§ We know how to model Logistic function and learn
Gradient Descent binary classifiers (K=2).
§ What about K >2 ,
Multi-class problems?
Andrew Ng
z ML - Logistic Regression
41
Dr. A.
42 Boutorh
z
One – vs - All
§ Use K classifiers, each solving a two class problem of
separating class k from all others.
§ For each class k, a logistic regression classifier

(𝒌)
𝒉𝜽 𝒙 is trained to predict the probability that
y=k (P ( y=k | x; 𝜽) )
§ To make prediction of new instance x (input) , select

(𝒌)
the class k that maximizes 𝐌𝐚𝐱 𝒉𝜽 𝒙
43
One – vs – All : K Classes
z
Figure Source:
Dr. A.
44 Boutorh
Logistic Regression
Dr. A.
45 Boutorh
z Summary
§ Logistic Regression uses sigmoid function to predict probability in [0,1]
suitable for the classification.
§ h 𝜽(x) gives the probability that the output is 1.
§ The decision boundary is a property of the hypothesis and its parameters

and not a property of the data set.
§ The cost is the penalty that the algorithm pays for the value of h𝜽(x)
relative to the value of the label y.
§ Multi-class classification involves predicting one of K classes where K > 2.

(𝒌)
§ One – vs – All select the class k that maximizes 𝑴𝒂𝒙 𝒉𝜽 𝒙 for K
trained logistic regression classifiers.
Dr. A.
46 Boutorh
Logistic Regression
Dr. A.
47 Boutorh
z
Refences
§ Introduction to Machine Learning, Andrew Ng, Stanford University
§ An Introduction to Machine Learning with Python by Andreas C. Müller

and Sarah Guido (O’Reilly). Copyright 2017 Sarah Guido and Andreas
Müller, 978-1-449-36941-5.
§ Applied Machine Learning in Python, Kevyn Collins Thompson,

University of Michigan
§ The elements of statistical learning: data mining, inference, and
prediction. Springer Science & Business Media. Hastie T, Tibshirani R,
Friedman J; 2009.
§ An introduction to statistical learning, with applications in R.

James G, Witten D, Hastie T, Tibshirani R. An introduction to
statistical learning. springer; 2013.
§ Mathematics for Machine Learning. Marc Peter Deisenroth, A. Aldo Faisal

, Cheng Soon Ong. 2020
Boutorh
Dr. A.
z ML - Logistic Regression
48

5 Logistic Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 Logistic Regression

Uploaded by

Copyright:

Available Formats

ARTIFICIAL INTELLIGENCE

Regression ‫اﻟﺬﻛﺎء اﻻﺻﻄﻨﺎﻋﻲ‬

§ Classification predictive modeling involves assigning a class

§ Binary classification refers to predicting one of two classes

§ Multi-class classification involves predicting one of more than

§ Multi-label classification involves predicting one or more

§ Imbalanced classification refers to classification tasks where

E.g. Benign Tumor, Not Spam Email, …

Goal: Take an input vector x and

E.g. Malignant Tumor, Spam Email, …

§ It should generalize well to new x* .

Ø x1: Fish Weight

Threshold Classifier Output h 𝜽(x) at 0.5

Ø If h 𝛉(x) < 0.5 then predict y = 0

Ø Binary linear classifier is a classifier that separates

Ø Logistic Regression : 0 ≤ h 𝛉(x) ≤ 1

Ø Linear Regression h 𝜽(x) ∈ [-∞, + ∞]

Ø Logistic Regression h 𝜽(x) ∈ [ 0, 1 ]

Ø h 𝜽(x) estimates probability that y = 1 on Input x

Ø Example : xt = [x0, x 1] = [1, TumorSize]

If h 𝜽(x) = 0.8 then y =1

§ It signifies : 80% chance of tumor being Malignant

Probability that y = 1 , given x parametrized by 𝜽

Ø P ( y=1 | x; 𝜽) + P ( y=0 | x; 𝜽) = 𝟏 Probability of

ØPredict y = 1 if h 𝜽(x) ≥ 0.5 ☞ 𝜽T x ≥ 0

§ The Decision Boundary is the boundary between two

( the region where predict y =1 versus y = 0) that's a property of the

z Non-Linear Decision Boundary

For example the hypothesis looks like this:

Two extra features were added to the

z Non-Linear Decision Boundary

Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )

Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0

Ø Predict y =1 if x12 + x22 ≥ 1

z Non-Linear Decision Boundary

Ø h 𝜽(x) = g ( −𝟏 + x12 + x22 )

Ø Decision Boundary : x12 + x22 = 1

Ø Predict y =1 if −𝟏 + x12 + x22 ≥ 0

Ø Decision Boundary : x12 + x22 = 1

𝛉3 x12 + 𝛉4 x12 x2 + 𝛉5 x12 x22 +

Ø m examples in the dataset

Ø How to choose parameters 𝛉 ?

§ 𝒉𝜽 𝒙 = 𝑪𝒐𝒔𝒕 𝒉𝜽(𝒙(𝒊) ), 𝒚(𝒊) = ??

- log ( 1- P ( y=1 | x; 𝜽)) if y = 0

zLogistic Regression Cost Function

Ø If the label y = 0 but the

Cost (𝐡𝛉 𝐱 , 𝐲) = - y log (𝐡𝛉 𝐱 ) − (1 – y) log ( 1- 𝐡𝛉 𝐱 )

Ø If y = 1 : Cost (𝐡𝛉 𝐱 , 𝐲) = – log (𝐡𝛉 𝐱 )

§ Parameters 𝛉 : Find 𝛉 that Minimize the cost J (𝛉)

§ To make prediction given new “x” :

Until Convergence Figure Source:

§ Example : Emails Classification ( K = 5)

From Binary to Multiclass

§ What about K >2 ,

§ For each class k, a logistic regression classifier

§ To make prediction of new instance x (input) , select

§ h 𝜽(x) gives the probability that the output is 1.

§ The decision boundary is a property of the hypothesis and its parameters

§ Multi-class classification involves predicting one of K classes where K > 2.

§ An Introduction to Machine Learning with Python by Andreas C. Müller

§ Applied Machine Learning in Python, Kevyn Collins Thompson,

§ An introduction to statistical learning, with applications in R.

§ Mathematics for Machine Learning. Marc Peter Deisenroth, A. Aldo Faisal

You might also like