Session2 Eng

Prediction methods and Machine learning
Session II
Pierre Michel
pierre.michel@univ-amu.fr
M2 EBDS
2021
1. Logistic regression
Pierre Michel Prediction methods and Machine learning 2/57

1.1. Introduction to classification
1.1 Introduction to classification

Classification problems
Some examples:
• spam detection in e-mails (ham or spam)

• fraud detection in online transactions (fraudulent, safe)
• tumor diagnosis (malignant, benign)
• heart rhythm diagnosis (normal sinus rhythm, atrial fibrillation)
• ...
In a classification problem, the variable to be explained (or target variable)

y is categorical, we note
• Binary classification: y:
I 0: “negative class” (ex: normal sinus rhythm)
I 1: “positive class” (ex: atrial fibrillation)
• Multiclass classification: y ∈ {0, 1, 2, ..., K}, K ≥ 2

Example of binary classification
1: Oui
Tumeur maligne ?
0: Non
Taille de la tumeur
Let hθ (x) = θT x, we define the following classification rule:

• if hθ (x) ≥ 0.5, we predict y = 1
• if hθ (x) < 0.5, we predict y = 0
Choice of the function hθ
The use of a linear model for classification is not recommended.

In binary classification, we have y = 0 or y = 1, but in the previous example,
hθ (x) can take values > 1 or < 0.
For the logistic regression, we will choose a function hθ which verifies:
0 ≤ hθ (x) ≤ 1

1.2. Representation of the model
1.2 Representation of the model

Logistic regression model
We want a function hθ which checks: 0 ≤ hθ (x) ≤ 1.

We set:
hθ (x) = g(θT x)
1
where g(z) = 1+e−z (g is the sigmoid or logistic function), that is:
1
hθ (x) =
1 + e−θT x

Logistic function
1
g(z) =
1 + e−z
1.0
0.8
0.6
g(z)
0.4
0.2
0.0
−4 −2 0 2 4
Pierre Michel
z Prediction methods and Machine learning 9/57
Interpretation of hθ (x)
hθ (x) is the estimated probability that y = 1 for the observation x.

Example: hθ (x) = 0.8 means that the individual has a 80% chance of
having a malignant tumor.
We denote:
hθ (x) = P (y = 1|x; θ)
with
P (y = 0|x; θ) + P (y = 1|x; θ) = 1
P (y = 0|x; θ) = 1 − P (y = 1|x; θ)

1.3. Decision threshold
1.3 Decision threshold

Classification rule
1
hθ (x) = g(θT x) = P (y = 1|x; θ) g(z) = 1+e−z
Classification rule:
• Predict y = 1 if hθ (x) ≥ 0.5

I as g(z) ≥ 0.5 when z ≥ 0
I we have: g(θT x) ≥ 0.5 when θT x ≥ 0
• Predict y = 0 if hθ (x) < 0.5
I as g(z) < 0.5 when z < 0
I we have: g(θT x) < 0.5 when θT x < 0

Example: hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )

4
3
x2
2
1
y=0
y=1
0
0.5 1.0 1.5 2.0 2.5 3.0 3.5
x1

Non-linear example:
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 + θ3 x21 + θ4 x22 )
2
1
x2
0
−1
y=0
y=1
−2
−2 −1 0 1 2
x1

1.4. Cost function and gradient descent
1.4 Cost function and gradient descent

Reminder: probability
We have:
P (y = 1|x; θ) = hθ (x)
P (y = 0|x; θ) = 1 − hθ (x)
That can be written as:
p(y|x; θ) = hθ (x)y (1 − hθ (x))1−y

Maximum likelihood estimation

We try to find the value of θ that maximizes the likelihood:
n
Y
L(θ) = p(y (i) |x(i) ; θ)
i=1
n
Y (i) (i)
= hθ (x(i) )y (1 − hθ (x(i) ))1−y
i=1
We consider instead the log-likelihood:
l(θ) = log L(θ)

Xn
= y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) ))
i=1

Cost function J(θ)
By defining the cost function J(θ) as follows:
n
1 X (i)
J(θ) = − y log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) ))
n i=1
We then look for the vector of parameters θ which minimizes J(θ)

Stochastic gradient descent
Repeat until convergence {

from i = 1 to n {
θj := θj − α(hθ (x(i) ) − y (i) )

}
}
Note: similar to linear regression

Other optimization algorithms
Gradient descent1 is not the only optimization algorithm allowing to

minimize the cost function.
Other algorithms (CG, LBFGS. . . )2 , more complex, do not require the
choice of the learning rate α and are faster.
1 https:
//scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
2 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
LogisticRegression.html
1.5. Multiclass classification
1.5 Multiclass classification

Introduction
In a multiclass classification problem we have:
y ∈ {1, 2, ..., K}, K > 2
Example:
• pattern recognition in images (see the MNIST dataset3 )
3 https:
//scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
Binary and multiclass classification
Binary (K=2) One−versus−rest (K>2)

4
4
3
3
x2
x2
2
2
1
1
y=1
y=0 y=2
y=1 y=3
0
0
0 1 2 3 4 0 1 2 3 4
x1 x1

“One versus all” (OVR)
For each class k (e.g. k = 1, 2, ..., K), we have:
(k)
hθ (x) = P (y = k|x; θ)
Idea of the OVR method:
• we estimate the function h(k)

θ (x) (logistic function) for each class k,
(k)
which predicts the probability that hθ (x)
• for a new observation x, we predict as class the one that maximizes
the function:
(k)
max hθ (x)
k

2. Regularization
2. Regularization

2. Regularization
2.1. The overfitting problem
2.1 The overfitting problem

2. Regularization
Example: linear regression

Price
Size
In this example, when d = 1, we talk about underfitting (high bias), for

d = 4, we talk about overfitting (high variance).
2. Regularization
Overfitting
We speak of overfitting, when:
• the fit of the model on the training data is satisfactory

• the model does not allow to generalize (predict) on new observations

2. Regularization
Solutions
Two options:
1. Reduce the number of variables (variable/model selection)

2. Regularization (reduce the values of the parameters θj )
Note: regularization is recommended in high dimension.

2. Regularization
2.2. Cost function
2.2 Cost function

2. Regularization
2.2. Cost function
Idea of regularization
Degré du polynôme: 2 Degré du polynôme: 4

Prix (y)
Prix (y)
Surface (x) Surface (x)
Let hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3 + θ4 x4 , we penalize the parameters

θ3 et θ4 : minθ J(θ) + λθ32 + λθ42 , λ > 0
2. Regularization
2.2. Cost function
Regularization
We obtain small values for the parameters θ0 , θ1 , ..., θp
• we obtain a simpler hθ (x) function

• less risk of overfitting
We define the cost function as follows:
X n
X
J(θ) = (y (i) − hθ (x(i) ))2 + λ θ2
i=1 i=1
Note: if the value of λ (regularization parameter) is too large, the values

of the parameters θ1 , θ2 , ..., θp tend towards 0. In other words we obtain
hθ (x) = θ0
.
3. Evaluation of a Machine Learning model

3.1. Prediction error
3.1 Prediction error

How to test a model ?
The evaluation of a Machine Learning model is done with a test sample.

The test sample is a subset of observations from the dataset that is not
used for model estimation.

Solutions for large prediction errors
In case of bad predictions, several solutions can be considered
• have more training observations

• use fewer explanatory variables
• add additional variables to the model
• add non-linearities (ex: polynomial regression)
• change the values of the α regularization parameter
This can improve the performance of the model.

3.2. Evaluation of a model
3.2 Evaluation of a model

Overfitting
A model with little learning error does not guarantee a good model.
The model may have difficulty generalizing to new observations (test
sample).

Training and test samples (train-test split)
The dataset is split into two sub-samples:
• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )
• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )
Note: nt est is the number of observations in the test sample.

Training/Testing for linear regression
1. The parameter θ is estimated using the training sample:
min J(θ)
θ
2.We compute the mean squared error (MSE) using the test sample:
nX
test
1 (i) (i)
M SEtest = (hθ (xtest ) − ytest )2
ntest i=1
Note: we take a higher proportion of obervations in the training sample

(e.g. 70%) than in the test sample (e.g. 30%).

Training/Test for logistic regression
1. We estimate the parameter theta using the training sample:
min J(θ)
θ
2. We compute the classification error on the test sample:
nX
test
1
err = I(y (i) 6= ŷ (i) )
ntest i=1
with
1 if hθ (x) ≥ 0.5
ŷ (i) =
0 elsewhere.

3.3. Model selection (training, validation, test)
3.3 Model selection (training, validation, test)

Example: polynomial regression
d = 1: hθ (x) = θ0 + θ1 x
d = 2: hθ (x) = θ0 + θ1 x + θ2 x2
d = 3: hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3
..
.
d = 10: hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θ10 x10
For each value of d, we obtain a vector of parameters noted θ(d) .
We can calculate the MSE on the test sample, for each value of d: M SE (d) .
Problem: the choice of the d hyperparameter is made according to the
same test sample, problem of overfitting on the test sample.

Cross-validation
The dataset is separated into three sub-samples:
• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )}
• the validation sample:
{(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ncv ) , y (ncv ) )
• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )}
Note: nc v is the number of observations in the validation sample.

Model selection
d = 1: hθ (x) = θ0 + θ1 x
d = 2: hθ (x) = θ0 + θ1 x + θ2 x2
d = 3: hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3
..
.
d = 10: hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θ10 x10
For each value of d, we obtain a vector of parameters noted θ(d) .
We compute the MSE on the validation sample, for each value of d:
(d) (d)
M SEcv . We choose the value of d that minimizes M SEcv .
The generalization error is then computed on the test sample, with the
chosen value of d.
When doing regularization, the choice of the regularization parameter λ is
the most important one.
3.4. Bias and variance
3.4 Bias and variance

Learning error and validation error
• Learning error: M SEtrain = f rac1n ni=1 (ht heta(x(i) ) − y (i) )2

P
Pncv (i) (i) 2
• Validation error: M SEcv = n1 i=1 (hθ (xcv ) − ycv )
cv
When the validation error is high, is it a problem of bias (underfitting), or

variance (overfitting)?

Bias versus Variance
• Bias (underfitting)
I M SEtrain high
I M SEtrain ' M SEcv
• Variance (overfitting)
I M SEtrain low
I M SEcv >> M SEtrain
Remark: when we do regularization, the choice of the hyperparameter λ

can also be done by cross-validation.

Solutions in case of large prediction error
In case of bad predictions, several solutions are possible
• have more training observations

• use fewer explanatory variables
• add additional variables to the model
• add non-linearities (ex: polynomial regression)
• change the values of the α regularization parameter
This can improve the performance of the model.

3.5. Error metrics for classification
3.5 Error metrics for classification

Classification example
We have medical data on n patients. For each value of d, we obtain a

vector of parameters noted θ(d) .
We can calculate the MSE on the test sample, for each value of d: M SE (d) .
Problem: the choice of the d hyperparameter is made according to the
same test sample, problem of overfitting on the test sample.

Cross-validation
The dataset is separated into three sub-samples:
• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )}
• the validation sample:
{(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ncv ) , y (ncv ) )}
• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )}
Note: nc v is the number of observations in the validation sample.

Model selection

1 if cancer
y=
0 elsewhere.
We obtain a test error of 1%.

Problem: only 0.5% of patients have cancer.

Confusion matrix
Predicted class
0 1 Total
0 TN FP TN + FP
Observed class
1 FN TP FN + TP
Total TN + FN FP + TP n
• T N : true negatives
• F P : false positives
• F N : false negatives
• T P : true positives

Precision and recall
• Precision:
TP
Precision =
TP + FP
• Recall:
TP
Recall =
TP + FN

Trade-off precision/recall
In a logistic regression we have: 0 ≤ hθ (x) ≤ 1

1 if hθ (x) ≥ 0.5
Predict:
0 elsewhere.
• If you increase the threshold: more precision, less recall

• If you decrease the threshold: less precision, more recall

Score F1
Harmonic mean of precision and recall.
Precision × Recall
F1 score = 2 ∗
Precision + Recall

Session2 Eng

Uploaded by

Copyright:

Available Formats

You might also like

Session2 Eng

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session2 Eng

Uploaded by

Copyright:

Available Formats

Prediction methods and Machine learning

Pierre Michel Prediction methods and Machine learning 2/57

1.1 Introduction to classification

Pierre Michel Prediction methods and Machine learning 3/57

• spam detection in e-mails (ham or spam)

In a classification problem, the variable to be explained (or target variable)

Pierre Michel Prediction methods and Machine learning 4/57

Example of binary classification

Let hθ (x) = θT x, we define the following classification rule:

Choice of the function hθ

The use of a linear model for classification is not recommended.

Pierre Michel Prediction methods and Machine learning 6/57

1.2 Representation of the model

Pierre Michel Prediction methods and Machine learning 7/57

Logistic regression model

We want a function hθ which checks: 0 ≤ hθ (x) ≤ 1.

Pierre Michel Prediction methods and Machine learning 8/57

hθ (x) is the estimated probability that y = 1 for the observation x.

Pierre Michel Prediction methods and Machine learning 10/57

1.3 Decision threshold

Pierre Michel Prediction methods and Machine learning 11/57

• Predict y = 1 if hθ (x) ≥ 0.5

Pierre Michel Prediction methods and Machine learning 12/57

Example: hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Pierre Michel Prediction methods and Machine learning 13/57

Pierre Michel Prediction methods and Machine learning 14/57

1.4 Cost function and gradient descent

Pierre Michel Prediction methods and Machine learning 15/57

That can be written as:

p(y|x; θ) = hθ (x)y (1 − hθ (x))1−y

Pierre Michel Prediction methods and Machine learning 16/57

Maximum likelihood estimation

We consider instead the log-likelihood:

l(θ) = log L(θ)

Pierre Michel Prediction methods and Machine learning 17/57

Cost function J(θ)

By defining the cost function J(θ) as follows:

We then look for the vector of parameters θ which minimizes J(θ)

Pierre Michel Prediction methods and Machine learning 18/57

Stochastic gradient descent

Repeat until convergence {

θj := θj − α(hθ (x(i) ) − y (i) )

Pierre Michel Prediction methods and Machine learning 19/57

Other optimization algorithms

Gradient descent1 is not the only optimization algorithm allowing to

1.5 Multiclass classification

Pierre Michel Prediction methods and Machine learning 21/57

In a multiclass classification problem we have:

y ∈ {1, 2, ..., K}, K > 2

• pattern recognition in images (see the MNIST dataset3 )

Binary and multiclass classification

Binary (K=2) One−versus−rest (K>2)

Pierre Michel Prediction methods and Machine learning 23/57

“One versus all” (OVR)

For each class k (e.g. k = 1, 2, ..., K), we have:

Idea of the OVR method: