Session2 Eng

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Prediction methods and Machine learning

Session II

Pierre Michel
pierre.michel@univ-amu.fr

M2 EBDS

2021
1. Logistic regression

1. Logistic regression

Pierre Michel Prediction methods and Machine learning 2/57


1. Logistic regression
1.1. Introduction to classification

1.1 Introduction to classification

Pierre Michel Prediction methods and Machine learning 3/57


1. Logistic regression
1.1. Introduction to classification

Classification problems
Some examples:

• spam detection in e-mails (ham or spam)


• fraud detection in online transactions (fraudulent, safe)
• tumor diagnosis (malignant, benign)
• heart rhythm diagnosis (normal sinus rhythm, atrial fibrillation)
• ...

In a classification problem, the variable to be explained (or target variable)


y is categorical, we note

• Binary classification: y:
I 0: “negative class” (ex: normal sinus rhythm)
I 1: “positive class” (ex: atrial fibrillation)
• Multiclass classification: y ∈ {0, 1, 2, ..., K}, K ≥ 2

Pierre Michel Prediction methods and Machine learning 4/57


1. Logistic regression
1.1. Introduction to classification

Example of binary classification

1: Oui
Tumeur maligne ?

0: Non

Taille de la tumeur

Let hθ (x) = θT x, we define the following classification rule:


• if hθ (x) ≥ 0.5, we predict y = 1
• if hθ (x) < 0.5, we predict y = 0
Pierre Michel Prediction methods and Machine learning 5/57
1. Logistic regression
1.1. Introduction to classification

Choice of the function hθ

The use of a linear model for classification is not recommended.


In binary classification, we have y = 0 or y = 1, but in the previous example,
hθ (x) can take values > 1 or < 0.
For the logistic regression, we will choose a function hθ which verifies:

0 ≤ hθ (x) ≤ 1

Pierre Michel Prediction methods and Machine learning 6/57


1. Logistic regression
1.2. Representation of the model

1.2 Representation of the model

Pierre Michel Prediction methods and Machine learning 7/57


1. Logistic regression
1.2. Representation of the model

Logistic regression model

We want a function hθ which checks: 0 ≤ hθ (x) ≤ 1.


We set:

hθ (x) = g(θT x)

1
where g(z) = 1+e−z (g is the sigmoid or logistic function), that is:

1
hθ (x) =
1 + e−θT x

Pierre Michel Prediction methods and Machine learning 8/57


1. Logistic regression
1.2. Representation of the model

Logistic function

1
g(z) =
1 + e−z
1.0
0.8
0.6
g(z)

0.4
0.2
0.0

−4 −2 0 2 4

Pierre Michel
z Prediction methods and Machine learning 9/57
1. Logistic regression
1.2. Representation of the model

Interpretation of hθ (x)

hθ (x) is the estimated probability that y = 1 for the observation x.


Example: hθ (x) = 0.8 means that the individual has a 80% chance of
having a malignant tumor.
We denote:

hθ (x) = P (y = 1|x; θ)
with
P (y = 0|x; θ) + P (y = 1|x; θ) = 1

P (y = 0|x; θ) = 1 − P (y = 1|x; θ)

Pierre Michel Prediction methods and Machine learning 10/57


1. Logistic regression
1.3. Decision threshold

1.3 Decision threshold

Pierre Michel Prediction methods and Machine learning 11/57


1. Logistic regression
1.3. Decision threshold

Classification rule

1
hθ (x) = g(θT x) = P (y = 1|x; θ) g(z) = 1+e−z

Classification rule:

• Predict y = 1 if hθ (x) ≥ 0.5


I as g(z) ≥ 0.5 when z ≥ 0
I we have: g(θT x) ≥ 0.5 when θT x ≥ 0
• Predict y = 0 if hθ (x) < 0.5
I as g(z) < 0.5 when z < 0
I we have: g(θT x) < 0.5 when θT x < 0

Pierre Michel Prediction methods and Machine learning 12/57


1. Logistic regression
1.3. Decision threshold

Example: hθ (x) = g(θ0 + θ1 x1 + θ2 x2 )


4
3
x2

2
1

y=0
y=1
0

0.5 1.0 1.5 2.0 2.5 3.0 3.5

x1

Pierre Michel Prediction methods and Machine learning 13/57


1. Logistic regression
1.3. Decision threshold

Non-linear example:
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 + θ3 x21 + θ4 x22 )
2
1
x2

0
−1

y=0
y=1
−2

−2 −1 0 1 2

x1

Pierre Michel Prediction methods and Machine learning 14/57


1. Logistic regression
1.4. Cost function and gradient descent

1.4 Cost function and gradient descent

Pierre Michel Prediction methods and Machine learning 15/57


1. Logistic regression
1.4. Cost function and gradient descent

Reminder: probability

We have:

P (y = 1|x; θ) = hθ (x)
P (y = 0|x; θ) = 1 − hθ (x)

That can be written as:

p(y|x; θ) = hθ (x)y (1 − hθ (x))1−y

Pierre Michel Prediction methods and Machine learning 16/57


1. Logistic regression
1.4. Cost function and gradient descent

Maximum likelihood estimation


We try to find the value of θ that maximizes the likelihood:

n
Y
L(θ) = p(y (i) |x(i) ; θ)
i=1
n
Y (i) (i)
= hθ (x(i) )y (1 − hθ (x(i) ))1−y
i=1

We consider instead the log-likelihood:

l(θ) = log L(θ)


Xn
= y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) ))
i=1

Pierre Michel Prediction methods and Machine learning 17/57


1. Logistic regression
1.4. Cost function and gradient descent

Cost function J(θ)

By defining the cost function J(θ) as follows:

n
1 X (i)
J(θ) = − y log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) ))
n i=1

We then look for the vector of parameters θ which minimizes J(θ)

Pierre Michel Prediction methods and Machine learning 18/57


1. Logistic regression
1.4. Cost function and gradient descent

Stochastic gradient descent

Repeat until convergence {


from i = 1 to n {

θj := θj − α(hθ (x(i) ) − y (i) )


}
}
Note: similar to linear regression

Pierre Michel Prediction methods and Machine learning 19/57


1. Logistic regression
1.4. Cost function and gradient descent

Other optimization algorithms

Gradient descent1 is not the only optimization algorithm allowing to


minimize the cost function.
Other algorithms (CG, LBFGS. . . )2 , more complex, do not require the
choice of the learning rate α and are faster.

1 https:

//scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
2 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.

LogisticRegression.html
Pierre Michel Prediction methods and Machine learning 20/57
1. Logistic regression
1.5. Multiclass classification

1.5 Multiclass classification

Pierre Michel Prediction methods and Machine learning 21/57


1. Logistic regression
1.5. Multiclass classification

Introduction

In a multiclass classification problem we have:

y ∈ {1, 2, ..., K}, K > 2

Example:

• pattern recognition in images (see the MNIST dataset3 )

3 https:

//scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
Pierre Michel Prediction methods and Machine learning 22/57
1. Logistic regression
1.5. Multiclass classification

Binary and multiclass classification

Binary (K=2) One−versus−rest (K>2)


4

4
3

3
x2

x2
2

2
1

1
y=1
y=0 y=2
y=1 y=3
0

0
0 1 2 3 4 0 1 2 3 4

x1 x1

Pierre Michel Prediction methods and Machine learning 23/57


1. Logistic regression
1.5. Multiclass classification

“One versus all” (OVR)

For each class k (e.g. k = 1, 2, ..., K), we have:

(k)
hθ (x) = P (y = k|x; θ)

Idea of the OVR method:

• we estimate the function h(k)


θ (x) (logistic function) for each class k,
(k)
which predicts the probability that hθ (x)
• for a new observation x, we predict as class the one that maximizes
the function:
(k)
max hθ (x)
k

Pierre Michel Prediction methods and Machine learning 24/57


2. Regularization

2. Regularization

Pierre Michel Prediction methods and Machine learning 25/57


2. Regularization
2.1. The overfitting problem

2.1 The overfitting problem

Pierre Michel Prediction methods and Machine learning 26/57


2. Regularization
2.1. The overfitting problem

Example: linear regression


Price

Size

In this example, when d = 1, we talk about underfitting (high bias), for


d = 4, we talk about overfitting (high variance).
Pierre Michel Prediction methods and Machine learning 27/57
2. Regularization
2.1. The overfitting problem

Overfitting

We speak of overfitting, when:

• the fit of the model on the training data is satisfactory


• the model does not allow to generalize (predict) on new observations

Pierre Michel Prediction methods and Machine learning 28/57


2. Regularization
2.1. The overfitting problem

Solutions

Two options:

1. Reduce the number of variables (variable/model selection)


2. Regularization (reduce the values of the parameters θj )

Note: regularization is recommended in high dimension.

Pierre Michel Prediction methods and Machine learning 29/57


2. Regularization
2.2. Cost function

2.2 Cost function

Pierre Michel Prediction methods and Machine learning 30/57


2. Regularization
2.2. Cost function

Idea of regularization

Degré du polynôme: 2 Degré du polynôme: 4


Prix (y)

Prix (y)

Surface (x) Surface (x)

Let hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3 + θ4 x4 , we penalize the parameters


θ3 et θ4 : minθ J(θ) + λθ32 + λθ42 , λ > 0
Pierre Michel Prediction methods and Machine learning 31/57
2. Regularization
2.2. Cost function

Regularization
We obtain small values for the parameters θ0 , θ1 , ..., θp

• we obtain a simpler hθ (x) function


• less risk of overfitting

We define the cost function as follows:

X n
X
J(θ) = (y (i) − hθ (x(i) ))2 + λ θ2
i=1 i=1

Note: if the value of λ (regularization parameter) is too large, the values


of the parameters θ1 , θ2 , ..., θp tend towards 0. In other words we obtain

hθ (x) = θ0

.
Pierre Michel Prediction methods and Machine learning 32/57
3. Evaluation of a Machine Learning model

3. Evaluation of a Machine Learning model

Pierre Michel Prediction methods and Machine learning 33/57


3. Evaluation of a Machine Learning model
3.1. Prediction error

3.1 Prediction error

Pierre Michel Prediction methods and Machine learning 34/57


3. Evaluation of a Machine Learning model
3.1. Prediction error

How to test a model ?

The evaluation of a Machine Learning model is done with a test sample.


The test sample is a subset of observations from the dataset that is not
used for model estimation.

Pierre Michel Prediction methods and Machine learning 35/57


3. Evaluation of a Machine Learning model
3.1. Prediction error

Solutions for large prediction errors

In case of bad predictions, several solutions can be considered

• have more training observations


• use fewer explanatory variables
• add additional variables to the model
• add non-linearities (ex: polynomial regression)
• change the values of the α regularization parameter

This can improve the performance of the model.

Pierre Michel Prediction methods and Machine learning 36/57


3. Evaluation of a Machine Learning model
3.2. Evaluation of a model

3.2 Evaluation of a model

Pierre Michel Prediction methods and Machine learning 37/57


3. Evaluation of a Machine Learning model
3.2. Evaluation of a model

Overfitting

A model with little learning error does not guarantee a good model.
The model may have difficulty generalizing to new observations (test
sample).

Pierre Michel Prediction methods and Machine learning 38/57


3. Evaluation of a Machine Learning model
3.2. Evaluation of a model

Training and test samples (train-test split)

The dataset is split into two sub-samples:

• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )
• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )

Note: nt est is the number of observations in the test sample.

Pierre Michel Prediction methods and Machine learning 39/57


3. Evaluation of a Machine Learning model
3.2. Evaluation of a model

Training/Testing for linear regression

1. The parameter θ is estimated using the training sample:

min J(θ)
θ

2.We compute the mean squared error (MSE) using the test sample:

nX
test
1 (i) (i)
M SEtest = (hθ (xtest ) − ytest )2
ntest i=1

Note: we take a higher proportion of obervations in the training sample


(e.g. 70%) than in the test sample (e.g. 30%).

Pierre Michel Prediction methods and Machine learning 40/57


3. Evaluation of a Machine Learning model
3.2. Evaluation of a model

Training/Test for logistic regression

1. We estimate the parameter theta using the training sample:

min J(θ)
θ

2. We compute the classification error on the test sample:

nX
test
1
err = I(y (i) 6= ŷ (i) )
ntest i=1

with 
1 if hθ (x) ≥ 0.5
ŷ (i) =
0 elsewhere.

Pierre Michel Prediction methods and Machine learning 41/57


3. Evaluation of a Machine Learning model
3.3. Model selection (training, validation, test)

3.3 Model selection (training, validation, test)

Pierre Michel Prediction methods and Machine learning 42/57


3. Evaluation of a Machine Learning model
3.3. Model selection (training, validation, test)

Example: polynomial regression

d = 1: hθ (x) = θ0 + θ1 x
d = 2: hθ (x) = θ0 + θ1 x + θ2 x2
d = 3: hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3
..
.
d = 10: hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θ10 x10
For each value of d, we obtain a vector of parameters noted θ(d) .
We can calculate the MSE on the test sample, for each value of d: M SE (d) .
Problem: the choice of the d hyperparameter is made according to the
same test sample, problem of overfitting on the test sample.

Pierre Michel Prediction methods and Machine learning 43/57


3. Evaluation of a Machine Learning model
3.3. Model selection (training, validation, test)

Cross-validation

The dataset is separated into three sub-samples:

• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )}
• the validation sample:

{(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ncv ) , y (ncv ) )

• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )}

Note: nc v is the number of observations in the validation sample.

Pierre Michel Prediction methods and Machine learning 44/57


3. Evaluation of a Machine Learning model
3.3. Model selection (training, validation, test)

Model selection
d = 1: hθ (x) = θ0 + θ1 x
d = 2: hθ (x) = θ0 + θ1 x + θ2 x2
d = 3: hθ (x) = θ0 + θ1 x + θ2 x2 + θ3 x3
..
.
d = 10: hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θ10 x10
For each value of d, we obtain a vector of parameters noted θ(d) .
We compute the MSE on the validation sample, for each value of d:
(d) (d)
M SEcv . We choose the value of d that minimizes M SEcv .
The generalization error is then computed on the test sample, with the
chosen value of d.
When doing regularization, the choice of the regularization parameter λ is
the most important one.
Pierre Michel Prediction methods and Machine learning 45/57
3. Evaluation of a Machine Learning model
3.4. Bias and variance

3.4 Bias and variance

Pierre Michel Prediction methods and Machine learning 46/57


3. Evaluation of a Machine Learning model
3.4. Bias and variance

Learning error and validation error

• Learning error: M SEtrain = f rac1n ni=1 (ht heta(x(i) ) − y (i) )2


P
Pncv (i) (i) 2
• Validation error: M SEcv = n1 i=1 (hθ (xcv ) − ycv )
cv

When the validation error is high, is it a problem of bias (underfitting), or


variance (overfitting)?

Pierre Michel Prediction methods and Machine learning 47/57


3. Evaluation of a Machine Learning model
3.4. Bias and variance

Bias versus Variance

• Bias (underfitting)
I M SEtrain high
I M SEtrain ' M SEcv
• Variance (overfitting)
I M SEtrain low
I M SEcv >> M SEtrain

Remark: when we do regularization, the choice of the hyperparameter λ


can also be done by cross-validation.

Pierre Michel Prediction methods and Machine learning 48/57


3. Evaluation of a Machine Learning model
3.4. Bias and variance

Solutions in case of large prediction error

In case of bad predictions, several solutions are possible

• have more training observations


• use fewer explanatory variables
• add additional variables to the model
• add non-linearities (ex: polynomial regression)
• change the values of the α regularization parameter

This can improve the performance of the model.

Pierre Michel Prediction methods and Machine learning 49/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

3.5 Error metrics for classification

Pierre Michel Prediction methods and Machine learning 50/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Classification example

We have medical data on n patients. For each value of d, we obtain a


vector of parameters noted θ(d) .
We can calculate the MSE on the test sample, for each value of d: M SE (d) .
Problem: the choice of the d hyperparameter is made according to the
same test sample, problem of overfitting on the test sample.

Pierre Michel Prediction methods and Machine learning 51/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Cross-validation

The dataset is separated into three sub-samples:

• the training sample: {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(n) , y (n) )}
• the validation sample:

{(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ncv ) , y (ncv ) )}

• the test sample {(x(1) , y (1) ), (x(2) , y (2) ), ..., (x(ntest ) , y (ntest ) )}

Note: nc v is the number of observations in the validation sample.

Pierre Michel Prediction methods and Machine learning 52/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Model selection


1 if cancer
y=
0 elsewhere.

We obtain a test error of 1%.


Problem: only 0.5% of patients have cancer.

Pierre Michel Prediction methods and Machine learning 53/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Confusion matrix

Predicted class
0 1 Total
0 TN FP TN + FP
Observed class
1 FN TP FN + TP
Total TN + FN FP + TP n

• T N : true negatives
• F P : false positives
• F N : false negatives
• T P : true positives

Pierre Michel Prediction methods and Machine learning 54/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Precision and recall

• Precision:

TP
Precision =
TP + FP

• Recall:

TP
Recall =
TP + FN

Pierre Michel Prediction methods and Machine learning 55/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Trade-off precision/recall

In a logistic regression we have: 0 ≤ hθ (x) ≤ 1



1 if hθ (x) ≥ 0.5
Predict:
0 elsewhere.

• If you increase the threshold: more precision, less recall


• If you decrease the threshold: less precision, more recall

Pierre Michel Prediction methods and Machine learning 56/57


3. Evaluation of a Machine Learning model
3.5. Error metrics for classification

Score F1

Harmonic mean of precision and recall.

Precision × Recall
F1 score = 2 ∗
Precision + Recall

Pierre Michel Prediction methods and Machine learning 57/57

You might also like