Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Ho Chi Minh University of Banking

Department of Economic Mathematics

Machine Learning
Model Evaluation

Vuong Trong Nhân (nhanvt@hub.edu.vn)


Outline

 1. Metrics for Classification


 2. Metrics for Regression
 3. Metrics for Clustering

2
1. Metrics for Classification

3
Evaluation for Classification

4
Metrics for Classification

 Accuracy score
 Confusion matrix
 Precision and Recall
 F1 score
 ROC curve
 Area Under the Curve

5
Accuracy Metrics

 Accuracy (độ chính xác):


 Tỉ lệ giữa số điểm được dự đoán đúng và tổng số
điểm trong tập dữ liệu kiểm thử

import numpy as np
from sklearn.metrics import accuracy_score
y_true = np.array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
y_pred = np.array([0, 1, 0, 2, 1, 1, 0, 2, 1, 2])

print('accuracy = ', accuracy_score(y_true, y_pred))


#accuracy = 0.6

6
Limitation of Accuracy

 Consider a binary classification problem


 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10
 If predict all as 0, accuracy is 9990/10000 = 99.9%

 Solution:
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝑇𝑁 𝑇𝑁
 Weighted Accuracy =
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝐹𝑃 𝐹𝑃+𝑤 𝑇𝑁 𝑇𝑁+𝑤 𝐹𝑁 𝐹𝑁

 Other metrics: precision, recall, F1-score, …

7
Confusion Matrix

 shows performance of an algorithm, especially


predictive capability.
 rather than how fast it takes to classify, build
models, or scalability.

Predicted Class
Actual Positive Negative
Class Positive True Positive (TP) False Negative (FN)

Negative False Positive (FP) True Negative (TN)

8
Confusion Matrix

 Imagine a study evaluating a test that screens people


for a disease. Each person taking the test either has or
does not have the disease. The test outcome can be
positive or negative.
 The test results for each subject may or may not match
the subject's actual status. In that setting:

 True positive: Sick people correctly identified as sick


 False positive: Healthy people incorrectly identified as sick
 True negative: Healthy people correctly identified as healthy
 False negative: Sick people incorrectly identified as healthy

9
Confusion Matrix

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers 10
Confusion Matrix
Predicted Class
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Class (Type II error)
Negative False Positive (FP) True Negative (TN)
(Type I error)

 Accuracy
= (TP+TN) / (TP+FP+TN+FN)
 Precision
= TP / (TP + FP)
 Recall
= TP / (TP + FN)
 F1-score = 2 precision * recall / (precision + recall)
= 2TP / (2 TP + FP + FN)
11
Precision-Precall

12
F1-score
The F1 score is the harmonic mean of the precision and recall
F1-score  (0,1]

precision recall F1_score


1 1 1
0.1 0.1 0.1
0.5 0.5 0.5
1 0.1 0.182
0.3 0.8 0.36

F1-score:
(precision 0.5, recall = 0.5) is better than (precision = 0.3, recall = 0.8)

13
Type I and II error

14
Normalized confusion matrix

Predicted Class
Positive Negative
Actual Positive TPR = TP / (TP + FN) FNR = FN / (TP + FN)
Class
Negative FPR = FP / (FP + TN) TNR = TN / (FP + TN)

False Positive Rate còn được gọi là False Alarm Rate (tỉ lệ báo động nhầm)

False Negative Rate còn được gọi là Miss Detection Rate (tỉ lệ bỏ sót)

Trong bài toán dò mìn, “thà báo nhầm còn hơn bỏ sót”, tức là ta có thể chấp
nhận False Alarm Rate cao để đạt được Miss Detection Rate thấp.
Trong bài toán lọc email rác thì việc cho nhầm email quan trọng vào thùng rác
nghiêm trọng hơn việc xác định một email rác là email thường.
15
ROC curve

 Receiver Operating Characteristic


 Graphical approach for displaying the tradeoff
between true positive rate(TPR) and false positive
rate (FPR) of a classifier
o TPR = positives correctly classified/total positives
o FPR = negatives incorrectly classified/total negatives
 TPR on y-axis and FPR on x-axis

16
ROC curve

 Points of interests (TP, FP)


 (0, 0): everything is negative
 (1, 1): everything is positive
 (1, 0): perfect (ideal)

 Diagonal line
 Random guessing (50%)

 Area Under Curve (AUC)


 Measurement how good the model on the average
 Good to compare with other methods

17
For multi-class classification
Micro-average

Macro-average

20
2. Metrics for Regression

21
2.1. Bias

The sum of residuals, sometimes referred to as bias.

residual = actual - prediction

As the residuals can be both positive


(prediction is smaller than the actual
value) and negative (prediction is larger
than the actual value), bias generally
tells you whether your predictions were
higher or lower than the actuals.

However, as the residuals of opposing signs


offset each other, you can obtain a model
that generates predictions with a very low
bias, while not being accurate at all.

Figure 1 presents the relationship between a target variable (y) and a single feature (x)
22
2.2. Mean squared error (MSE)

Pros:
 MSE uses the mean (instead of the sum) to keep the metric independent of
the dataset size.
 As the residuals are squared, MSE puts a significantly heavier penalty on
large errors. Some of those might be outliers, so MSE is not robust to their
presence.
 The metric is useful for optimization algorithms.
Cons:
 MSE is not measured in the original units, which can make it harder to
interpret.
 MSE cannot be used to compare the performance between different datasets.

23
2.3. Root mean squared error
Root mean squared error (RMSE)

 Pros:
 Take the square (MSE) to bring the metric back to the scale of the target
variable, so it is easier to interpret and understand.
 Cos:
 However, take caution: one fact that is often overlooked is that although
RMSE is on the same scale as the target, an RMSE of 10 does not actually
mean you are off by 10 units on average.

24
https://developer.nvidia.com/
2.4. Mean absolute error (MAE)

 Pros:
 Due to the lack of squaring, the metric is expressed at the same scale as
the target variable, making it easier to interpret.
 All errors are treated equally, so the metric is robust to outliers.
 Cons:
 Absolute value disregards the direction of the errors, so underforecasting
= overforecasting.
 Similar to MSE and RMSE, MAE is also scale-dependent, so you cannot
compare it between different datasets.
 When you optimize for MAE, the prediction must be as many times
higher than the actual value as it should be lower. That means that you
are effectively looking for the median; that is, a value that splits a dataset
into two equal parts.
 As the formula contains absolute values, MAE is not easily differentiable.
25
2.5. R-squared (R²)

Measure: how well your model fits the data

RSS : the residual sum of squares


TSS : the total sum of squares

26
2.5. R-squared
Measure: how well your model fits the data
• RSS : the residual sum of squares
• TSS : the total sum of squares

 Pros:
 Model Fit Assessment & Model Comparisons
o A higher R-squared means a better fit.
 Helps in Feature Selection
o If adding a variable improves R-squared a lot,
it's likely a good predictor.
 Cons:
 Sensitive to Outliers
 Depends on Sample Size
 Not distinguishing between different types of
relationships
27
2.6. Some other metrics
 Mean squared log error (MSLE)
 Root mean squared log error (RMSLE)
 Symmetric mean absolute percentage error (sMAPE)
…

28
3. Metrics for Clustering

29
Rand index (RI)
 Given the knowledge of the ground truth class assignments
labels_true and our clustering algorithm assignments of the same
samples labels_pred, the (adjusted or unadjusted)
 Rand index measures the similarity of the two assignments,
ignoring permutations
 If C is a ground truth class assignment and K the clustering, let us
define a and b as:
 a the number of pairs of elements that are in the same set in C and in the
same set in K
 b the number of pairs of elements that are in different sets in C and in
different sets in K

is the total number of possible pairs in the dataset.

30
Adjusted Rand index (ARI)
 However, the Rand index does not guarantee that
random label assignments will get a value close to zero
(esp. if the number of clusters is in the same order of
magnitude as the number of samples).

 To counter this effect we can discount the expected RI


(E[RI) of random labelings by defining the adjusted
Rand index as follows

31
Rand index (RI) & Adjust Rand Index (ARI)

Rand index is a function that measures the similarity of the two assignments,
ignoring permutations:

The Rand index does not ensure to obtain a value close to 0.0 for a
random labelling.

The adjusted Rand index corrects for chance and will give such a
baseline.

32
Silhouette Score
 The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an
example of such an evaluation, where a higher Silhouette Coefficient
score relates to a model with better defined clusters. The Silhouette
Coefficient is defined for each sample and is composed of two scores:

•a: The mean distance between a sample and all other


points in the same class.
•b: The mean distance between a sample and all other
points in the next nearest cluster.

 The Silhouette Coefficient s for a single sample is then given as:

33
Some other metrics

 Mutual Information based scores


 Homogeneity, completeness and V-measure
 Fowlkes-Mallows scores
 Calinski-Harabasz Index
 Davies-Bouldin Index
 Contingency Matrix
 Pair Confusion Matrix

34
References:
 https://scikit-
learn.org/stable/modules/clustering.html#clustering-
evaluation

35

You might also like