Professional Documents
Culture Documents
03 - Model Evaluation Comparison
03 - Model Evaluation Comparison
202223
03
What is a model?
Evaluation methods and metrics
1
A
G
E
N Supervised Learning
D
A
What is a model?
Evaluation Methods
Evaluation Metrics
2
1 Supervised Learning
What is
Supervised Learning?
Machine learning techniques that use labeled datasets
to train algorithms with the purpose to classify data or predict outcomes.
3
1 Supervised Learning
Supervised Learning
Regression Classification
VS
4
1.1 Methodologies in Data Science
Regression
95 T2 Arroios Yes 3 ?
5
1.1 Methodologies in Data Science
Regression
03 VLU … … … ?
6
1.2 Methodologies in Data Science
Classification
3627 F 93 185 0 ?
7
1.2 Methodologies in Data Science
Classification
9
1.3 Methodologies in Data Science
Classification + Regression
10
1.3 Methodologies in Data Science
Classification + Regression
12
2.1 Models
Developing a model
- As in any other computer tasks, modeling requires a “program” providing detailed instructions
- These instructions are typically mathematical equations, which characterize the relationship between
13
2.2 Models
Types of models
MODELS
How can we define these instructions and equations?
14
2.2.1 Models
Fixed models
- Closed-form equations that define how the outputs are derived from
the inputs
- Being all the characteristics fixed when the equations are derived we
refer them as fixed models
- Suitable for simple and fully understood problems
Example:
Compute how much it takes an apple to hit the ground on Earth:
2ℎ
𝑡=
9.8
Most problems are too complex and / or not sufficiently understood for us to use fixed models.
15
2.2.2 Models
Parametric models
Example:
Computes how much it takes an apple to hit the ground on Mars:
2ℎ
𝑡=
𝑔
The rate of acceleration
INPUTS OUTPUTS
(in this case the input
vector has only one
component)
17
2.2.2 Models
Parametric models
A value should be selected for g so the model produces estimates close to the
measured times when presented with the corresponding heights as input.
Due to inaccuracies in
the structure of the
model or measurement
errors
In this case is easy to define a good value for g but this is not generally possible
in most problems which involve complex relationships and multiple variables
18
2.2.2 Models
Parametric models
19
2.2.3 Models
Non-Parametric models
- Models that rely heavily on data, instead of human expertise, can be called data-
driven models
- Used in complex problems where the relationship between inputs and outputs is
not known
20
2.2.3 Models
Non-Parametric models
- The basic premise is that relations consistently occurring in the dataset will recur in
future observations
21
2.2.3 Models
Non-Parametric models
- In real use-cases, most problems are too complex to use a fixed or parametric
model
- However, you should always start with the simplest model and only increase
22
2.2.4 Models
Parametric vs Non-Parametric models
Deductive
General Particular
Rule Examples
Inductive
23
2.2.4 Models
Parametric vs Non-Parametric models
Parametric model
“A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model. No matter how much data you throw at a
parametric model, it won’t change its mind about how many parameters it needs.”
Nonparametric model
“Nonparametric methods are good when you have a lot of data and no prior knowledge, and when
you don’t want to worry too much about choosing just the right features.”
Russell, Stuart J. (Stuart Jonathan). Artificial Intelligence : a Modern Approach. Upper Saddle River, N.J. :Prentice Hall, 2010.
24
2.2.4 Models
Parametric vs Non-Parametric models
Linear Regression
Parametric model | assume known distributions in the data
Training data – Require more data than the parametric models Non-linear SVM
Human Expertise
Complexity
28
2.3.1 Models
The no free lunch theorem
“If an algorithm performs better than random search on some class of problems
then it must perform worse than random search on the remaining problems.”
- No single machine learning algorithm is universally the best-performing algorithm for all problems!
- You need to understand the trade-offs
29
2.3.2 Models
The problem of complexity
- Sometimes it is not feasible to collect enough data and to allow enough processing time to
- Choosing the right features is a way of simplifying the problem and reducing the “search space”
- Important attributes vs redundant attributes: which are the variables that are relevant for the
task?
30
2.3.2 Models
The problem of complexity
31
2.3.3 Models
The concept of separability
32
2.3.4 Models
The concept of overfitting
If we allow too much complexity, the model will “memorize” the training data,
instead of extracting useful relationships
OVERFITTING
33
2.3.4 Models
The concept of overfitting
Memorizing vs Understanding
34
2.3.4 Models
The concept of overfitting
Underfitting vs Overfitting
In a
classification
problem
35
2.3.4 Models
The concept of overfitting
Underfitting vs Overfitting
In a
regression
problem
36
2.3.4 Models
The concept of overfitting
DATA
37
2.3.4 Models
The concept of overfitting
Training Error
38
2.3.4 Models
The concept of overfitting
VALIDATION SET The bigger the best the estimate of the optimal training.
TEST SET The bigger the best the estimate of the performance of the classifier on unseen data.
39
3 Model evaluation
40
3 Model evaluation
41
3 Model evaluation
Dataset
We can estimate and evaluate the performance of the model on the training data
- However, estimates based only on the training data are not good indicators of
Model the performance of the model on unseen data
- New data will probably not be exactly as the same as the training data – these
estimates suffer from overfitting
Dataset
42
3.1 Model evaluation
The hold-out method
(Some authors defend 80% train and 20% validation – it depends on the size of the dataset)
43
3.1 Model evaluation
The hold-out method
When there is plenty of data we can split our dataset into training, validation and test set:
Dataset
RULE OF THUMB
- 70 % for training set
Model
- 15 % for validation set
- 15 % for test set
44
3.1 Model evaluation
The hold-out method
Stratified sampling
Sample each class independently such that each dataset has the same ratio of any given classc
Dataset
27 % Class 0
73 % Class 1
46
3.1 Model evaluation
The hold-out method
47
3.2 Model evaluation
K-Fold Cross Validation
DATA Step 1
1st Iteration TEST TRAIN TRAIN TRAIN TRAIN Split data into k subsets of equal sizes
2nd Iteration TRAIN TEST TRAIN TRAIN TRAIN Step 2
Each turn, use one subset for testing / validation and the
3rd Iteration TRAIN TRAIN TEST TRAIN TRAIN
remainder for training
4th Iteration TRAIN TRAIN TRAIN TEST TRAIN
Step 3
5th Iteration TRAIN TRAIN TRAIN TRAIN TEST Average all estimates to get a final value
48
3.2 Model evaluation
K-Fold Cross Validation
SOME TIPS
- Extensive experiments have shown that this is the best choice to get an accurate estimate
- Big datasets can make use of more folds (and in that way get a more robust estimate)
49
3.3 Model evaluation
Other options
- Leave–One–Out Cross-Validation:
50
3.4 Model evaluation
Compare two models
The train and validation on different splits and the computation of standard deviation will help you
(you can even go further and use a hypothesis test to prove or disprove measurable hypothesis)
51
3.4 Model evaluation
Compare two models
- Easier to understand
52
3.4 Model evaluation
Wrap-up
- Use test sets and the hold-out method for “large” data
- Don’t use test data for parameter tuning – use separate validation data
53
4 Model evaluation
54
4.1 Model evaluation
Metrics for classification
55
4.1 Model evaluation
Metrics for classification
True Class
Positive Negative Type I Error
Positive TP FP
Predicted Class
Negative FN TN
Type II Error
56
4.1 Model evaluation
Metrics for classification
True Class
Goat Not Goat
Goat 1
Predicted Class
Not Goat
TRUE POSITIVE
57
4.1 Model evaluation
Metrics for classification
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat
FALSE POSITIVE
58
4.1 Model evaluation
Metrics for classification
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1
FALSE NEGATIVE
59
4.1 Model evaluation
Metrics for classification
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1 1
TRUE NEGATIVE
60
4.1 Model evaluation
Metrics for classification
True Class
Goat Not Goat
Goat 30 4
Predicted Class
Not Goat 6 20
...
61
4.1.1 Model evaluation
Metrics for classification - Accuracy
Accuracy
Proportion of events correctly identified (positive or negative) on all events
True Class
Positive Negative 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
Predicted Positive TP FP (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Class Negative FN TN
62
4.1.2 Model evaluation
Metrics for classification – Error Rate
63
4.1.3 Model evaluation
Metrics for classification – Precision
Precision
Proportion of correctly positive events from all events identified as positive
True Class
Positive Negative 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
Predicted Positive TP FP (𝑇𝑃 + 𝐹𝑃)
Class Negative FN TN
Predicted Class
Goat 30 4 30 30
Not Goat 6 20 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.882
(30 + 4) 34
- Email Spam detection: A false positive means that an e-mail that is not spam has been identified as spam. The
user might lose important emails if the precision is not high.
64
4.1.4 Model evaluation
Metrics for classification – Recall
- Sick patient detection: If a sick patient (actual positive) goes through the test and predicted as not sick (false
negative), the cost will be extremely high if the sickness is contagious
65
4.1.5 Model evaluation
Metrics for classification – Specificity
66
4.1.6 Model evaluation
Metrics for classification – The problem for imbalanced datasets
Example 2
- We have a dataset with 10000 images. Only 10 have goats.
True Class
- What is the accuracy of the model?
Positive Negative
Positive TP FP
Predicted 1 + 9990 9991
Class Negative FN TN 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.9991
(1 + 9990 + 0 + 9) 10000
True Class
Goat Not Goat
This seems like an almost perfect model!
Goat 1 0 But in reality, the model is not able to identify well goats, which is our main
Predicted Class
Not Goat 9 9990
goal!
The recall is equal to 0.1!
The problem of imbalanced datasets
67
4.1.6 Model evaluation
Metrics for classification – The problem for imbalanced datasets
68
4.1.7 Model evaluation
Metrics for classification – Wrap up (some hints and examples)
- F1 Score is a good measure if you seek a balance between precision and recall
- Accuracy is a good measure only if the cost of False Positive and False Negative is equal and the dataset is balanced
69
4.1.8 Model evaluation
Cutoff for classification
- Most predictive algorithms classify via a 2-step process. For each record:
- We can use different cutoff values (typically, error rate is lowest for cutoff = 0.5)
70
4.1.8 Model evaluation
Metrics for classification – ROC Curve
RULE OF THUMB
71
4.2 Model evaluation
Metrics for Regression
72
4.2 Model evaluation
Metrics for Regression
73
4.2 Model evaluation
Metrics for Regression
Model 1 Model 2
𝐲 𝐲ො ෝ
𝒚 − 𝒚 𝐲 𝐲ො ෝ
𝒚 − 𝒚
3 4 1 3 4 1
6 5 1 6 5 1
4 6 2 4 6 2
8 7 1 8 7 1
12 11 1 12 11 1
5 5 0 5 14 9
74
4.2.1 Model evaluation
Metrics for Regression – Mean Absolute Error (MAE)
Model 1 𝑛
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1
𝑀𝐴𝐸 𝑦, 𝑦ො = 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 2
8 7 1
12 11 1
5 5 0
Model 2 Model 1
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1+1+2+1+1+0 This measures the absolute
𝑀𝐴𝐸 = =1 average distance between the
3 4 1 6
6 5 1 real data and the predicted data,
Model 2
4 6 2 1+1+2+1+1+9 but it fails to punish large errors
8 7 1 𝑀𝐴𝐸 = = 2.5 in prediction.
6
12 11 1
5 14 9
75
4.2.2 Model evaluation
Metrics for Regression – Mean Squared Error (MSE)
𝑛
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐 1 2
𝑀𝑆𝐸 𝑦, 𝑦ො = 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 4
8 7 1
Model 1
12 11 1 This measures the squared average
1+1+4+1+1+0
5 5 0 𝑀𝑆𝐸 = = 1.33 distance between the real data and
6 the predicted data. Here, larger errors
Model 2
are well noted (better than MAE).
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐
1 + 1 + 4 + 1 + 1 + 81
3 4 1
𝑀𝑆𝐸 = = 14.83
6
6 5 1
4 6 4
8 7 1 𝑛
1 2
12 11 1 𝑅𝑀𝑆𝐸 𝑦, 𝑦ො = 𝑦𝑖 − 𝑦ො𝑖
𝑛
5 14 81 𝑖=1
76
4.2.4 Model evaluation
Metrics for Regression - R Square (R2 ) / Coefficient of determination
σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑅2 𝑦, 𝑦ො = 1 − 𝑛 2
σ𝑖=1 𝑦𝑖 − 𝑦ത𝑖
Model 1 𝑦ത = 6.33
𝐲 𝐲ො ෝ 𝟐 ഥ 𝟐
𝒚 − 𝒚 𝒚−𝒚
Model 1
3 4 1 11.09
8
6 5 1 0.11 𝑅2 = 1 − = 0.85
53.34
4 6 4 5.43
8 7 1 2.79
Interpretation: 85 % of the variance in the dependent variable is
12 11 1 32.15
5 5 0 1.77
explained by the model
SUM 8 53.34
77
4.2.4 Model evaluation
Metrics for Regression – Adjusted R-Square (Adjusted R2 )
R Squared
Sample Size (number of observations)
2
(1 − 𝑅2 )(𝑁 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 𝑦, 𝑦ො = 1 −
𝑁 −𝑝 −1
Number of predictors (independent variables)
78
4.2.4 Model evaluation
Metrics for Regression – Wrap Up
- MSE / RMSE has the benefit of penalizing large errors more so can be more appropriate in some cases, for
- If being off by 10 is just twice as bad as being off by 5, then use MAE
- Use adjusted R-Squared when comparing the performance of models where the number of predictors is
different
79
Obrigado!
80