03 - Model Evaluation Comparison

Master in Information Management
202223
03
What is a model?
Evaluation methods and metrics
1
A
G
E
N Supervised Learning
D
A
What is a model?
Evaluation Methods
Evaluation Metrics
2
1 Supervised Learning
What is
Supervised Learning?
Machine learning techniques that use labeled datasets
to train algorithms with the purpose to classify data or predict outcomes.
3
1 Supervised Learning
Supervised Learning
Regression Classification
VS
Outcome variable is Outcome variable is

numerical categorical
4
1.1 Methodologies in Data Science
Regression
Predicting the price of a house
Area Typology Localization Garage Pollution Index Price

75 T2 Alfama No 1 234 000
86 T1 Campolide Yes 2 210 000
93 T2 Santos No 2 254 000
134 T4 Arroios No 3 457 000
96 T3 Alfama No 1 367 000
54 T2 Alcântara No 2 219 000
167 T5 Restelo Yes 2 675 000
95 T2 Arroios Yes 3 ?
5
Regression
Predicting stock market values
Day Company GDP Inflation Value

…
acceleration
01 VLU 0.358 -1.059 … 20.17
01 RIT 0.121 0.205 … 125.23
01 XEZ -0.121 -0.075 … 98.76
02 VLU -0.256 0.172 … 20.19
02 RIT 0.232 1.212 … 145.87
02 XEZ 0.867 -0.754 … 95.23
03 VLU … … … ?
Which variables proxy for those changes remains unknown …
6
Classification
Predicting the probability of someone having diabetes
ID Gender Weight Height Pregnancies Diabetes

12562 M 78 175 0 0
62735 F 66 155 3 1
84638 F 91 165 1 1
1232 M 89 187 0 0
13738 M 101 172 0 1
3274 M 81 179 0 0
3283 F 72 169 0 0
3627 F 93 185 0 ?
7
Classification
Predict which animal is in a photo

Figure Channel Pixel_1 Pixel_2 Pixel_3 … Animal
Fig_1 Red 126 134 136
Fig_1 Green 222 211 207 Cat
Fig_1 Blue 198 198 196
Fig_2 Red 35 31 31
Fig_2 Green 89 88 87 Goat
Fig_2 Blue 145 147 149
… … … … … … …
Figure Channel Pixel_1 Pixel_2 Pixel_3 … Animal

Fig_156 Red 234 236 237
Fig_156 Green 176 176 175 ?
Fig_156 Blue 98 97 98
8
Classification + Regression
Can we have both regression and

classification in a same model?
9
What does CAPTCHA tests have to do with automatic driving cars?
10
Classification (Class: Truck)
Regression (Coordinates: x1, y1, x2, y2)

11
2 What are models?
What are models?

The output of machine learning algorithms run on data.
12
2.1 Models
Developing a model
- As in any other computer tasks, modeling requires a “program” providing detailed instructions
- These instructions are typically mathematical equations, which characterize the relationship between
inputs and outputs
- Formulating these equations is the central problem in modeling
13
2.2 Models
Types of models
MODELS
How can we define these instructions and equations?
Fixed Models Parametric models Non-parametric models

Model driven Data driven
14
2.2.1 Models
Fixed models
- Closed-form equations that define how the outputs are derived from
the inputs
- Being all the characteristics fixed when the equations are derived we
refer them as fixed models
- Suitable for simple and fully understood problems
Example:
Compute how much it takes an apple to hit the ground on Earth:
2ℎ
𝑡=
9.8
Most problems are too complex and / or not sufficiently understood for us to use fixed models.
15
2.2.2 Models
Parametric models
- You know enough about the problem to define the general

relationship between inputs and outputs
- But some parameters are unspecified – those are chosen by
examining data examples
Example:
Computes how much it takes an apple to hit the ground on Mars:
2ℎ
𝑡=
𝑔
The rate of acceleration
You need to estimate the parameter g!

16
2.2.2 Models
Parametric models
You need to estimate the parameter g!
Go to Mars and collect some data:
Height (h) Falling time (t)

0.5 0.2
1.3 0.4
2.8 0.46
4 0.68
7.3 0.7
INPUTS OUTPUTS
(in this case the input
vector has only one
component)
17
2.2.2 Models
Parametric models
A value should be selected for g so the model produces estimates close to the
measured times when presented with the corresponding heights as input.
Search for the parameter that leads to the predictions

that are closest to the observed data:
Seems to be
the best model
Due to inaccuracies in
the structure of the
model or measurement
errors
In this case is easy to define a good value for g but this is not generally possible
in most problems which involve complex relationships and multiple variables
18
2.2.2 Models
Parametric models
- A common example of a parametric model is the linear

regression.
- The assumption is that there is a linear relationship
between inputs and outputs:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + … + 𝛽𝑘 𝑥𝑖𝑘 + 𝜀𝑖 , 𝑖 = 1, … 𝑛
19
2.2.3 Models
Non-Parametric models
- Models that rely heavily on data, instead of human expertise, can be called data-
driven models
- Used in complex problems where the relationship between inputs and outputs is
not known
- Requires large amounts of data
20
2.2.3 Models
- The basic premise is that relations consistently occurring in the dataset will recur in
future observations
- One important benefit is that non-parametric methods do not require a thorough
understanding of the problem
- Models can be arbitrarily complex
- Closer to inductive reasoning
21
2.2.3 Models
- In real use-cases, most problems are too complex to use a fixed or parametric
model
- Thus, non-parametric models are the most used
- However, you should always start with the simplest model and only increase
complexity when the problem requires it!
22
2.2.4 Models
Parametric vs Non-Parametric models
Inductive vs Deductive Reasoning
Deductive
General Particular
Rule Examples
Inductive
23
2.2.4 Models
Parametric model
“A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model. No matter how much data you throw at a
parametric model, it won’t change its mind about how many parameters it needs.”
Nonparametric model
“Nonparametric methods are good when you have a lot of data and no prior knowledge, and when
you don’t want to worry too much about choosing just the right features.”
Russell, Stuart J. (Stuart Jonathan). Artificial Intelligence : a Modern Approach. Upper Saddle River, N.J. :Prentice Hall, 2010.
24
2.2.4 Models
Linear Regression
Parametric model | assume known distributions in the data
Simplicity – Easier to understand and interpret Logistic Regression

Training Speed – Faster to train and learn from data
Perceptron
Less training data – Does not require a huge quantity of data for
training and work well even if the fit to the data is not perfect
Naïve Bayes
Constrained – Limited by the functional form chosen

…
Limited complexity – Better suited to less complex problems
Poor fit– In practice the methods are unlikely to match the underlying mapping
function – do not offer the best fit to data
In pre-processing - the analyst often spends considerable time transforming data so it
stands with some specific distribution (for example normal distribution)
25
2.2.4 Models
Nonparametric model | do not assume distributions in the data
Flexibility – Capable of fitting a large number of functional forms

Power – No assumptions (or weak ones) about the underlying function. Learn from
data.
Performance – Can result in higher accuracy since they offer better fit KNN
In pre-processing – there is no distribution assumptions and time can be saved in

Decision Trees
preprocessing steps (e.g. no need to transform the data to normal distributions)
Training data – Require more data than the parametric models Non-linear SVM
Slower – Slower due to the several parameters needed to train

…
Overfitting – Higher risk of overfitting the training data
Lack of interpretability – Harder to explain why specific predictions are made in some of the
algorithms
26
2.2.5 Models
Models Recap
Human Expertise
Complexity
Fixed Models Parametric Models Non-Parametric Models

27
2.3 Models
When building non-parametric models…
- Which is the best algorithm to use?

- The concept of no-free lunch theorem
- What is the right number of attributes to use?
- The concept of complexity
- Is this problem a separable one?
- The concept of separability
- Too much learning?
- The concept of overfitting
28
2.3.1 Models
The no free lunch theorem
“If an algorithm performs better than random search on some class of problems
then it must perform worse than random search on the remaining problems.”
David Wolpert and William G. Macready

In: No Free Lunch Theorems for Optimization
https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf
- No single machine learning algorithm is universally the best-performing algorithm for all problems!
- You need to understand the trade-offs
29
2.3.2 Models
The problem of complexity
- There are problems that are too complex to deal with
- Sometimes it is not feasible to collect enough data and to allow enough processing time to
generate an optimal model
- Choosing the right features is a way of simplifying the problem and reducing the “search space”
- Important attributes vs redundant attributes: which are the variables that are relevant for the
task?
30
2.3.2 Models
The problem of complexity
What is the right number of features to use?

Small number of attributes
- Unable to distinguish between classes
Large number of attributes
- The most usual case
- The curse of dimensionality
- As the number of dimensions increases the space gets sparser and finding
groups is more difficult
- Hard to visualize and some “strange” effects
31
2.3.3 Models
The concept of separability
Linearly Separable Non-Linearly Separable Inseparable
It’s possible to have no Will always have

errors errors
32
2.3.4 Models
The concept of overfitting
Models rely on training data to learn
If we allow too much complexity, the model will “memorize” the training data,
instead of extracting useful relationships
OVERFITTING
33
2.3.4 Models
Memorizing vs Understanding
- Overfitting is like when someone memorizes things to pass an exam

- He’ll be too biased on the exercises he saw in classes
- If he gets a slightly different question in the exam, he won’t know how to answer
34
2.3.4 Models
Underfitting vs Overfitting
In a
classification
problem
Underfitting Appropriate fit Overfitting

(Too simple to explain (Forcing the fit!
the variance) Too good to be true)
35
2.3.4 Models
Underfitting vs Overfitting
In a
regression
problem
Underfitting Appropriate fit Overfitting

(Too simple to explain (Forcing the fit!
the variance) Too good to be true)
36
2.3.4 Models
How to avoid overfitting?
How to prepare for the unknown?

- Keep some data aside!
DATA
TRAIN VALIDATION TEST

Train model Choose Estimate
model’s model’s
optimal performance
training
37
2.3.4 Models
How to avoid overfitting?
Best Validation Error

complexity
Training Error
38
2.3.4 Models
How to split your data?
TRAINING SET The bigger the better the classifier.
VALIDATION SET The bigger the best the estimate of the optimal training.
TEST SET The bigger the best the estimate of the performance of the classifier on unseen data.
39
3 Model evaluation
What is model evaluation?

The evaluation of the performance of one or more different predictive models
based on the use of a validation, or test dataset.
Measuring how well the model is performing a certain task.
40
3 Model evaluation
Tuning and optimization Estimating future performance
41
3 Model evaluation
Dataset
We can estimate and evaluate the performance of the model on the training data
- However, estimates based only on the training data are not good indicators of
Model the performance of the model on unseen data
- New data will probably not be exactly as the same as the training data – these
estimates suffer from overfitting
Dataset
This is not the ideal!
42
3.1 Model evaluation
The hold-out method
One possible solution:
- Use an independent validation set

Dataset - Used when:
- There is plenty of data
Training set Validation set

RULE OF THUMB
- 70 % for training set
Model
- 30 % for validation set
(Some authors defend 80% train and 20% validation – it depends on the size of the dataset)
43
The hold-out method
When there is plenty of data we can split our dataset into training, validation and test set:
Dataset
Training set Validation set Test set
RULE OF THUMB
- 70 % for training set
Model
- 15 % for validation set
- 15 % for test set
44
The hold-out method
Question: What if your dataset is imbalanced?

Answer: Then, for some runs, certain classes may be not (well) represented. We need to assure that the
data partition will not change the original proportions of each class on the new datasets.
Stratified sampling
Sample each class independently such that each dataset has the same ratio of any given classc
Dataset
27 % Class 0
73 % Class 1
Training set Validation set Test set

27 % Class 0 27 % Class 0 27 % Class 0
73 % Class 1 73 % Class 1 73 % Class 1
45
The hold-out method
Repeated hold-out method

PROBLEMS OF THE HOLD-OUT METHOD
1. Run the holdout method many times
- It only provides a single estimate 2. Each run will have a different train and
validation set
- SOLUTION : Use a repeated hold-out method 3. Each run will lead to slightly different
metric values
- It requires a large enough dataset 
Average value allows a more robust
- SOLUTION : Use K-Fold cross-validation estimate
Compute standard deviation to estimate
variability
Still not optimum

The different tests sets overlap, but we
would like all our instances from the data
to be tested at least once
Can we prevent overlapping?
46
The hold-out method
PROBLEMS OF THE HOLD-OUT METHOD
- It only provides a single estimate
- SOLUTION : Use a repeated hold-out method
- It requires a large enough dataset
- SOLUTION : Use K-Fold cross-validation
47
K-Fold Cross Validation
DATA Step 1
1st Iteration TEST TRAIN TRAIN TRAIN TRAIN Split data into k subsets of equal sizes
2nd Iteration TRAIN TEST TRAIN TRAIN TRAIN Step 2
Each turn, use one subset for testing / validation and the
3rd Iteration TRAIN TRAIN TEST TRAIN TRAIN
remainder for training
4th Iteration TRAIN TRAIN TRAIN TEST TRAIN
Step 3
5th Iteration TRAIN TRAIN TRAIN TRAIN TEST Average all estimates to get a final value
48
K-Fold Cross Validation
SOME TIPS
- Use a stratified sample (it reduces the variance)
- 10 folds is by far the most used
- Extensive experiments have shown that this is the best choice to get an accurate estimate
- Smaller datasets can demand a small number of folds
- Big datasets can make use of more folds (and in that way get a more robust estimate)
49
Other options
- Use a repeated stratified cross-validation to further reduce variance
- Leave–One–Out Cross-Validation:
- CV with as many folds as instances
- Very expensive computationally
- Makes best use of the data
THE BEST METHOD DEPENDS ON EACH CASE !
50
Compare two models
Step 1 Choose your metric

Step 2 Choose an evaluation method
Step 3 Run the evaluation for both models and collect metrics
Step 4 Compare metrics
Step 5 Pick the model with the best performance
How to be sure one model is always better than the other?

Using different datasets will lead to slightly different results
The train and validation on different splits and the computation of standard deviation will help you
(you can even go further and use a hypothesis test to prove or disprove measurable hypothesis)
51
Compare two models
What is the best model?
- The simplest representation of knowledge
- Easier to understand
- Decision Trees vs Neural Networks
- The knowledge representation with less error
- The most simple…
- Occam’s Razor – “ the simplest explanation is usually the right one”
52
Wrap-up
- Use test sets and the hold-out method for “large” data
- Use the cross-validation method for middle-sized data
- Use the leave-one-out method for small data
- Take advantage of stratified sampling
- Don’t use test data for parameter tuning – use separate validation data
53
4 Model evaluation
What are evaluation metrics?

An evaluation metric quantifies the performance of a predictive model
54
Metrics for classification
Accuracy Error Rate Precision Recall Specificity …
55
It all starts with …
True Class
Positive Negative Type I Error
Positive TP FP
Predicted Class
Negative FN TN
Type II Error
In classification, predictions are either correct or wrong.

We can encode this in a confusion matrix.
56
True Class
Goat Not Goat
Goat 1
Predicted Class
Not Goat
Predicted Class: Goat

True Class: Goat
TRUE POSITIVE
57
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat
Predicted Class: Goat

True Class: Not Goat
FALSE POSITIVE
58
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1
Predicted Class: Not Goat

True Class: Goat
FALSE NEGATIVE
59
True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1 1
Predicted Class: Not Goat

True Class: Not Goat
TRUE NEGATIVE
60
True Class
Goat Not Goat
Goat 30 4
Predicted Class
Not Goat 6 20
...
61
4.1.1 Model evaluation
Metrics for classification - Accuracy
Accuracy
Proportion of events correctly identified (positive or negative) on all events
True Class
Positive Negative 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
Predicted Positive TP FP (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Class Negative FN TN
Overall, how often is the classifier correct?

True Class In our example…
Goat Not Goat
Goat 30 4 30 + 20 50
Predicted Class 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.83
Not Goat 6 20 (30 + 20 + 4 + 6) 60 Seems good, right?
Well… Not always!
(check slides 67 and 68)
62
Metrics for classification – Error Rate
Error Rate / Misclassification rate

Proportion of events incorrectly identified (positive or negative) on all events
True Class
Positive Negative 𝐹𝑃 + 𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 =
Predicted Positive TP FP (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Overall, how often is the classifier wrong?

Goat Not Goat
4+6 10
Predicted Class
Goat 30 4 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = = = 0.167
Not Goat 6 20 (30 + 20 + 4 + 6) 60
The lower the better!
63
Metrics for classification – Precision
Precision
Proportion of correctly positive events from all events identified as positive
True Class
Positive Negative 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
Predicted Positive TP FP (𝑇𝑃 + 𝐹𝑃)
When it predicts as positive, how often is the classifier correct?

Goat Not Goat
Predicted Class
Goat 30 4 30 30
Not Goat 6 20 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.882
(30 + 4) 34
- Email Spam detection: A false positive means that an e-mail that is not spam has been identified as spam. The
user might lose important emails if the precision is not high.
64
Metrics for classification – Recall
Recall / Sensitivity / True Positive Rate (TPR)

Proportion of events identified as positive on all positive events
True Class
Positive Negative 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
Predicted Positive TP FP (𝑇𝑃 + 𝐹𝑁)
How well the positive class was predicted? Or…

True Class When it’s actually positive, how often does the classifier predict positive?
Goat Not Goat
Goat 30 4 In our example…
Predicted Class
Not Goat 6 20 30 30
𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.833
(30 + 6) 36
- Sick patient detection: If a sick patient (actual positive) goes through the test and predicted as not sick (false
negative), the cost will be extremely high if the sickness is contagious
65
Metrics for classification – Specificity
Specificity / True Negative Rate (TNR)

Proportion of events identified as negative on all negative events
True Class
Positive Negative 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
Predicted Positive TP FP (𝐹𝑃 + 𝑇𝑁)
How well the negative class was predicted? Or….

True Class When it’s actually negative, how often does the classifier predict negative?
Goat Not Goat
Goat 30 4 In our example…
Predicted Class
Not Goat 6 20
20 20
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = = 0.833
(4 + 20) 24
66
Metrics for classification – The problem for imbalanced datasets
Example 2
- We have a dataset with 10000 images. Only 10 have goats.
True Class
- What is the accuracy of the model?
Positive Negative
Positive TP FP
Predicted 1 + 9990 9991
Class Negative FN TN 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.9991
(1 + 9990 + 0 + 9) 10000
True Class
Goat Not Goat
This seems like an almost perfect model!
Goat 1 0 But in reality, the model is not able to identify well goats, which is our main
Predicted Class
Not Goat 9 9990
goal!
The recall is equal to 0.1!
The problem of imbalanced datasets
67
Metrics for classification – The problem for imbalanced datasets
True Class True Class

Positive Negative Goat Not Goat
Predicted Positive TP FP Goat 1 0
Predicted Class
Class Negative FN TN Not Goat 9 9990
Metrics for imbalanced datasets

𝑇𝑃 𝑇𝑁 1 9990
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦+𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 +
(𝑇𝑃+𝐹𝑁) (𝐹𝑃+𝑇𝑁) +
10 9990
𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = = 0.55
2 2 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 1 ×0.1

𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 = 2 × 1+0.1
= 0.18
68
Metrics for classification – Wrap up (some hints and examples)
- Precision is a good measure when the cost of False Positive is high
- Recall is a good measure when the cost of False Negative is high
- F1 Score is a good measure if you seek a balance between precision and recall
- Accuracy is a good measure only if the cost of False Positive and False Negative is equal and the dataset is balanced
69
Cutoff for classification
- Most predictive algorithms classify via a 2-step process. For each record:
- Compute probability of belonging to class “1”
- Compare to cutoff value, and classify accordingly
- Default cutoff value is 0.5
- If >= 0.5, then classify as “1”
- If <0.5, classify as “0”
- We can use different cutoff values (typically, error rate is lowest for cutoff = 0.5)
70
Metrics for classification – ROC Curve
Area under the ROC Curve

(“AUC”) is a useful metric
RULE OF THUMB
71
Metrics for Regression
Metrics for regression
MAE MSE RMSE R2 …
72
What about when label is continuous?

We can no longer compute the confusion matrix…
Measure how far the prediction (𝑦)

ො is from the true value (𝑦)
73
Model 1 Model 2
𝐲 𝐲ො ෝ
𝒚 − 𝒚 𝐲 𝐲ො ෝ
𝒚 − 𝒚
3 4 1 3 4 1
6 5 1 6 5 1
4 6 2 4 6 2
8 7 1 8 7 1
12 11 1 12 11 1
5 5 0 5 14 9
74
Metrics for Regression – Mean Absolute Error (MAE)
Model 1 𝑛
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1
𝑀𝐴𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 2
8 7 1
12 11 1
5 5 0
Model 2 Model 1
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1+1+2+1+1+0 This measures the absolute
𝑀𝐴𝐸 = =1 average distance between the
3 4 1 6
6 5 1 real data and the predicted data,
Model 2
4 6 2 1+1+2+1+1+9 but it fails to punish large errors
8 7 1 𝑀𝐴𝐸 = = 2.5 in prediction.
6
12 11 1
5 14 9
75
Metrics for Regression – Mean Squared Error (MSE)
𝑛
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐 1 2
𝑀𝑆𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 4
8 7 1
Model 1
12 11 1 This measures the squared average
1+1+4+1+1+0
5 5 0 𝑀𝑆𝐸 = = 1.33 distance between the real data and
6 the predicted data. Here, larger errors
Model 2
are well noted (better than MAE).
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐
1 + 1 + 4 + 1 + 1 + 81
3 4 1
𝑀𝑆𝐸 = = 14.83
6
6 5 1
4 6 4
8 7 1 𝑛
1 2
12 11 1 𝑅𝑀𝑆𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
5 14 81 𝑖=1
76
Metrics for Regression - R Square (R2 ) / Coefficient of determination
σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑅2 𝑦, 𝑦ො = 1 − 𝑛 2
σ𝑖=1 𝑦𝑖 − 𝑦ത𝑖
Model 1 𝑦ത = 6.33
𝐲 𝐲ො ෝ 𝟐 ഥ 𝟐
𝒚 − 𝒚 𝒚−𝒚
Model 1
3 4 1 11.09
8
6 5 1 0.11 𝑅2 = 1 − = 0.85
53.34
4 6 4 5.43
8 7 1 2.79
Interpretation: 85 % of the variance in the dependent variable is
12 11 1 32.15
5 5 0 1.77
explained by the model
SUM 8 53.34
77
Metrics for Regression – Adjusted R-Square (Adjusted R2 )
When comparing models….

- As the number of independent variables increase, the value of R-Square will never decrease
- Reason: Any independent variable has tendency of slightly correlation with the dependent variable
- To compare models with different number of independent variables, we can use the adjusted R-Square:
R Squared
Sample Size (number of observations)
2
(1 − 𝑅2 )(𝑁 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 𝑦, 𝑦ො = 1 −
𝑁 −𝑝 −1
Number of predictors (independent variables)
78
Metrics for Regression – Wrap Up
- MSE / RMSE has the benefit of penalizing large errors more so can be more appropriate in some cases, for
example, if being off by 10 is more than twice as bad as being off by 5.
- If being off by 10 is just twice as bad as being off by 5, then use MAE
- Use adjusted R-Squared when comparing the performance of models where the number of predictors is
different
79
Obrigado!
Morada: Campus de Campolide, 1070-312 Lisboa, Portugal

Tel: +351 213 828 610 | Fax: +351 213 828 611
80

03 - Model Evaluation Comparison

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 - Model Evaluation Comparison

Uploaded by

Copyright:

Available Formats

Master in Information Management

Outcome variable is Outcome variable is

Predicting the price of a house

Area Typology Localization Garage Pollution Index Price

Predicting stock market values

Day Company GDP Inflation Value

Which variables proxy for those changes remains unknown …

Predicting the probability of someone having diabetes

ID Gender Weight Height Pregnancies Diabetes

Predict which animal is in a photo

Figure Channel Pixel_1 Pixel_2 Pixel_3 … Animal

Can we have both regression and

What does CAPTCHA tests have to do with automatic driving cars?

Classification (Class: Truck)

Regression (Coordinates: x1, y1, x2, y2)

What are models?

inputs and outputs

- Formulating these equations is the central problem in modeling

Fixed Models Parametric models Non-parametric models

- You know enough about the problem to define the general

You need to estimate the parameter g!

You need to estimate the parameter g!

Go to Mars and collect some data:

Height (h) Falling time (t)

Search for the parameter that leads to the predictions

- A common example of a parametric model is the linear

- Requires large amounts of data

- One important benefit is that non-parametric methods do not require a thorough

understanding of the problem

- Models can be arbitrarily complex

- Closer to inductive reasoning

- Thus, non-parametric models are the most used

complexity when the problem requires it!

Inductive vs Deductive Reasoning

Simplicity – Easier to understand and interpret Logistic Regression

Constrained – Limited by the functional form chosen

Nonparametric model | do not assume distributions in the data

Flexibility – Capable of fitting a large number of functional forms

In pre-processing – there is no distribution assumptions and time can be saved in

Slower – Slower due to the several parameters needed to train

Fixed Models Parametric Models Non-Parametric Models

- Which is the best algorithm to use?

David Wolpert and William G. Macready

- There are problems that are too complex to deal with

generate an optimal model

What is the right number of features to use?

Linearly Separable Non-Linearly Separable Inseparable

It’s possible to have no Will always have

Models rely on training data to learn

- Overfitting is like when someone memorizes things to pass an exam

Underfitting Appropriate fit Overfitting

Underfitting Appropriate fit Overfitting

How to avoid overfitting?

How to prepare for the unknown?

TRAIN VALIDATION TEST

How to avoid overfitting?

Best Validation Error

How to split your data?

TRAINING SET The bigger the better the classifier.

What is model evaluation?

Tuning and optimization Estimating future performance

This is not the ideal!