Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Master in Information Management

202223

03
What is a model?
Evaluation methods and metrics

1
A
G
E
N Supervised Learning
D
A
What is a model?

Evaluation Methods

Evaluation Metrics

2
1 Supervised Learning

What is
Supervised Learning?
Machine learning techniques that use labeled datasets
to train algorithms with the purpose to classify data or predict outcomes.

3
1 Supervised Learning

Supervised Learning

Regression Classification

VS

Outcome variable is Outcome variable is


numerical categorical

4
1.1 Methodologies in Data Science
Regression

Predicting the price of a house

Area Typology Localization Garage Pollution Index Price


75 T2 Alfama No 1 234 000
86 T1 Campolide Yes 2 210 000
93 T2 Santos No 2 254 000
134 T4 Arroios No 3 457 000
96 T3 Alfama No 1 367 000
54 T2 Alcântara No 2 219 000
167 T5 Restelo Yes 2 675 000

95 T2 Arroios Yes 3 ?

5
1.1 Methodologies in Data Science
Regression

Predicting stock market values

Day Company GDP Inflation Value



acceleration
01 VLU 0.358 -1.059 … 20.17
01 RIT 0.121 0.205 … 125.23
01 XEZ -0.121 -0.075 … 98.76
02 VLU -0.256 0.172 … 20.19
02 RIT 0.232 1.212 … 145.87
02 XEZ 0.867 -0.754 … 95.23

03 VLU … … … ?

Which variables proxy for those changes remains unknown …

6
1.2 Methodologies in Data Science
Classification

Predicting the probability of someone having diabetes

ID Gender Weight Height Pregnancies Diabetes


12562 M 78 175 0 0
62735 F 66 155 3 1
84638 F 91 165 1 1
1232 M 89 187 0 0
13738 M 101 172 0 1
3274 M 81 179 0 0
3283 F 72 169 0 0

3627 F 93 185 0 ?

7
1.2 Methodologies in Data Science
Classification

Predict which animal is in a photo


Figure Channel Pixel_1 Pixel_2 Pixel_3 … Animal
Fig_1 Red 126 134 136
Fig_1 Green 222 211 207 Cat
Fig_1 Blue 198 198 196
Fig_2 Red 35 31 31
Fig_2 Green 89 88 87 Goat
Fig_2 Blue 145 147 149
… … … … … … …

Figure Channel Pixel_1 Pixel_2 Pixel_3 … Animal


Fig_156 Red 234 236 237
Fig_156 Green 176 176 175 ?
Fig_156 Blue 98 97 98
8
1.3 Methodologies in Data Science
Classification + Regression

Can we have both regression and


classification in a same model?

9
1.3 Methodologies in Data Science
Classification + Regression

What does CAPTCHA tests have to do with automatic driving cars?

10
1.3 Methodologies in Data Science
Classification + Regression

Classification (Class: Truck)

Regression (Coordinates: x1, y1, x2, y2)


11
2 What are models?

What are models?


The output of machine learning algorithms run on data.

12
2.1 Models
Developing a model

- As in any other computer tasks, modeling requires a “program” providing detailed instructions

- These instructions are typically mathematical equations, which characterize the relationship between

inputs and outputs

- Formulating these equations is the central problem in modeling

13
2.2 Models
Types of models

MODELS
How can we define these instructions and equations?

Fixed Models Parametric models Non-parametric models


Model driven Data driven

14
2.2.1 Models
Fixed models

- Closed-form equations that define how the outputs are derived from
the inputs
- Being all the characteristics fixed when the equations are derived we
refer them as fixed models
- Suitable for simple and fully understood problems

Example:
Compute how much it takes an apple to hit the ground on Earth:

2ℎ
𝑡=
9.8

Most problems are too complex and / or not sufficiently understood for us to use fixed models.
15
2.2.2 Models
Parametric models

- You know enough about the problem to define the general


relationship between inputs and outputs
- But some parameters are unspecified – those are chosen by
examining data examples

Example:
Computes how much it takes an apple to hit the ground on Mars:

2ℎ
𝑡=
𝑔
The rate of acceleration

You need to estimate the parameter g!


16
2.2.2 Models
Parametric models

You need to estimate the parameter g!

Go to Mars and collect some data:

Height (h) Falling time (t)


0.5 0.2
1.3 0.4
2.8 0.46
4 0.68
7.3 0.7

INPUTS OUTPUTS
(in this case the input
vector has only one
component)
17
2.2.2 Models
Parametric models

A value should be selected for g so the model produces estimates close to the
measured times when presented with the corresponding heights as input.

Search for the parameter that leads to the predictions


that are closest to the observed data:
Seems to be
the best model

Due to inaccuracies in
the structure of the
model or measurement
errors
In this case is easy to define a good value for g but this is not generally possible
in most problems which involve complex relationships and multiple variables
18
2.2.2 Models
Parametric models

- A common example of a parametric model is the linear


regression.
- The assumption is that there is a linear relationship
between inputs and outputs:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + … + 𝛽𝑘 𝑥𝑖𝑘 + 𝜀𝑖 , 𝑖 = 1, … 𝑛

19
2.2.3 Models
Non-Parametric models

- Models that rely heavily on data, instead of human expertise, can be called data-

driven models

- Used in complex problems where the relationship between inputs and outputs is

not known

- Requires large amounts of data

20
2.2.3 Models
Non-Parametric models

- The basic premise is that relations consistently occurring in the dataset will recur in

future observations

- One important benefit is that non-parametric methods do not require a thorough

understanding of the problem

- Models can be arbitrarily complex

- Closer to inductive reasoning

21
2.2.3 Models
Non-Parametric models

- In real use-cases, most problems are too complex to use a fixed or parametric

model

- Thus, non-parametric models are the most used

- However, you should always start with the simplest model and only increase

complexity when the problem requires it!

22
2.2.4 Models
Parametric vs Non-Parametric models

Inductive vs Deductive Reasoning

Deductive

General Particular
Rule Examples

Inductive

23
2.2.4 Models
Parametric vs Non-Parametric models

Parametric model

“A learning model that summarizes data with a set of parameters of fixed size (independent of the
number of training examples) is called a parametric model. No matter how much data you throw at a
parametric model, it won’t change its mind about how many parameters it needs.”

Nonparametric model

“Nonparametric methods are good when you have a lot of data and no prior knowledge, and when
you don’t want to worry too much about choosing just the right features.”

Russell, Stuart J. (Stuart Jonathan). Artificial Intelligence : a Modern Approach. Upper Saddle River, N.J. :Prentice Hall, 2010.

24
2.2.4 Models
Parametric vs Non-Parametric models

Linear Regression
Parametric model | assume known distributions in the data

Simplicity – Easier to understand and interpret Logistic Regression


Training Speed – Faster to train and learn from data
Perceptron
Less training data – Does not require a huge quantity of data for
training and work well even if the fit to the data is not perfect
Naïve Bayes

Constrained – Limited by the functional form chosen



Limited complexity – Better suited to less complex problems
Poor fit– In practice the methods are unlikely to match the underlying mapping
function – do not offer the best fit to data
In pre-processing - the analyst often spends considerable time transforming data so it
stands with some specific distribution (for example normal distribution)
25
2.2.4 Models
Parametric vs Non-Parametric models

Nonparametric model | do not assume distributions in the data

Flexibility – Capable of fitting a large number of functional forms


Power – No assumptions (or weak ones) about the underlying function. Learn from
data.
Performance – Can result in higher accuracy since they offer better fit KNN

In pre-processing – there is no distribution assumptions and time can be saved in


Decision Trees
preprocessing steps (e.g. no need to transform the data to normal distributions)

Training data – Require more data than the parametric models Non-linear SVM

Slower – Slower due to the several parameters needed to train



Overfitting – Higher risk of overfitting the training data
Lack of interpretability – Harder to explain why specific predictions are made in some of the
algorithms
26
2.2.5 Models
Models Recap

Human Expertise

Complexity

Fixed Models Parametric Models Non-Parametric Models


27
2.3 Models
When building non-parametric models…

- Which is the best algorithm to use?


- The concept of no-free lunch theorem
- What is the right number of attributes to use?
- The concept of complexity
- Is this problem a separable one?
- The concept of separability
- Too much learning?
- The concept of overfitting

28
2.3.1 Models
The no free lunch theorem

“If an algorithm performs better than random search on some class of problems
then it must perform worse than random search on the remaining problems.”

David Wolpert and William G. Macready


In: No Free Lunch Theorems for Optimization
https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

- No single machine learning algorithm is universally the best-performing algorithm for all problems!
- You need to understand the trade-offs

29
2.3.2 Models
The problem of complexity

- There are problems that are too complex to deal with

- Sometimes it is not feasible to collect enough data and to allow enough processing time to

generate an optimal model

- Choosing the right features is a way of simplifying the problem and reducing the “search space”

- Important attributes vs redundant attributes: which are the variables that are relevant for the

task?

30
2.3.2 Models
The problem of complexity

What is the right number of features to use?


Small number of attributes
- Unable to distinguish between classes
Large number of attributes
- The most usual case
- The curse of dimensionality
- As the number of dimensions increases the space gets sparser and finding
groups is more difficult
- Hard to visualize and some “strange” effects

31
2.3.3 Models
The concept of separability

Linearly Separable Non-Linearly Separable Inseparable

It’s possible to have no Will always have


errors errors

32
2.3.4 Models
The concept of overfitting

Models rely on training data to learn

If we allow too much complexity, the model will “memorize” the training data,
instead of extracting useful relationships

OVERFITTING

33
2.3.4 Models
The concept of overfitting

Memorizing vs Understanding

- Overfitting is like when someone memorizes things to pass an exam


- He’ll be too biased on the exercises he saw in classes
- If he gets a slightly different question in the exam, he won’t know how to answer

34
2.3.4 Models
The concept of overfitting

Underfitting vs Overfitting

In a
classification
problem

Underfitting Appropriate fit Overfitting


(Too simple to explain (Forcing the fit!
the variance) Too good to be true)

35
2.3.4 Models
The concept of overfitting

Underfitting vs Overfitting

In a
regression
problem

Underfitting Appropriate fit Overfitting


(Too simple to explain (Forcing the fit!
the variance) Too good to be true)

36
2.3.4 Models
The concept of overfitting

How to avoid overfitting?

How to prepare for the unknown?


- Keep some data aside!

DATA

TRAIN VALIDATION TEST


Train model Choose Estimate
model’s model’s
optimal performance
training

37
2.3.4 Models
The concept of overfitting

How to avoid overfitting?

Best Validation Error


complexity

Training Error

38
2.3.4 Models
The concept of overfitting

How to split your data?

TRAINING SET The bigger the better the classifier.

VALIDATION SET The bigger the best the estimate of the optimal training.

TEST SET The bigger the best the estimate of the performance of the classifier on unseen data.

39
3 Model evaluation

What is model evaluation?


The evaluation of the performance of one or more different predictive models
based on the use of a validation, or test dataset.
Measuring how well the model is performing a certain task.

40
3 Model evaluation

Tuning and optimization Estimating future performance

41
3 Model evaluation

Dataset
We can estimate and evaluate the performance of the model on the training data
- However, estimates based only on the training data are not good indicators of
Model the performance of the model on unseen data
- New data will probably not be exactly as the same as the training data – these
estimates suffer from overfitting
Dataset

This is not the ideal!

42
3.1 Model evaluation
The hold-out method

One possible solution:

- Use an independent validation set


Dataset - Used when:
- There is plenty of data

Training set Validation set


RULE OF THUMB
- 70 % for training set
Model
- 30 % for validation set

(Some authors defend 80% train and 20% validation – it depends on the size of the dataset)

43
3.1 Model evaluation
The hold-out method

When there is plenty of data we can split our dataset into training, validation and test set:

Dataset

Training set Validation set Test set

RULE OF THUMB
- 70 % for training set
Model
- 15 % for validation set
- 15 % for test set

44
3.1 Model evaluation
The hold-out method

Question: What if your dataset is imbalanced?


Answer: Then, for some runs, certain classes may be not (well) represented. We need to assure that the
data partition will not change the original proportions of each class on the new datasets.

Stratified sampling
Sample each class independently such that each dataset has the same ratio of any given classc

Dataset
27 % Class 0
73 % Class 1

Training set Validation set Test set


27 % Class 0 27 % Class 0 27 % Class 0
73 % Class 1 73 % Class 1 73 % Class 1
45
3.1 Model evaluation
The hold-out method

Repeated hold-out method


PROBLEMS OF THE HOLD-OUT METHOD
1. Run the holdout method many times
- It only provides a single estimate 2. Each run will have a different train and
validation set
- SOLUTION : Use a repeated hold-out method 3. Each run will lead to slightly different
metric values
- It requires a large enough dataset 
Average value allows a more robust
- SOLUTION : Use K-Fold cross-validation estimate
Compute standard deviation to estimate
variability

Still not optimum


The different tests sets overlap, but we
would like all our instances from the data
to be tested at least once
Can we prevent overlapping?

46
3.1 Model evaluation
The hold-out method

PROBLEMS OF THE HOLD-OUT METHOD

- It only provides a single estimate

- SOLUTION : Use a repeated hold-out method

- It requires a large enough dataset

- SOLUTION : Use K-Fold cross-validation

47
3.2 Model evaluation
K-Fold Cross Validation

DATA Step 1
1st Iteration TEST TRAIN TRAIN TRAIN TRAIN Split data into k subsets of equal sizes
2nd Iteration TRAIN TEST TRAIN TRAIN TRAIN Step 2
Each turn, use one subset for testing / validation and the
3rd Iteration TRAIN TRAIN TEST TRAIN TRAIN
remainder for training
4th Iteration TRAIN TRAIN TRAIN TEST TRAIN
Step 3
5th Iteration TRAIN TRAIN TRAIN TRAIN TEST Average all estimates to get a final value

48
3.2 Model evaluation
K-Fold Cross Validation

SOME TIPS

- Use a stratified sample (it reduces the variance)

- 10 folds is by far the most used

- Extensive experiments have shown that this is the best choice to get an accurate estimate

- Smaller datasets can demand a small number of folds

- Big datasets can make use of more folds (and in that way get a more robust estimate)

49
3.3 Model evaluation
Other options

- Use a repeated stratified cross-validation to further reduce variance

- Leave–One–Out Cross-Validation:

- CV with as many folds as instances

- Very expensive computationally

- Makes best use of the data

THE BEST METHOD DEPENDS ON EACH CASE !

50
3.4 Model evaluation
Compare two models

Step 1 Choose your metric


Step 2 Choose an evaluation method
Step 3 Run the evaluation for both models and collect metrics
Step 4 Compare metrics
Step 5 Pick the model with the best performance

How to be sure one model is always better than the other?


Using different datasets will lead to slightly different results

The train and validation on different splits and the computation of standard deviation will help you
(you can even go further and use a hypothesis test to prove or disprove measurable hypothesis)

51
3.4 Model evaluation
Compare two models

What is the best model?

- The simplest representation of knowledge

- Easier to understand

- Decision Trees vs Neural Networks

- The knowledge representation with less error

- The most simple…

- Occam’s Razor – “ the simplest explanation is usually the right one”

52
3.4 Model evaluation
Wrap-up

- Use test sets and the hold-out method for “large” data

- Use the cross-validation method for middle-sized data

- Use the leave-one-out method for small data

- Take advantage of stratified sampling

- Don’t use test data for parameter tuning – use separate validation data

53
4 Model evaluation

What are evaluation metrics?


An evaluation metric quantifies the performance of a predictive model

54
4.1 Model evaluation
Metrics for classification

Metrics for classification

Accuracy Error Rate Precision Recall Specificity …

55
4.1 Model evaluation
Metrics for classification

It all starts with …

True Class
Positive Negative Type I Error
Positive TP FP
Predicted Class
Negative FN TN

Type II Error

In classification, predictions are either correct or wrong.


We can encode this in a confusion matrix.

56
4.1 Model evaluation
Metrics for classification

True Class
Goat Not Goat
Goat 1
Predicted Class
Not Goat

Predicted Class: Goat


True Class: Goat

TRUE POSITIVE

57
4.1 Model evaluation
Metrics for classification

True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat

Predicted Class: Goat


True Class: Not Goat

FALSE POSITIVE
58
4.1 Model evaluation
Metrics for classification

True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1

Predicted Class: Not Goat


True Class: Goat

FALSE NEGATIVE
59
4.1 Model evaluation
Metrics for classification

True Class
Goat Not Goat
Goat 1 1
Predicted Class
Not Goat 1 1

Predicted Class: Not Goat


True Class: Not Goat

TRUE NEGATIVE
60
4.1 Model evaluation
Metrics for classification

True Class
Goat Not Goat
Goat 30 4
Predicted Class
Not Goat 6 20

...

61
4.1.1 Model evaluation
Metrics for classification - Accuracy

Accuracy
Proportion of events correctly identified (positive or negative) on all events
True Class
Positive Negative 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
Predicted Positive TP FP (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Class Negative FN TN

Overall, how often is the classifier correct?


True Class In our example…
Goat Not Goat
Goat 30 4 30 + 20 50
Predicted Class 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.83
Not Goat 6 20 (30 + 20 + 4 + 6) 60 Seems good, right?
Well… Not always!
(check slides 67 and 68)

62
4.1.2 Model evaluation
Metrics for classification – Error Rate

Error Rate / Misclassification rate


Proportion of events incorrectly identified (positive or negative) on all events
True Class
Positive Negative 𝐹𝑃 + 𝐹𝑁
𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 =
Predicted Positive TP FP (𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Class Negative FN TN

Overall, how often is the classifier wrong?


True Class In our example…
Goat Not Goat
4+6 10
Predicted Class
Goat 30 4 𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = = = 0.167
Not Goat 6 20 (30 + 20 + 4 + 6) 60

The lower the better!

63
4.1.3 Model evaluation
Metrics for classification – Precision

Precision
Proportion of correctly positive events from all events identified as positive
True Class
Positive Negative 𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
Predicted Positive TP FP (𝑇𝑃 + 𝐹𝑃)
Class Negative FN TN

When it predicts as positive, how often is the classifier correct?


True Class In our example…
Goat Not Goat

Predicted Class
Goat 30 4 30 30
Not Goat 6 20 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = = 0.882
(30 + 4) 34

- Email Spam detection: A false positive means that an e-mail that is not spam has been identified as spam. The
user might lose important emails if the precision is not high.
64
4.1.4 Model evaluation
Metrics for classification – Recall

Recall / Sensitivity / True Positive Rate (TPR)


Proportion of events identified as positive on all positive events
True Class
Positive Negative 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
Predicted Positive TP FP (𝑇𝑃 + 𝐹𝑁)
Class Negative FN TN

How well the positive class was predicted? Or…


True Class When it’s actually positive, how often does the classifier predict positive?
Goat Not Goat
Goat 30 4 In our example…
Predicted Class
Not Goat 6 20 30 30
𝑅𝑒𝑐𝑎𝑙𝑙 = = = 0.833
(30 + 6) 36

- Sick patient detection: If a sick patient (actual positive) goes through the test and predicted as not sick (false
negative), the cost will be extremely high if the sickness is contagious
65
4.1.5 Model evaluation
Metrics for classification – Specificity

Specificity / True Negative Rate (TNR)


Proportion of events identified as negative on all negative events
True Class
Positive Negative 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
Predicted Positive TP FP (𝐹𝑃 + 𝑇𝑁)
Class Negative FN TN

How well the negative class was predicted? Or….


True Class When it’s actually negative, how often does the classifier predict negative?
Goat Not Goat
Goat 30 4 In our example…
Predicted Class
Not Goat 6 20
20 20
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = = = 0.833
(4 + 20) 24

66
4.1.6 Model evaluation
Metrics for classification – The problem for imbalanced datasets

Example 2
- We have a dataset with 10000 images. Only 10 have goats.
True Class
- What is the accuracy of the model?
Positive Negative
Positive TP FP
Predicted 1 + 9990 9991
Class Negative FN TN 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.9991
(1 + 9990 + 0 + 9) 10000

True Class
Goat Not Goat
This seems like an almost perfect model!
Goat 1 0 But in reality, the model is not able to identify well goats, which is our main
Predicted Class
Not Goat 9 9990
goal!
The recall is equal to 0.1!
The problem of imbalanced datasets

67
4.1.6 Model evaluation
Metrics for classification – The problem for imbalanced datasets

True Class True Class


Positive Negative Goat Not Goat
Predicted Positive TP FP Goat 1 0
Predicted Class
Class Negative FN TN Not Goat 9 9990

Metrics for imbalanced datasets


𝑇𝑃 𝑇𝑁 1 9990
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦+𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 +
(𝑇𝑃+𝐹𝑁) (𝐹𝑃+𝑇𝑁) +
10 9990
𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = = 0.55
2 2 2

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 1 ×0.1


𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 = 2 × 1+0.1
= 0.18

68
4.1.7 Model evaluation
Metrics for classification – Wrap up (some hints and examples)

- Precision is a good measure when the cost of False Positive is high

- Recall is a good measure when the cost of False Negative is high

- F1 Score is a good measure if you seek a balance between precision and recall

- Accuracy is a good measure only if the cost of False Positive and False Negative is equal and the dataset is balanced

69
4.1.8 Model evaluation
Cutoff for classification

- Most predictive algorithms classify via a 2-step process. For each record:

- Compute probability of belonging to class “1”

- Compare to cutoff value, and classify accordingly

- Default cutoff value is 0.5

- If >= 0.5, then classify as “1”

- If <0.5, classify as “0”

- We can use different cutoff values (typically, error rate is lowest for cutoff = 0.5)

70
4.1.8 Model evaluation
Metrics for classification – ROC Curve

Area under the ROC Curve


(“AUC”) is a useful metric

RULE OF THUMB

71
4.2 Model evaluation
Metrics for Regression

Metrics for regression

MAE MSE RMSE R2 …

72
4.2 Model evaluation
Metrics for Regression

What about when label is continuous?


We can no longer compute the confusion matrix…

Measure how far the prediction (𝑦)


ො is from the true value (𝑦)

73
4.2 Model evaluation
Metrics for Regression

Model 1 Model 2
𝐲 𝐲ො ෝ
𝒚 − 𝒚 𝐲 𝐲ො ෝ
𝒚 − 𝒚
3 4 1 3 4 1
6 5 1 6 5 1
4 6 2 4 6 2
8 7 1 8 7 1
12 11 1 12 11 1
5 5 0 5 14 9

74
4.2.1 Model evaluation
Metrics for Regression – Mean Absolute Error (MAE)

Model 1 𝑛
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1
𝑀𝐴𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 2
8 7 1
12 11 1
5 5 0

Model 2 Model 1
𝐲 𝐲ො ෝ
𝒚 − 𝒚 1+1+2+1+1+0 This measures the absolute
𝑀𝐴𝐸 = =1 average distance between the
3 4 1 6
6 5 1 real data and the predicted data,
Model 2
4 6 2 1+1+2+1+1+9 but it fails to punish large errors
8 7 1 𝑀𝐴𝐸 = = 2.5 in prediction.
6
12 11 1
5 14 9

75
4.2.2 Model evaluation
Metrics for Regression – Mean Squared Error (MSE)

𝑛
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐 1 2
𝑀𝑆𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
3 4 1 𝑛
𝑖=1
6 5 1
4 6 4
8 7 1
Model 1
12 11 1 This measures the squared average
1+1+4+1+1+0
5 5 0 𝑀𝑆𝐸 = = 1.33 distance between the real data and
6 the predicted data. Here, larger errors
Model 2
are well noted (better than MAE).
𝐲 𝐲ො ෝ𝒊
𝒚𝒊 − 𝒚 𝟐
1 + 1 + 4 + 1 + 1 + 81
3 4 1
𝑀𝑆𝐸 = = 14.83
6
6 5 1
4 6 4
8 7 1 𝑛
1 2
12 11 1 𝑅𝑀𝑆𝐸 𝑦, 𝑦ො = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛
5 14 81 𝑖=1

76
4.2.4 Model evaluation
Metrics for Regression - R Square (R2 ) / Coefficient of determination

σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑅2 𝑦, 𝑦ො = 1 − 𝑛 2
σ𝑖=1 𝑦𝑖 − 𝑦ത𝑖

Model 1 𝑦ത = 6.33

𝐲 𝐲ො ෝ 𝟐 ഥ 𝟐
𝒚 − 𝒚 𝒚−𝒚
Model 1
3 4 1 11.09
8
6 5 1 0.11 𝑅2 = 1 − = 0.85
53.34
4 6 4 5.43
8 7 1 2.79
Interpretation: 85 % of the variance in the dependent variable is
12 11 1 32.15
5 5 0 1.77
explained by the model
SUM 8 53.34

77
4.2.4 Model evaluation
Metrics for Regression – Adjusted R-Square (Adjusted R2 )

When comparing models….


- As the number of independent variables increase, the value of R-Square will never decrease
- Reason: Any independent variable has tendency of slightly correlation with the dependent variable
- To compare models with different number of independent variables, we can use the adjusted R-Square:

R Squared
Sample Size (number of observations)

2
(1 − 𝑅2 )(𝑁 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 𝑦, 𝑦ො = 1 −
𝑁 −𝑝 −1
Number of predictors (independent variables)

78
4.2.4 Model evaluation
Metrics for Regression – Wrap Up

- MSE / RMSE has the benefit of penalizing large errors more so can be more appropriate in some cases, for

example, if being off by 10 is more than twice as bad as being off by 5.

- If being off by 10 is just twice as bad as being off by 5, then use MAE

- Use adjusted R-Squared when comparing the performance of models where the number of predictors is

different

79
Obrigado!

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal


Tel: +351 213 828 610 | Fax: +351 213 828 611

80

You might also like