Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

2nd edition

Evaluations
All models are wrong, but some are useful

Charles Parker
VP Algorithms, BigML, Inc

#MLSEV 2
My Model Is Wonderful

• I trained a model on my data and it


seems really marvelous!
• How do you know for sure?
• To quantify your model’s
performance, you must evaluate it
• This is not optional. If you don’t
do this and do it right, you’ll have
problems

#MLSEV 3
Proper Evaluation

• Choosing the right metric


• Testing on the right data (which might be harder than you think)
• Replicating your tests

#MLSEV 4
Metric Choice

#MLSEV 5
Proper Evaluation

• The most basic workflow for model evaluation is:


• Split your data into two sets, training and testing
• Train a model on the training data
• Measure the “performance” of the model on the testing data
• If your training data is representative of what you will see in the future, that’s
the performance you should get out of your model

• What do we mean by “performance”? This is where you come in.

#MLSEV 6
Medical Testing Example

• Let’s say we develop an ML model that can


diagnose a disease
• About 1 in 1000 people who are tested by
the model turn out to have the disease
• Call the people who have the disease
“sick” and people who don’t have it “well”.
• How well do we do on a test set?

#MLSEV 7
Some Terminology
We’ll define the sick people as “positive” and the well people as “negative"

• “True Positive”: You’re sick and the model diagnosed you as sick
• “False Positive”: You’re well, but the model diagnosed you as sick
• “True Negative”: You’re well, and the model diagnosed you as well
• “False Negative”: You’re sick, but the model diagnosed you as well

The model is correct in the “true” cases, and incorrect in the “false” cases

#MLSEV 8
Accuracy
TP + TN
Total

• “Percentage correct” - like an exam


• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Remember, only 1 in 1000 have the disease
• A silly model which always predicts “well” is 99.9% accurate

#MLSEV 9
Precision
Predicted “Sick”
TP
= 0.6
TP + FP
• How well did we do when we predicted
someone was sick?
Predicted “Well”
• A test with high precision has few false
positives
• Precision of 1.0 indicates that everyone who
we predict is sick is actually sick
• What about people who we predict are well?
Sick Person
Well Person
#MLSEV 10
Recall
Predicted “Sick”
TP
= 0.75
TP + FN
• How well did we do when someone was
actually sick?
Predicted “Well”
• A test with high recall indicates few false
negatives
• Recall of 1.0 indicates that everyone who was
actually sick was correctly diagnosed
• But this doesn’t say anything about false
Sick Person
positives!
Well Person
#MLSEV 11
Trade Offs
• We can “trivially maximize” both measures
• If you pick the sickest person and only label them sick and no one
else, you can probably get perfect precision
• If you label everyone sick, you are guaranteed perfect recall

• The unfortunate catch is that if you make one perfect, the


other is terrible, so you want a model that has both high
precision and recall
• This is what quantities like the F1 score and Phi
Coefficient try to do

#MLSEV 12
Cost Matrix

• In many cases, the consequences of a true


Cost matrix for medical
diagnosis problem positive and a false positive are very different
• You can define “costs” for each type of mistake
Classified Classified
Sick Well • Total Cost = TP * TP_Cost + FP * FP_Cost
Actually
Sick
0 100 • Here, we are willing to accept lots of false
Actually
Well
1 0 positives in exchange for high recall
• What if a positive diagnosis resulted in
expensive or painful treatment?

#MLSEV 13
Operating Thresholds
• Most classifiers don’t output a prediction. Instead they give a “score” for each
class
• The prediction you assign to an instance is usually a function of a threshold on
this score (e.g., if the score is over 0.5, predict true)
• You can experiment with an ROC curve to see how your metrics will change if
you change the threshold
• Lowering the threshold means you are more likely to predict the positive class, which improves
recall but introduces false positives
• Increasing the threshold means you predict the positive class less often (you are more “picky”),
which will probably increase precision but lower recall.
#MLSEV 14
ROC Curve Example

#MLSEV 15
Holding Out Data

#MLSEV 16
Why Hold Out Data?

• Why do we split the dataset into training and testing sets? Why do we always
(always, always) test on data that the model training process did not see?
• Because machine learning algorithms are good at memorizing data
• We don’t care how well the model does on data it has already seen because it
probably won’t see that data again
• Holding out some of the test data is simulating the data the model will see in
the future

#MLSEV 17
Memorization
Training Evaluating
plasma diabetes
bmi age diabetes plasma diabetes
glucose pedigree bmi age diabetes
glucose pedigree

148 33,6 0,627 50 TRUE 148 33,6 0,627 50 ?

85 26,6 0,351 31 ?
85 26,6 0,351 31 FALSE

183 23,3 0,672 32 TRUE


• You don’t even need meaningful features;
89 28,1 0,167 21 FALSE
the person’s name would be enough
137 43,1 2,288 33 TRUE
• “Oh right, Bob. I know him. Yes, he
116 25,6 0,201 30 FALSE certainly has diabetes”
78 31 0,248 26 TRUE • As long as there are no duplicate names

115 35,3 0,134 29 FALSE in the dataset, it's a 100% accurate

197 30,5 0,158 53 TRUE


model

#MLSEV 18
Well, That Was Easy
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?
#MLSEV 19
Traps Everywhere!

• This is common when you have time-distributed


data, but can also happen in other instances:
• Let’s say we have a dataset of 10,000 pictures
from 20 people, each labeled with the year it which
it was taken
• We want to predict the year from the image
• What happens if we hold out random data?
• Solution: Hold out users instead

#MLSEV 20
How Do We Avoid This?
• It’s a terrible problem, because if you make the mistake you will get results
that are too good, and be inclined to believe them
• So be careful? Do you have:
• Data where points can be grouped in time (by week or by month)?
• Data where points can be grouped by user (each point is an action a user took)
• Data where points can be grouped by location (each point is a day of sales at a particular store)

• Even if you’re suspicious that points from the group might leak information to
one another, try a test where you hold out a few groups (months, users,
locations) and train on the rest

#MLSEV 21
Do It Again!

#MLSEV 22
One Test is Not Enough

• Even if you have a correct holdout, you still need to test more than once.
• Every result you get from any test is a result of randomness
• Randomness from the Data:
• The dataset you have is a finite number of points drawn from an infinite distribution
• The split you make between training and test data is done at random

• Randomness of the algorithm


• The ordering of the data might give different results
• The best performing algorithms (random forests, deepnets) have randomness built-in

• With just one result, you might get lucky

#MLSEV 23
One Test is Not Enough

Really nice result!

Performance

#MLSEV 24
One Test is Not Enough
But really just a lucky one

Likelihood

Really nice result!

Performance

#MLSEV 25
Comparing Models is Even Worse

#MLSEV 26
Comparing Models is Even Worse

#MLSEV 27
Comparing Models is Even Worse

First digit of

random seed
#MLSEV 28
Please, Sir, Can I Have Some More?
• Always do more than one test!
• For each test, try to vary all sources of
randomness that you can (change the seeds of all
random processes) to try to “experience” as much
variance as you can
• Cross-validation (stratifying is great, monte-carlo
can be a useful simplification)
• Don’t just average the results! The variance is
important!
#MLSEV 29
Summing Up

• Choose the metric that makes sense for


your problem
• Use held out data for testing and watch out
for information leakage
• Always do more than one test, varying all
sources of randomness that you have
control over!

#MLSEV 30

You might also like