Professional Documents
Culture Documents
Bigm Evaluations
Bigm Evaluations
Evaluations
All models are wrong, but some are useful
Charles Parker
VP Algorithms, BigML, Inc
#MLSEV 2
My Model Is Wonderful
#MLSEV 3
Proper Evaluation
#MLSEV 4
Metric Choice
#MLSEV 5
Proper Evaluation
#MLSEV 6
Medical Testing Example
#MLSEV 7
Some Terminology
We’ll define the sick people as “positive” and the well people as “negative"
• “True Positive”: You’re sick and the model diagnosed you as sick
• “False Positive”: You’re well, but the model diagnosed you as sick
• “True Negative”: You’re well, and the model diagnosed you as well
• “False Negative”: You’re sick, but the model diagnosed you as well
The model is correct in the “true” cases, and incorrect in the “false” cases
#MLSEV 8
Accuracy
TP + TN
Total
#MLSEV 9
Precision
Predicted “Sick”
TP
= 0.6
TP + FP
• How well did we do when we predicted
someone was sick?
Predicted “Well”
• A test with high precision has few false
positives
• Precision of 1.0 indicates that everyone who
we predict is sick is actually sick
• What about people who we predict are well?
Sick Person
Well Person
#MLSEV 10
Recall
Predicted “Sick”
TP
= 0.75
TP + FN
• How well did we do when someone was
actually sick?
Predicted “Well”
• A test with high recall indicates few false
negatives
• Recall of 1.0 indicates that everyone who was
actually sick was correctly diagnosed
• But this doesn’t say anything about false
Sick Person
positives!
Well Person
#MLSEV 11
Trade Offs
• We can “trivially maximize” both measures
• If you pick the sickest person and only label them sick and no one
else, you can probably get perfect precision
• If you label everyone sick, you are guaranteed perfect recall
#MLSEV 12
Cost Matrix
#MLSEV 13
Operating Thresholds
• Most classifiers don’t output a prediction. Instead they give a “score” for each
class
• The prediction you assign to an instance is usually a function of a threshold on
this score (e.g., if the score is over 0.5, predict true)
• You can experiment with an ROC curve to see how your metrics will change if
you change the threshold
• Lowering the threshold means you are more likely to predict the positive class, which improves
recall but introduces false positives
• Increasing the threshold means you predict the positive class less often (you are more “picky”),
which will probably increase precision but lower recall.
#MLSEV 14
ROC Curve Example
#MLSEV 15
Holding Out Data
#MLSEV 16
Why Hold Out Data?
• Why do we split the dataset into training and testing sets? Why do we always
(always, always) test on data that the model training process did not see?
• Because machine learning algorithms are good at memorizing data
• We don’t care how well the model does on data it has already seen because it
probably won’t see that data again
• Holding out some of the test data is simulating the data the model will see in
the future
#MLSEV 17
Memorization
Training Evaluating
plasma diabetes
bmi age diabetes plasma diabetes
glucose pedigree bmi age diabetes
glucose pedigree
85 26,6 0,351 31 ?
85 26,6 0,351 31 FALSE
#MLSEV 18
Well, That Was Easy
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?
#MLSEV 19
Traps Everywhere!
#MLSEV 20
How Do We Avoid This?
• It’s a terrible problem, because if you make the mistake you will get results
that are too good, and be inclined to believe them
• So be careful? Do you have:
• Data where points can be grouped in time (by week or by month)?
• Data where points can be grouped by user (each point is an action a user took)
• Data where points can be grouped by location (each point is a day of sales at a particular store)
• Even if you’re suspicious that points from the group might leak information to
one another, try a test where you hold out a few groups (months, users,
locations) and train on the rest
#MLSEV 21
Do It Again!
#MLSEV 22
One Test is Not Enough
• Even if you have a correct holdout, you still need to test more than once.
• Every result you get from any test is a result of randomness
• Randomness from the Data:
• The dataset you have is a finite number of points drawn from an infinite distribution
• The split you make between training and test data is done at random
#MLSEV 23
One Test is Not Enough
Performance
#MLSEV 24
One Test is Not Enough
But really just a lucky one
Likelihood
Performance
#MLSEV 25
Comparing Models is Even Worse
#MLSEV 26
Comparing Models is Even Worse
#MLSEV 27
Comparing Models is Even Worse
First digit of
random seed
#MLSEV 28
Please, Sir, Can I Have Some More?
• Always do more than one test!
• For each test, try to vary all sources of
randomness that you can (change the seeds of all
random processes) to try to “experience” as much
variance as you can
• Cross-validation (stratifying is great, monte-carlo
can be a useful simplification)
• Don’t just average the results! The variance is
important!
#MLSEV 29
Summing Up
#MLSEV 30