Bigm Evaluations

2nd edition
Evaluations
All models are wrong, but some are useful
Charles Parker
VP Algorithms, BigML, Inc
#MLSEV 2
My Model Is Wonderful
• I trained a model on my data and it

seems really marvelous!
• How do you know for sure?
• To quantify your model’s
performance, you must evaluate it
• This is not optional. If you don’t
do this and do it right, you’ll have
problems
#MLSEV 3
Proper Evaluation
• Choosing the right metric

• Testing on the right data (which might be harder than you think)
• Replicating your tests
#MLSEV 4
Metric Choice
#MLSEV 5
Proper Evaluation
• The most basic workflow for model evaluation is:

• Split your data into two sets, training and testing
• Train a model on the training data
• Measure the “performance” of the model on the testing data
• If your training data is representative of what you will see in the future, that’s
the performance you should get out of your model
• What do we mean by “performance”? This is where you come in.
#MLSEV 6
Medical Testing Example
• Let’s say we develop an ML model that can

diagnose a disease
• About 1 in 1000 people who are tested by
the model turn out to have the disease
• Call the people who have the disease
“sick” and people who don’t have it “well”.
• How well do we do on a test set?
#MLSEV 7
Some Terminology
We’ll define the sick people as “positive” and the well people as “negative"
• “True Positive”: You’re sick and the model diagnosed you as sick
• “False Positive”: You’re well, but the model diagnosed you as sick
• “True Negative”: You’re well, and the model diagnosed you as well
• “False Negative”: You’re sick, but the model diagnosed you as well
The model is correct in the “true” cases, and incorrect in the “false” cases
#MLSEV 8
Accuracy
TP + TN
Total
• “Percentage correct” - like an exam

• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Remember, only 1 in 1000 have the disease
• A silly model which always predicts “well” is 99.9% accurate
#MLSEV 9
Precision
Predicted “Sick”
TP
= 0.6
TP + FP
• How well did we do when we predicted
someone was sick?
Predicted “Well”
• A test with high precision has few false
positives
• Precision of 1.0 indicates that everyone who
we predict is sick is actually sick
• What about people who we predict are well?
Sick Person
Well Person
#MLSEV 10
Recall
Predicted “Sick”
TP
= 0.75
TP + FN
• How well did we do when someone was
actually sick?
Predicted “Well”
• A test with high recall indicates few false
negatives
• Recall of 1.0 indicates that everyone who was
actually sick was correctly diagnosed
• But this doesn’t say anything about false
Sick Person
positives!
Well Person
#MLSEV 11
Trade Oﬀs
• We can “trivially maximize” both measures
• If you pick the sickest person and only label them sick and no one
else, you can probably get perfect precision
• If you label everyone sick, you are guaranteed perfect recall
• The unfortunate catch is that if you make one perfect, the

other is terrible, so you want a model that has both high
precision and recall
• This is what quantities like the F1 score and Phi
Coefficient try to do
#MLSEV 12
Cost Matrix
• In many cases, the consequences of a true

Cost matrix for medical
diagnosis problem positive and a false positive are very different
• You can define “costs” for each type of mistake
Classified Classified
Sick Well • Total Cost = TP * TP_Cost + FP * FP_Cost
Actually
Sick
0 100 • Here, we are willing to accept lots of false
Actually
Well
1 0 positives in exchange for high recall
• What if a positive diagnosis resulted in
expensive or painful treatment?
#MLSEV 13
Operating Thresholds
• Most classifiers don’t output a prediction. Instead they give a “score” for each
class
• The prediction you assign to an instance is usually a function of a threshold on
this score (e.g., if the score is over 0.5, predict true)
• You can experiment with an ROC curve to see how your metrics will change if
you change the threshold
• Lowering the threshold means you are more likely to predict the positive class, which improves
recall but introduces false positives
• Increasing the threshold means you predict the positive class less often (you are more “picky”),
which will probably increase precision but lower recall.
#MLSEV 14
ROC Curve Example
#MLSEV 15
Holding Out Data
#MLSEV 16
Why Hold Out Data?
• Why do we split the dataset into training and testing sets? Why do we always
(always, always) test on data that the model training process did not see?
• Because machine learning algorithms are good at memorizing data
• We don’t care how well the model does on data it has already seen because it
probably won’t see that data again
• Holding out some of the test data is simulating the data the model will see in
the future
#MLSEV 17
Memorization
Training Evaluating
plasma diabetes
bmi age diabetes plasma diabetes
glucose pedigree bmi age diabetes
glucose pedigree
148 33,6 0,627 50 TRUE 148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE

• You don’t even need meaningful features;
89 28,1 0,167 21 FALSE
the person’s name would be enough
137 43,1 2,288 33 TRUE
• “Oh right, Bob. I know him. Yes, he
116 25,6 0,201 30 FALSE certainly has diabetes”
78 31 0,248 26 TRUE • As long as there are no duplicate names
115 35,3 0,134 29 FALSE in the dataset, it's a 100% accurate
197 30,5 0,158 53 TRUE

model
#MLSEV 18
Well, That Was Easy
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?
#MLSEV 19
Traps Everywhere!
• This is common when you have time-distributed

data, but can also happen in other instances:
• Let’s say we have a dataset of 10,000 pictures
from 20 people, each labeled with the year it which
it was taken
• We want to predict the year from the image
• What happens if we hold out random data?
• Solution: Hold out users instead
#MLSEV 20
How Do We Avoid This?
• It’s a terrible problem, because if you make the mistake you will get results
that are too good, and be inclined to believe them
• So be careful? Do you have:
• Data where points can be grouped in time (by week or by month)?
• Data where points can be grouped by user (each point is an action a user took)
• Data where points can be grouped by location (each point is a day of sales at a particular store)
• Even if you’re suspicious that points from the group might leak information to
one another, try a test where you hold out a few groups (months, users,
locations) and train on the rest
#MLSEV 21
Do It Again!
#MLSEV 22
One Test is Not Enough
• Even if you have a correct holdout, you still need to test more than once.
• Every result you get from any test is a result of randomness
• Randomness from the Data:
• The dataset you have is a finite number of points drawn from an infinite distribution
• The split you make between training and test data is done at random
• Randomness of the algorithm

• The ordering of the data might give different results
• The best performing algorithms (random forests, deepnets) have randomness built-in
• With just one result, you might get lucky
#MLSEV 23
Really nice result!
Performance
#MLSEV 24
But really just a lucky one
Likelihood
Really nice result!
Performance
#MLSEV 25
Comparing Models is Even Worse
#MLSEV 26
#MLSEV 27
First digit of
random seed
#MLSEV 28
Please, Sir, Can I Have Some More?
• Always do more than one test!
• For each test, try to vary all sources of
randomness that you can (change the seeds of all
random processes) to try to “experience” as much
variance as you can
• Cross-validation (stratifying is great, monte-carlo
can be a useful simplification)
• Don’t just average the results! The variance is
important!
#MLSEV 29
Summing Up
• Choose the metric that makes sense for

your problem
• Use held out data for testing and watch out
for information leakage
• Always do more than one test, varying all
sources of randomness that you have
control over!
#MLSEV 30

Bigm Evaluations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigm Evaluations

Uploaded by

Copyright:

Available Formats

2nd edition

• I trained a model on my data and it

• Choosing the right metric

• The most basic workflow for model evaluation is:

• What do we mean by “performance”? This is where you come in.

• Let’s say we develop an ML model that can

• “Percentage correct” - like an exam

• The unfortunate catch is that if you make one perfect, the

• In many cases, the consequences of a true

148 33,6 0,627 50 TRUE 148 33,6 0,627 50 ?

183 23,3 0,672 32 TRUE

115 35,3 0,134 29 FALSE in the dataset, it's a 100% accurate

197 30,5 0,158 53 TRUE

• This is common when you have time-distributed

• Randomness of the algorithm

• With just one result, you might get lucky

Really nice result!

Really nice result!

• Choose the metric that makes sense for

You might also like