Professional Documents
Culture Documents
P 2 FSDL Berkeley Lecture10 Testing and Explainability 51 97
P 2 FSDL Berkeley Lecture10 Testing and Explainability 51 97
P 2 FSDL Berkeley Lecture10 Testing and Explainability 51 97
https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
• E.g., arbitrary other machine learning model like a noisy cat classifier
The What-If Tool: Interactive Probing of Automated Data Slicing for Model Validation: A
Machine Learning Models Big data - AI Integration Approach
• When you collect datasets to specify specific edge cases (e.g., “left hand
turn” dataset)
• When you run your production model on multiple modalities (e.g., “San Francisco”
and “Miami” datasets, English and French datasets)
• When you augment your training set with data not found in production (e.g.,
synthetic data)
New
Age <18 87% 89% 90% 87% 89% 90%
accounts
Frequent
Age 18-45 85% 87% 79% 85% 87% 79%
users
Churned
Age 45+ 92% 95% 90% 50% 36% 70%
users
• Compare new model to previous model and and a fixed older model
• Set threshold on the diffs between new and old for most metrics
Online
Shadow tests Model inputs
ML system
Prediction
Serving system
system
ML
Training system
Predictions
Model
• Goals:
• Detect bugs in the production deployment
• Detect inconsistencies between the offline and online model
• Detect issues that appear on production data
• How?
• Run new model in production system, but don’t return predictions to users
• Save the data and run the offline model on it
• Inspect the prediction distributions for old vs new and offline vs online for
inconsistencies
Online
AB Tests
Model inputs
ML system
Serving system
ML Prediction
Training system
Model
Predictions system
Storage and
Prod
preprocessing Labeling system data system
• How?
Online
Model inputs
ML system
Serving system
ML Prediction
Training system
Model
Predictions system
Monitoring (next
time!)
Storage and
Prod
preprocessing Labeling system data system
Online
Model inputs
ML system
ML Prediction
Training system Serving system Predictions Model system
Labeling system
Storage and Prod
preprocessing data
system Labeling tests
• How?
Online
Model inputs
ML system
ML Prediction
Training system Serving system Predictions Model system
Storage and
preprocessing
system Prod
Labeling system
data
Expectation tests
• Goals:
• Catch data quality issues before they make your way into your
pipeline
• How?
https://greatexpectations.io/
• Manually
• Both
• Infrastructure tests
• Evaluation tests
• Expectation tests
• Software testing
1 Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! Criticism for interpretability." Advances in Neural Information Processing Systems (2016).↩
2 Miller, Tim. "Explanation in artificial intelligence: Insights from the social sciences." arXiv Preprint arXiv:1706.07269. (2017).
• Linear regression
Interpretable and
explainable (to the right
• Logistic regression
user)
• Generalized linear
models
Not very
• Decision trees powerful
Interpretabl
e
Not explainable
• Train an interpretable
surrogate model on your data Easy to use,
and your original model’s general
predictions
• Intuition:
Works for a variety of
data. Mathematically
• How much does the pretense of
principled
this feature affect the value of
the classifier, independent of
what other features are Tricky to implement. Still
present? doesn’t provide explanations
• How often will they be interacting with the model? If frequently, maybe interpretability is
enough?
• Not really
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (https://arxiv.org/pdf/1811.10154.pdf)
• Interpretability methods like SHAP, LIME, etc can help users reach
the interpretability threshold faster