Professional Documents
Culture Documents
Arize Guide To Optimized Retraining
Arize Guide To Optimized Retraining
Arize Guide To Optimized Retraining
A GUIDE
GUIDE TO
TO
Optimized
Optimized
Retraining
AUTHORS
Fixed Dynamic
Retraining a set cadence Adhoc triggered retraining based
(e.g., daily, weekly, monthly) on model performance metrics.
While the fixed approach is straightforward to implement, there are some drawbacks. Compute
costs can be higher than necessary, and the frequent retraining can lead to inconsistencies
from one model to another, while infrequent retraining schedules can lead to a stale model.
The dynamic approach can prevent models from going stale, and optimize the compute cost.
While there are numerous approaches to retraining, Arize has compiled some recommended
best practices for dynamic model retraining that will keep models healthier and performant.
There are a wealth of tutorials for specific tooling. Here are a few:
Ready to get started? Take it a step further with Etsy’s take on stateful model retraining.
Retraining Strategy
Automating the retraining of a live machine learning model can be a complex task, but there
are some best practices that can help guide the design.
1. Metrics to trigger retraining: The metrics used to trigger retraining will depend on the
specific model and use-cae. Each metric will need a threshold set. The threshold will be
used to trigger retraining when the performance of the model falls below the threshold.
This is where monitors can come into play. When a performance monitor fires in Arize, for
example, the Arize GraphQL API can be used to programmatically query the performance
and drift metrics to evaluate whether retraining is needed.
The baseline dataset used to calculate drift can be derived from either the training
dataset, or a window of production data.
• Human review
• Automated metric checks in CI/CD pipeline
• Champion vs. Challenger - serve production traffic to both models but only use the
prediction/response from the existing model (champion) in the application. The data
from the challenger model is stored for analysis but not used.
• A/B testing - split production traffic to both models for a fixed experimentation period.
Compare key metrics at the end of the experiment and decide which model to promote.
• Canary deployment - start by redirecting a small percentage of production traffic to the
new model. Since it’s in a production path, this helps to catch real issues with the new
model but limits the impact to a small percentage of users. Ramp up the traffic to the
new model until the new model receives 100% of the traffic.
Measurable ROI
Overall, it’s important to have a clear understanding of your business requirements and
the problem you are trying to solve when determining the best approach for automating
the retraining of a live machine learning model. It’s also important to continuously
monitor the performance of the model and make adjustments to the retraining cadence
and metrics as needed.
Here, we propose a weekly cost calculation, although this calculator can be adapted to
a different cadence such as daily or monthly depending on the model’s purpose and
maintenance needs.
Retraining Cost =
( compute cost for retraining
+ cost of storing new model ) x frequency per week
The model is retraining too frequently The model is not retrained enough
Transitioning from model retraining at fixed intervals to automated model retraining triggered
by model performance offers numerous benefits for organizations, from lower compute costs at
a time when cloud costs are increasing to better AI ROI from to improved model performance.
Hopefully this guide provides a template for teams to take action.
Questions or thoughts? Feel free to reach out in the Arize Slack community.