Arize Guide To Optimized Retraining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A

A GUIDE
GUIDE TO
TO

Optimized
Optimized
Retraining

AUTHORS

Trevor LaViale Claire Longo


Introduction

While the industry has invested a lot in processes and


techniques for knowing when to deploy a model into
production, there is arguably less collective knowledge on
the equally important task of knowing when to retrain a
model. In truth, knowing when to retrain a model is hard
due to factors like delays in feedback or labels for live
predictions. In practice, many practitioners just end up
training on a specific schedule — or not at all — and hope
for the best.

Based on direct experience working with customers with


models in production topping billions of daily predictions,
this guide is designed to help data scientists and machine
learning engineering teams embrace automated retraining.

Guide to Optimized Model Retraining


Approaches for Retraining

There are two core approaches to automated model retraining:

Fixed Dynamic
Retraining a set cadence Adhoc triggered retraining based
(e.g., daily, weekly, monthly) on model performance metrics.

While the fixed approach is straightforward to implement, there are some drawbacks. Compute
costs can be higher than necessary, and the frequent retraining can lead to inconsistencies
from one model to another, while infrequent retraining schedules can lead to a stale model.

The dynamic approach can prevent models from going stale, and optimize the compute cost.
While there are numerous approaches to retraining, Arize has compiled some recommended
best practices for dynamic model retraining that will keep models healthier and performant.

Guide to Optimized Model Retraining


There is a suite of various tools that can be used to create a model retraining system. The
diagram on the preceding page shows how an ML observability platform (i.e. Arize) can
integrate into a generalized flow.

There are a wealth of tutorials for specific tooling. Here are a few:

Automated model Automated model Pachyderm


retraining with retraining with example for
Eventbridge Airflow FinTech use-case

Ready to get started? Take it a step further with Etsy’s take on stateful model retraining.

Retraining Strategy

Automating the retraining of a live machine learning model can be a complex task, but there
are some best practices that can help guide the design.

1. Metrics to trigger retraining: The metrics used to trigger retraining will depend on the
specific model and use-cae. Each metric will need a threshold set. The threshold will be
used to trigger retraining when the performance of the model falls below the threshold.
This is where monitors can come into play. When a performance monitor fires in Arize, for
example, the Arize GraphQL API can be used to programmatically query the performance
and drift metrics to evaluate whether retraining is needed.

Ideal metrics to trigger model retraining:


• Prediction (score or label) drift
• Performance metric degradation
• Performance metric degradation for specific segments/cohorts.
• Feature drift
• Embeddings drift

Guide to Optimized Model Retraining


Drift is the measure of the distance between two distributions. It is a meaningful metric
for triggering model retraining because it indicates how much your production data
has shifted from a baseline. Statistical drift can be measured with various drift metrics.

The baseline dataset used to calculate drift can be derived from either the training
dataset, or a window of production data.

2. Ensuring the new model is working


The new model will need to be tested or validated before promoting it to production to
replace the old one. There are a few recommended approaches here:

• Human review
• Automated metric checks in CI/CD pipeline

3. Strategy for promoting the new model


The strategy for promoting the new model will depend on the impact that the model
has on the business. In some cases, it may be appropriate to automatically replace the
old model with the new model. But in other cases, the new model may need to be A/B
test live before replacing the old model.

Some strategies for live model testing to consider are:

• Champion vs. Challenger - serve production traffic to both models but only use the
prediction/response from the existing model (champion) in the application. The data
from the challenger model is stored for analysis but not used.
• A/B testing - split production traffic to both models for a fixed experimentation period.
Compare key metrics at the end of the experiment and decide which model to promote.
• Canary deployment - start by redirecting a small percentage of production traffic to the
new model. Since it’s in a production path, this helps to catch real issues with the new
model but limits the impact to a small percentage of users. Ramp up the traffic to the
new model until the new model receives 100% of the traffic.

4. Retraining feedback loop data


Once we identify that the model needs to be retained, the next step is to choose the
right dataset to retrain with. Here are some recommendations to ensure the new
training data will improve the models performance.
• If the model performs well overall, but is failing to meet optimal performance criteria
on some segments, such as specific feature values or demographics, the new training
dataset should contain extra data points for these lower performing segments. A simple
upsampling strategy can be used to create a new training dataset that targets these low
performing segments.
• If the model is trained on a small timeslice, the training dataset may not accurately
capture and represent all possible patterns that will appear in the live production data. To
prevent this, avoid training the model on recent data alone. Instead, use a large sample of
historical data, and augment this with the latest data to add additional patterns for the
model to learn from.
• If your model architecture follows the transfer learning design, new data can simply be
added to the model during retraining, without losing the patterns that the model has
already learned from previous training data.

Guide to Optimized Model Retraining


Arize dashboards are great for tracking and comparing model live performance during
these tests. Whether the model is tested as a shadow deploy, live A/B test, or simply an
offline comparison, these dashboards offer a simple way to view a side by side model
comparison. The dashboards can also easily be shared with others to demonstrate model
performance improvements to stakeholders.

Measurable ROI

Overall, it’s important to have a clear understanding of your business requirements and
the problem you are trying to solve when determining the best approach for automating
the retraining of a live machine learning model. It’s also important to continuously
monitor the performance of the model and make adjustments to the retraining cadence
and metrics as needed.

Measuring Cost impact:


Although it is challenging to calculate direct ROI for some tasks in AI, the value of
optimized model retraining is simple, tangible, and possible to calculate directly. The
compute and storage costs for model training jobs are often already tracked as part of
cloud compute costs. Often, the business impact of a model can be calculated as well.

Guide to Optimized Model Retraining


When optimizing retraining, we are considering both the retraining costs, and the impact
of model performance to the business (“AI ROI”). We can weigh this cost against each
other to justify the cost of model retraining.

Here, we propose a weekly cost calculation, although this calculator can be adapted to
a different cadence such as daily or monthly depending on the model’s purpose and
maintenance needs.

Retraining Cost =
( compute cost for retraining
+ cost of storing new model ) x frequency per week

SCENARIO ONE SCENARIO TWO

The model is retraining too frequently The model is not retrained enough

My model costs $200 to retrain. I My model costs $200 to train. I train


train my model 1x per day. This model my model once per week. This model
maintained a steady average weekly maintained a steady average weekly
accuracy of 85%. I set up a pipeline Accuracy of 65%. I set up a pipeline
to automatically retrain based on to automatically retrain based on
prediction score drift greater than prediction score drift greater than 0.25
0.25 PSI and accuracy. Based on the PSI. Based on the new rule, my model
new rule, my model starts retraining retrains twice a week, and has achieved
only twice a week, and maintains that a better Accuracy of 85%.
accuracy of 85%.
Comparison of weekly maintenance
Comparison of weekly maintenance costs:
costs:
• Old model maintenance cost: 1*$200
• Old model maintenance cost: 7*$200 = $200 for 65% accuracy
= $1400 • New model maintenance cost:
• New model maintenance cost 2*$200= $400 for 85% accuracy
2*$200= $400 For a higher price, better model
That’s a x% reduction in model performance has been achieved.
maintenance costs. Although this This can be justified and profitable
is a simple contrived example, the if the AI ROI values are higher than
magnitude of cost savings can be on the retraining costs. Lack of frequent
this scale. retraining could have been leaving
money on the table.

Guide to Optimized Model Retraining


Conclusion

Transitioning from model retraining at fixed intervals to automated model retraining triggered
by model performance offers numerous benefits for organizations, from lower compute costs at
a time when cloud costs are increasing to better AI ROI from to improved model performance.
Hopefully this guide provides a template for teams to take action.

Questions or thoughts? Feel free to reach out in the Arize Slack community.

Tto start your ML observability journey, sign up for a free account


of schedule a demo.

For the latest on ML observability best practices and tips, Sign up


for our monthly newsletter The Drift.

You might also like