In this task Research is going to present our experiments and conclusions on the optional retraining frequency for ML models (based on the system designed in T392305)
Description
Description
Details
Details
- Due Date
- Aug 29 2025, 4:00 AM
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Stalled | fkaelin | T392305 [Request] Create a replicable system to determine the optimal retraining frequency for ML models | |||
| Resolved | diego | T399726 Provide a recommendation on the optimal retraining frequency for ML models |
Event Timeline
Comment Actions
Recommendation 1
TL;DR (Revert Risk): Models should be retrained at least once a year
Main findings
- Precision decreases by approximately 1% per year.
- A Revert Risk model trained on 2024 data and used in 2025 would have:
- 1% better precision if trained on 2023 data
- 2% better precision compared with a model trained on 2022 data
- This decay in precision is consistent regardless of training data length (3 months, 1 year, or 2 years).
Experiment details
- Evaluation data: 1 month (May 2025) of //enwiki//
- Training data: Random samples (100K per experiment), from April 2015 to April 2025
The plot shows precision for three different training lengths. Each dot represents the end of the training period. For example:
- A dot on April 2016 represents:
- February–April 2016 for the 3-month training period
- April 2015–April 2016 for the 1-year training period
- April 2014–April 2016 for the 2-year training period
For all three cases, we observe a similar pattern: precision decays by ~1% per year.
To examine the effect on tools that use an ad-hoc threshold, such as Automoderator, we computed results using a threshold = 0.95
As expected, results with a higher threshold are more sensitive to the training period, showing a decay of more than 15% over 10 years for longer training periods (2 years).
Limitations
- Results are based on a single evaluation dataset.
- Anecdotal evidence suggests similar patterns with other datasets.
- Running additional experiments is costly (around 55 experiments, ~5h each).
- Consistency across different training periods suggests generalizability, but ideally more experiments on new evaluation data should be run (ongoing).
- Experiments were conducted with the Revert Risk Language-Agnostic model. Results may vary for other models, though similar trends are expected.
Possible next steps
- Study the impact of data freshness for different wiki databases.
- Repeat the experiment with different evaluation datasets.
- Run experiments to determine the optimal length of the training period.
Comment Actions
Thank you, @diego !! Resolving this task as the recommendation has been published. Thanks for all the work!

