Page MenuHomePhabricator

Provide a recommendation on the optimal retraining frequency for ML models
Closed, ResolvedPublic

Description

In this task Research is going to present our experiments and conclusions on the optional retraining frequency for ML models (based on the system designed in T392305)

Details

Due Date
Aug 29 2025, 4:00 AM

Event Timeline

Isaac triaged this task as Medium priority.Jul 16 2025, 6:52 PM
Isaac set Due Date to Aug 29 2025, 4:00 AM.
diego renamed this task from Provide a recommendation on the optional frequency for ML models to Provide a recommendation on the optimal retraining frequency for ML models.Aug 15 2025, 4:22 PM

Recommendation 1

TL;DR (Revert Risk): Models should be retrained at least once a year

Main findings

  • Precision decreases by approximately 1% per year.
  • A Revert Risk model trained on 2024 data and used in 2025 would have:
    • 1% better precision if trained on 2023 data
    • 2% better precision compared with a model trained on 2022 data
  • This decay in precision is consistent regardless of training data length (3 months, 1 year, or 2 years).

Experiment details

  • Evaluation data: 1 month (May 2025) of //enwiki//
  • Training data: Random samples (100K per experiment), from April 2015 to April 2025

The plot shows precision for three different training lengths. Each dot represents the end of the training period. For example:

  • A dot on April 2016 represents:
    • February–April 2016 for the 3-month training period
    • April 2015–April 2016 for the 1-year training period
    • April 2014–April 2016 for the 2-year training period

For all three cases, we observe a similar pattern: precision decays by ~1% per year.

plots(2).png (1×1 px, 114 KB)

To examine the effect on tools that use an ad-hoc threshold, such as Automoderator, we computed results using a threshold = 0.95

As expected, results with a higher threshold are more sensitive to the training period, showing a decay of more than 15% over 10 years for longer training periods (2 years).

plot_retraining_95_thereshold.jpeg (1×1 px, 96 KB)

Limitations

  • Results are based on a single evaluation dataset.
    • Anecdotal evidence suggests similar patterns with other datasets.
    • Running additional experiments is costly (around 55 experiments, ~5h each).
    • Consistency across different training periods suggests generalizability, but ideally more experiments on new evaluation data should be run (ongoing).
  • Experiments were conducted with the Revert Risk Language-Agnostic model. Results may vary for other models, though similar trends are expected.

Possible next steps

  • Study the impact of data freshness for different wiki databases.
  • Repeat the experiment with different evaluation datasets.
  • Run experiments to determine the optimal length of the training period.

Thank you, @diego !! Resolving this task as the recommendation has been published. Thanks for all the work!