Goal: Create a replicable system to determine the optimal retraining frequency for ML models by evaluating the impact of training data age on their performance (precision and recall). Test it with the Revert Risk models.
Task:
- Define a fixed and up-to-date test dataset: Utilize the most recent data available (example: data from March 2025).
- Execute a series of controlled retraining: Train at least 10 Revert risk models. In each iteration, vary the latest date of the training data, creating an increasing lag compared to the test data (example: from February 2025 back to April 2024).
- Monitor and report performance: For each trained model, record and analyze the precision and recall metrics on the fixed test dataset.
Background: Research has been developing tooling to enable more thorough evaluation of LLMs for various tasks. For example, LLMPerf allows for benchmarking of resource usage by models and T386448 established metrics for evaluating the quality of simple summaries. Through projects like SDS 1.2.1B (T377159), we have also developed some good practices around what types of modeling strategies to try when trying to optimize performance. A major missing piece in our strategy is guidance on how often to retrain the models that we do develop, in the context of (assumed) model and data drift. This task is oriented towards helping us to understand the importance of that gap and recommendations for how to address it. The focus on "replicable system" is not about establishing actual infrastructure for this retraining but instead a general framework (like LLMPerf) that can be applied to other models as well even if our initial focus is Revert Risk.
Stakeholders: this is a Research task because we are still in the understanding/exploration/prototyping space. We will coordinate with ML Platform though given that hopefully any frameworks that we develop will be applicable to production use-cases as well.
Status: @MunizaA has created an configurable Airflow DAG that allows to retrain the Revert Risk Language Agnostic model. Among other features the system allows to:
- Create multiple training datasets:
- Training periods: The systems receives as input the training periods (eg monthly, with or without overlapping).
- Data balancing: Balance date per language or label
- Outputs: The system outputs several metrics such as F1, Acc and precision, as well as providing this results for different probability thresholds, allowing to compute precision @ recall X