Page MenuHomePhabricator

[Request] Create a replicable system to determine the optimal retraining frequency for ML models
Open, Stalled, MediumPublic

Description

Goal: Create a replicable system to determine the optimal retraining frequency for ML models by evaluating the impact of training data age on their performance (precision and recall). Test it with the Revert Risk models.

Task:

  • Define a fixed and up-to-date test dataset: Utilize the most recent data available (example: data from March 2025).
  • Execute a series of controlled retraining: Train at least 10 Revert risk models. In each iteration, vary the latest date of the training data, creating an increasing lag compared to the test data (example: from February 2025 back to April 2024).
  • Monitor and report performance: For each trained model, record and analyze the precision and recall metrics on the fixed test dataset.

Background: Research has been developing tooling to enable more thorough evaluation of LLMs for various tasks. For example, LLMPerf allows for benchmarking of resource usage by models and T386448 established metrics for evaluating the quality of simple summaries. Through projects like SDS 1.2.1B (T377159), we have also developed some good practices around what types of modeling strategies to try when trying to optimize performance. A major missing piece in our strategy is guidance on how often to retrain the models that we do develop, in the context of (assumed) model and data drift. This task is oriented towards helping us to understand the importance of that gap and recommendations for how to address it. The focus on "replicable system" is not about establishing actual infrastructure for this retraining but instead a general framework (like LLMPerf) that can be applied to other models as well even if our initial focus is Revert Risk.

Stakeholders: this is a Research task because we are still in the understanding/exploration/prototyping space. We will coordinate with ML Platform though given that hopefully any frameworks that we develop will be applicable to production use-cases as well.

Status: @MunizaA has created an configurable Airflow DAG that allows to retrain the Revert Risk Language Agnostic model. Among other features the system allows to:

  • Create multiple training datasets:
    • Training periods: The systems receives as input the training periods (eg monthly, with or without overlapping).
    • Data balancing: Balance date per language or label
  • Outputs: The system outputs several metrics such as F1, Acc and precision, as well as providing this results for different probability thresholds, allowing to compute precision @ recall X

Event Timeline

Thanks @diego for putting this together! I'll work on prioritizing. A few thoughts / questions in the meantime to consider:

  • What factors should we hold steady? Presumably a uniform number of training examples? Are they randomly sampled from all data before the cut-off though or is there some sort of stratification by time or other approach that should be used?
  • Some stretch ideas once the basic system is working:
    • Do you want to explore strategies for reducing the impact of model drift? For instance, testing whether training the model with the most recent data for a particular cut-off as opposed to a more uniform distribution over time? Other data sampling strategies? I wonder if it would make sense to try different types of pre-trained models (for the multilingual revert risk) as well to see if larger/newer base models perhaps are less sensitive to drift? I guess probably can't find good one-to-one comparisons in that regard though so hard to know how much can be learned from switching out the base models.
    • I wonder if there's any reason to also explore knowledge cut-offs in the sense of when the article was created? Maybe new articles introduce new vocabulary that throws off the multilingual revert risk model and degrades performance? I wrote up some ideas around how to approach this idea in T383090 and you could just use article creation date.
    • I also would be curious to see if we could test whether you see stable patterns as far as which revisions the models get wrong -- e.g., the newest 3 models correctly predicting a particular test row but the oldest 7 models not?

Hi @diego, please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown). That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Thanks @diego for putting this together! I'll work on prioritizing. A few thoughts / questions in the meantime to consider:

  • What factors should we hold steady? Presumably a uniform number of training examples? Are they randomly sampled from all data before the cut-off though or is there some sort of stratification by time or other approach that should be used?

Given that are considering a system that we can reuse later, I think we should define this as parameters:

  • Label balance: Balanced, Real (random), or desired balance (eg. 0.80 False)
  • Max data: Undefined or fixed
  • Date: start and end date.
  • Some stretch ideas once the basic system is working:
    • Do you want to explore strategies for reducing the impact of model drift? For instance, testing whether training the model with the most recent data for a particular cut-off as opposed to a more uniform distribution over time? Other data sampling strategies? I wonder if it would make sense to try different types of pre-trained models (for the multilingual revert risk) as well to see if larger/newer base models perhaps are less sensitive to drift? I guess probably can't find good one-to-one comparisons in that regard though so hard to know how much can be learned from switching out the base models.

Great idea. I think with the parameters mentioned above we can run these experiments.

  • I wonder if there's any reason to also explore knowledge cut-offs in the sense of when the article was created? Maybe new articles introduce new vocabulary that throws off the multilingual revert risk model and degrades performance? I wrote up some ideas around how to approach this idea in T383090 and you could just use article creation date.

Very good point, I haven't thought about it. Maybe we could have an optional parameter to filter by "Create date". Even maybe would be a good idea to add something more flexible such as adding any SQL constrain as parameter (does this sounds good @MunizaA ?)

  • I also would be curious to see if we could test whether you see stable patterns as far as which revisions the models get wrong -- e.g., the newest 3 models correctly predicting a particular test row but the oldest 7 models not?

For system design I think this means that we want to be able to filter/group results by some parameters, for example get results by splitted "is_anonymous".

Isaac triaged this task as High priority.May 13 2025, 1:11 PM
Isaac set Due Date to Jun 30 2025, 4:00 AM.
Isaac moved this task from Backlog to In Progress on the Research board.
fkaelin changed the task status from Open to In Progress.May 29 2025, 6:05 PM
fkaelin assigned this task to MunizaA.
  • kickoff meeting with ML team
  • choice of model: language agnostic revert risk model
  • output of work will be report / notebook
  • code will be added to research-datasets
  • @MunizaA spend the last week debugging the system and now is ready to use.
  • The system was deployed and can be found in the Research Airflow instance.
  • I'm currently running a large experiment (12 different training datasets) to study the effect of data "freshness" on the Revert Risk models.

Next step:

  • Use the system to establish the optional retraining period for revert risk models

Potential future work:

  • Expand this system to be used with other models (eg. Reference Need, Tone Check)
  • @MunizaA spend the last week debugging the system and now is ready to use.
  • The system was deployed and can be found in the Research Airflow instance.
  • I'm currently running a large experiment (12 different training datasets) to study the effect of data "freshness" on the Revert Risk models.

The large experiment had failed. I'll need engineering help me to understand and fix this error. I'm going to coordinate with @Miriam and @fkaelin to decide how to proceed with this issue.

Looking at the logs, the job seems to fail with timeouts and workers being removed from the pool - which often indicates that there are not enough resources available.

25/07/06 10:16:57 WARN TaskSetManager: Lost task 20.0 in stage 2.3 (TID 99049) (an-worker1117.eqiad.wmnet executor 281): FetchFailed(BlockManagerId(47, an-worker1174.eqiad.wmnet, 7337, None), shuffleId=1, mapIndex=399, mapId=8301, reduceId=391, message=
org.apache.spark.shuffle.FetchFailedException: Connecting to an-worker1174.eqiad.wmnet/10.64.165.4:7337 failed in the last 9500 ms, fail this connection directly

Looking at the difference in configuration between the last successful run and the failing one, the period date range are different (12 years vs 2 years). Is the 12 year period intentional? That would mean a large job for the base features, and the spark config would def need to be adjusted (not beefy atm). The config from the failing run vs the last successful one.

{"snapshot":"2025-05","wikis":["enwiki"],"period":{"start":"2013-03-01T12:00:00Z","end":"2025-03-01T12:00:00Z"},"output":"/tmp/research/model_retraining_diego_2_months/base_features","partition_output":["wiki_db",10]}

vs

{"snapshot":"2025-05","wikis":["enwiki"],"period":{"start":"2023-04-01T12:00:00Z","end":"2025-04-01T12:00:00Z"},"output":"/tmp/research/model_retraining_diego_2_months/base_features","partition_output":["wiki_db",10]}

Hi!,

Is the 12 year period intentional?

Yes, this is intentional. The previous runs were tests, now this is the experiment (at least 10 years), we need to run

That creates a substantial dataset as part of the base features dataset (the wikitext and parent wikitext for all revision in these 10+ years, which could be around ~10TB of data. Let's first try with a beefier spark config, e.g. spark.sql.shuffle.partitions=4000, maxExecutors=129, executor-cores 4 --executor-memory 24G. There are also timeout configs to play with, but that is not a fun place to be.

Airflow retries failed attempts, but if this job fails once after a number of hours, it is unlikely to succeed in the next retry, you can mark it is failed. You can see failed attempts in the "logs" tab of the running dags, where you can also verify the desired config is actually used in the spark-submit command..

That creates a substantial dataset as part of the base features dataset (the wikitext and parent wikitext for all revision in these 10+ years, which could be around ~10TB of data. Let's first try with a beefier spark config, e.g. spark.sql.shuffle.partitions=4000, maxExecutors=129, executor-cores 4 --executor-memory 24G. There are also timeout configs to play with, but that is not a fun place to be.

This solution keeps failing. The workaround I found was to run this experiments in chunks of two years (2013 to 2015, 2015 to 2017 ...) and then join the results. This is not optimal because requires manually creating each chunk, but at least solves the problem.

Isaac changed Due Date from Jun 30 2025, 4:00 AM to Jul 31 2025, 4:00 AM.

There is buggy behavior:

current:

compute_intervals(start='2015-01-01',intervals=2,frequency="2Y",is_overlap_allowed=False)
[Period(start='2015-01-01', end=Timestamp('2015-12-31 00:00:00')),
 Period(start=Timestamp('2015-12-31 00:00:00'), end=Timestamp('2017-12-31 00:00:00'))]

expected

compute_intervals(start='2015-01-01',intervals=2,frequency="2Y",is_overlap_allowed=False)
[Period(start=Timestamp('2015-12-31 00:00:00'), end=Timestamp('2017-12-31 00:00:00')),
 Period(start=Timestamp('2017-12-31 00:00:00'), end=Timestamp('2019-12-31 00:00:00'))]

or (at least)

[Period(start=Timestamp('2015-01-31 00:00:00'), end=Timestamp('2017-12-31 00:00:00')),
 Period(start=Timestamp('2017-12-31 00:00:00'), end=Timestamp('2019-12-31 00:00:00'))]

workaround: align start date with period format (always add YS or YE)

compute_intervals(start='2015-01-01',intervals=2,frequency="2YS",is_overlap_allowed=False)

[Period(start='2015-01-01', end=Timestamp('2017-01-01 00:00:00')),
 Period(start=Timestamp('2017-01-01 00:00:00'), end=Timestamp('2019-01-01 00:00:00'))]

compute_intervals(start='2015-12-31',intervals=2,frequency="2YE",is_overlap_allowed=False)

[Period(start='2015-12-31', end=Timestamp('2017-12-31 00:00:00')),
 Period(start=Timestamp('2017-12-31 00:00:00'), end=Timestamp('2019-12-31 00:00:00'))]

possible solution:

replace these functions:

def compute_intervals_fixed(start: datetime, intervals: int, frequency: str, is_overlap_allowed: bool = True) -> list[Period]:
    if is_overlap_allowed:
        ends = pd.date_range(start=start, periods=intervals + 1, freq=frequency, inclusive="right")
        return [Period(start=start, end=end) for end in ends]
    else:
        boundaries = pd.date_range(start=start, periods=intervals + 1, freq=frequency, inclusive="both")
        return [Period(start=boundaries[i], end=boundaries[i+1]) for i in range(intervals)]


periods = compute_intervals_fixed(
    start=datetime.strptime(params["start"], "%Y-%m-%dT%H:%M:%S%z"),
    intervals=params["intervals"],
    frequency=params["frequency"],
    is_overlap_allowed=params["overlapping"],
)

Reassigned this to Fabian as this is a Research engineering task, the Research Science part of this is captured as subtask.

Miriam changed the task status from In Progress to Stalled.Aug 28 2025, 1:46 PM
Miriam lowered the priority of this task from High to Medium.
Miriam removed Due Date which was set to Jul 31 2025, 4:00 AM.

Moving this back to Research backlog until a training infrastructure becomes available. We will work on tooling to standardise the structure of model training, and that will be captured in a separate task.