Page MenuHomePhabricator

[FY25-SDS1.4.2] Update research pipelines to use Dumps2
Closed, ResolvedPublic

Description

Update the research pipelines to use the new content history dataset (aka dumps2).

Context: DE has stopped producing dumps1 wikitext history datasets due to recurring issues, until the affected research pipelines are updated they will not produce any data (e.g. the knowledge gaps / content gap metrics).

Event Timeline

fkaelin renamed this task from Update research pipelines to use Dumps2 to [FY25-SDS1.4.2] Update research pipelines to use Dumps2.Apr 2 2025, 8:03 PM
fkaelin claimed this task.
fkaelin triaged this task as High priority.
fkaelin moved this task from Backlog to FY2024-25-Research-April-June on the Research board.

Weekly updates

  • The existing wikidiff dataset has been transformed to an incremental iceberg table (research.mediawiki_content_diff)
  • There are merge requests (1, 2) for to finalize the airflow dag that updates the dataset daily based on the content history dataset

Weekly updates

  • the mediawiki content diff (formerly called wikidiff) dataset is in production and backfilled (datahub documentation, airflow dag)

Weekly updates

  • Started code changes for add-a-link switch to using content history (T388146)
fkaelin changed the status of subtask T388144: Update reference risk pipelines for dumps2 from Stalled to In Progress.
fkaelin changed the status of subtask T388146: Update addalink pipeline for dumps2 from Open to In Progress.

Weekly updates

  • Code changes (MR) completed for add-a-link (T388146), reference quality (T388144) and article embedding (T390704) pipelines
  • Tested manually in notebooks, airflow dag updates in progress.

Weekly updates

  • Airflow dag updates were tested and deployed to production
  • The affected pipelines were backfilled and are now up-to-date
  • Open remaining tasks all depend on the content current dataset (wmf_content.mediawiki_content_current_v1). The pipelines have been tested with the wip dataset. When DE promotes this dataset to production, there will be an airflow sensor that the dags can await.

Weekly updates

  • Updates have been completed, all sub tasks are resolved
  • This task will be resolved when the wmf_content.mediawiki_content_current_v1 is in prod T391279

Final asana report:

Hypothesis: The research team will adopt the wmf_content.mediawiki_content_history_v1 on all existing use cases in which they currently use the deprecated wmf.mediawiki_wikitext_history.

Confirm whether the hypothesis was supported or contradicted
The hypothesis is supported.

Briefly describe what was accomplished over the course of the hypothesis work (list of deliverables, links to documents, etc.)
Over the course of this hypothesis work we delivered the following:

  • Migrated all data pipelines maintained by research to use the new content history data sources (T385999)
  • The updated pipelines are: content diff (the difference between revision and parent revision), article quality model, knowledge gaps (i.e. content gap metrics), reference risk, revert risk model training, add-a-link model training, article llm embeddings
  • The corresponding airflow dags have been deployed, and back-filled where applicable.
  • Coordination with DE for new data sources and access patterns

Major lessons

Migrating the research pipelines to using an incremental content history has been a goal for a while, it is great that we were able to accomplish this. This represented a substantial effort, and illustrated the cost of maintenance for a small team like research (one swe, no sre resources). The benefits of this investment are two-fold: the immediate benefit is that implementing and maintaining pipelines using mediawiki content is much cleaner and cheaper that it was previously. The main goal for research is to be able to run incremental daily pipelines (same cadence as the content history), but in order for that we also need an incremental mediawiki history dataset (currently generated as monthly snapshot). Until such a data source is available, the output generated by most of the pipelines updated with this hypothesis will remain unchanged.