Page MenuHomePhabricator

Wikidiff dataset using content history
Closed, ResolvedPublic

Description

The wikidiff dataset is blocked as the generation of wikitext history (dumps1) datasets was stopped due to recurring production issues. The wikidiff dataset is used by research for ML training (revert risk models), producing revert risk predictions datasets, the risk observatory dashboard, and for edit types.

Based on previous discussion T358366#9831389, the best path forward for wikidiff is not clear yet - eventually such a diff column can hopefully be part of an official dataset produced by DE. In the meantime, research will transform the wikidiff pipeline into an incremental pipeline. Pros: 1. will decrease the latency from 1+ month to a few days. 2. only one version of dataset instead of full snapshots. Cons: 1. no reconciliation after a daily update of the content history has been processed by wikidiff 2. will be computationally expensive to compute

Details

Due Date
Apr 18 2025, 4:00 AM

Event Timeline

task is unassigned, unprioritized and doesn't have a deadline. Moving to Backlog. @XiaoXiao-WMF cc.

fkaelin changed the task status from Open to In Progress.Apr 1 2025, 12:54 PM
fkaelin moved this task from Backlog to In Progress on the Research board.

This work is ongoing, as discussed with @XiaoXiao-WMF moving this to in progress and adding a deadline.

fkaelin set Due Date to Apr 4 2025, 4:00 AM.Apr 1 2025, 12:54 PM
fkaelin changed Due Date from Apr 4 2025, 4:00 AM to Apr 18 2025, 4:00 AM.Apr 7 2025, 4:29 PM
fkaelin claimed this task.
  • The research.mediawiki_content_diff dataset is documented on datahub and backfilled for historical revisions
  • The content_diff airflow dag is deployed and will run daily