The [[ https://gitlab.wikimedia.org/repos/research/research-datasets/-/blob/main/src/research_datasets/wikidiff/pipeline.py | wikidiff ]] dataset is blocked as the generation of wikitext history (dumps1) datasets was stopped due to recurring production issues. The wikidiff dataset is used by research for ML training (revert risk models), producing revert risk predictions datasets, the risk observatory dashboard, and for edit types.
Based on previous discussion T358366#9831389, the best path forward for wikidiff is not clear yet - eventually such a diff column can hopefully be part of an official dataset produced by DE. In the meantime, research will transform the wikidiff pipeline into an incremental pipeline. Pros: 1. will decrease the latency from 1+ month to a few days. 2. only one version of dataset instead of full snapshots. Cons: 1. no reconciliation after a daily update of the content history has been processed by wikidiff 2. will be computationally expensive to compute