Page MenuHomePhabricator

Incremental MediaWiki History Phase I
Open, In Progress, HighPublic

Description

Incremental MediaWiki History

Summary

The wmf.mediawiki_history pipeline currently runs monthly against sqooped MariaDB snapshots from ~900 wikis. End-to-end wall-clock averages ~56 hours across the last five runs, of which ~46h is sqoop wait and ~10.6h is Spark compute. Product-side analytics initiatives (editor month, constructive edit rate, retained newcomers, UWER, etc.) need freshness measured in days, not weeks. This umbrella task tracks the work to deliver a daily-cadence MediaWiki history that covers the columns those metrics actually need, while preserving the existing monthly rebuild as the canonical correctness backstop.

Current Approach

We ship a new Iceberg table (mediawiki_history_incremental_v1, database TBD) populated by two separate Scala/Spark 3.5 writers: a daily events writer that lands rows from MediaWiki event streams (page_change_v1, revision_visibility_change, revision_change_tags, serversideaccountcreation, etc.) with source='events', and a monthly merge job triggered downstream of MediawikiHistoryRunner success that uses MERGE INTO ... WHEN NOT MATCHED BY SOURCE to atomically reconcile source='snapshot' rows without touching event rows. The table is partitioned by (source, days(event_timestamp)). Schema covers audit categories A+B with a 48-hour revert detection window. MediawikiHistoryRunner and the legacy wmf.mediawiki_history table are not touched — all current consumers continue to work, and the monthly rebuild serves as the reconcile mechanism. Full architectural plan in T424359.

This is phase I of this project. We will follow up this work with T428273: Incremental MediaWiki History Phase II.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenAhoelzl
In Progressxcollazo
OpenAPizzata-WMF
Resolvedxcollazo
ResolvedBTullis
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
OpenAPizzata-WMF
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
Resolvedxcollazo
OpenAPizzata-WMF
OpenAPizzata-WMF
OpenJAllemandou
OpenJAllemandou
OpenJAllemandou
OpenJAllemandou

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1286481 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Upgrade graphframes to 0.11.0 from Maven Central, drop Archiva repos

https://gerrit.wikimedia.org/r/1286481

Change #1286481 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery/source@master] Upgrade graphframes to 0.11.0 from Maven Central, drop Archiva repos

https://gerrit.wikimedia.org/r/1286481

xcollazo changed the task status from Open to In Progress.May 14 2026, 4:13 PM
xcollazo renamed this task from Incremental MediaWiki History to Incremental MediaWiki History Phase I.Thu, Jun 4, 8:56 PM