Incremental MediaWiki History
Summary
The wmf.mediawiki_history pipeline currently runs monthly against sqooped MariaDB snapshots from ~900 wikis. End-to-end wall-clock averages ~56 hours across the last five runs, of which ~46h is sqoop wait and ~10.6h is Spark compute. Product-side analytics initiatives (editor month, constructive edit rate, retained newcomers, UWER, etc.) need freshness measured in days, not weeks. This umbrella task tracks the work to deliver a daily-cadence MediaWiki history that covers the columns those metrics actually need, while preserving the existing monthly rebuild as the canonical correctness backstop.
Current Approach
We ship a new Iceberg table (mediawiki_history_incremental_v1, database TBD) populated by two separate Scala/Spark 3.5 writers: a daily events writer that lands rows from MediaWiki event streams (page_change_v1, revision_visibility_change, revision_change_tags, serversideaccountcreation, etc.) with source='events', and a monthly merge job triggered downstream of MediawikiHistoryRunner success that uses MERGE INTO ... WHEN NOT MATCHED BY SOURCE to atomically reconcile source='snapshot' rows without touching event rows. The table is partitioned by (source, days(event_timestamp)). Schema covers audit categories A+B with a 48-hour revert detection window. MediawikiHistoryRunner and the legacy wmf.mediawiki_history table are not touched — all current consumers continue to work, and the monthly rebuild serves as the reconcile mechanism. Full architectural plan in T424359.
This is phase I of this project. We will follow up this work with T428273: Incremental MediaWiki History Phase II.