Background
The daily delta writer bounds revert detection to a 48-hour window. Any revision reverted more than 48h after it was written gets revision_is_identity_reverted=false and revision_seconds_to_identity_revert=NULL — regardless of the true revert time. The monthly pipeline has no such bound.
This was flagged in T425573 by multiple stakeholders: the UWER time-to-revert metric (T424713) uses revision_seconds_to_identity_revert <= 7 * 86400, and the revert-window analysis found 10–50% of reverts fall outside 48 hours depending on wiki.
The cross-source complication
Extending the window also affects source='snapshot' rows. A reverting edit arriving on day T can target a revision that has already been promoted to source='snapshot' by the monthly reconcile:
- Jan 28: revision R arrives → source='events', revision_is_identity_reverted=false
- Feb 2: January monthly snapshot runs → R becomes source='snapshot', revision_is_identity_reverted=false
- Feb 3: revision R' arrives and reverts R → R's snapshot row still reads false
The daily writer must be able to patch revision_is_identity_reverted, revision_first_identity_reverting_revision_id, and revision_seconds_to_identity_revert on source='snapshot' rows in place. The source value is not changed — it still reflects how the row's primary content was generated. The patched revert fields will be documented as "best available as of last daily run, regardless of source."
Questions to answer
- Revert window distribution. Query wmf.mediawiki_history for a histogram of revision_seconds_to_identity_revert bucketed as: 0–48h / 48h–7d / 7d–30d / 30d–90d / 90d–1y / 1y+. This quantifies how much signal each candidate window captures.
- Cross-month revert frequency. How often does a reverting edit arrive in month M+1 for a revision written in month M? This determines how often the cross-source patch scenario occurs in practice.
- Window size vs. scan cost. Prototype the revert_seed CTE at each candidate window size — 48h / 7d / 30d / 90d / 1y / unbounded — and measure Spark runtime and shuffle size for enwiki + one medium wiki. The goal is to find the smallest window that captures the meaningful tail of reverts. Confirm Iceberg predicate pushdown is effective at each window size (EXPLAIN FORMATTED).
- MERGE key change feasibility. The current MERGE key is (source='events', wiki_db, revision_id). To patch snapshot rows the key must drop the source filter. Assess implications for write amplification on snapshot partitions and the snapshot merger contract.
- Retroactive UPDATE volume. Estimate how many rows (across both sources) need a MERGE UPDATE per daily run at each window size.
Definition of done
A comment on this ticket with:
- Revert-window histogram + cross-month revert frequency estimate
- Runtime and shuffle measurements across all candidate window sizes, with a recommended cutoff
- Assessment of dropping source from the MERGE key
- Retroactive UPDATE row count estimate per day at the recommended window
- Go / no-go recommendation
Relationships
- Parent: T424350
- Informs: T425573 (revert window concern from @diego, @Tchanders, @Milimetric)
- Related: T424713 (time-to-revert metric that needs 7d window)