Page MenuHomePhabricator

Spike: full-history revert detection for `mediawiki_history_incremental_v1`
Closed, ResolvedPublic

Description

Background

The daily delta writer bounds revert detection to a 48-hour window. Any revision reverted more than 48h after it was written gets revision_is_identity_reverted=false and revision_seconds_to_identity_revert=NULL — regardless of the true revert time. The monthly pipeline has no such bound.

This was flagged in T425573 by multiple stakeholders: the UWER time-to-revert metric (T424713) uses revision_seconds_to_identity_revert <= 7 * 86400, and the revert-window analysis found 10–50% of reverts fall outside 48 hours depending on wiki.

The cross-source complication

Extending the window also affects source='snapshot' rows. A reverting edit arriving on day T can target a revision that has already been promoted to source='snapshot' by the monthly reconcile:

  • Jan 28: revision R arrives → source='events', revision_is_identity_reverted=false
  • Feb 2: January monthly snapshot runs → R becomes source='snapshot', revision_is_identity_reverted=false
  • Feb 3: revision R' arrives and reverts R → R's snapshot row still reads false

The daily writer must be able to patch revision_is_identity_reverted, revision_first_identity_reverting_revision_id, and revision_seconds_to_identity_revert on source='snapshot' rows in place. The source value is not changed — it still reflects how the row's primary content was generated. The patched revert fields will be documented as "best available as of last daily run, regardless of source."

Questions to answer

  1. Revert window distribution. Query wmf.mediawiki_history for a histogram of revision_seconds_to_identity_revert bucketed as: 0–48h / 48h–7d / 7d–30d / 30d–90d / 90d–1y / 1y+. This quantifies how much signal each candidate window captures.
  1. Cross-month revert frequency. How often does a reverting edit arrive in month M+1 for a revision written in month M? This determines how often the cross-source patch scenario occurs in practice.
  1. Window size vs. scan cost. Prototype the revert_seed CTE at each candidate window size — 48h / 7d / 30d / 90d / 1y / unbounded — and measure Spark runtime and shuffle size for enwiki + one medium wiki. The goal is to find the smallest window that captures the meaningful tail of reverts. Confirm Iceberg predicate pushdown is effective at each window size (EXPLAIN FORMATTED).
  1. MERGE key change feasibility. The current MERGE key is (source='events', wiki_db, revision_id). To patch snapshot rows the key must drop the source filter. Assess implications for write amplification on snapshot partitions and the snapshot merger contract.
  1. Retroactive UPDATE volume. Estimate how many rows (across both sources) need a MERGE UPDATE per daily run at each window size.

Definition of done

A comment on this ticket with:

  • Revert-window histogram + cross-month revert frequency estimate
  • Runtime and shuffle measurements across all candidate window sizes, with a recommended cutoff
  • Assessment of dropping source from the MERGE key
  • Retroactive UPDATE row count estimate per day at the recommended window
  • Go / no-go recommendation

Relationships

Event Timeline

Spike results

Queries run against wmf.mediawiki_history (full history, all wikis). Source: spike_T426469_revert_window.py on branch T424350-mwhistory-incremental-writers.

1. Revert-window histogram

The 48h window captures only 55% of all reverted revisions globally. Nearly half the signal is lost.

WindowReverted revisions% of totalCumulative %
0–48h257,574,70255.1%55.1%
48h–7d30,786,9996.6%61.7%
7d–30d39,269,8608.4%70.1%
30d–90d32,347,6776.9%77.1%
90d–1y44,515,3429.5%86.6%
1y+62,687,59213.4%100%

The large 1y+ bucket reflects old vandalism and bot spam cleaned up years later.

2. Cross-month revert frequency

35.4% of all reverts cross a month boundary (165M out of 467M reverted revisions). These are rows whose reverting edit arrives after the monthly reconcile has already promoted the reverted revision to source='snapshot', making them invisible to the daily writer under the current design.

Cross-month breakdown by months later (top of the distribution):

Months later% of cross-month revertsCumulative
0 (< 30 days, crosses month boundary)15.3%15.3%
112.4%27.7%
27.4%35.1%
35.4%40.5%

The distribution has a long tail well past 30 months. The top-30 wikis by cross-month rate are all Wiktionary projects (96–99.6%), likely driven by bulk bot edits reverted months/years later.

3. Window size vs. scan cost

Simulated the revert_seed → sha1_ranked → revert_annotations CTE on wmf.mediawiki_history. Because that table is snapshot-partitioned (not date-partitioned), scan cost is constant across window sizes — what varies is join/shuffle cost only.

Windowenwiki seed rowsenwiki timefrwiki seed rowsfrwiki time
48h331,36660s55,97950s
7d1,283,48758s203,40253s
30d5,482,14253s834,86949s
90d15,964,76854s2,621,78350s
1y62,930,41056s10,184,60856s
unbounded1,336,444,67489s232,734,77162s

Iceberg predicate pushdown confirmed. EXPLAIN FORMATTED on the Iceberg incremental table shows timestamp predicates pushed to the BatchScan layer — only the relevant date-partitioned files are opened. On the real Iceberg table, scan cost scales proportionally with window size. A 90d window scans ~45× more files than a 48h window; unbounded scans the full table history.

4. MERGE key change assessment

The 35.4% cross-month rate confirms that dropping source='events' from the MERGE key is necessary — one in three reverts targets a row already promoted to snapshot. Without this change the daily writer silently leaves those rows stale.

Write amplification estimate (enwiki):

  • Total snapshot rows: 1.34 billion (9,174 distinct days, Jan 2001 – Feb 2026)
  • Snapshot rows with cross-month reverts: 68.2 million (5.1%), spanning 9,044 distinct days

Write amplification per daily run is bounded by the new revert detections for that day, not the total historical backlog (see §5).

5. Retroactive UPDATE volume per daily run

At the recommended 90d window, enwiki cross-month reverted rows captured vs. other candidate windows:

Windowenwiki rows captured% of 68.2M cross-month total
48h642,1080.9%
7d2,102,0853.1%
30d10,079,87614.8%
90d22,779,83333.4%
1y42,015,03361.6%

Per-day UPDATE volume distribution across all wikis:

StatValue
Median16,022 / day
p9540,783 / day
Max1,927,251 / day
Average19,039 / day

The max is driven by rare spike days on enwiki (e.g. 2026-04-01: 1,018,511 — likely a mass revert operation, warrants investigation before go-live). At the median/p95 the daily MERGE UPDATE volume is operationally fine.

6. Schema change: two-tier revert fields

The spike informed an evolution of the revert field design. The previous approach used _within_48h variants; this spike motivates widening to 90d and making the two-tier split explicit:

Proposed: replace the four existing revert fields with eight, split into two explicit tiers:

Authoritative tier — for source='events' rows, written as NULL only when no revert has been detected within the 90d window (meaning "unknown; monthly may still find one outside the window"). When the daily writer detects a revert within 90 days — whether on a fresh insert or a cross-month back-patch — it sets the authoritative field to true at the same time as the bounded field. Populated authoritatively for source='snapshot' rows by the SnapshotMerger from wmf.mediawiki_history.

FieldNote
revision_is_identity_reverted
revision_seconds_to_identity_revert
revision_first_identity_reverting_revision_id
revision_is_identity_revert

Bounded tier — always populated for both sources. Computed by the daily writer using the 90d rolling revert_seed CTE; recomputed by the SnapshotMerger each monthly run.

FieldNote
revision_is_identity_reverted_within_90_days
revision_seconds_to_identity_revert_within_90_days
revision_first_identity_reverting_revision_id_within_90_days
revision_is_identity_revert_within_90_days

Valid states:

revision_is_identity_reverted_within_90_daysMeaning
NULLfalseevents row, no revert detected within 90d yet; monthly may still find one
truetrueevents or snapshot row, reverted within 90d
falsefalsesnapshot row, not reverted (as of last run)
truefalsesnapshot row, reverted >90d after writing — monthly catches this, daily window does not

The NULL / true combination is impossible: if the daily writer detects a revert within 90d it sets both tiers to true simultaneously, regardless of whether the row is source='events' or source='snapshot'. The false / true combination is also impossible for the same reason.

Who writes what:

  • DeltaWriter inserts (source='events'): bounded tier computed by the revert_seed CTE (always populated). If a revert is detected within 90d, both _within_90_days = true and authoritative revision_is_identity_reverted = true are written together. If no revert is detected, _within_90_days = false and the authoritative tier is NULL.
  • DeltaWriter back-patch (cross-month revert detected on a snapshot row, ≤90d): updates both tiers on the snapshot row in the same MERGE UPDATE — no semantic gap between the two tiers.
  • SnapshotMerger (monthly): authoritative tier projected directly from wmf.mediawiki_history. Bounded tier recomputed from monthly data: _reverted fields are trivially derived by applying the 90d filter to the existing revision_seconds_to_identity_revert column; _is_revert_within_90_days requires a self-join on revision_first_identity_reverting_revision_id (acceptable extra cost on the monthly run). Both tiers are overwritten with authoritative monthly values each run; the daily writer fills in between runs.

Consumer query patterns:

  • "Was this revision reverted?" (all rows, bounded): revision_is_identity_reverted_within_90_days = true
  • "Was this revision reverted within 7d?" (UWER): revision_is_identity_reverted_within_90_days = true AND revision_seconds_to_identity_revert_within_90_days <= 604800
  • "Authoritative answer, monthly quality": source = 'snapshot' AND revision_is_identity_reverted = true
  • "Best available across all rows": COALESCE(revision_is_identity_reverted, revision_is_identity_reverted_within_90_days)

7. Recommendation: GO, with 90d window

  • Window: 90d for the daily bounded tier. Captures 33% of cross-month reverts for enwiki (vs 0.9% at 48h), directly covers the T424713 7-day time-to-revert metric, and keeps daily scan cost proportional and bounded by Iceberg pushdown. The 1y+ tail adds large scan cost for diminishing returns.
  • Schema: rename the four _within_48h columns to _within_90_days and add the four authoritative counterparts, for eight revert fields total.
  • MERGE key: drop source='events' filter. Add WHEN MATCHED AND t.source = 'snapshot' guard in the daily writer to patch only the revert fields — primary content fields on snapshot rows are not touched.
  • Action items: (1) Change revert_seed lookback to INTERVAL 90 DAYS. (2) Update MERGE ON clause to (wiki_id, revision_id). (3) Add four _within_90_days columns to DDL and both writers. (4) Update SnapshotMerger projection to populate both tiers. (5) Investigate the 2026-04-01 enwiki spike (1M+ cross-month revert detections in a single day) before go-live.

Decision

Agreed with @JAllemandou that a 90d window makes sense: it covers todays and any potential future uses cases that can be achieved in an incremental way, and equally importantly, it bounds the calculation so that we can properly size the job and not have to do a full table scan.

Thanks @xcollazo for this data and the details around back-filling the revision tag fields in T425573: mediawiki_history_incremental_v1: schema specification for stakeholder review. I mention that here because I think the ability to filter on mw-reverted tag will be necessary for the metrics that folks are working on. The long-tail of revert times have a lot of false positives due to bugs in how we calculate reverts -- i.e. null-content revisions at the very edge of snapshots that match up against any previously-deleted revisions in the page's history and page moves/protections. I checked a sample from English Wikipedia and it was 50% false positives after 48 hours. Rerunning your queries but with the WHERE ARRAY_CONTAINS(revision_tags, 'mw-reverted') as a much more high-fidelity signal of what edits actually have been reverted since 2020 gives us:

  • The 48h window captures only 55% of all reverted revisions globally. Nearly half the signal is lost. becomes 70% within 48 hours (and 78% if you look just at Wikipedias). So just a quarter lost.
  • 35.4% of all reverts cross a month boundary becomes 19% cross-month (12% if you look just at Wikipedias).

A note on Wiktionary: those are often "true" reverts but they're completely unrelated to vandalism and so frankly shouldn't be counted in the metrics either -- it's just pages that had interwiki links added to them over the years and then eventually they were all removed when Wikidata became the standard. But also I think you'd see that the vast majority are many years old and it's not an ongoing issue if you look only at more recent data.

I'm mainly seeing this to document this issue, not to convince you to change your choice. If 90-days doesn't break out systems, that's okay as folks can always further filter but I do hope that long-term we prioritize fixing these bugs and simplify the revert fields again when there's more time to discuss and reach consensus.

Rerunning your queries but with the WHERE ARRAY_CONTAINS(revision_tags, 'mw-reverted') as a much more high-fidelity signal of what edits actually have been reverted since 2020 gives us:

Huh! In my revert spelunking in T423583, I learned that mw-reverted is only added on the 15 revisions prior to the reverting revision (or something like that).

I know nothing about revert stats but I'm surprised that revision mw-reverted tag adds this much fidelity.

Huh! In my revert spelunking in T423583, I learned that mw-reverted is only added on the 15 revisions prior to the reverting revision (or something like that).

Oh what a great task! Excellent details in there about revert mechanics (thank you). Good point that Mediawiki's heuristics are also semi-arbitrary so we're essentially choosing between two things right now:

  • Mediawiki: Only look 15 edits back to find which edits have been reverted.
  • Mediawiki History: Look back some amount of time to find potentially-reverted revisions. Currently all of time but now we're considering 90 days as an additional approach. I had advocated for 48 hours (the original proposal).

So which is better? Mediawiki's approach is semi-recent so it doesn't help with old data. I also don't think that once a mw-reverted tag is applied that it's ever removed. So e.g., if someone reverts an edit and then someone comes in and restores that edit by reverting the revert, the now-restored edit will still be marked as mw-reverted. Presumably this is partially why mw-reverted only goes 15 edits deep (otherwise very easy for someone to mess up a page's history by reverting the whole thing). But generally when I look at edits marked with mw-reverted, they are indeed reverted edits.

MWH's approach could be better than Mediawiki's as it's true that reverts sometimes stretch past 15 edits and we have more flexibility (at the monthly snapshot at least). But the current implementation has a number of bugs that introduce a lot of false positives, especially in "reverts" that happen at longer time-scales (they don't matter so much in the 0-48 hour range as so many of those reverts are actually real). Those would need to be fixed for me to recommend relying on MWH alone. The other question though is when a "reverted" edit actually corresponds with "bad" edit, which is what we usually think of when we think of revert and is how they're interpreted for metrics. Aaron had a paper on this but the tl;dr is that when an edit is reverted long after it's created, generally it's just because the norms or context changed (e.g., my example of no longer tracking sitelinks in the wikitext; an article no longer being under discussion for deletion; etc.) and not because the initial edit was actually bad. In that sense, I think the perfect heuristic for revert-as-marker-of-bad-edit is probably something like "no further than 15 revisions back or 48 hours ago", but I'm trying to not further complicate this :)