Page MenuHomePhabricator

Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history
Open, MediumPublic

Description

With the introduction of the "mw-reverted" tag in T254074, we now have two different approaches to identifying reverts in the MediaWiki History table:

  1. SHA1 checksums. Reverts identified this way are reflected in the revision_is_identity_reverted column.
  2. The mw-reverted tag. These reverts are identified through the tag being in the revision_tags column.

Once the October 2020 snapshot of MediaWiki History is available in early November, someone from the Product Analytics (most likely @nettrom_WMF) should run a comparison of edits and these revert detection approaches to identify where they overlap and where they differ.

Event Timeline

The expected difference is to have more (possibly many more) tag-based reverts than checksum-based reverts.

LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Current Quarter on the Product-Analytics board.

@JAllemandou : Yes, and I'm expecting to see some checksum-based reverts not having the tag because the tag only checks the last 15 edits.

I'd be interested at looking at the differences in the relationship between how actively a page is being edited and how many reverts are detected by these methods.

LGoto lowered the priority of this task from Medium to Low.Nov 16 2020, 5:43 PM

We're quickly running out of time in Q2, so moving this to Q3.

kzimmerman raised the priority of this task from Low to Medium.Feb 3 2021, 11:21 PM

The Growth team runs updates every week, so they're using the mw-reverted tag in those notebooks. But we don't have a clear sense of how well mw-reverted maps to revision_is_identity_reverted, which is what we use to calculate key metrics. Being able to reconcile these numbers is important because:

  • We are increasingly trying to understand Product Teams' impact on high level metrics
  • We need to verify that the mw-reverted tag is providing the information we expect

We believe this is medium priority but should be reconsidered for Q4.

@Isaac has made a preliminary investigation into this for English Wikipedia from May 2022. Adding the query he used and the results here so we can build upon it.

WITH reverts AS (
    SELECT
      revision_is_identity_revert as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-undo'), 1, 0) AS undo,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-rollback'), 1, 0) AS rollback,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-manual-revert'), 1, 0) as manual
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  mw_history,
  undo,
  rollback,
  manual,
  COUNT(1)
FROM reverts
GROUP BY
  mw_history,
  undo,
  rollback,
  manual

Findings:

English Wikipedia May 2022 data (namespace 0; no redirects):
+----------+----+--------+------+--------+
|mw_history|undo|rollback|manual|# edits |
+----------+----+--------+------+--------+
|false     |0   |0       |0     |3100382 |  # edits that weren't reverts by any definition

# reverts captured in mediawiki history
# vast majority also show up as edit tags
# but some small proportion were manual and
# missed by mediawiki for some reason
|true      |1   |0       |0     |109548  |  # straightforward undos
|true      |0   |1       |0     |54032   |  # straightforward rollbacks
|true      |0   |0       |1     |49750   |  # straightforward manual reverts
|true      |0   |0       |0     |20827   |  # manual reverts found by mediawiki_history table but not identified by mediawiki

# reverts not captured in mediawiki history
# vast majority are undos where presumably the editor
# changed something too and so generated a unique hash
|false     |1   |0       |0     |7711    |  # undo
|false     |0   |0       |1     |8       |  # manual
|false     |0   |1       |0     |2       |  # rollback
+----------+----+--------+------+--------+

thanks @nettrom_WMF ! seeing that this task is about reverted edits as opposed to reverting edits (what I calculated above), here's the corresponding query/data for that -- tl;dr: mw-reverted is pretty good proxy but is missing about 10% of what mediawiki_history identifies. That's not necessarily a bad thing though as some of these edits seem to be e.g., an edit war where a patroller's revert was also reverted but really shouldn't have been (mw-reverted tries to account for this; mediawiki_history does not).

spark.sql("""
WITH reverted AS (
    SELECT
      revision_is_identity_reverted as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-reverted'), 1, 0) AS reverted
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  COUNT(1) AS num_edits,
  mw_history,
  reverted
FROM reverted
GROUP BY
  mw_history,
  reverted
ORDER BY
  mw_history,
  num_edits DESC
""").show(100, False)
+---------+----------+--------+
|# edits  |mw_history|reverted|
+---------+----------+--------+
|3037145  |false     |0       |  # most edits aren't reverted (yay!)

# most reverted edits are captured by mediawiki_history and mw-reverted tags (yay!)
|275158   |true      |1       |  

# ~10% of reverted edits missed by mw-reverted; likely because its criteria are much more complicated: https://www.mediawiki.org/wiki/Manual:Reverts#Conditions_for_execution
|22487    |true      |0       |  

# small number of edits are tagged as reverted but missed in mediawiki_history (~3%). this about matches the number of mw-undo edits that were presumably partial and therefore missed by mediawiki_history (7711 -- see data from previous comment). so I assume that's what is going on here -- i.e. mediawiki_history doesn't detect partially-reverted edits as reverted.
|7470     |false     |1       | 
+---------+----------+--------+
Aklapper subscribed.

@nettrom_WMF: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

mpopov lowered the priority of this task from Medium to Low.Apr 11 2023, 5:17 PM
mpopov moved this task from Upcoming Quarter to Backlog on the Product-Analytics board.

Just wanted to link to my comment in T415140#11545827 here too, which is that I've noticed two potential bugs related to revision_is_identity_reverted in mediawiki history. Quick summary:

  • Page moves / protections seem to generate no-op edits for articles. Not sure if that's always been true or is a recent change, but it means you have subsequent edits with the same sha1 hash and that is captured as a revert. Logically, it's easy to fix (require matching sha1 sums to be separated by at least two edits and not next to each other). Hopefully not difficult to patch but I'm not familiar enough with the code to know.
  • Deleted revisions have null sha1 sums and sometimes revisions that happen right after the snapshot cut-off do too (e.g., some revisions from the first minutes of January 2026 still appear December 2025 snapshot but with only partial data). These null sha1 sums can match up and trigger very long stretches of reverts that didn't happen. I suspect hopefully easy fix again, which is just to ignore null sha1 sums in any calculations.

Flagging for @Ahoelzl : This could be something we wish to consider.

Flagging for @Ahoelzl : This could be something we wish to consider.

+1, having this working as designed could help the Monthly Active Moderators work thread.

Ahoelzl raised the priority of this task from Low to Medium.Wed, Mar 25, 12:01 AM
Ahoelzl removed a project: Data-Engineering-Icebox.
Ahoelzl moved this task from Tag with Icebox to Next Up on the Data-Engineering board.