Page MenuHomePhabricator

Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history
Open, LowPublic

Description

With the introduction of the "mw-reverted" tag in T254074, we now have two different approaches to identifying reverts in the MediaWiki History table:

  1. SHA1 checksums. Reverts identified this way are reflected in the revision_is_identity_reverted column.
  2. The mw-reverted tag. These reverts are identified through the tag being in the revision_tags column.

Once the October 2020 snapshot of MediaWiki History is available in early November, someone from the Product Analytics (most likely @nettrom_WMF) should run a comparison of edits and these revert detection approaches to identify where they overlap and where they differ.

Event Timeline

The expected difference is to have more (possibly many more) tag-based reverts than checksum-based reverts.

LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Current Quarter on the Product-Analytics board.

@JAllemandou : Yes, and I'm expecting to see some checksum-based reverts not having the tag because the tag only checks the last 15 edits.

I'd be interested at looking at the differences in the relationship between how actively a page is being edited and how many reverts are detected by these methods.

LGoto lowered the priority of this task from Medium to Low.Nov 16 2020, 5:43 PM

We're quickly running out of time in Q2, so moving this to Q3.

kzimmerman raised the priority of this task from Low to Medium.Feb 3 2021, 11:21 PM

The Growth team runs updates every week, so they're using the mw-reverted tag in those notebooks. But we don't have a clear sense of how well mw-reverted maps to revision_is_identity_reverted, which is what we use to calculate key metrics. Being able to reconcile these numbers is important because:

  • We are increasingly trying to understand Product Teams' impact on high level metrics
  • We need to verify that the mw-reverted tag is providing the information we expect

We believe this is medium priority but should be reconsidered for Q4.

@Isaac has made a preliminary investigation into this for English Wikipedia from May 2022. Adding the query he used and the results here so we can build upon it.

WITH reverts AS (
    SELECT
      revision_is_identity_revert as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-undo'), 1, 0) AS undo,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-rollback'), 1, 0) AS rollback,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-manual-revert'), 1, 0) as manual
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  mw_history,
  undo,
  rollback,
  manual,
  COUNT(1)
FROM reverts
GROUP BY
  mw_history,
  undo,
  rollback,
  manual

Findings:

English Wikipedia May 2022 data (namespace 0; no redirects):
+----------+----+--------+------+--------+
|mw_history|undo|rollback|manual|# edits |
+----------+----+--------+------+--------+
|false     |0   |0       |0     |3100382 |  # edits that weren't reverts by any definition

# reverts captured in mediawiki history
# vast majority also show up as edit tags
# but some small proportion were manual and
# missed by mediawiki for some reason
|true      |1   |0       |0     |109548  |  # straightforward undos
|true      |0   |1       |0     |54032   |  # straightforward rollbacks
|true      |0   |0       |1     |49750   |  # straightforward manual reverts
|true      |0   |0       |0     |20827   |  # manual reverts found by mediawiki_history table but not identified by mediawiki

# reverts not captured in mediawiki history
# vast majority are undos where presumably the editor
# changed something too and so generated a unique hash
|false     |1   |0       |0     |7711    |  # undo
|false     |0   |0       |1     |8       |  # manual
|false     |0   |1       |0     |2       |  # rollback
+----------+----+--------+------+--------+

thanks @nettrom_WMF ! seeing that this task is about reverted edits as opposed to reverting edits (what I calculated above), here's the corresponding query/data for that -- tl;dr: mw-reverted is pretty good proxy but is missing about 10% of what mediawiki_history identifies. That's not necessarily a bad thing though as some of these edits seem to be e.g., an edit war where a patroller's revert was also reverted but really shouldn't have been (mw-reverted tries to account for this; mediawiki_history does not).

spark.sql("""
WITH reverted AS (
    SELECT
      revision_is_identity_reverted as mw_history,
      IF(ARRAY_CONTAINS(revision_tags, 'mw-reverted'), 1, 0) AS reverted
    FROM wmf.mediawiki_history
    WHERE
      snapshot = '2022-05'
      AND wiki_db = 'enwiki'
      AND page_namespace = 0
      AND NOT page_is_redirect
      AND event_type = 'create'
      AND event_entity = 'revision'
      AND event_timestamp >= '2022-05-01'
      AND event_timestamp < '2022-06-01'  
)
SELECT
  COUNT(1) AS num_edits,
  mw_history,
  reverted
FROM reverted
GROUP BY
  mw_history,
  reverted
ORDER BY
  mw_history,
  num_edits DESC
""").show(100, False)
+---------+----------+--------+
|# edits  |mw_history|reverted|
+---------+----------+--------+
|3037145  |false     |0       |  # most edits aren't reverted (yay!)

# most reverted edits are captured by mediawiki_history and mw-reverted tags (yay!)
|275158   |true      |1       |  

# ~10% of reverted edits missed by mw-reverted; likely because its criteria are much more complicated: https://www.mediawiki.org/wiki/Manual:Reverts#Conditions_for_execution
|22487    |true      |0       |  

# small number of edits are tagged as reverted but missed in mediawiki_history (~3%). this about matches the number of mw-undo edits that were presumably partial and therefore missed by mediawiki_history (7711 -- see data from previous comment). so I assume that's what is going on here -- i.e. mediawiki_history doesn't detect partially-reverted edits as reverted.
|7470     |false     |1       | 
+---------+----------+--------+
Aklapper subscribed.

@nettrom_WMF: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

mpopov lowered the priority of this task from Medium to Low.Apr 11 2023, 5:17 PM
mpopov moved this task from Upcoming Quarter to Backlog on the Product-Analytics board.