Page MenuHomePhabricator

Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history
Open, LowPublic

Description

With the introduction of the "mw-reverted" tag in T254074, we now have two different approaches to identifying reverts in the MediaWiki History table:

  1. SHA1 checksums. Reverts identified this way are reflected in the revision_is_identity_reverted column.
  2. The mw-reverted tag. These reverts are identified through the tag being in the revision_tags column.

Once the October 2020 snapshot of MediaWiki History is available in early November, someone from the Product Analytics (most likely @nettrom_WMF) should run a comparison of edits and these revert detection approaches to identify where they overlap and where they differ.

Event Timeline

The expected difference is to have more (possibly many more) tag-based reverts than checksum-based reverts.

fdans moved this task from Incoming to Data Quality on the Analytics board.Oct 26 2020, 4:11 PM
LGoto assigned this task to nettrom_WMF.Oct 27 2020, 5:13 PM
LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Current Quarter on the Product-Analytics board.

@JAllemandou : Yes, and I'm expecting to see some checksum-based reverts not having the tag because the tag only checks the last 15 edits.

I'd be interested at looking at the differences in the relationship between how actively a page is being edited and how many reverts are detected by these methods.

LGoto lowered the priority of this task from Medium to Low.Mon, Nov 16, 5:43 PM

We're quickly running out of time in Q2, so moving this to Q3.