Page MenuHomePhabricator

Evaluate Time to Revert data in the datalake
Open, MediumPublicSpike

Description

We are working on a metric that can measure Reduced Backlog or Reduced Workload for Moderators.

While moderator patrol counts seemed like a good candidate, Anecdotally people are asking for additional changes to feel comfortable with using this, so likely it will not be enough.
On the Jan 20, Patrol Count Definition meeting, we tinkered with the idea of using a combination of [Edits]Time to Revert (within a certain threshold) + [editors]no. of users who took one of the moderator actions to be a good indicator of workload.
In this task we will evaluate the revert data fields, related to time, in mediawiki_history and other tables in our datalake to understand if this can be easily calculated or if additional work is required.

Event Timeline

Mayakp.wiki changed the subtype of this task from "Task" to "Spike".

@Mayakp.wiki for other purposes, I was looking at a sample of reverts pulled from mediawiki history (code) and a few findings/realizations that are likely relevant here:

  • There's a bug in mediawiki history where it marks edits as reverts when in fact they're just page moves or changes in page protections. These actions trigger an edit in the history of the page but it has no difference from the previous edit so it looks like a revert based on the sha1 hash even though no edits were actually reverted. Hopefully this is easy for Data Engineering to fix by just ensuring that a "revert" doesn't just bring it back to the previous edit but actually undoes something. Example: page move or page protection.
  • You maybe already know this but you'll probably want to exclude any reverts that were also reverted (i.e. have mw-reverted tag) as they likely indicate an edit war and then it's hard to know in the middle who is the good-faith actor.
  • I found at least one instance of the sha1 being marked as null (I think because it actually occurred in January 2026 for the December 2025 snapshot so presumably was only partially included). Unfortunately, in early October a revision had been deleted (also a null sha1) so every edit to that page across a period of three months was marked as reverted (revision_is_identity_reverted). In practice, I assume this happens anytime there are deleted edits spaced out though I haven't double-checked this.

My sense is that it's worthwhile to file these bugs and try to fix them. Alternatively in the sample I'm looking at, I don't see actual reverts that were missed in the edits tags but caught by mediawiki history job so you could just fallback to only using the edit tags to determine reverts. My sense is that the mediawiki-history logic in part exists to cover the edits that happened before the the edit tags existed, but in this case you're just looking at recent history and forward so less useful.

# example of page where null sha1 + deleted edit led to 3 months of edits being tagged as reverted
df = spark.sql("""
SELECT
  revision_id,
  event_timestamp,
  revision_text_sha1,
  revision_text_bytes_diff,
  revision_is_identity_reverted,
  revision_first_identity_reverting_revision_id, 
  revision_is_identity_revert,
  revision_tags
FROM wmf.mediawiki_history
WHERE
  snapshot = "2025-12"
  AND wiki_db = "enwiki"
  AND page_id = 80838047
  AND event_entity = "revision"
  AND event_type = "create"
  AND event_timestamp >= "2025-10-01"
  AND page_namespace = 0
""").toPandas()