Page MenuHomePhabricator

Develop method for identifying reverts in EventBus data
Closed, ResolvedPublic

Description

The mediawiki_history dataset has a very convenient is_reverted field, but mediawiki_history arrives monolithically once a month. If we are trying to track revert rates closer to real time, our current best strategy is querying the API and using the mw-reverts package. However, this isn't very performant.

The EventBus data stream of revision creations (see: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml) includes the revision's hash, which is the same thing that mw-reverts uses for revision detection. It would be useful to create a tool that operates directly on this data (with a to be determined time to revert cutoff), which would probably provide much better performance.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 15 2019, 11:00 PM

I'll add this to our agenda for monthly Analytics/Product Analytics hangout for discussion

If we are trying to track revert rates closer to real time, our current best strategy is querying the API and using the mw-reverts package. However, this isn't very performant.

Indeed, but mwreverts also offers the option to use the (MySQL replica) database instead of the API, which should be much faster.
(The db option did not work on PAWS last time I tried to use it there. I filed https://github.com/mediawiki-utilities/python-mwreverts/issues/8 about this, @Halfak looked a bit into it and said it should work there too in principle, but would need some work fixing.)

Hmmmm, if MW can know if an edit is a revert via a revert tool (not just a copy/paste of old content), then I think we could include that fact in the event, e.g. is_revert: true or something.

Could we get here (from the UI) whether the user clicked the "revert" button, even? (per @Milimetric 's suggestion) and send that to the hook so the event data also has this information? This would not catch the totality of revisions but a big percentage of them, which, hey, it is a start.

Does anyone on Platform Engineering have any worlds of advice here?

Could we get here (from the UI) whether the user clicked the "revert" button, even? (per @Milimetric 's suggestion) and send that to the hook so the event data also has this information? This would not catch the totality of revisions but a big percentage of them, which, hey, it is a start.

Of course there are already the "undo" and "rollback" change tags, which are generated based on the usage of the corresponding UI elements. (Does EventBus data already include change tags now?) But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

Another possibility is that the "rollback" and "undo" tags were only added a little over a year ago, so things written before that time wouldn't have been able to use them.

It also depends on your definition of "revert", whether you count those edited undoes or undoing of a revision older than the most recent while keeping the changes from later revisions.

@Anomie this is concerned with eventbus data that is real time, most of the data wanted here is less than couple months old as information wether edits where a revert exists on other datasets for data older than that. So (it seems) that having "rollback" or "undo" tags is actually a good measure of whether a revert has happened. In which case adding tags to this schema should be sufficient:

https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml

But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

Another possibility is that the "rollback" and "undo" tags were only added a little over a year ago, so things written before that time wouldn't have been able to use them.

But before that, researchers and analysts were able to instead use the characteristic strings those actions leave in the edit summary (certainly a bit less reliable and convenient, but e.g. the Kittur et al. paper mentioned in https://meta.wikimedia.org/wiki/Research:Revert did something like this in 2007 already).

It also depends on your definition of "revert", whether you count those edited undoes or undoing of a revision older than the most recent while keeping the changes from later revisions.

Sure. There's a good overview in https://meta.wikimedia.org/wiki/Research:Revert , I understand this task is about what is called "identity revert" there because that is what is most practical and already implemented in mwreverts and in mediawiki_history.

@Anomie this is concerned with eventbus data that is real time, most of the data wanted here is less than couple months old as information wether edits where a revert exists on other datasets for data older than that. So (it seems) that having "rollback" or "undo" tags is actually a good measure of whether a revert has happened. In which case adding tags to this schema should be sufficient:

https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml

We may be getting a little offtopic here: I understand the main purpose of this task is to identify reverted edits (see the first sentence of the task description), like the existing field in mediawiki_history does, rather than reverting edits (admittedly the task title is a bit ambiguous).
That said, having EventBus data include rollback and undo tags could be valuable as well - I might have a use case for this soon when measuring AMC moderation actions (those include undos and rollbacks, cf. T213461, although I guess some of the other actions and tags involved may not be supported in EventBus yet).

We may be getting a little offtopic here: I understand the main purpose of this task is to identify reverted edits (see the first sentence of the task description), like the existing field in mediawiki_history does, rather than reverting edits (admittedly the task title is a bit ambiguous).

The event stream can of course only include information about whether an edit reverts another. That information could then be used to mark some revision as reverted in the EventBus data storage (wherever that may be, and however that may work).

MediaWiki exposes information about which revision was restored in two ways:

  1. implicitly, via the revisions SHA1 hash (plus the page ID). If the target store has an index over (page_id, rev_sha1), using this should be simple enough.
  2. explicitly, via the "original revision ID" which is set on the PageUpdater class and exposed via the PageContentSaveComplete and NewRevisionFromEditComplete hooks. Note that this is also set on null revisions, so if the original rev ID is the same as the parent ID, it's not an actual revert. This could be used if the target store has an index on the revision ID.

From this information, we can try to infer which revision(s) (it may be more than one!) was reverted, by looking at the range of revisions between the restored revision, and the one doing the restoring. This is a little tricky, since the revision sequence is defined by a combination of timestamp and revision ID, but should be doable.

Note that manual reverts (by clicking edit on an older revision, and then saving) is detected by hash, but the "original rev ID" is not set in this case. This could be done without too much trouble, but if I recall correctly, this may confuse some extensions.

Also note that "undo" actions are not covered by this - they are detected by the has method only if it was a "simple" revert of the most recent edit, without any "reverse patching". Information about which revision was undone is present in the PageUpdater class, but currently not exposed via any hooks. It's only used for setting the change tags.

@daniel thanks for explaining how "original rev id" works, I didn't know about that. It sounds like as it is implemented now it would only catch the same kind of reverts that sha-matching would help us catch, right? We do sha-matching revert detection as part of the mediawiki history reconstruction, where we take care to solve problems like overlapping or nested windows of matching shas (eg. histories of a page with revisions with shas like B...A...B...A...B). We could apply the same logic to real-time data here, but it would not be trivial.

What was interesting about what you said was that "original rev id" could be retrieved for other kinds of reverts, like partial reverts (where you're editing an old revision). This would be really cool. Could we maybe think about adding rev_original_id or something similar to the database (could be a separate table since we don't want to grow revision more)? I figure if we do that going forward, we don't mess with extensions that rely on the hook. And then we can backfill it historically using sha-matching. Later when we start analyzing content persistence we can even backfill partial reverts. Thoughts?

@Milimetric recording original revision ID would be possible, but if it's only relevant for analytics, it should probably be done in an analytics system, not core. Or by an extension, as a middle ground.

Note that this plays into the debate about the notion (or rather, multiple conflicting notions) of "parent revisions", see the lengthy discussion at T193690.

Well, maybe finding some way to put original revision ID in the hook without hurting other extensions might be needed then, let's consider that the "right way" to approach this and see what work-around we can come up with until such a time. Thanks for the pointer to the undelete RFC.

With the new EditResult class in place the task described here should be now possible. After an edit is saved, an EditResult object containing all revert-related information is created and passed to extensions using the PageSaveComplete hook.

I don't know how this data collection setup works, so I'm not sure if that's enough for this use case. If you need the EditResult object to be passed also somewhere else, please let me know :)

Great, then I think it should be rather easy :) Ping me if you have any issues with this, the feature is new and there still may be some bugs here and there.

Pchelolo subscribed.

So now that it's fairly easy to do we would welcome a volunteer contribution. Two things need to happen in order for this to work:

  1. A new minor version of the revision-create schema needs to be created in the schema repo to include rev_revert boolean, rev_revert_method enum, rev_revert_tags array and rev_is_exact_revert boolean. I am not sure all of these info needs to be in the event, we can discuss this in a review.
  2. EventBus extension needs to be updated to provide the info.

Change 615415 had a related patch set uploaded (by Ostrzyciel; owner: Ostrzyciel):
[schemas/event/primary@master] mediawiki/revision/create: Add revert-related fields

https://gerrit.wikimedia.org/r/615415

Change 617477 had a related patch set uploaded (by Ostrzyciel; owner: Ostrzyciel):
[mediawiki/extensions/EventBus@master] Revision create event: add revert-related fields

https://gerrit.wikimedia.org/r/617477

Change 615415 merged by Ppchelko:
[schemas/event/primary@master] mediawiki/revision/create: Add revert-related fields

https://gerrit.wikimedia.org/r/615415

Thank you for working on this, this is great. Both patches were merged, so now just need to wait for them to get deployed and verify them working.

@Ottomata please remind me - do we need to do anything with eventgate to make it pick up new schema versions or that's in the ancient past?

For production eventgate's like eventgate-main, yes, we need to rebuild and deploy the eventgate-wikimedia image, which has the specific schema repo git sha to clone in blubber.yaml.

We could consider removing this restriction. eventgate-analytics-external also has schema repos baked into the image, but it also has the remote one configured. It will fall back to using the remote one if a schema is not found in the local ones.

We should rebuild the image and update in beta and test. I'll work on this now @Pchelolo .

And yes thank you so much @Ostrzyciel this is so so great!

Change 618531 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] blubber.yaml - bump schema repo versions to get revision/create/1.1.0

https://gerrit.wikimedia.org/r/618531

Change 617477 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Revision create event: add revert-related fields

https://gerrit.wikimedia.org/r/617477

Change 618531 merged by Ottomata:
[eventgate-wikimedia@master] blubber.yaml - bump schema repo versions to get revision/create/1.1.0

https://gerrit.wikimedia.org/r/618531

Tested in beta, it works! I just _manually_ reverted some wikitext via an edit:

{
  "$schema": "/mediawiki/revision/create/1.1.0",
  "meta": {
    "uri": "https://meta.wikimedia.beta.wmflabs.org/wiki/TestOtto",
    "request_id": "Xyq7gKwQBHcAAA-tbk8AAAAG",
    "id": "28528caa-b416-4b29-b908-4a2aab0a246b",
    "dt": "2020-08-05T14:00:32Z",
    "domain": "meta.wikimedia.beta.wmflabs.org",
    "stream": "mediawiki.revision-create"
  },
  "database": "metawiki",
  "page_id": 6784,
  "page_title": "TestOtto",
  "page_namespace": 0,
  "rev_id": 9250,
  "rev_timestamp": "2020-08-05T14:00:32Z",
  "rev_sha1": "778ci8h2d9h2hvy02tq7bm1qm3fmplo",
  "rev_minor_edit": false,
  "rev_len": 4,
  "rev_content_model": "wikitext",
  "rev_content_format": "text/x-wiki",
  "performer": {
    "user_text": "Otto",
    "user_groups": [
      "*",
      "user"
    ],
    "user_is_bot": false,
    "user_id": 7496,
    "user_registration_dt": "2016-08-29T16:26:00Z",
    "user_edit_count": 0
  },
  "page_is_redirect": false,
  "rev_parent_id": 9249,
  "rev_content_changed": true,
  "rev_is_revert": true,
  "rev_revert_details": {
    "rev_reverted_revs": [
      9249
    ],
    "rev_revert_method": "manual",
    "rev_is_exact_revert": true,
    "rev_original_rev_id": 9248
  }
}

Change 618535 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main - bump image version to get schema mediawiki/revision/create/1.1.0

https://gerrit.wikimedia.org/r/618535

Change 618535 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main - bump image version to get schema mediawiki/revision/create/1.1.0

https://gerrit.wikimedia.org/r/618535

Ok, deployed in prod; should be ready for train deployment next week.

BTW: I can't get over how cool this is! This is something we've wanted for a long time, and @Ostrzyciel just made it happen!

Ostrzyciel claimed this task.

Thanks :)