Page MenuHomePhabricator

Develop method for identifying reverts in EventBus data
Open, NormalPublic

Description

The mediawiki_history dataset has a very convenient is_reverted field, but mediawiki_history arrives monolithically once a month. If we are trying to track revert rates closer to real time, our current best strategy is querying the API and using the mw-reverts package. However, this isn't very performant.

The EventBus data stream of revision creations (see: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml) includes the revision's hash, which is the same thing that mw-reverts uses for revision detection. It would be useful to create a tool that operates directly on this data (with a to be determined time to revert cutoff), which would probably provide much better performance.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 15 2019, 11:00 PM

I'll add this to our agenda for monthly Analytics/Product Analytics hangout for discussion

Tbayer added subscribers: Halfak, Tbayer.EditedFeb 20 2019, 8:23 PM

If we are trying to track revert rates closer to real time, our current best strategy is querying the API and using the mw-reverts package. However, this isn't very performant.

Indeed, but mwreverts also offers the option to use the (MySQL replica) database instead of the API, which should be much faster.
(The db option did not work on PAWS last time I tried to use it there. I filed https://github.com/mediawiki-utilities/python-mwreverts/issues/8 about this, @Halfak looked a bit into it and said it should work there too in principle, but would need some work fixing.)

Hmmmm, if MW can know if an edit is a revert via a revert tool (not just a copy/paste of old content), then I think we could include that fact in the event, e.g. is_revert: true or something.

Could we get here (from the UI) whether the user clicked the "revert" button, even? (per @Milimetric 's suggestion) and send that to the hook so the event data also has this information? This would not catch the totality of revisions but a big percentage of them, which, hey, it is a start.

Nuria updated the task description. (Show Details)Mar 13 2019, 6:39 PM

Does anyone on Core Platform Team have any worlds of advice here?

Tbayer added a comment.EditedMar 13 2019, 9:31 PM

Could we get here (from the UI) whether the user clicked the "revert" button, even? (per @Milimetric 's suggestion) and send that to the hook so the event data also has this information? This would not catch the totality of revisions but a big percentage of them, which, hey, it is a start.

Of course there are already the "undo" and "rollback" change tags, which are generated based on the usage of the corresponding UI elements. (Does EventBus data already include change tags now?) But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

Tbayer updated the task description. (Show Details)Mar 13 2019, 9:35 PM
Anomie added a subscriber: Anomie.Mar 15 2019, 3:14 PM

But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

Another possibility is that the "rollback" and "undo" tags were only added a little over a year ago, so things written before that time wouldn't have been able to use them.

It also depends on your definition of "revert", whether you count those edited undoes or undoing of a revision older than the most recent while keeping the changes from later revisions.

Nuria added a comment.Mar 15 2019, 3:59 PM

@Anomie this is concerned with eventbus data that is real time, most of the data wanted here is less than couple months old as information wether edits where a revert exists on other datasets for data older than that. So (it seems) that having "rollback" or "undo" tags is actually a good measure of whether a revert has happened. In which case adding tags to this schema should be sufficient:

https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml

But I think there are several reasons why people have not been using them for reverts and instead relied on content-based revert detection, e.g. the fact that edits tagged "undo" might not actually be reverts because the user can modify the content before saving.

Another possibility is that the "rollback" and "undo" tags were only added a little over a year ago, so things written before that time wouldn't have been able to use them.

But before that, researchers and analysts were able to instead use the characteristic strings those actions leave in the edit summary (certainly a bit less reliable and convenient, but e.g. the Kittur et al. paper mentioned in https://meta.wikimedia.org/wiki/Research:Revert did something like this in 2007 already).

It also depends on your definition of "revert", whether you count those edited undoes or undoing of a revision older than the most recent while keeping the changes from later revisions.

Sure. There's a good overview in https://meta.wikimedia.org/wiki/Research:Revert , I understand this task is about what is called "identity revert" there because that is what is most practical and already implemented in mwreverts and in mediawiki_history.

Tbayer updated the task description. (Show Details)Mar 15 2019, 7:48 PM

@Anomie this is concerned with eventbus data that is real time, most of the data wanted here is less than couple months old as information wether edits where a revert exists on other datasets for data older than that. So (it seems) that having "rollback" or "undo" tags is actually a good measure of whether a revert has happened. In which case adding tags to this schema should be sufficient:
https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/3.yaml

We may be getting a little offtopic here: I understand the main purpose of this task is to identify reverted edits (see the first sentence of the task description), like the existing field in mediawiki_history does, rather than reverting edits (admittedly the task title is a bit ambiguous).
That said, having EventBus data include rollback and undo tags could be valuable as well - I might have a use case for this soon when measuring AMC moderation actions (those include undos and rollbacks, cf. T213461, although I guess some of the other actions and tags involved may not be supported in EventBus yet).

daniel added a subscriber: daniel.Mar 19 2019, 1:46 PM

We may be getting a little offtopic here: I understand the main purpose of this task is to identify reverted edits (see the first sentence of the task description), like the existing field in mediawiki_history does, rather than reverting edits (admittedly the task title is a bit ambiguous).

The event stream can of course only include information about whether an edit reverts another. That information could then be used to mark some revision as reverted in the EventBus data storage (wherever that may be, and however that may work).

MediaWiki exposes information about which revision was restored in two ways:

  1. implicitly, via the revisions SHA1 hash (plus the page ID). If the target store has an index over (page_id, rev_sha1), using this should be simple enough.
  2. explicitly, via the "original revision ID" which is set on the PageUpdater class and exposed via the PageContentSaveComplete and NewRevisionFromEditComplete hooks. Note that this is also set on null revisions, so if the original rev ID is the same as the parent ID, it's not an actual revert. This could be used if the target store has an index on the revision ID.

From this information, we can try to infer which revision(s) (it may be more than one!) was reverted, by looking at the range of revisions between the restored revision, and the one doing the restoring. This is a little tricky, since the revision sequence is defined by a combination of timestamp and revision ID, but should be doable.

Note that manual reverts (by clicking edit on an older revision, and then saving) is detected by hash, but the "original rev ID" is not set in this case. This could be done without too much trouble, but if I recall correctly, this may confuse some extensions.

Also note that "undo" actions are not covered by this - they are detected by the has method only if it was a "simple" revert of the most recent edit, without any "reverse patching". Information about which revision was undone is present in the PageUpdater class, but currently not exposed via any hooks. It's only used for setting the change tags.

@daniel thanks for explaining how "original rev id" works, I didn't know about that. It sounds like as it is implemented now it would only catch the same kind of reverts that sha-matching would help us catch, right? We do sha-matching revert detection as part of the mediawiki history reconstruction, where we take care to solve problems like overlapping or nested windows of matching shas (eg. histories of a page with revisions with shas like B...A...B...A...B). We could apply the same logic to real-time data here, but it would not be trivial.

What was interesting about what you said was that "original rev id" could be retrieved for other kinds of reverts, like partial reverts (where you're editing an old revision). This would be really cool. Could we maybe think about adding rev_original_id or something similar to the database (could be a separate table since we don't want to grow revision more)? I figure if we do that going forward, we don't mess with extensions that rely on the hook. And then we can backfill it historically using sha-matching. Later when we start analyzing content persistence we can even backfill partial reverts. Thoughts?

@Milimetric recording original revision ID would be possible, but if it's only relevant for analytics, it should probably be done in an analytics system, not core. Or by an extension, as a middle ground.

Note that this plays into the debate about the notion (or rather, multiple conflicting notions) of "parent revisions", see the lengthy discussion at T193690.

Well, maybe finding some way to put original revision ID in the hook without hurting other extensions might be needed then, let's consider that the "right way" to approach this and see what work-around we can come up with until such a time. Thanks for the pointer to the undelete RFC.