Page MenuHomePhabricator

Choose the definition of a "reverted" edit
Closed, ResolvedPublic

Description

This task is dedicated to establishing a definition of a reverted edit, as part of my GSoC 2020 project: T248775: Proposal: Add Reverted filter to RecentChanges Filters
I will later implement the chosen solution in MediaWiki's core.

There already a couple approaches to this problem out there, most notably the Echo extension and the mwreverts Python library, both discussed below. Here I tried to compile the various requirements and ideas posted on Phabricator by people interested in this feature, most notably:

Aspects

Important aspects we must consider when choosing the solution:

  • Whether it will satisfy Echo's requirements, i.e. more or less match its current behavior for revert notifications. It would be perfect if we could create a unified definition of a reverted edit, as it would significantly reduce confusion among users. We will also want to get rid of redundant revert detection code in Echo.
  • Overall accuracy. We will never cover all cases, but we should try to cover a reasonable amount of them.
  • Performance. A computationally expensive solution going through all diffs on a page is probably out of the question, we want to apply this tag right after the edit is saved.

Possible solutions

Below I list some of the solutions that were proposed for the problem over the years.

Exact SHA1 matching

Entries in the revision table have the rev_sha1 field, which, according to documentation, would be suitable for checking if the contents of all slots are equal in two revisions.

This is pretty much what the mwreverts Python library does for revert detection. The main difference is it does it completely off-wiki.

Pros
  • That's perfect for exact reverts, i.e. reverting the page to an exact previous state.
  • There probably won't be a lot of logic required to implement this.
  • According to this document, about 94% of all reverts would be covered. Please note that this is based on research done in 2007 and can be a bit outdated. Still, I think it's safe to assume this method would have reasonable accuracy.
Cons
  • Non-exact reverts (for example: user uses undo and makes some slight changes over it before saving) would not be handled.
  • Reverting an intermediate change with automatic conflict resolution can create an entirely new SHA1, thanks to being a mix of two revisions. This wouldn't be handled.
  • Other more complicated scenarios would not be handled, too.
  • This does not match Echo's behavior at all.

Without an index on the rev_sha1 field this would be very inefficient and have unpredictable performance. There are at least two solutions for that:

Add an index to the revision table

That would involve adding an additional index to the revision table, probably over rev_page and rev_sha1 fields.

Pros
  • Would allow for efficient checking for reverts on pages with very long history.
  • Would offer better accuracy than the second method described below.
  • The index may be useful for other possible features, such as looking up edits by their SHA1 through the API.
Cons
  • Requires a schema change and an additional index. Any additional index slows down the DB on inserts/deletes slightly, so we have to have a good reason to introduce one.
Use radius search

This approach (also described here) would involve linearly going through SHA1s of n most recent edits or edits that were made at most x hours ago.

Pros
  • The performance of this solution would be very predictable in most cases.
  • We could make the search radius a configuration variable to let sysadmins decide on the balance between performance and accuracy.
Cons
  • This wouldn't handle all cases. There are cases such as deep reverts that revert the page to a state it's been in even a few years ago. This is used rather rarely, though, and it is a very different rollback pattern.
  • Another case this solution wouldn't handle are extreme cases of spam, where a vandal creates tens of revisions on the same page and we use radius search based on revision count. This is also quite rare, though, as this would be a very inefficient thing to do for a vandal.

Echo-like

In this approach we would try to more-or-less mimic current behavior of Echo when detecting reverted edits. I do not use Echo very often, so the following is mostly based on reading Echo's code, mainly this and this hook. The current solution relies on some shady static variable, so it would be great if we could fix that along the way. I hope I understand the code correctly, but I may be wrong on the following pros and cons.

One possible improvement over the current implementation would be to support undoing multiple edits, as described in T153570.

Pros
  • It handles well undos done through the undo link, even if the undoing user later applies some changes over it. I think.
  • It handles the classic automatic rollback well, that is reverting one or multiple changes made by one user.
  • Using this approach we would have a unified revert definition and could reuse the code in Echo easily. That would probably require a hook that would get triggered whenever an edit (or a group of edits) is reverted.
  • It should handle reverting intermediate changes with automatic conflict resolution, as long as the reverting user uses the undo link.
  • Very predictable performance.
Cons
  • It does not handle manual reverts, that is reverting the page to a previous state not using undo or rollback links. That would mean reverts made with some gadgets would not be included.

Other

A method of identifying reverts by diffing was proposed, but I don't think it would be suitable for this application given its high computational requirements.

Hybrid

It is possible to combine two approaches described above to cover more cases and achieve higher accuracy.

I would propose using the Echo-like method as the first stage and if it does not give us any results, proceed to a SHA1-based method. I propose to use radius search based on edit count, for example we would only check the last 15 revisions of page's history.

In this approach it may also be valuable to separate in code exact and non-exact reverts, as was suggested in T152434.

Pros
  • It would handle the vast majority of cases out there, I think.
  • This would be mostly consistent with what Echo does. Actually, by being an extension to Echo's behavior, it can be considered an improvement upon it. We would meet the requirement of creating a unified revert definition.
  • This should also provide at least as good data as the mwreverts Python library, if I understand correctly. Data science people may be pleased to hear that.
Cons
  • This can significantly increase code complexity, so it may be wise to do some research into by how much would the accuracy actually improve before deciding on this.

Conclusion

Personally, I would go for the hybrid approach, as it covers the majority of cases while being compatible with the needs of most groups interested in the feature. If we choose radius search, it would also not require any DB changes.

Any input on the topic is very welcome! :)
I also tag here my mentors: @kostajh and @Catrope

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Stopping by after @kostajh asked if I might be interested in this. Great summary of possible methods and pros and cons, @Ostrzyciel !

One technical note: I'm fairly sure that the radius cutoff of 15 is based on Ekstrand and Riedl's paper "rv you’re dumb: Identifying Discarded Work in Wiki Article History". It's a fun paper that does some cool stuff, well worth a read. The idea of limiting how far back you want to go is partly to not run into those cases where some user vandalizes things by reverting back to something a couple of years old, which isn't helpful when you're trying to visualize edit histories. It's arguably not meaningful when we're doing revert detection for something like Recent Changes either.

I support the idea of the hybrid approach! Can I propose that it's implemented in a way that allows us to understand if either the Echo or SHA1 checksum methods resulted in a revert being detected? The Analytics Engineering team does something similar for bot detection in the MediaWiki History table in the Data Lake. The event_user_is_bot_by (and similar) column(s) is an array, so we not only learn that an account has been labelled as a bot, but also how. That allows us to only select bots identified through the user group, or their user name, or both, depending on what we're interested in. Something similar for revert detection could allow us easy access to learn more about revert behavior.

I've also shared this phab task with the Product Analytics team, as other analysts might have other use cases or ideas about how to do this.

Thank you @nettrom_WMF, that research paper is indeed very good. It also led me to thinking of cases when a reverted edit is reverted back, that would have to be handled nicely. I will have to think this through carefully, as I can think of many weird cases that would be tricky to solve, i.e. what happens if there is a revert war longer than 15 edits. This also leads me to think implementing this can require something more complicated than just a revision tag.

Can I propose that it's implemented in a way that allows us to understand if either the Echo or SHA1 checksum methods resulted in a revert being detected?

Sure! This was already previously proposed in some other task and would certainly be useful for analysts.

I thought a bit about revert wars and how would this feature work with them. Suppose we have a series of edits, in chronological order:

  1. User Alice
  2. User Bob
  3. User Alice, reverts to version 1
  4. User Bob, reverts to version 2
  5. User Alice, reverts to version 1
  6. User Bob, reverts to version 2

So Bob's version is the one that finally stays in the article. Here we have two possible approaches:

  1. Every time an edit is reverted, we mark it as reverted and that is permanent. So in this case when edit 3 is made, we mark 2 as reverted, when 4 is made we mark 3 as reverted, etc. In the end the only non-reverted edits would be 1 and 6.
    • That is quite logical and would be relatively simple to implement.
    • The only downside is that when we use some kind of condensed view that shows only non-reverted edits we will be left with edit 1 and edit 6 with a rather meaningless edit description (something like reverting Alice instead of added a description of twin-nosed elephants, for example). This also means that Bob's good edit (version 2) will be permanently marked as bad.
  2. The other approach would be to maintain some kind of revert chain. Every time an edit is reverted we would have to figure out whether it un-reverts some previous edit. For example with edit 4 we would remove the reverted mark from edit 2.
    • This would be nice, but it would also require a much more complicated algorithm. Possibly additional fields or even a DB table to keep track of revert relations between edits. I'm not sure it's worth it.
    • This would also add a ton of weird corner cases we would have to consider and I'm not sure it's a good idea to go down this particular rabbit hole.

I think a much saner approach would be to go with #1, to make the algorithm a little simpler to implement and understand by editors. I'm curious whether analytics people have any insights about that. Would something like this be sufficient? I think it should be relatively simple to reconstruct a revert chain later off-site and figure out which edits were good but reverted and then brought back.

  • The only downside is that when we use some kind of condensed view that shows only non-reverted edits we will be left with edit 1 and edit 6 with a rather meaningless edit description (something like reverting Alice instead of added a description of twin-nosed elephants, for example). This also means that Bob's good edit (version 2) will be permanently marked as bad.

My first reaction to this is that we shouldn't say that a "revert edit" is a judgment about good or bad, just a reversion of text to a previous version. So I don't think this is something to worry about, personally.

My first reaction to this is that we shouldn't say that a "revert edit" is a judgment about good or bad, just a reversion of text to a previous version. So I don't think this is something to worry about, personally.

You are right, that makes sense :)

I think I'll wrap it up here, I got some very useful feedback and had plenty of time to think it over. :) I collected the conclusions in T254074: Implement the reverted edit tag and will start implementing this soon, probably.

If you have any other ideas on the topic, they are still welcome :)