Page MenuHomePhabricator

Investigate using rev_sha1 to detect reverted edits
Closed, ResolvedPublic

Assigned To
Authored By
MusikAnimal
Nov 7 2017, 11:55 PM
Referenced Files
None
Tokens
"Meh!" token, awarded by Liuxinyu970226."Meh!" token, awarded by JStrodt_WMDE.

Description

Using the rev_sha1 column in the revision table should be more reliable in detecting reverted edits. Currently we determine this by the edit summary and size of the following edit.

Using the SHA might also allow us to detect reverts that happened later on in the revision history.

Event Timeline

Hello and happy new year!

From what I understand the problem is two-fold:

  1. I vandalize an article by adding some text. If my edit is not reverted immediately, it is counted as my contribution. (This is what you're describing.)
  2. I am a "vandalism hunter" and I see that person x deleted 200 kb of text. I revert person x's edit and the 200 kb of text are now counted as my contribution.

Is my observation correct?

Best, Johanna

Hello and happy new year!

Happy new year to you! :)

From what I understand the problem is two-fold:

  1. I vandalize an article by adding some text. If my edit is not reverted immediately, it is counted as my contribution. (This is what you're describing.)

Correct. The logic only looks to see if content was reverted with the following edit, not later on.

  1. I am a "vandalism hunter" and I see that person x deleted 200 kb of text. I revert person x's edit and the 200 kb of text are now counted as my contribution.

Yes, unless the 200 KB of text was a single edit. So say person X adds 150 KB of text, then with the next edit adds 50 KB, when I revert it ArticleInfo only knows to discount the 50 KB part, and not previous edits.

The science of reversion detection is proving to be very challenging. I have a solution mapped out in my head, but I think it would require putting all the edits to the page in memory. This will be a problem for pages that have a lot of revisions. What I could do as a compromise is increase the number of surrounding edits that it looks for. So instead of looking at a single edit surrounding a given edit, maybe I could have it look for say, 10 edits. That would help in a lot of cases, but there's still the possibility someone could vandalize with 11 consecutive edits, then they all get reverted. In that case ArticleInfo again would incorrectly only discard the first 10 edits as reverted content, and not the 11th edit.

In related good news, I have found a suitable solution to show textshares. Expect that to be released soon! :)

Heyho,

Yes, unless the 200 KB of text was a single edit. So say person X adds 150 KB of text, then with the next edit adds 50 KB, when I revert it ArticleInfo only knows to discount the 50 KB part, and not previous

Thanks for your quick reply, but isn't this what was described in scenario 1? What I meant here in scenario 2 is that a vandal is deleting text and that the vandalism hunter is restoring the text.
E.g. : In this case 73.090 bytes were removed in a single edit and then restored in the next revision. Apparently, the person who restored the deleted text gets credited the 73.090 bytes.

The science of reversion detection is proving to be very challenging.

Oh yes, I believe that. Were you able to look into how WikiHistory deals with this challenge?

In related good news, I have found a suitable solution to show textshares. Expect that to be released soon! :)

Whooo, awesome! I'm really looking forward.

Best,
Johanna

Heyho,

Yes, unless the 200 KB of text was a single edit. So say person X adds 150 KB of text, then with the next edit adds 50 KB, when I revert it ArticleInfo only knows to discount the 50 KB part, and not previous

Thanks for your quick reply, but isn't this what was described in scenario 1? What I meant here in scenario 2 is that a vandal is deleting text and that the vandalism hunter is restoring the text.
E.g. : In this case 73.090 bytes were removed in a single edit and then restored in the next revision. Apparently, the person who restored the deleted text gets credited the 73.090 bytes.

Yes, XTools does do revert detection for surrounding edits, including content that was removed. However this is done by looking at the edit summary to see if it said "revert", etc., and here we don't have the German translations :/ This is not really a problem for new edits (after December 2017), because now edits are automatically tagged as Rollback, Undo, etc. Still, XTools needs to be using rev_sha1 to be more accurate. If we use rev_sha1, your example will correctly be counted as a revert.

The science of reversion detection is proving to be very challenging.

Oh yes, I believe that. Were you able to look into how WikiHistory deals with this challenge?

As far as I can tell, WikiHistory is not doing revert detection as we are, rather computing attribution (aka textshares), correct? For example, we have the "text added" column, when WikiHistory appears to only have a "textshares" column. These are two different concepts. Anyway, we're now using WikiWho to get authorship attribution (textshares), e.g. https://xtools.wmflabs.org/articleinfo/de.wikipedia.org/Bottmingen#authorship. On that note, it seems WikiHistory and WikiWho are using two very different algorithms. WikiWho underwent extensive testing, showing it is 95% accurate, but I can't say for sure if WikiHistory exceeds that.

As of fb7a7b9 we are using rev_sha1 to detect reverts. However we're still only checking surrounding revisions. In other words, if a revert undid multiple edits, only one of those edits is counted as having been reverted. This is not perfect, but it's a big improvement over going completely off the edit summary, which is what we were doing before.

I'm working on T179995 and once that's done I'll get everything deployed.

MusikAnimal moved this task from Pending deployment to Complete on the XTools board.

As noted above, revert detection still isn't perfect, but it has improved significantly. For example:

In this case 73.090 bytes were removed in a single edit and then restored in the next revision. Apparently, the person who restored the deleted text gets credited the 73.090 bytes.

Fixed :) https://xtools.wmflabs.org/articleinfo/de.wikipedia.org/Requiem_%28Mozart%29

thank you MusikAnimal..........Ozzie A.