Page MenuHomePhabricator

Expose more detailed diff information to the AbuseFilter
Open, MediumPublic2 Estimated Story Points

Description

If you go to https://test.wikipedia.org/w/index.php?title=User:Huji/diff&diff=387036&oldid=387035 you not only see which line changed, but you also see specifically which part of the lines was modified (in the HTML, it is highlighted using <del> and <ins> tags).

However, the information AbuseFilter shows for that diff (accessible at https://test.wikipedia.org/wiki/Special:AbuseFilter/examine/426401) is only at the line level. I can think of many cases when knowing the exact portions that were changed is useful for a filter.

I think a good way to do that is to have two new variables: edit_diff_added and edit_diff_removed. They should each be an array of only the added/removed fragments (so for the example above, we would have edit_diff_added = ['some'] and edit_diff_removed = ['a'])

A less optimal, but still workable solution is to literally expose the HTML of the diff as a variable (so one could look for a pattern like <ins>some</ins> in it).

Event Timeline

Daimona subscribed.

This would indeed be useful, and I sent https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/431731/ for that a few months ago.
However, IIRC those variables would be pretty tricky: I think each added character would be alone in the resulting array. Now I cannot check if that's true, though (and I cannot add the task number on gerrit) because I'm from mobile. And that patch needs tests anyway.
Edit: Nah, each word in the array would be separated, according to the commit msg I wrote. Just as you said in the example!

Change 431731 had a related patch set uploaded (by Huji; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add word-level diff variables

https://gerrit.wikimedia.org/r/431731

Huji triaged this task as Medium priority.Apr 12 2019, 1:27 PM
Huji set the point value for this task to 2.
Huji moved this task from Backlog to Filtering features on the AbuseFilter board.

The main problem with this addition is that these variables would be very large if the diff is also large, and the details view might be filled up.

I think this may solve one of the problems that some wikis are having, subtle number vandalism (e.g. changing "There are 5 cats and dogs here" to "There are 4 cats and dogs here". This is trivial to see when looking at a highlighted diff, but seemingly difficult to write a filter for. (Bright ideas welcome though!)