Page MenuHomePhabricator

Add function to check if added content matches any regex in a list
Open, MediumPublic

Description

As many inappropriate edits add bad words to articles, many filters¹ are created to check if a user added an expression which was not previously in the page, like this:

bad_words := 'ba+d|real+y\s*bad|not?\s*goo+d|...';
something & added_lines irlike bad_words
          & !( removed_lines irlike bad_words )

However, this approach causes false negatives such as this: a user can remove "baaad" while he adds "really bad", and the edit will not be matched. This is more frequent if the regex for bad_words contains many alternatives, because then the user can remove one of them and add any/all of the others while going undetected.

An approach to fix that would be to write the filter as

bad_word1 := 'ba+d';
bad_word2 := 'real+y\s*bad';
bad_word3 := 'not?\s*goo+d';
bad_word... := '...';
something & (
     added_lines   irlike bad_word1
& !( removed_lines irlike bad_word1 )
|
     added_lines   irlike bad_word2
& !( removed_lines irlike bad_word2 )
|
     added_lines   irlike bad_word3
& !( removed_lines irlike bad_word3 )
| ...
)

but this is unnecessarily repetitive, makes the filter very long (and maybe increase the condition count more than it should?). It should be possible to just check if a user is adding some bad thing without having all this trouble... Maybe a new function could be added, which would allow something like this:

bad_word_regexes := [ 'ba+d', 'real+y\s*bad', 'not?\s*goo+d', '...' ];
something & irlike_added_any( added_lines, removed_lines, bad_word_regexes )

(the name and syntax is just an example, feel free to suggest something better)

¹Examples

Event Timeline

Daimona triaged this task as Medium priority.Nov 19 2017, 5:40 PM
Daimona moved this task from Backlog to Filtering features on the AbuseFilter board.
Daimona subscribed.

I know the problem. A possible solution, available within days, would be to use the newly implemented get_matches function (T179957) in a way like:

matched:=get_matches("any|bad|word|you want", added_lines)[0];
matched != false &
!(removed_lines irlike matched)

This way you know that the check will be performed on the same word, e.g. if the user adds "really bad" it will only check for that in removed_lines. I don't know if this fully solves the problem (that would otherwise need a brand new function), but it can surely fix some of those situations.