Page MenuHomePhabricator

AbuseFilter should not cast arrays into strings
Open, NormalPublic

Description

Using the debugging tools:

  • 1 in [0, 11, 20] => true, yet 1 is not in the array
  • contains_any([0, 11, 20], 1) => true

etc.

Currently when looking for multiple namespaces (for example), it seems the only way to do it with a single condition is with regex, like article_namespace in "^(0|11|20)$". This is not very pretty, and probably (very very slightly) less performant because it uses the regex engine.

Arrays always get joined into a single string, which is not what you would expect from the syntax.

It should be noted fixing this will introduce regressions. There are many, many filters that do things like !("confirmed" in user_groups). Right now this covers "confirmed" and "autoconfirmed". Assuming user_groups is an array (which it logically it should be), you'd need to change this clause to something like !(contains_any(user_groups, "confirmed", "autoconfirmed")). We could run a query to find all the filters that use in, contains_any and contains in order to find out what needs updating. The rollout process would be painful, but the point is newcomers to the extension would expect arrays to act like arrays, and not a giant string.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2017, 4:48 AM
RP88 added a subscriber: RP88.Nov 21 2017, 9:13 AM

As a partial solution, there is https://gerrit.wikimedia.org/r/#/c/409312/, to be used in cases like the one with namespaces. It doesn't solve the problem, but it may be something to start with. Changing array behaviour to be sane would indeed be a pain in the neck for transition involved.

Dalba added a subscriber: Dalba.

@MusikAnimal the fix itself should be pretty easy: probably, we only need to change how arrays are handled in some functions (keywordIn, keywordContains, contains ...). However, as you pointed out, there would be a huge work of re-adapting each filter with the new standards (and properly warn the user of this change). What would you propose to face this? I mean, we'd need to precisely schedule what to do and how to make the transition less painful.

@MusikAnimal the fix itself should be pretty easy: probably, we only need to change how arrays are handled in some functions (keywordIn, keywordContains, contains ...). However, as you pointed out, there would be a huge work of re-adapting each filter with the new standards (and properly warn the user of this change). What would you propose to face this? I mean, we'd need to precisely schedule what to do and how to make the transition less painful.

A while back Community Tech made a big update to ccnorm and norm, and we queried production to find all the filters that use these functions and updated them ourselves (leaving a note in the filter description, of course). For a change as big as this task, we'd need to give lots of forewarning.

Many of the affected filters are critical, and tripped very often. We don't want any chance of false positives happening after deployment, so perhaps we could first update all the affected filters so that arrays are type casted into strings. So !("confirmed" in user_groups) could be changed to !("confirmed" in (string)user_groups). Then afterwards, filter authors are free to change their syntax to the new system, or leave the filters as-is.

Or, we could create new functions entirely, maybe in_array, array_contains, etc.? This seems subpar as the point is to remove the wonky handly of arrays for good, but honestly introducing something backwards-compatible might be better. in, contains and contains_any are used extensively (probably most filters on enwiki), and it will take filter managers time to remember to use the new syntax. That's a lot to ask of people.

You are totally right, on it.wiki we also have many filters relying on such flaky behaviour of lists. Talking about the correction itself, it would be painful but feasible. Although I don't have the rights, I'd be happy to help in converting every filter to the sane form. The real trouble would be to make users comfortable with the new syntax. I also thought about the new function, but sincerely I don't know if it would be somehow better. The only sure thing is, if we don't want new functions we should first correct every filter. And, finally, I also don't know whether it would be better to give a one-shot change (i.e. changing the behaviour of every function all at once) or to proceed progressively, e.g. first add new functions, then sanitize the old ones, or correcting them one by one.

Daimona triaged this task as Normal priority.EditedMar 25 2018, 2:55 PM

This is a point of the situation, please feel free to add anything I may have forgotten
Things we need to do

  • gather an exact list of array variables (and possibly update the list here)
  • fix every filter using variables from above and funcs/keywords from below, by changing variable to string(variable)
  • fix in, contains, like and rlike keyword
  • fix contains function (which is used for every flavour of contains/any-all, also the ccnormed ones)
  • add unit tests for changes above
  • inform users of the change. MassMessage to filter editors? Sitenotice on Special:AbuseFilter and subpages (like the note on Special:LintErrors)?
Huji added a subscriber: Huji.Mar 25 2018, 3:33 PM

Can you explain the part about converting variable to string(variable) with some examples?

Sure! Right now, if we do "confirmed" in user_groups, what the parser does is casting user_groups to string and checking for containment. To preserve the same behaviour, we should explicitly do what the parser already does, i.e. transform it to "confirmed" in string(user_groups). This way, the behaviour will be identical and it will resist the main fix this task aims to do.
Anyway, in the meanwhile I gathered a list for the point 1 above of array variables:

user_groups, user_rights, article_restrictions_edit, article_restrictions_move, article_restrictions_upload, article_restrictions_create, article_recent_contributors, added_lines, added_lines_pst, removed_lines, added_links, removed_links, old_links, all_links
MusikAnimal added a comment.EditedMar 25 2018, 4:12 PM

Yeah first changing everything to string(var)just ensures there are no broken filters between the time of deployment and changing everything to be array-like syntax. There are way too many filters that'd be affected by this change, we wouldn't be able to quickly update them after deployment.

In addition to T181024#4079406, our migration plan should include an extensive testing period prior to deployment, and thorough attempts to contact active filter managers (I can get a list of those usernames via a query, too).

Obviously the first step is creating the patch. I personally would like to see some unit tests go out with it, which I'm happy to help write :)

Many thanks for your help.

@MusikAnimal, yeah, but I'd avoid changing every variable to string(variable), but instead change only the will-be-affected ones. Unless, of course, it would require way too much work to do.
Right, we need lots of testings AND also we should make users aware of the change. What about a kind of editnotice only on Special:AbuseFilter and subpages? I should be able to both create the patch and add lots of unit tests, which are really needed in this case.
As for the query, since you have the access, could you also please take a look at this and this related commits and tell me if it would be possible for you to run the requested queries (explained by me together with -1s)? Otherwise we can't merge them.

Thinking about it, maybe we should also review and merge https://gerrit.wikimedia.org/r/#/c/411593/ to allow a search on the fly. Tomorrow I'll make a list of the queries we need for this and related tasks, then I'll start working on the patch.

Huji added a comment.Mar 25 2018, 7:58 PM

Thinking about it, maybe we should also review and merge https://gerrit.wikimedia.org/r/#/c/411593/ to allow a search on the fly. Tomorrow I'll make a list of the queries we need for this and related tasks, then I'll start working on the patch.

Certainly. I am rebuilding my test environment to have global filters in it so I can test the search in that regard.

@Huji many thanks. I think it shouldn't have problems for global filters, while it probably needs a special review for DB compatibility. And, of course, we first need to merge its dependency :-)

Well, I actually realised that a change like this wouldn't always be good. For instance, if we change "contains" (and it's similar for other keywords/functions), we'd need to write more convoluted filters. Let's say we want to check if the user added a badword: "added_lines contains 'badword'" won't work anymore, unless the word is on a separate line. And it would be uncomfortable to add "string()" to every filter (which would also increase the conditions used). So, before doing anything, we should probably understand what needs to be left as it is. We may either keep the cast for some variables (added_lines?) by making it when they're generated, or add totally new functions like "in_array", "array_contains" and so on. Also note that in any case the transition process will be way less complicated.

Well, I actually realised that a change like this wouldn't always be good. For instance, if we change "contains" (and it's similar for other keywords/functions), we'd need to write more convoluted filters. Let's say we want to check if the user added a badword: "added_lines contains 'badword'" won't work anymore, unless the word is on a separate line. And it would be uncomfortable to add "string()" to every filter (which would also increase the conditions used). So, before doing anything, we should probably understand what needs to be left as it is. We may either keep the cast for some variables (added_lines?) by making it when they're generated, or add totally new functions like "in_array", "array_contains" and so on. Also note that in any case the transition process will be way less complicated.

In my opinion, if the variable is an array, it should act like an array. Is there any benefit in added_lines and removed_lines being an array in the first place? I know the array type would match the diff view, but for the purposes of edit filters, it would seem this would most often be adverse. Or perhaps we could introduce new variables, added_text and removed_text?

No matter route we go, I feel like this is growing into a dramatic set of changes that will be difficult to adapt to. We should avoid a rabbit whole and figure out a long-term strategy, and introduce changes piecemeal, carefully notifying filter managers along the way of upcoming changes. added_text, removed_text, etc. (if we like the idea) seems like a good start. We could go ahead and update filters to use these instead of the _lines variables, and we'd be set up for a more smooth transition to proper array handling.

Perhaps we should invite broader input before making these decisions.

Well, added_and removed_lines make sense both as arrays and strings, for the reasons you explained. My proposal for new functions was in fact to greatly reduce transition pain. In any case, I definitely agree about waiting for broader input.

@MusikAnimal Nice, I'll translate your message on it.wiki as well.

How about a legacy directive that could be added to or warpped around the existing models (basically recasting in to arrays), would still require touching the filters but not so much "rewriting" them.

MusikAnimal added a comment.EditedApr 4 2018, 8:38 PM

We got a little feedback on enwiki (thank you also Xaosflux, good idea!). All things considered I'm going to !vote we not change the behaviour of in, contains, etc. The main reasons are:

  • We don't want to cause massive regressions
  • Filter managers would take a long time to adjust to the change
  • It is desirable to cast variables like added_lines and user_groups into strings

As I briefly mentioned above, I propose we go the route of creating array-casting functions like in_array, array_contains, array_contains_any. This will make everything backwards-compatible, and allow you to do things like array_contains(article_namespace, [0, 11, 20]) instead of the ugly regex.

Newcomers to AbuseFilter will just have to learn that in etc. will cast arrays into strings. Perhaps with the advent of in_array etc., it will be more clear which function they should use for their use case.

@MusikAnimal Nice solutions, I totally agree. The only point of avoiding regressions is probably enough to make this proposal the best one. So, let me recap:

  • Leave array to string conversion as it is.
  • Add specific keywords and functions to check for array containment and document them (in_array, array_contains and array_contains_any is our final list?)
  • Clearly document that "normal" functions cast array to strings and new ones don't.

Correct? If so, I'll start working on a patch.

I guess also array_contains_all, but otherwise what you've said sounds right to me. We've received only limited feedback about this task, and creating these new functions/keywords won't affect any existing functionality, so I think we're okay to proceed without further discussion. I'm sure array_contains_any in particular will become popular, since many filters check multiple namespaces and the like.

I'd like to take a moment to thank you for your contributions to AbuseFilter. This extension hasn't received much attention in a long time. So cheers to you!! 🍻

However, I worry things are changing too fast... We don't want too many patches to go out on the same deployment train, so that in case something goes wrong we'll have a better idea of what it is. I suppose there's nothing wrong with creating the patches, but we should be diligent about how many things get merged within a week's time. That being said I don't think this task in particular would cause any regressions, so probably okay to merge it whenever we want :) Thanks again!

@MusikAnimal Yeah, I think that's it. Also, is https://gerrit.wikimedia.org/r/#/c/409312/ still useful someway?

And yeah, I know AF has been neglected for a long time, but since it's actually one of the really, really useful extensions I felt like I had to do something to help it and everyone who uses it. I'll start working on a patch, since in fact I think we don't risk any regression with this.

Daimona claimed this task.Apr 5 2018, 11:30 AM

I'm writing the patch, however there's a problem we must solve first: T179238. Otherwise some cases return weird result because lists comparing is always false. Could you please review and eventually merge https://gerrit.wikimedia.org/r/421786?

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add array-specific functions

https://gerrit.wikimedia.org/r/424298

Could you please review and eventually merge https://gerrit.wikimedia.org/r/421786?

I most certainly can, if not today then Friday, for sure. I'm a little overwhelmed by all the AbuseFilter patches up for review! Trying to catch up... currently there are a total of 16 in my inbox, all created within the past few days.

@MusikAnimal Of course, there's no hurry :-) Many thanks.

seth added a subscriber: seth.Feb 17 2019, 5:40 PM

Right now I had a problem with
added_links > 10
Because of the implicit type casting, the syntax is correct, but this code does not do what one could expect, i.e.,
length(added_links) > 10.

The code should be evaluated as incorrect.

@seth IMHO it shouldn't. What you're doing is comparing an array with an integer. Right, it doesn't really make sense, but I don't think the parser should throw for this. Such comparison is also legal in several programming languages. What we can do is write in the docs that added_links is an array of strings, so that people will know that added_links > 10 doesn't make sense.

seth added a comment.Feb 17 2019, 8:52 PM

Of course, it is legal in some programming languages, but then the array would be converted into something reasonable, e.g. into the length of the array (as in perl).
Do you know any common languages, where array arr in
arr > 3
would be converted to a string by default? And if so, will then the '3' also be converted to a string? And the '>' will do a lexicographical comparison than?
Im my opinion this is far to much magic and the abuse filter language should not be contra-intuitive for programmers.

First of all, the AbuseFilter language tends to be similar to PHP, mostly because that's the language of its parser.
I don't know many programming languages, so I cannot give a fully explanatory answer.
Anyway, in PHP arrays are always greater than numbers, and no cast is performed.
And yes, < and > can be used for lexicographical comparison in AF, too. IMHO this isn't counterintuitive, or at least not that much.

Huji added a comment.Feb 17 2019, 9:43 PM

Anyway, in PHP arrays are always greater than numbers, and no cast is performed.

Correct. Indeed, Arrays are greater than any other object, except when the other object is also an array in which case a more complex logic (based on the size and contents of the two arrays) is used. It is explained here: http://php.net/manual/en/language.operators.comparison.php

One could argue that the choices made by the creators of PHP are or are not the best choices; but irrespective of that, so long as abusefilter is mimicking those choices, I think it is fine.

seth added a comment.Feb 17 2019, 10:25 PM

Oops, ok, I understand. Then I totally agree with both of you. Sorry for the interruption and thanks for the explanation. I should have read the php manual.

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add array-specific functions

https://gerrit.wikimedia.org/r/424298