Page MenuHomePhabricator

AbuseFilter should not cast arrays into strings
Open, LowPublic

Description

Using the debugging tools:

  • 1 in [0, 11, 20] => true, yet 1 is not in the array
  • contains_any([0, 11, 20], 1) => true

etc.

Currently when looking for multiple namespaces (for example), it seems the only way to do it with a single condition is with regex, like article_namespace in "^(0|11|20)$". This is not very pretty, and probably (very very slightly) less performant because it uses the regex engine.

Arrays always get joined into a single string, which is not what you would expect from the syntax.

It should be noted fixing this will introduce regressions. There are many, many filters that do things like !("confirmed" in user_groups). Right now this covers "confirmed" and "autoconfirmed". Assuming user_groups is an array (which it logically it should be), you'd need to change this clause to something like !(contains_any(user_groups, "confirmed", "autoconfirmed")). We could run a query to find all the filters that use in, contains_any and contains in order to find out what needs updating. The rollout process would be painful, but the point is newcomers to the extension would expect arrays to act like arrays, and not a giant string.

Event Timeline

As a partial solution, there is https://gerrit.wikimedia.org/r/#/c/409312/, to be used in cases like the one with namespaces. It doesn't solve the problem, but it may be something to start with. Changing array behaviour to be sane would indeed be a pain in the neck for transition involved.

@MusikAnimal the fix itself should be pretty easy: probably, we only need to change how arrays are handled in some functions (keywordIn, keywordContains, contains ...). However, as you pointed out, there would be a huge work of re-adapting each filter with the new standards (and properly warn the user of this change). What would you propose to face this? I mean, we'd need to precisely schedule what to do and how to make the transition less painful.

@MusikAnimal the fix itself should be pretty easy: probably, we only need to change how arrays are handled in some functions (keywordIn, keywordContains, contains ...). However, as you pointed out, there would be a huge work of re-adapting each filter with the new standards (and properly warn the user of this change). What would you propose to face this? I mean, we'd need to precisely schedule what to do and how to make the transition less painful.

A while back Community Tech made a big update to ccnorm and norm, and we queried production to find all the filters that use these functions and updated them ourselves (leaving a note in the filter description, of course). For a change as big as this task, we'd need to give lots of forewarning.

Many of the affected filters are critical, and tripped very often. We don't want any chance of false positives happening after deployment, so perhaps we could first update all the affected filters so that arrays are type casted into strings. So !("confirmed" in user_groups) could be changed to !("confirmed" in (string)user_groups). Then afterwards, filter authors are free to change their syntax to the new system, or leave the filters as-is.

Or, we could create new functions entirely, maybe in_array, array_contains, etc.? This seems subpar as the point is to remove the wonky handly of arrays for good, but honestly introducing something backwards-compatible might be better. in, contains and contains_any are used extensively (probably most filters on enwiki), and it will take filter managers time to remember to use the new syntax. That's a lot to ask of people.

You are totally right, on it.wiki we also have many filters relying on such flaky behaviour of lists. Talking about the correction itself, it would be painful but feasible. Although I don't have the rights, I'd be happy to help in converting every filter to the sane form. The real trouble would be to make users comfortable with the new syntax. I also thought about the new function, but sincerely I don't know if it would be somehow better. The only sure thing is, if we don't want new functions we should first correct every filter. And, finally, I also don't know whether it would be better to give a one-shot change (i.e. changing the behaviour of every function all at once) or to proceed progressively, e.g. first add new functions, then sanitize the old ones, or correcting them one by one.

Daimona triaged this task as Medium priority.EditedMar 25 2018, 2:55 PM

This is a point of the situation, please feel free to add anything I may have forgotten
Things we need to do

  • gather an exact list of array variables (and possibly update the list here)
  • fix every filter using variables from above and funcs/keywords from below, by changing variable to string(variable)
  • fix in, contains, like and rlike keyword
  • fix contains function (which is used for every flavour of contains/any-all, also the ccnormed ones)
  • add unit tests for changes above
  • inform users of the change. MassMessage to filter editors? Sitenotice on Special:AbuseFilter and subpages (like the note on Special:LintErrors)?

Can you explain the part about converting variable to string(variable) with some examples?

Sure! Right now, if we do "confirmed" in user_groups, what the parser does is casting user_groups to string and checking for containment. To preserve the same behaviour, we should explicitly do what the parser already does, i.e. transform it to "confirmed" in string(user_groups). This way, the behaviour will be identical and it will resist the main fix this task aims to do.
Anyway, in the meanwhile I gathered a list for the point 1 above of array variables:

user_groups, user_rights, article_restrictions_edit, article_restrictions_move, article_restrictions_upload, article_restrictions_create, article_recent_contributors, added_lines, added_lines_pst, removed_lines, added_links, removed_links, old_links, all_links

Yeah first changing everything to string(var)just ensures there are no broken filters between the time of deployment and changing everything to be array-like syntax. There are way too many filters that'd be affected by this change, we wouldn't be able to quickly update them after deployment.

In addition to T181024#4079406, our migration plan should include an extensive testing period prior to deployment, and thorough attempts to contact active filter managers (I can get a list of those usernames via a query, too).

Obviously the first step is creating the patch. I personally would like to see some unit tests go out with it, which I'm happy to help write :)

Many thanks for your help.

@MusikAnimal, yeah, but I'd avoid changing every variable to string(variable), but instead change only the will-be-affected ones. Unless, of course, it would require way too much work to do.
Right, we need lots of testings AND also we should make users aware of the change. What about a kind of editnotice only on Special:AbuseFilter and subpages? I should be able to both create the patch and add lots of unit tests, which are really needed in this case.
As for the query, since you have the access, could you also please take a look at this and this related commits and tell me if it would be possible for you to run the requested queries (explained by me together with -1s)? Otherwise we can't merge them.

Thinking about it, maybe we should also review and merge https://gerrit.wikimedia.org/r/#/c/411593/ to allow a search on the fly. Tomorrow I'll make a list of the queries we need for this and related tasks, then I'll start working on the patch.

Thinking about it, maybe we should also review and merge https://gerrit.wikimedia.org/r/#/c/411593/ to allow a search on the fly. Tomorrow I'll make a list of the queries we need for this and related tasks, then I'll start working on the patch.

Certainly. I am rebuilding my test environment to have global filters in it so I can test the search in that regard.

@Huji many thanks. I think it shouldn't have problems for global filters, while it probably needs a special review for DB compatibility. And, of course, we first need to merge its dependency :-)

Well, I actually realised that a change like this wouldn't always be good. For instance, if we change "contains" (and it's similar for other keywords/functions), we'd need to write more convoluted filters. Let's say we want to check if the user added a badword: "added_lines contains 'badword'" won't work anymore, unless the word is on a separate line. And it would be uncomfortable to add "string()" to every filter (which would also increase the conditions used). So, before doing anything, we should probably understand what needs to be left as it is. We may either keep the cast for some variables (added_lines?) by making it when they're generated, or add totally new functions like "in_array", "array_contains" and so on. Also note that in any case the transition process will be way less complicated.

Well, I actually realised that a change like this wouldn't always be good. For instance, if we change "contains" (and it's similar for other keywords/functions), we'd need to write more convoluted filters. Let's say we want to check if the user added a badword: "added_lines contains 'badword'" won't work anymore, unless the word is on a separate line. And it would be uncomfortable to add "string()" to every filter (which would also increase the conditions used). So, before doing anything, we should probably understand what needs to be left as it is. We may either keep the cast for some variables (added_lines?) by making it when they're generated, or add totally new functions like "in_array", "array_contains" and so on. Also note that in any case the transition process will be way less complicated.

In my opinion, if the variable is an array, it should act like an array. Is there any benefit in added_lines and removed_lines being an array in the first place? I know the array type would match the diff view, but for the purposes of edit filters, it would seem this would most often be adverse. Or perhaps we could introduce new variables, added_text and removed_text?

No matter route we go, I feel like this is growing into a dramatic set of changes that will be difficult to adapt to. We should avoid a rabbit whole and figure out a long-term strategy, and introduce changes piecemeal, carefully notifying filter managers along the way of upcoming changes. added_text, removed_text, etc. (if we like the idea) seems like a good start. We could go ahead and update filters to use these instead of the _lines variables, and we'd be set up for a more smooth transition to proper array handling.

Perhaps we should invite broader input before making these decisions.

Well, added_and removed_lines make sense both as arrays and strings, for the reasons you explained. My proposal for new functions was in fact to greatly reduce transition pain. In any case, I definitely agree about waiting for broader input.

@MusikAnimal Nice, I'll translate your message on it.wiki as well.

How about a legacy directive that could be added to or warpped around the existing models (basically recasting in to arrays), would still require touching the filters but not so much "rewriting" them.

We got a little feedback on enwiki (thank you also Xaosflux, good idea!). All things considered I'm going to !vote we not change the behaviour of in, contains, etc. The main reasons are:

  • We don't want to cause massive regressions
  • Filter managers would take a long time to adjust to the change
  • It is desirable to cast variables like added_lines and user_groups into strings

As I briefly mentioned above, I propose we go the route of creating array-casting functions like in_array, array_contains, array_contains_any. This will make everything backwards-compatible, and allow you to do things like array_contains(article_namespace, [0, 11, 20]) instead of the ugly regex.

Newcomers to AbuseFilter will just have to learn that in etc. will cast arrays into strings. Perhaps with the advent of in_array etc., it will be more clear which function they should use for their use case.

@MusikAnimal Nice solutions, I totally agree. The only point of avoiding regressions is probably enough to make this proposal the best one. So, let me recap:

  • Leave array to string conversion as it is.
  • Add specific keywords and functions to check for array containment and document them (in_array, array_contains and array_contains_any is our final list?)
  • Clearly document that "normal" functions cast array to strings and new ones don't.

Correct? If so, I'll start working on a patch.

I guess also array_contains_all, but otherwise what you've said sounds right to me. We've received only limited feedback about this task, and creating these new functions/keywords won't affect any existing functionality, so I think we're okay to proceed without further discussion. I'm sure array_contains_any in particular will become popular, since many filters check multiple namespaces and the like.

I'd like to take a moment to thank you for your contributions to AbuseFilter. This extension hasn't received much attention in a long time. So cheers to you!! 🍻

However, I worry things are changing too fast... We don't want too many patches to go out on the same deployment train, so that in case something goes wrong we'll have a better idea of what it is. I suppose there's nothing wrong with creating the patches, but we should be diligent about how many things get merged within a week's time. That being said I don't think this task in particular would cause any regressions, so probably okay to merge it whenever we want :) Thanks again!

@MusikAnimal Yeah, I think that's it. Also, is https://gerrit.wikimedia.org/r/#/c/409312/ still useful someway?

And yeah, I know AF has been neglected for a long time, but since it's actually one of the really, really useful extensions I felt like I had to do something to help it and everyone who uses it. I'll start working on a patch, since in fact I think we don't risk any regression with this.

I'm writing the patch, however there's a problem we must solve first: T179238. Otherwise some cases return weird result because lists comparing is always false. Could you please review and eventually merge https://gerrit.wikimedia.org/r/421786?

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add array-specific functions

https://gerrit.wikimedia.org/r/424298

Could you please review and eventually merge https://gerrit.wikimedia.org/r/421786?

I most certainly can, if not today then Friday, for sure. I'm a little overwhelmed by all the AbuseFilter patches up for review! Trying to catch up... currently there are a total of 16 in my inbox, all created within the past few days.

@MusikAnimal Of course, there's no hurry :-) Many thanks.

Right now I had a problem with
added_links > 10
Because of the implicit type casting, the syntax is correct, but this code does not do what one could expect, i.e.,
length(added_links) > 10.

The code should be evaluated as incorrect.

@seth IMHO it shouldn't. What you're doing is comparing an array with an integer. Right, it doesn't really make sense, but I don't think the parser should throw for this. Such comparison is also legal in several programming languages. What we can do is write in the docs that added_links is an array of strings, so that people will know that added_links > 10 doesn't make sense.

Of course, it is legal in some programming languages, but then the array would be converted into something reasonable, e.g. into the length of the array (as in perl).
Do you know any common languages, where array arr in
arr > 3
would be converted to a string by default? And if so, will then the '3' also be converted to a string? And the '>' will do a lexicographical comparison than?
Im my opinion this is far to much magic and the abuse filter language should not be contra-intuitive for programmers.

First of all, the AbuseFilter language tends to be similar to PHP, mostly because that's the language of its parser.
I don't know many programming languages, so I cannot give a fully explanatory answer.
Anyway, in PHP arrays are always greater than numbers, and no cast is performed.
And yes, < and > can be used for lexicographical comparison in AF, too. IMHO this isn't counterintuitive, or at least not that much.

Anyway, in PHP arrays are always greater than numbers, and no cast is performed.

Correct. Indeed, Arrays are greater than any other object, except when the other object is also an array in which case a more complex logic (based on the size and contents of the two arrays) is used. It is explained here: http://php.net/manual/en/language.operators.comparison.php

One could argue that the choices made by the creators of PHP are or are not the best choices; but irrespective of that, so long as abusefilter is mimicking those choices, I think it is fine.

Oops, ok, I understand. Then I totally agree with both of you. Sorry for the interruption and thanks for the explanation. I should have read the php manual.

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add array-specific functions

https://gerrit.wikimedia.org/r/424298

There are a couple of filters in various Wikimedia projects that currently have bugs due to this problem (e.g., https://cs.wikipedia.org/w/index.php?diff=17857577). Is there anything I can help to progress the above patch?

@Nullzero that example is really not suffering from the lack of a workaround, they could just use:

equals_to_any(page_namespace, 0,12)

For that filter

(edit: Just saw you already gave that advice!)

There are a couple of filters in various Wikimedia projects that currently have bugs due to this problem (e.g., https://cs.wikipedia.org/w/index.php?diff=17857577). Is there anything I can help to progress the above patch?

I'm afraid it won't be easy. Not the backend change, that is pretty straightforward. We have various problems here:

  • Arrays were introduced in a very incomplete form; some related bugs are still to be fixed
  • People started using arrays in their incomplete form; this included implicit string casts, so that's not something we can easily get rid of
  • Sometimes it's an advantage to have string casts. For instance, 'confirmed' in user_groups can check for both 'autoconfirmed' and 'confirmed'

So, I think this is not really something that we can fix in the short term.

Many thanks to everyone working on this and to @Nullzero in particular for the helpful notice!

I've added a short cautionary note to Extension:AbuseFilter/Rules format#Arrays. Feel free to correct/expand as you deem necessary.

@kerberizer Thanks! I had written a note in that section some time ago, at the line starting with 1 in my_array. However, I do realize that it was pretty hard to find (in fact, I couldn't find it right now...). I think your note is perfectly fine as-is.

@Daimona isn't it possible to merge the array specific functions right now (which should be completely backward-compatible, since it's a new feature). Then, ask global sysops to fix all filters in the "obvious" cases by changing in to the new features. For non-obvious case, obviously we should let local sysops handle them.

I misunderstood about how the confirmed group works, so let me write this part again:

Several uses of 'confirmed' in user_groups should be able to translate to the array specific functions easily. I ran a query and found that a lot of filters have the pattern !('autoconfirmed' in user_groups) & !('confirmed' in user_groups). This means that filter writers really intend for in to work with array, because if they had the substring semantics in their mind, they would not have written the left conjunct.

@Daimona isn't it possible to merge the array specific functions right now (which should be completely backward-compatible, since it's a new feature).

Sure. Although before merging, I'd like to refurbish that patch and see whether we really need all of those functions.

Then, ask global sysops to fix all filters in the "obvious" cases by changing in to the new features.

Global sysop isn't enough, they cannot act on most wikis.

For non-obvious case, obviously we should let local sysops handle them.

My main worry is for what could happen *after* the migration. If people are still used to the old syntax, they may create wrong filters without noticing.

Several uses of 'confirmed' in user_groups should be able to translate to the array specific functions easily.

For sure. Right now, that's just a clean and short (and hacky) way to check for both groups.

I ran a query and found that a lot of filters have the pattern !('autoconfirmed' in user_groups) & !('confirmed' in user_groups). This means that filter writers really intend for in to work with array, because if they had the substring semantics in their mind, they would not have written the left conjunct.

No doubt about that. These implicit casts have never been an intended feature, but rather a bug, caused by the "weak" implementation of arrays that I was speaking about above; and also the fact the arrays were introduced roughly one year after the creation of AbuseFilter (rEABF128ae5983b27c5109ff78b41217b220134107729) -- previously, all array-like variables were strings. Hence the bug became an undocumented feature, which survived until today. People inevitably found out how it worked and what the benefits were, in such a way that it's become common practice. I don't know how common, but for sure, it's not something that we can take lightly.

Daimona lowered the priority of this task from Medium to Low.

While I'd love to implement this change, this is almost impossible due to current usage.

I see the code page_namespace in "^(0|11|20)$" more often than I'd want to (probably because of the note here). This is wrong: in does not test regex patterns, only literal strings. The proper way to check namespaces is equals_to_any(page_namespace, 0, 11, 20).

Other than that, it's too bad that current functions do not support arrays. For example, the new function ip_in_ranges could accept either many strings or an array of strings; rlike could test against one long pattern or an array of patterns (as if they're joined by |), which would improve readability; and so on. This shouldn't be hard to implement in all existing function, I believe: branch after testing for first argument type.

I would also support adding new keywords and functions. Mainly because of the BC concerns discussed here, and also because it's really a different processing when working with an array: instead of strpos() on bulk data, we are doing in_array() on a list.

I would suggest the following API:

Keywords:
42 in_array [41, 42, 43]
[41, 42, 43] array_contains 42

Functions:
array_contains_any([41, 42, 43], 99, 42)
array_contains_all([41, 42, 43], 41, 42, 43)

for example:

data := [41, 42, 43];

/* Keywords */
42 in_array data
data array_contains 42

/* Functions */
array_contains_any(data, 99, 42)
array_contains_all(data, 41, 42, 43)

See how it nicely mimics the existing keywords and functions (that's also what makes me think it's the way to go).