Page MenuHomePhabricator

Shared variables for AbuseFilter
Open, Needs TriagePublic

Description

for example, to store regex with bad words, and such variable could be used in many filters

--AS (talk) 22:20, 15 June 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 6 support votes, and was ranked #84 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Moderation_and_admin_tools#Shared_variables_for_AbuseFilter

Event Timeline

DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH subscribed.
IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.

I took a look a this. There are two main "difficulties" in approaching this task:

  1. Store global variables in a convenient place. However, this isn't such a big problem: a solution like the one in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/420757/ should do the trick (basically we pass info from AbuseFilter class, which provides a higher abstraction level, to AbuseFilterParser, which only handles single filters). Actually, the patch would be very similar to that: we add a class member to AbuseFilter to store global variables, add a function like set_global_var, one like get_global_var, add this special use case to setUserVariable and we're done.
  2. Be able to distinguish between a variable not being set and one not yet computed. For instance, if I define a global variable in filter 100 and try to get it in filter 2, when the parser examines filter 2 the global variable won't be available. Either we solve T132846 (or find a better way to specify execution order), or I think the only solution would be to enqueue the current filter after all the others. Then, when we'll have parsed all filters in ascending order and get back to filter with lower ID which were enqueued, we stop forgiving and throw an exception if the global variable still doesn't exist. This would again require some new logic to allow enqueuing filters, and the real order of execution wouldn't be nor clear, nor customizable. As a side note, this will still produce errors at runtime: in fact, when you create a new filter, the others aren't (obviously) executed, so we have to forgive non-existent global variables. But if such variable isn't actually set (or maybe it was set in a filter that was disabled), it will give a runtime error. This could have been fine if we had a page for problematic executions like the one introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/429464/, but until we don't have it there'll be a way too high risk to prevent filters from being executed.

TL;DR: The patch itself shouldn't be too hard to write, although it involves adding a considerable amount of brand new logic. However, we need to deal with the order of execution of the filters and other related problems that may potentially cause a big havoc.

Change 441630 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add global variables shared between filters

https://gerrit.wikimedia.org/r/441630

I personally worry about the concept of filter priorities (see comments at T198005). I think ideally we'd have a dedicated place to define shared variables that is NOT within a filter.

So I could go to Special:AbuseFilter/variables (something like that), and create new variables similar to how you create filters. That way I have access to runtime variables like user_name (which I think can be interpolated in variables?). This gets computed by AbuseFilter before any actual filters are ran. This means you don't need to worry about the "priorities" or execution order of filters -- the variables are already defined.

A new page would surely be neat. My first thought is that we may be adding even too much pages (if counting T193064 and subtasks), which could potentially end up to confuse users. Maybe something user-friendly like a dedicated page but not as big?

Yeah, a small page for defining global variables I think is the easiest to wrap your head around as a filter author. You'd also have all the shared variables in one place, so you know what's available and can be used in your new filter. This approach also means we could tackle this ticket independently of T186960, or have to worry about having your filter's priorities set correctly.

The other thing I worried about was naming conflicts. On enwiki, a lot of variables have the same name: "stringy", "abuseStr", "nope" (weird choices, but not the point :). So referencing another filter's variable would have to be namespaced ("5.stringy", "15.nope", etc.). That's a bit complicated to me, and there's the possibility you'd still end up with unnecessary duplication, where 5.stringy and 15.nope actually do the same thing -- you just weren't aware because you didn't take the time to check what variables are already available (understandably, because there are a lot of filters to check).

I'll change the patch to use a new page, then. I suppose it's fine to make it visible only for those with abusefilter-modify right, huh? As for naming conflicts, having a global variable named "stringy" shouldn't happen :-) You can't see the content of the variable from another filter, so having such a name and using it as a global is already wrong. As for namespaces, people could just give more explicit name, like "badwords_1" :-)

I'll change the patch to use a new page, then.

Sweet, that's just my opinion of course but I think this will be easier to work with. How will the interface work? If it's easier to implement, I think a plain textarea will be fine:

myGlobalVar1 := "bad word|" + user_name + "|bad word2";
myGlobalVar2 := "foo bar";
...

Is that weird?

I suppose it's fine to make it visible only for those with abusefilter-modify right, huh?

Good point, there could be private things in there, so I would say abusefilter-view-private. That way users with read-only rights can still see how the filters work.

Yes, I think a plain textarea would be the best solution: first, I don't think there'll be many global variables, so the textarea shouldn't become too clogged with stuff. Second, adding brand new forms (like for filters: a button "create new" with a textarea for the variable and a field for its name) requires much more coding. Third, like I was saying above, let's keep things minimal and simple, like a single textarea is.
I can see a single problem: what about versioning? We'd need to implement a custom history, which would require a dedicated DB table and a lot of work to do. Also, to store our global variables we'd need a new table anyway, and it seems a bit overkill to use it only for our variables list. What shall we do?
And yes, probably we should check for view-private OR modify right.

I can see a single problem: what about versioning?

Grr! That is a problem. Maybe invisibly, on the backend, Special:AbuseFilter/variables is actually filter 0. The only thing different is the checkboxes for the actions, and tags, etc., are all hidden. All you get is a single textarea (and a check syntax button). AbuseFilter knows any variables set in filter 0 are global. You'd also add logic to make AbuseFilter ignore it for logging, since it is used only for variable definition.

Hacky... but maybe it could work?

Well, I also thought of using Special:AbuseFilter/0 :-) Actually, we may call it "Special:AbuseFilter/config" and let it open to further customization in case we'll add new, similar features. It would be a bit hacky, true, and also need a DB change, but should work just well. If no other problems will rise, I'll start working on it tomorrow.

Change 442078 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add a page for filter config and shared global variables

https://gerrit.wikimedia.org/r/442078

@MusikAnimal the patch above is the implementation with a dedicated page. It requires much more changes than the other one, but provides a more stable result, which causes less troubles to users and may be used for other global options in the future.

Change 442078 abandoned by Daimona Eaytoy:
[mediawiki/extensions/AbuseFilter@master] [WIP] Add a page for filter config and shared global variables

Reason:

https://gerrit.wikimedia.org/r/442078

Change 441630 abandoned by Daimona Eaytoy:
[mediawiki/extensions/AbuseFilter@master] [WIP] Add global variables shared between filters

Reason:

https://gerrit.wikimedia.org/r/441630

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

This is part of the overhaul.

I haven't really thought about this, but a few random ideas:

  • VariableHolder should have a separate property for global variables; retrieving a global variable should happen via a dedicated method, not the ones for "usual" variables
  • There are various things that might make this difficult: the order in which filters are executed, checking syntax of filters that use global variables, etc.
    • A possible solution (partly explored in my 2 y.o. patch is to have a dedicated interface for declaring global variables (a sort of "filter 0" as proposed above), that would always be executed when checking syntax of a filter.
    • Any variable set there would be global
    • Retrieving a global should probably happen with a special syntax, e.g. get_global_variable( name ), not just by direct access
  • I'm unsure about the current capability of the parser to check for unset variables etc. T260903 would probably help, but it won't happen as part of the overhaul

My ideas are related to privacy. We have different privilege levels for viewing and editing public and private filters. If someone cannot see/edit private filters, but can create a public filter, they should probably see the shared variables. So they would have to be public. Per description:

to store regex with bad words

I believe such filters are usually hidden from the public, especially because of such a list of words. So we might think about more sets of variables for different audiences, which would complicate things not only in the code, but also for maintainers. Or make them private but then their usage in public filters could be problematic.

Retrieving a global should probably happen with a special syntax, e.g. get_global_variable( name ), not just by direct access

This explicit query would be my preference. However, static analysis is more difficult. (Maybe we could enforce a convention that ordinary variables are in lower case while shared variables are in upper case, like constants in PHP.)

My ideas are related to privacy. We have different privilege levels for viewing and editing public and private filters. If someone cannot see/edit private filters, but can create a public filter, they should probably see the shared variables. So they would have to be public. Per description:

to store regex with bad words

I believe such filters are usually hidden from the public, especially because of such a list of words. So we might think about more sets of variables for different audiences, which would complicate things not only in the code, but also for maintainers. Or make them private but then their usage in public filters could be problematic.

Very good point. All in all, I think that having two sets might be a good idea, one for "public" variables, one for "private" variables. These might be two edit boxes on the same page. At which point, we'd have to be very careful and check permissions before parsing any code (also e.g. from /test or /tools); perhaps this is doable by injecting a list of allowed global variable from the current user into the parser (so the parser itself is not aware of permissions).

Alternatively, we might implement a single visibility flavour for now, which would have to be public (easier to do, but perhaps less useful?).

Also, having different sets has two additional disadvantages:

  • The DB schema. The approach below wouldn't work easily.
  • The "public editor" couldn't use anything from the "private editor", assuming this is something that people might want.

Retrieving a global should probably happen with a special syntax, e.g. get_global_variable( name ), not just by direct access

This explicit query would be my preference. However, static analysis is more difficult. (Maybe we could enforce a convention that ordinary variables are in lower case while shared variables are in upper case, like constants in PHP.)

Heh, static analysis is going to be difficult anyway. Some possibilities:

  • get_global_variable. Pro: totally BC, intention of the code is obvious. Con: static analysis is difficult [1]
  • Uppercase convention. Pro: easier code in the AF parser, filters are also easy to read. Con: not backwards-compatible, in theory. Enforcing case might be complicated, too. (Also need to ensure that no naming conflict can happen)
  • "global" keyword. Pro: intention is quite clear. Con: very hard to implement due to scoping. I put this option here, but I don't really think it's feasible.

Another good question is what the DB schema should look like. Like proposed above (and implemented in an early POC), putting global variables in a fake "filter 0" would be a simple option that also denies the need for schema changes. Also, we'd likely want a complete filter-like textbox for rules, so we'd only need a DB row per set of variables, which is kinda overkill for a dedicated table.

Also, with some of the backend code now being revamped, it should be possible to implement this in a way that is transparent to the callers, because only FilterLookup and the Api query module are accessing the schema.

Note, as I wrote above, that this wouldn't work really well with different sets of shared variables.


[1] - Might be mitigated if we do like set_var, requiring the first argument (var name) to be a literal string. I think that would keep things easy enough.

putting global variables in a fake "filter 0" would be a simple option that also denies the need for schema changes

I was wondering whether we could mimic the way how we store afl_var_dump for this.

putting global variables in a fake "filter 0" would be a simple option that also denies the need for schema changes

I was wondering whether we could mimic the way how we store afl_var_dump for this.

To an extent, yes: storing data to BlobStore (aka text table + ExternalStore) might be viable, the problem is that we'd have to keep a reference to the insert ID (aka the ID of the entry in the text table), and that ID should be stored somewhere in an AF table. And to store more than one ID we'd need more than one row, which takes us back to the start.