Page MenuHomePhabricator

Add subroutines to AbuseFilter
Open, NormalPublic

Description

As discussed in the Community health initiative, it would be useful to add subroutines, i.e. the possibility to check from within a filter if another filter was tripped by the same edit. This would open the doors to a great improvement in performance by avoiding redundancies.
I was also thinking about a partial way to achieve this. If I'm right, AbuseFilter tests filters ordering them by number. Thus, we may limit this functionality to only checking previous filter (=filters with smaller number). This might be further extended by solving T132846 and related task (adding a custom execution order would be fine too) in order to give full control to the users.

Event Timeline

Daimona created this task.Feb 10 2018, 1:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2018, 1:40 PM
Daimona triaged this task as Normal priority.
Huji added a subscriber: Huji.Feb 10 2018, 2:15 PM

Change 441621 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add a function to check if another filter matched the action

https://gerrit.wikimedia.org/r/441621

Daimona claimed this task.Jun 24 2018, 1:23 PM

AbuseFilter tests filters ordering them by number. Thus, we may limit this functionality to only checking previous filter (=filters with smaller number).

What if we force the evaluation of a given filter if we detect it hasn't been evaluated? That way AF doesn't need a particular order. Of course, it should check for circular dependencies and max recursion level to avoid loops.

Example:

  1. Filters are evaluated in order
  2. Filter $1 wants to check if filter $2 matched.
  3. Filter $2 hasn't been evaluated yet. Call the function to evaluate filter $2.
  4. Filter $2 gets evaluated, etc. Execution returns to filter $1
  5. Next filter to evaluate is $2. Since it has been evaluated already, there's nothing to do. Evaluate the next filter...

@Ciencia_Al_Poder Yes, I also had that idea. However, I think that we'd better allow users to specify the execution order and avoid setting one without explicitly reporting it. This would give users more control and simplify coding the feature.

MusikAnimal added a comment.EditedJun 24 2018, 6:36 PM

Ciencia_Al_Poder's proposal is basically what I was thinking with T198005#4309946. I was suggesting we have input fields at the top of the editing interface to specify what filters are "required", but if AbuseFilter can be that smart to just "cache" the result of filters, such that everything is done at run-time, then I think your proposed workflow is probably better.

So an example filter for the above workflow:

!("confirmed" in user_groups) &
check_filter(2) &
"foo" in added_lines

AbuseFilter will evaluate filter 2 at runtime and cache the result (true or false), so it's ready to go when it encounters any other filter that has check_filter(2). That seems feasible to implement, no?

  1. Next filter to evaluate is $2. Since it has been evaluated already, there's nothing to do. Evaluate the next filter...

It seems unlikely, but there's a possibility $1 won't take action but $2 will. So I probably wouldn't have AbuseFilter skip $2 altogether, but if implementation allows (described above), AbuseFilter will already know what the result of $2 is and hence can jump straight to the actions it's supposed to take.

Yes, it's feasible, and it's also possible to execute every filter only once, since actions to execute are determined basing on the IDs of the matched filters. I'd rather avoid giving full order control to AF itself, while manually adding requirements is what we're talking about in T198005.

I think that we'd better allow users to specify the execution order and avoid setting one without explicitly reporting it. This would give users more control and simplify coding the feature.

The execution order currently doesn't affect anything AFAIK. Filters could be run in reverse or random order and they should work the same. They're (currently) independent from each other.

If users need to manually choose the execution order because of this, it introduces a path of errors on the end users, when they change execution order and break dependencies. This should be handled transparently.

I'd code it to work recursively: When it finds a check_filter statement (or whatever you call it), it gets the filter execution result, or evaluates it if it hasn't been evaluated yet, and then continue evaluating the current filter.

It seems unlikely, but there's a possibility $1 won't take action but $2 will. So I probably wouldn't have AbuseFilter skip $2 altogether, but if implementation allows (described above), AbuseFilter will already know what the result of $2 is and hence can jump straight to the actions it's supposed to take.

When $2 executes because it's in use by $1, I was thinking to execute it completely (performing any configured actions), not only evaluate if it matches or not. But yes, deferring the action until it goes through the normal execution order should also be fine. Either way shouldn't affect anything. In fact, ideally, AF should first check only if filters match, and after all filters have been evaluated it should perform the action in batch. I'm not sure if AF is currently coded to do this, or it's performing actions on every filter evaluation when it matches.

The execution order currently doesn't affect anything AFAIK. Filters could be run in reverse or random order and they should work the same. They're (currently) independent from each other.

This is true from a syntax validity POV. However, changing the order could change what we decide to show to the user, see for instance the original report on T132846. Making filters auto-require themselves would throw away the possibility to customize their order and thus solve problems like that one.

When $2 executes because it's in use by $1, I was thinking to execute it completely (performing any configured actions), not only evaluate if it matches or not.

That's actually what we already do: first execute all filters, record the ID of the ones that matched, then query abuse_filter_action with a JOIN on those IDs and progressively execute actions. Nothing would need to be changed from this perspective.

Note about this feature: it'd be cool to make this function accept arrays or be variadic (like contains_any/all). For instance, add two functions:

matched_filter_any([1,2,3,...])
matched_filter_all([1,2,3,...])