Page MenuHomePhabricator

Change $wgAbuseFilterConditionLimit for wikimedia commons
Closed, ResolvedPublic

Description

Of the last 3,003 actions, 2 (0.07%) have reached the condition limit of 1,000, and 25 (0.83%) have matched one of the filters currently enabled.

Can we have the $wgAbuseFilterConditionLimit changed. +500 conditions?

We have to run a lot of powerful filters on commons to prevent abuse.

I also checked all enabled filters on commons, and as far i can see we need them all.

Event Timeline

Poyekhali triaged this task as Medium priority.Apr 13 2016, 3:35 AM
Poyekhali subscribed.
matmarex raised the priority of this task from Medium to High.
matmarex subscribed.

(Claiming this task as a reminder to myself to review the filters after T132200, I'm hoping we'll end up marking this "invalid". :) )

I took a look today and I'm confident that this won't be necessary. :D

Commons is now safely below the condition limit without any filter changes, due to my recent changes to the way conditions are counted (T43693 and T132190). These made the number of conditions used by a filter lesser in almost all cases (and closer to the truth, too; see some examples in the commit message of https://gerrit.wikimedia.org/r/#/c/282477/).

There are 63 enabled filters on Commons, and according to the data, almost every one of them executes in under 1 ms (on average) and consumes only 1 or 2 conditions (on average). Most are closer to 0.15 ms and use just one condition. There are a few outliers, I'll look into them some more.

I looked at some slower filters, and I have two observations that seem not to be documented anywhere prominent (at least not at https://www.mediawiki.org/wiki/Extension:AbuseFilter/Conditions).

  • When checking for occurrences of multiple strings in text (common in filters detecting spam), it is a lot faster to use contains_any(text, 'a', 'b', 'c') or at least text rlike 'a|b|c' than a separate test for each string. Really, it's ridiculous how much faster it is. Always do it this way.
  • All user_* variables except for user_name potentially require a database query, so computing them is more expensive than variables like action and article_namespace. They probably shouldn't be used as the first condition of a filter. (This might decrease or increase condition count, depending on whether the new order causes the matching to finish earlier or later, but should improve the actual performance.)

I edited the following filters (see https://commons.wikimedia.org/wiki/Special:AbuseFilter/history) – note that some are private and only Commons sysops will be able to see the diffs:

(I also made another mistaken edit to filter 69, which I reverted.)

I tried not to touch filters where the improvements weren't obvious. There seemed to be many where user_groups or user_editcount is used in the first condition, but they all looked very quick anyway.

@matmarex: Your the best! Thanks! :-) Very happy.

@matmarex: I didn't get the reasoning behind the edit to filter 69:

(2016-04-17) Reordered rules for performance. The checks for user_name/global_user_groups are almost never hit, so putting them first doesn't let us skip the regexes. Let's just put the regexes first. --Matma Rex

If they are almost never hit, isn't better for short circuit purposes to let them as first conditions? Isn't it a problem that the (new|old)_wikitext can be quite large and will now be tested more often against the regexes?

What I mean is that e.g. "OTRS-member" in global_user_groups is almost never true (most users are not OTRS members); !("OTRS-member" in global_user_groups) (negated, as it appears in the filter) is almost always true and doesn't let us short-circuit the filter. Perhaps I should have said that they are almost always hit?

Again:

Of the last 3,574 actions, 1 (0.03%) has reached the condition limit of 1,000, and 22 (0.62%) have matched one of the filters currently enabled.

:-(

I'd suggest to disable some infrequently used filters:

The action check could be put first in some of these:

Some more work along the lines of what MatmaRex said above is in order, although it would be useful to know which filters are consuming most conditions.

And i disabled 107.

But it wouldn't hurt to have a bit moor conditions :-). Commons has a small community, and thus we need more filters than other wikis.

Change 294363 had a related patch set uploaded (by Bartosz Dziewoński):
Set $wgAbuseFilterConditionLimit = 2000 for commonswiki

https://gerrit.wikimedia.org/r/294363

Change 294363 merged by jenkins-bot:
Set $wgAbuseFilterConditionLimit = 2000 for commonswiki

https://gerrit.wikimedia.org/r/294363

Mentioned in SAL [2016-06-14T23:09:45Z] <ori@tin> Synchronized wmf-config/abusefilter.php: I4e5e4d227: Set $wgAbuseFilterConditionLimit = 2000 for commonswiki (T132048) (duration: 00m 28s)

For the record, the condition limit has been recently increased to 2000 for all wikis: T309609.