Page MenuHomePhabricator

Filter is slower than expected
Closed, ResolvedPublic

Description

AbuseFilter profiling data on fawiki indicates that filter 13 has an average execution time of 1.8 ms which is one of the slowest. This is weird, given that we have really complex filters with tons of ccnrom()s and rlikes that run faster. The current rule of this filter is shown below:

Filter 13
! "bot" in user_groups &
! "sysop" in user_groups &
page_namespace == 0 & 
page_recent_contributors == "" &
new_size < 512 &
! contains_any(
  lcase(new_wikitext), 
  "#تغییرمسیر", 
  "#تغییر_مسیر", 
  "#redirect", 
  "{{softredirect}}", 
  "{{حذف سریع}}"
) &
! 'ابهام‌زدایی' in new_html

There are only two other filters with a prolonged average execution time (i.e. > 2 ms, compared to ~100 active filters, most of which with average execution times ranging around 0.2 to 0.8 ms): filter 176 which also uses contains_any and filter 198 which is uses ccnorm. The logic of these filters is pretty simple and is shown below:

Filter 176
page_recent_contributors == "" &
action == "upload" &
old_size == 0 & 
! "autopatrol" in user_rights &
contains_any(added_lines, "isna.ir") /* Beta */
Filter 198
(
    added_lines rlike 'تالشستان'
    |
    ccnorm(added_lines) rlike ccnorm('تالشستان')
)
&
page_namespace == 0
&
(
    user_age == 0
    |
    user_editcount < 100
)

The fact that filters this simple have such a long average execution time is concerning.

Event Timeline

Huji created this task.Sep 4 2019, 9:52 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptSep 4 2019, 9:52 PM
Huji updated the task description. (Show Details)Sep 4 2019, 9:53 PM

So... First of all, recent changes to the parser (i.e. hoisting variables inside skipped branches) have increased the execution time for all filters, and this can also be seen on logstash, where we started getting a bit more entries. For fawiki, filter 13 was never reported in the last 7 days. And actually, there's no filter consistently reported as slow, so everything's fine.

What I think could make those filters slow is using page_recent_contributors, which is retrieved via a DB query; but anyway, 2ms cannot be considered slow.

page_recent_contributors == "" and as the first thing to be checked? We can do better...

@Daimona I agree that 2ms on its own isn't slow, but considering Persian Wikipedia has over 150 active filters, the numbers add up, the main reason this issue is being worked on locally is the concern that the filters are adding up to 0.5 to 1.5 seconds to the time it takes for saving an edit which isn't good. Will page_age be a faster method than page_recent_contributor for checking if the page is new? As far as I know, the fastest variable to check is old_size but that's unreliable. (though could be used in many cases)

You could use page_id === 0, if you need a quick and mostly-reliable solution. page_age would be more reliable, but it'd still require a DB query. Please tell me if you have other questions and/or want to know what filters are the slowest (although there aren't really slow filters) and the like.

Huji added a comment.EditedSep 6 2019, 1:06 PM

@matej_suchanek @Daimona thanks for the quick response. What is your thought on Filter 198? Do you think ccnorm() is the culprit? (We can obviously make that the last rule if so.)

Also, where in the AF parser code do we have the logic that runs the conditions one by one? It is implied in what you said above that conditions are checked sequentially and if one of them is enough for the entire filter rule to not match an action, then the next conditions are ignored. Where is this incorporated into code?

One more question: I am modifying these filters based on the above recommendations. Will this reset the profiling data? Or will the profiling data be based on a moving window? Or will it continue to contain the old data from before the changes I am making (and therefore, it would take a while for the average to go down)?

Lastly, is it possible to know the cost of each specific function? (I am guessing no, but thought I should still ask)

@matej_suchanek @Daimona thanks for the quick response. What is your thought on Filter 198? Do you think ccnorm() is the culprit? (We can obviously make that the last rule if so.)

I think it's good as it is, ccnorm shouldn't be too slow. user_age and user_editcount *could*, but they're already placed at the end.

Also, where in the AF parser code do we have the logic that runs the conditions one by one? It is implied in what you said above that conditions are checked sequentially and if one of them is enough for the entire filter rule to not match an action, then the next conditions are ignored. Where is this incorporated into code?

Short-circuit happens in various places... For the old parser, you can find them pretty easily by grepping mShortCircuit, and also the phrase "doesn't matter" (search), which is curiously present in other short-circuited places. For the new parser, it's a little bit more complicated, but see the end of this comment.

One more question: I am modifying these filters based on the above recommendations. Will this reset the profiling data? Or will the profiling data be based on a moving window? Or will it continue to contain the old data from before the changes I am making (and therefore, it would take a while for the average to go down)?

Yes, the profiling data will be reset right after inserting the new DB row, see here. As an aside, we claim to collect data for "the last 10000 actions", but that's not really it. Instead, we collect data for "10000 actions at a time". Which means that when we get to 10000, all data is deleted, instead of just the first entry. Basically, it's not LRU-style. This can produce distorted stats if the very first executions after the reset are too slow or too quick. And it can also break the emergency disable feature (see comment).
We'll have to fix it sooner or later.

Lastly, is it possible to know the cost of each specific function? (I am guessing no, but thought I should still ask)

No, not even back-end.


At any rate, I suggest waiting before starting to bulk-edit filters for micro-optimizations. We're in the process of switching to the new CachingParser (T156095), and it will run way more quickly, see T156095#5434572 for a preview. We need actual data from production before advertising numbers, but I'm confident that the average runtime will be around 3 times shorter (and around 40 times less memory usage, but that's not something that users are supposed to notice, usually).

Huji closed this task as Resolved.Sep 6 2019, 3:24 PM
Huji claimed this task.

Understood. I am going to mark this as resolved for now (the purpose of the task was to figure out ways that could potentially make the filters faster, and we have achieved that).