Page MenuHomePhabricator

Rethink slow filters and profiling after enabling CachingParser
Open, Needs TriagePublic

Description

The new CachingParser will have a shorter runtime, as previewed in T156095#5434572. This means that the current slow-filters system will mostly stop working, because the current threshold is too high. The problem is, cold-cache executions will still have runtimes similar to the ones we get now. So an idea could be to have two different thresholds, one for cold cache and one for warm cache. But that's not easy: runtime recording is handled inside AbuseFilterRunner, and is not limited to parsing, so the Parser doesn't have any knowledge about it.
Of note, for what concerns the runtime of all filters together, some of them may find warm cache, and others may not. Thus, the grouped runtime may be a little useful.

A possible implementation would be a (singleton?) profiler class, like I said in the comments of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/498978/, and the thresholds could be determined based on runtime in production. But for aggregated stats, there's not much I can think of.

Event Timeline

I think it'd be great if we could make it very rare/unlikely to have a cache miss, and then apply the per-filter measures only when the rule-parsing was cached. We could then have a separate metric (in Graphite/Logstash, for developers) to closely monitor when cache entries are missing.

There would still be the overall measure for all filters combined (not skipping anything) to make sure we don't regress in terms of impact on Save Timing.

Would it be feasible to make cache misses nearly impossible? Some of the below may already be the case, but here are some ideas:

  • We can avoid a guaranteed cache miss after a filter modification, by parsing and caching the result automatically when modifying/creating a an abuse filter. There are several features in WANObjectCache (key touching, check keys, staleTTL, lockTSE, async pre-emptive regeneration) that can help us minimise chances of cache misses.
  • If the cache might fall out for a filter if it is only rarely reached (e.g. other rules always match first), we could do a batch query on the cache (getMulti) to always get them together and thus keep them popular. Alternatively, if we want to optimise for shorter individual steps and quicker early return, we could keep things warm by having a deferred update that checks the unused filters and ensures they are re-parsed/cached if they fell out.

With some of the above applied, and a long expiry (month or year), and assuming it won't expire/fall out of Memcached for other reasons - in theory - what scenarios could still result in a cache miss?

I think it'd be great if we could make it very rare/unlikely to have a cache miss, and then apply the per-filter measures only when the rule-parsing was cached. We could then have a separate metric (in Graphite/Logstash, for developers) to closely monitor when cache entries are missing.

Collecting data for warm runs only makes sense. As I said in the description, profiling data is saved in AbuseFilterRunner, while only the Parser has knowledge about cache hits/miss. So we'd have to write some additional logic, but it shouldn't be too complicated. And not showing cold runs data in the UI is fine, but only as long as we really manage to greatly reduce the cache misses.

There would still be the overall measure for all filters combined (not skipping anything) to make sure we don't regress in terms of impact on Save Timing.

For sure. And after all, I don't think we should care about that runtime being a mix of cold and warm runs. Currently, it's already a mix of filters bailing out at the first condition and filters where every condition is checked, so it should only represent the average extra save time.

Would it be feasible to make cache misses nearly impossible?

We can definitely try :-)

  • We can avoid a guaranteed cache miss after a filter modification, by parsing and caching the result automatically when modifying/creating a an abuse filter.

I think this is already in place (inadvertently): when saving a filter, we run a syntax check, which in turns involves parsing the filter and will cache the tokens and the AST with the new parser. And that will also keep working if we change the syntax check so that it only tries to build the AST, without evaluating it (T231536).

There are several features in WANObjectCache (key touching, check keys, staleTTL, lockTSE, async pre-emptive regeneration) that can help us minimise chances of cache misses.

We could also use key touching as we do for global filters, yes. [1]

  • If the cache might fall out for a filter if it is only rarely reached (e.g. other rules always match first), we could do a batch query on the cache (getMulti) to always get them together and thus keep them popular. Alternatively, if we want to optimise for shorter individual steps and quicker early return, we could keep things warm by having a deferred update that checks the unused filters and ensures they are re-parsed/cached if they fell out.

Every filter is always executed (unless they're consuming too many conditions, but this shouldn't happen on WMF wikis), so every edit will recache all missing filters.

With some of the above applied, and a long expiry (month or year), and assuming it won't expire/fall out of Memcached for other reasons - in theory - what scenarios could still result in a cache miss?

No other scenarios. Currently, the TTL is one day in the local server. We could increase it and use key touching as you proposed above.

[1] - For global filters, we cache the results of the query on the foreign DB. That has a long TTL and the key is touched when changing a global filter. The mechanism here would be a little different, but we could also be able to cache the query for local filters and delegate all the caching to a separate method.

Change 537499 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Group slow filters by warm/cold cache

https://gerrit.wikimedia.org/r/537499

The implementation above should be good. Ensuring as few cache miss as possible could be the scope of another task, to be opened after looking at prod results.