Page MenuHomePhabricator

Rate limit thresholds requests when the service is down
Closed, ResolvedPublic

Description

Currently, when a cached threshold expires we will keep retrying the service until a good response is received. In the case of pages like Special:RecentChanges, this amounts to hundreds of requests per second. This traffic is potentially exacerbating overload conditions.

We should cache the failure with a short expiry such as one minute, and not retry again until that time.

Event Timeline

We aren't seeing these huge floods of failed thresholds requests in the metrics, because we're falling back to the old-style thresholds code which is missing the metrics calls.

https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&panelId=6&fullscreen&edit&from=now-30d&to=now

Change 393922 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@master] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393922

Change 393922 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393922

Change 393945 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@master] Cache anti-stampede improvements

https://gerrit.wikimedia.org/r/393945

Change 393960 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393960

Change 393960 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393960

Mentioned in SAL (#wikimedia-operations) [2017-11-29T00:32:18Z] <awight@tin> Synchronized php-1.31.0-wmf.8/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 50s)

Change 393966 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.10] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393966

Change 393945 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Cache anti-stampede improvements

https://gerrit.wikimedia.org/r/393945

Change 393966 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.10] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393966

Mentioned in SAL (#wikimedia-operations) [2017-11-29T00:44:18Z] <awight@tin> Synchronized php-1.31.0-wmf.10/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 49s)

Still need to better understand, and test, https://gerrit.wikimedia.org/r/#/c/393945/ before letting it get to production.

So, my understanding is that the lockTSE causes a mutex around the get and set within getWithSetCallback, so multiple threads don’t try to fetch new thresholds at the same time. I set this to an arbitrary 10 seconds, will that be good? pcTTL keeps a cached version in-memory so that we don’t recalculate multiple times in one request, due to cache fetch coming from a replica.

@Legoktm @Krinkle would you mind confirming my understanding, and giving the patch a +1?

I’d also like to do something Krinkle suggested, where we check the cached value before it expires, and if the service is unreachable, put the old value back into the cache for some shorter TTL. This can be future work.

awight claimed this task.