Rate limit thresholds requests when the service is down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Nov 28 2017, 9:38 PM

Description

Currently, when a cached threshold expires we will keep retrying the service until a good response is received. In the case of pages like Special:RecentChanges, this amounts to hundreds of requests per second. This traffic is potentially exacerbating overload conditions.

We should cache the failure with a short expiry such as one minute, and not retry again until that time.

Details

Subject	Repo	Branch	Lines +/-
Rate limit thresholds failures to once per (minute x model x wiki)	mediawiki/extensions/ORES	master	+17 -15
Rate limit thresholds failures to once per (minute x model x wiki)	mediawiki/extensions/ORES	wmf/1.31.0-wmf.10	+17 -15
Cache anti-stampede improvements	mediawiki/extensions/ORES	master	+7 -1
Rate limit thresholds failures to once per (minute x model x wiki)	mediawiki/extensions/ORES	wmf/1.31.0-wmf.8	+17 -15

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T181538 ORES overload incident, 2017-11-28
Resolved	awight	T181567 Rate limit thresholds requests when the service is down
Declined	None	T182256 Clean up ORES thresholds cache: pre-emptively check before expiry

Event Timeline

awight created this task.Nov 28 2017, 9:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 9:38 PM

We aren't seeing these huge floods of failed thresholds requests in the metrics, because we're falling back to the old-style thresholds code which is missing the metrics calls.

https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&panelId=6&fullscreen&edit&from=now-30d&to=now

awight merged a task: T181534: Why is nlwiki requesting the old-style thresholds API?.Nov 28 2017, 10:10 PM

ores_test_stat_correlation.png (778×1 px, 113 KB)

Change 393922 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@master] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393922

gerritbot added a project: Patch-For-Review.Nov 28 2017, 10:11 PM

Halfak mentioned this in T181538: ORES overload incident, 2017-11-28.Nov 28 2017, 11:08 PM

Halfak added a parent task: T181538: ORES overload incident, 2017-11-28.

Change 393922 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393922

Change 393945 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@master] Cache anti-stampede improvements

https://gerrit.wikimedia.org/r/393945

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Nov 29 2017, 12:01 AM

Change 393960 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393960

Change 393960 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393960

Mentioned in SAL (#wikimedia-operations) [2017-11-29T00:32:18Z] <awight@tin> Synchronized php-1.31.0-wmf.8/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 50s)

Change 393966 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.10] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393966

Change 393945 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Cache anti-stampede improvements

https://gerrit.wikimedia.org/r/393945

Change 393966 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.10] Rate limit thresholds failures to once per (minute x model x wiki)

https://gerrit.wikimedia.org/r/393966

Mentioned in SAL (#wikimedia-operations) [2017-11-29T00:44:18Z] <awight@tin> Synchronized php-1.31.0-wmf.10/extensions/ORES: Hotfix to mitigate cache stampeding, T181567 (duration: 00m 49s)

Still need to better understand, and test, https://gerrit.wikimedia.org/r/#/c/393945/ before letting it get to production.

ReleaseTaggerBot edited projects, added MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)); removed MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)).Nov 29 2017, 1:01 AM

So, my understanding is that the lockTSE causes a mutex around the get and set within getWithSetCallback, so multiple threads don’t try to fetch new thresholds at the same time. I set this to an arbitrary 10 seconds, will that be good? pcTTL keeps a cached version in-memory so that we don’t recalculate multiple times in one request, due to cache fetch coming from a replica.

@Legoktm @Krinkle would you mind confirming my understanding, and giving the patch a +1?

I’d also like to do something Krinkle suggested, where we check the cached value before it expires, and if the service is unreachable, put the old value back into the cache for some shorter TTL. This can be future work.

awight closed this task as Resolved.Dec 6 2017, 9:47 PM

awight claimed this task.

awight edited projects, added Wikimedia-Incident, MediaWiki-extensions-ORES; removed Patch-For-Review.

awight moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.Dec 6 2017, 9:53 PM

• Mholloway mentioned this in T146933: MCS endpoint checks timing out / flapping in production again.Dec 8 2017, 5:24 PM

elukey closed subtask T182256: Clean up ORES thresholds cache: pre-emptively check before expiry as Declined.May 29 2023, 9:26 AM

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 12:18 PM