Page MenuHomePhabricator

Memcached keys sometimes outlive their TTL (affects MW rate limit)
Closed, ResolvedPublic

Description

See logstash, specifically the last one: it reports that the ratelimiter of 5.169.150.213 reached 17 (limit is 8) in 60 seconds. However, their contributions tell a different story.

Apparently, there are no SpamBlacklist or TitleBlacklist entries for the user, so that's not a possible explanation.

Impact

Some or all users are sometimes or always unable to make multiple edits in an unknown span of time. It would appear the rate limit functionality is loosing track of time. As such, it is observed numerous times that edit attempst by end users are rejected when in fact they have (far?) fewer edits and only over a long period of time.

Root cause as of yet is unknown. It could be in the Core ping limiter logic, it could be at a higher level in the API/Revision/PageUpdater layer, or it could be at a lower level e.g. in the BagOStuff abstraction, or perhaps lower still e.g. in Memcached itself and/or atomic particles somewhere in-between influenced by solar rays.

TBD.

Event Timeline

This was discussed on IRC with @Urbanecm. We determined that captchas might have played a role (the user got several captchas), but it still doesn't explain how they could reach 8 total edits. It's interesting to note that the hit count was never reset for at least 6 minutes, which lead us to think that this might have to do with a rogue cache key.

Adding cache subproject, so watchers can comment on Daimona's above comment.

@Urbanecm I'm trying to triage this on the MediaWiki-Cache board, but it's not clear to me what the suspected bug or general question is here with regards to the objectcache library, its BagOStuff classes, or another cache components in MW?

Hi @Krinkle, as you can see in Daimona's LogStash link, the editcount (stored in ObjectCache::getLocalClusterInstance()1) wasn't reset for at least 6 minutes. That suspects some part of MW's cache didn't remove the key, even the key should live only 60 seconds. This could be a bug in whatever part of MW purges cache keys, or, theoretically, by some weird way, the number read from config is not config. I suspect the former, but who knows :). I hope this helps.

Yeah, well, I think this could've been caused either by MW not setting the right expiry, or memcached not evicting the key.

Krinkle renamed this task from Apparently incorrect throttler count for a specific user to Memcached keys sometimes outlive their TTL (affects MW rate limit).Apr 13 2020, 4:27 PM
Krinkle moved this task from Untriaged to libs/objectcache on the MediaWiki-Cache board.

Thanks, I've updated the task to summarise this observation. It makes sense under lib-objectcache indeed.

Change 602935 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/602935

Change 602936 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/602936

The above two patches do not fix this issue, but they helped me better understand this code while investiging this bug hence tracking the relevant work here. I don't have an answer yet for why this is happening, or how to solve it.

Change 602935 merged by jenkins-bot:
[mediawiki/core@master] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/602935

Change 602936 merged by jenkins-bot:
[mediawiki/core@master] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/602936

Krinkle triaged this task as High priority.Jul 8 2020, 4:28 PM

The above clean up was motivated by an attempt find the cause of this bug. However, they do not resolve this issue and I also have not uncovered any new information.

Can we include the expiry in the payload, and then just ignore any stale limiter info we may get from the cache?

Reedy renamed this task from Memcached keys sometimes outlive their TTL (affects MW rate limit) to Memcached keys sometimes outlive their TTL (affects MW rate limit).Aug 18 2020, 9:31 PM

That could work, although I'd use it as an investigative measure to log a warning in the ratelimit channel with trace for debugging; not as the general recommended best practice for Memcached in general.

Change 622541 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] pingLimiter: enforce expiry time

https://gerrit.wikimedia.org/r/622541

The reported behavior would also be consistent with a "creeping counter": if incrWithInit() applied the new expiry upon increment, rather than just on init, the TLL would be getting bumped forward, and the counter would keep increasing. I spend a couple of hours investigating this possibility, but the various BagOStuff implementations however seem to handle this correctly. I wasn't able to create "creeping counter" behavior locally.

Change 624074 had a related patch set uploaded (by Reedy; owner: Krinkle):
[mediawiki/core@REL1_34] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/624074

Change 624075 had a related patch set uploaded (by Reedy; owner: Krinkle):
[mediawiki/core@REL1_34] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/624075

Change 624083 had a related patch set uploaded (by Reedy; owner: Krinkle):
[mediawiki/core@REL1_31] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/624083

Change 624084 had a related patch set uploaded (by Reedy; owner: Krinkle):
[mediawiki/core@REL1_31] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/624084

Change 624074 merged by jenkins-bot:
[mediawiki/core@REL1_34] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/624074

Change 624075 merged by jenkins-bot:
[mediawiki/core@REL1_34] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/624075

Change 624083 merged by jenkins-bot:
[mediawiki/core@REL1_31] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/624083

Change 624084 merged by jenkins-bot:
[mediawiki/core@REL1_31] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/624084

Seems complete from my side. Can this be closed, or is it needed for backports?

Change 623901 had a related patch set uploaded (by Reedy; owner: Daniel Kinzler):
[mediawiki/core@REL1_35] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/623901

Change 622541 merged by jenkins-bot:
[mediawiki/core@master] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/622541

Change 624115 had a related patch set uploaded (by Reedy; owner: Daniel Kinzler):
[mediawiki/core@REL1_34] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/624115

Change 624121 had a related patch set uploaded (by Reedy; owner: Daniel Kinzler):
[mediawiki/core@REL1_31] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/624121

Change 623901 merged by jenkins-bot:
[mediawiki/core@REL1_35] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/623901

Change 624115 merged by jenkins-bot:
[mediawiki/core@REL1_34] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/624115

Change 624121 merged by jenkins-bot:
[mediawiki/core@REL1_31] User: enforce pingLimiter() expiry time

https://gerrit.wikimedia.org/r/624121