Page MenuHomePhabricator

Memcached keys sometimes outlive their TTL (affects MW rate limit)
Open, HighPublic

Description

See logstash, specifically the last one: it reports that the ratelimiter of 5.169.150.213 reached 17 (limit is 8) in 60 seconds. However, their contributions tell a different story.

Apparently, there are no SpamBlacklist or TitleBlacklist entries for the user, so that's not a possible explanation.

Impact

Some or all users are sometimes or always unable to make multiple edits in an unknown span of time. It would appear the rate limit functionality is loosing track of time. As such, it is observed numerous times that edit attempst by end users are rejected when in fact they have (far?) fewer edits and only over a long period of time.

Root cause as of yet is unknown. It could be in the Core ping limiter logic, it could be at a higher level in the API/Revision/PageUpdater layer, or it could be at a lower level e.g. in the BagOStuff abstraction, or perhaps lower still e.g. in Memcached itself and/or atomic particles somewhere in-between influenced by solar rays.

TBD.

Event Timeline

Daimona created this task.Mar 5 2020, 2:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2020, 2:36 PM
Daimona updated the task description. (Show Details)Mar 5 2020, 2:59 PM
Daimona updated the task description. (Show Details)Mar 5 2020, 3:10 PM

This was discussed on IRC with @Urbanecm. We determined that captchas might have played a role (the user got several captchas), but it still doesn't explain how they could reach 8 total edits. It's interesting to note that the hit count was never reset for at least 6 minutes, which lead us to think that this might have to do with a rogue cache key.

sbassett moved this task from Incoming to Watching on the Security-Team board.

Adding cache subproject, so watchers can comment on Daimona's above comment.

@Urbanecm I'm trying to triage this on the MediaWiki-Cache board, but it's not clear to me what the suspected bug or general question is here with regards to the objectcache library, its BagOStuff classes, or another cache components in MW?

Hi @Krinkle, as you can see in Daimona's LogStash link, the editcount (stored in ObjectCache::getLocalClusterInstance()1) wasn't reset for at least 6 minutes. That suspects some part of MW's cache didn't remove the key, even the key should live only 60 seconds. This could be a bug in whatever part of MW purges cache keys, or, theoretically, by some weird way, the number read from config is not config. I suspect the former, but who knows :). I hope this helps.

Yeah, well, I think this could've been caused either by MW not setting the right expiry, or memcached not evicting the key.

Krinkle renamed this task from Apparently incorrect throttler count for a specific user to Memcached keys sometimes outlive their TTL (affects MW rate limit).Apr 13 2020, 4:27 PM
Krinkle moved this task from Untriaged to libs/objectcache on the MediaWiki-Cache board.

Thanks, I've updated the task to summarise this observation. It makes sense under lib-objectcache indeed.

Change 602935 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/602935

Change 602936 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/602936

The above two patches do not fix this issue, but they helped me better understand this code while investiging this bug hence tracking the relevant work here. I don't have an answer yet for why this is happening, or how to solve it.

Change 602935 merged by jenkins-bot:
[mediawiki/core@master] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/602935

Change 602936 merged by jenkins-bot:
[mediawiki/core@master] User: Move per-user logic in pingLimiter() closer together

https://gerrit.wikimedia.org/r/602936

Krinkle triaged this task as High priority.Wed, Jul 8, 4:28 PM

The above clean up was motivated by an attempt find the cause of this bug. However, they do not resolve this issue and I also have not uncovered any new information.

Krinkle updated the task description. (Show Details)