Page MenuHomePhabricator

Uncached wiki requests partially unavailable due to excessive request rates from a bot
Closed, ResolvedPublic

Description

From approximately 07:31 to 08:16 UTC, there has been instability serving Wiki uncached content (mostly impacted would be authenticated users and bot, as well as POST requests), affecting at first commonswiki, but later most other wikis, albeit at lower rate.

The impact between those timestamp was increased latency (up to 5+ seconds to return results) and high level of errors (approximately 1/4th of regular requests lost). This was due to overload on commonswiki requests blocking resources on most api & database servers, which in turn created contention on most other wikis.

The issues were narrowed as triggered by SchlurcherBot at commonswiki, which was blocked from editing and sending requests to Wikimedia sites at 7:49 and 8:08, respectively.


https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-15_appserver_latency

Details

Other Assignee
Reedy

Event Timeline

Aklapper renamed this task from Uncached wiki requests partially unavailable due to excesive request rates from a bot to Uncached wiki requests partially unavailable due to excessive request rates from a bot.Apr 15 2021, 9:55 AM
Aklapper added a subscriber: Schlurcher.

Heads-up to @Schlurcher

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

image.png (684×444 px, 44 KB)

Autoloader shouldn't take one second.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

image.png (684×444 px, 44 KB)

Autoloader shouldn't take one second.

Indeed, that's very worrying.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

That shipped first in 1.36.0-wmf.36; is it possible we didn't notice for four weeks?

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

image.png (684×444 px, 44 KB)

Autoloader shouldn't take one second.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

The AutoLoader has been slow since the Buster migration, see e.g. T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces. I doubt that micro-optimization patch had any significant effect in either direction.

Also, the AutoLoader being slow doesn't really explain the lock contention or why this appeared suddenly last night despite the bot having been running for quite a while.

There was one thing I missed in my comment on autolader and Joe pointed out in IRC that I did the test on mwdebug (where you can do xhgui) and mwdebug is a VM with different opcache (probably cold most of the time) with production. So the situation in production is better. Of course, we might need to improve things but that's another topic not related to this issue. Sorry my bad.

Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rate of ~60 edits per minutes more or less over the last year. The last script change occurred roughly a week ago, so this cause of events caught me off guard. Could someone please explain to me in some more layman terms what the underlying issue is and how I can help to address this. From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.
Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rate of ~60 edits per minutes more or less over the last year. The last script change occurred roughly a week ago, so this cause of events caught me off guard. Could someone please explain to me in some more layman terms what the underlying issue is and how I can help to address this.

My best guess is that something changed in MediaWiki, probably with 1.37.0-wmf.1, that made the purge requests you were making more expensive, slower. And then it took a few hours with the new code for it to turn into database problems, causing an incident. In general I think we should be able to support the rate at which you were making requests.

From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.

Can you double check that maxlag is set on the purge requests, that's what I saw missing in the logs. But tbh, I don't even know if maxlag would have helped in this case.

Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

To be clear, you're editing a page and then immediately purging it? Also, it seems to be a pretty clear bug in MediaWiki to me if editing the page doesn't properly refresh categories (but that's a separate issue).

From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.

Can you double check that maxlag is set on the purge requests, that's what I saw missing in the logs. But tbh, I don't even know if maxlag would have helped in this case.

On 2021-04-04, I have moved the purge requests from the bash part of my script to the python part of my script. I dropped the ball on the maxlag parameter there. I have now updated this. While I was there, I also added a check for permissiondenied. So the purge request is not executed even if the sdc edit fails for missing permissions. This might have been why a block of my bot was not sufficient and my whole IP address was block in the end.
My bad, but still only partly explains why 3 weeks later this became such a big issue.

Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

To be clear, you're editing a page and then immediately purging it? Also, it seems to be a pretty clear bug in MediaWiki to me if editing the page doesn't properly refresh categories (but that's a separate issue).

Yes, that is what all bots do that systematically add structured data components to commons. Note that I am not doing this for wikitext edits. The behavior is tracked in T237991.

My bot is still blocked in Commons. Please advise when/how I can return to performing edits again.

Hi all, I have resumed my bot edits in Commons at 1/10th of the edit rate that lead to this. I made sure that the coding includes maxlag checks for each purge requests and other api calls, as applicable. Is these a way from my end how I can monitor if this will lead to issues again? Please let me know. Please also confirm once the underlying issue is resolved.

Could someone please answer Schlurcher's question? Thanks in advance.

Adding current SRE clinic duty person: @Dzahn My guess is someone from service ops should follow up, but for him to decide.

SRE folks: Six months later, is there more to do here in this task?

jcrespo claimed this task.
jcrespo updated Other Assignee, added: Reedy.

I don't think it is worth this being open anymore- there indeed is a need to review it to generate a better description, but hopefully it will be caught up in the incident review process that is kickstarting now.

AFAICS, actual actionables were put in place, but they are not yet made public, so the only missing thing is review the information made public about the incident (improve the documentation). In general, better incident documentation is a common "bug" that is being actionable for several tickets at a time at: T292254