Uncached wiki requests partially unavailable due to excessive request rates from a bot
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Apr 15 2021, 9:10 AM

Description

From approximately 07:31 to 08:16 UTC, there has been instability serving Wiki uncached content (mostly impacted would be authenticated users and bot, as well as POST requests), affecting at first commonswiki, but later most other wikis, albeit at lower rate.

The impact between those timestamp was increased latency (up to 5+ seconds to return results) and high level of errors (approximately 1/4th of regular requests lost). This was due to overload on commonswiki requests blocking resources on most api & database servers, which in turn created contention on most other wikis.

The issues were narrowed as triggered by SchlurcherBot at commonswiki, which was blocked from editing and sending requests to Wikimedia sites at 7:49 and 8:08, respectively.

https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-04-15_appserver_latency

Details

Other Assignee: Reedy

Related Objects

Mentioned In: T237991: Changes to Structured Data on Commons should trigger page refresh
T280226: CVE-2021-35197: Blocked users should not be able to issue purges (action=purge)
Mentioned Here: T237991: Changes to Structured Data on Commons should trigger page refresh
T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces

Event Timeline

jcrespo created this task.Apr 15 2021, 9:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 15 2021, 9:10 AM

jcrespo updated the task description. (Show Details)Apr 15 2021, 9:12 AM

RhinosF1 subscribed.Apr 15 2021, 9:13 AM

taavi subscribed.Apr 15 2021, 9:16 AM

jcrespo updated the task description. (Show Details)Apr 15 2021, 9:16 AM

jcrespo mentioned this in T280226: CVE-2021-35197: Blocked users should not be able to issue purges (action=purge).Apr 15 2021, 9:22 AM

Heads-up to @Schlurcher

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

Autoloader shouldn't take one second.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

Peachey88 subscribed.Apr 15 2021, 10:52 AM

akosiaris subscribed.Apr 15 2021, 10:56 AM

Jdforrester-WMF subscribed.Apr 15 2021, 3:02 PM

In T280232#7002156, @Ladsgroup wrote:

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

Autoloader shouldn't take one second.

Indeed, that's very worrying.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

That shipped first in 1.36.0-wmf.36; is it possible we didn't notice for four weeks?

In T280232#7002156, @Ladsgroup wrote:

Requests the bot made are not that bad but probably had a terrible regression in wmf.1 (action=purge format=json forcelinkupdate= pageids=33330972 in API ):
https://performance.wikimedia.org/xhgui/run/view?id=60780f591e29d3593756eaae

Autoloader shouldn't take one second.

This might be the culprit: https://github.com/wikimedia/mediawiki/commit/9881e82ef7fa7a3a8564fae0d6bb557019c9ca32

The AutoLoader has been slow since the Buster migration, see e.g. T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces. I doubt that micro-optimization patch had any significant effect in either direction.

Also, the AutoLoader being slow doesn't really explain the lock contention or why this appeared suddenly last night despite the bot having been running for quite a while.

There was one thing I missed in my comment on autolader and Joe pointed out in IRC that I did the test on mwdebug (where you can do xhgui) and mwdebug is a VM with different opcache (probably cold most of the time) with production. So the situation in production is better. Of course, we might need to improve things but that's another topic not related to this issue. Sorry my bad.

fgiunchedi triaged this task as High priority.Apr 16 2021, 8:44 AM

Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rate of ~60 edits per minutes more or less over the last year. The last script change occurred roughly a week ago, so this cause of events caught me off guard. Could someone please explain to me in some more layman terms what the underlying issue is and how I can help to address this. From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.
Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

In T280232#7009873, @Schlurcher wrote:

Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rate of ~60 edits per minutes more or less over the last year. The last script change occurred roughly a week ago, so this cause of events caught me off guard. Could someone please explain to me in some more layman terms what the underlying issue is and how I can help to address this.

My best guess is that something changed in MediaWiki, probably with 1.37.0-wmf.1, that made the purge requests you were making more expensive, slower. And then it took a few hours with the new code for it to turn into database problems, causing an incident. In general I think we should be able to support the rate at which you were making requests.

From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.

Can you double check that maxlag is set on the purge requests, that's what I saw missing in the logs. But tbh, I don't even know if maxlag would have helped in this case.

Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

To be clear, you're editing a page and then immediately purging it? Also, it seems to be a pretty clear bug in MediaWiki to me if editing the page doesn't properly refresh categories (but that's a separate issue).

From my talk page a suggestion was to check maxlag. I'm checking maxlag, and at the time I ultimately shut down the bot, maxlag events were triggered on all concurrent workers.

Can you double check that maxlag is set on the purge requests, that's what I saw missing in the logs. But tbh, I don't even know if maxlag would have helped in this case.

On 2021-04-04, I have moved the purge requests from the bash part of my script to the python part of my script. I dropped the ball on the maxlag parameter there. I have now updated this. While I was there, I also added a check for permissiondenied. So the purge request is not executed even if the sdc edit fails for missing permissions. This might have been why a block of my bot was not sufficient and my whole IP address was block in the end.
My bad, but still only partly explains why 3 weeks later this became such a big issue.

Please also understand why the bot is sending so many purge requests. The Bot is currently adding structured data to commons based on tracking categories (see e.g https://commons.wikimedia.org/wiki/Category:Pages_with_local_camera_coordinates_and_missing_SDC_coordinates). However only after a purge the categories do get refreshed and properly removed after adding the missing structured data. So Multichill asked me to perform a purge after each structured data edit. Please advise if there is a better way of handling this.

To be clear, you're editing a page and then immediately purging it? Also, it seems to be a pretty clear bug in MediaWiki to me if editing the page doesn't properly refresh categories (but that's a separate issue).

Yes, that is what all bots do that systematically add structured data components to commons. Note that I am not doing this for wikitext edits. The behavior is tracked in T237991.

My bot is still blocked in Commons. Please advise when/how I can return to performing edits again.

Schlurcher mentioned this in T237991: Changes to Structured Data on Commons should trigger page refresh.Apr 17 2021, 10:25 AM

CBogen subscribed.Apr 19 2021, 2:47 PM

Hi all, I have resumed my bot edits in Commons at 1/10th of the edit rate that lead to this. I made sure that the coding includes maxlag checks for each purge requests and other api calls, as applicable. Is these a way from my end how I can monitor if this will lead to issues again? Please let me know. Please also confirm once the underlying issue is resolved.

Could someone please answer Schlurcher's question? Thanks in advance.

Adding current SRE clinic duty person: @Dzahn My guess is someone from service ops should follow up, but for him to decide.

Krinkle updated the task description. (Show Details)May 11 2021, 4:06 PM

Krinkle updated the task description. (Show Details)May 11 2021, 4:17 PM

Rubin16 subscribed.Jun 19 2021, 7:34 PM

Jeff_G subscribed.Jun 20 2021, 3:48 PM

SRE folks: Six months later, is there more to do here in this task?

I don't think it is worth this being open anymore- there indeed is a need to review it to generate a better description, but hopefully it will be caught up in the incident review process that is kickstarting now.

AFAICS, actual actionables were put in place, but they are not yet made public, so the only missing thing is review the information made public about the incident (improve the documentation). In general, better incident documentation is a common "bug" that is being actionable for several tickets at a time at: T292254

	F34399948: image.png
	Apr 15 2021, 10:07 AM

Uncached wiki requests partially unavailable due to excessive request rates from a botClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Uncached wiki requests partially unavailable due to excessive request rates from a bot
Closed, ResolvedPublic
Actions