Page MenuHomePhabricator

Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure
Closed, ResolvedPublic

Description

Wikis were unreachable for a brief period of time of around 12 minutes in total, split in 2 spikes: 08:24-08:30 and 08:39-08:45 UTC

The root cause seemed to be a network slowdown caused by ongoing maintenance on db1099- which caused fallout on all of s8 section, with later expanded to other sections too (s1 was also hosted on db1099, plus other wikis require connecting to wikidata database (s8).

More specifically it seems to have been caused by db1099 (an s8 and s1 replica) that was rebooted for maintenance and slowly repooled while at the same time there was a network transfer from the same host. The host became slow to respond, but not enough to be considered down and depooled automatically. This caused a cascade effect on DBs on the same section (S8) and because S8 is read by every wiki page, causing a cascade effect on all wikis, manifesting in the page for too many PHP-FPM workers busy.

This mostly affected logged-in users (editors) as cache layer was for the most part unaffected (except misses).

https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-10_MediaWiki_availability

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo renamed this task from 2022-03-10 Mediawiki availability due to database contention/open connection pileups to 2022-03-10 Mediawiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure.Mar 10 2022, 9:34 AM

Adding performance team- at the very least for awareness. FYI We believe we have found a weakness on the "automatic depooling logic" where when a host is up enough to answer probes, but slow enough to cause pileups, this can cause (specially for central wikis) a general downtime, with the slowdown expanding to other dbs of the same section (and if it hits wikidata, to other wikis).

Reedy renamed this task from 2022-03-10 Mediawiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure to 2022-03-10 MediaWiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure.Mar 10 2022, 1:41 PM
Aklapper renamed this task from 2022-03-10 MediaWiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure to 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure.Mar 13 2022, 1:29 PM
lmata renamed this task from 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure to Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest of the database infrastructure.Apr 28 2022, 10:23 PM
lmata claimed this task.

Boldly resolving - scorecards marked as done and old incidents.