2022-05-05 Wikimedia full site outage
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	AlexisJazz
	May 5 2022, 5:43 AM

Description

Screenshot 2022-05-05 at 02-02-59 Wikimedia Status.png (689×850 px, 52 KB)

Wikimedia wikis were down from ~5:40 to 5:54 because of a faulty schema change.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved	BUG REPORT	• Marostegui	T307647 2022-05-05 Wikimedia full site outage
Resolved		• Marostegui	T307501 Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis
Resolved		• Marostegui	T308126 Migrate a s7 DB host to mariadb 10.6
Resolved		Ladsgroup	T307648 Audit database usage of GlobalBlocking extension

Event Timeline

AlexisJazz created this task.May 5 2022, 5:43 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 5 2022, 5:43 AM

AlexisJazz added a project: Wikimedia-production-error.May 5 2022, 5:47 AM

Ladsgroup edited projects, added Wikimedia-Incident; removed Wikimedia-production-error, Traffic.May 5 2022, 6:00 AM

This was due to an outage, we identified and fixed the root cause around 5-10 minutes ago and we should be ok now.
We are now investigating why the outage happened in the first place

Ladsgroup mentioned this in T307648: Audit database usage of GlobalBlocking extension.May 5 2022, 6:04 AM

Legoktm renamed this task from upstream connect error or disconnect/reset before headers. reset reason: overflow to 2022-05-05 Wikimedia full site outage.May 5 2022, 6:04 AM

Legoktm added projects: SRE, DBA, GlobalBlocking.

Legoktm updated the task description. (Show Details)

lmata added a project: SRE-OnFire (FY2021/2022-Q4).May 5 2022, 6:11 AM

lmata subscribed.

RhinosF1 subscribed.May 5 2022, 6:14 AM

Ahmad_Kanik subscribed.May 5 2022, 6:17 AM

Bebiezaza subscribed.May 5 2022, 6:35 AM

Zabe subscribed.May 5 2022, 6:50 AM

As the description was overwritten: it didn't break instantly and maybe it was never strictly down, just too slow to work. For me it was extremely slow for a few minutes or so first (taking 20s to load a page) until finally I saw "upstream connect error or disconnect/reset before headers. reset reason: overflow".

Btw, Phabricator was up and performing normally, as was beta cluster. https://foundation.wikimedia.org/wiki/ was down, as was enwiki and metawiki.

• toan subscribed.May 5 2022, 8:30 AM

Updating this task - cross posting from the original schema change task:
This query seemed to be the one that got stuck

SELECT /* MediaWiki\Extension\GlobalBlocking\GlobalBlocking::getGlobalBlockingBlock  */  gb_id,gb_address,gb_by,gb_by_wiki,gb_reason,gb_timestamp,gb_anon_only,gb_expiry,gb_range_start,gb_range_end  FROM `globalblocks`WHERE (gb_range_start  LIKE '5B85%' ESCAPE '`' ) AND (gb_range_start <= '5B85B2D2') AND (gb_range_end >= '5B85B2D2') AND (gb_expiry > '20220505012805');

However the explain output (and the optimizer trace) shows no difference on query plans between the original schema change and the new one (T307501)

• Marostegui added subtasks: T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis, T307648: Audit database usage of GlobalBlocking extension.May 5 2022, 10:23 AM

@AlexisJazz Full doc will come later, but for clarification, the impact was the following:

Cached requests (anonymous users' reads) were mostly unaffected- only 15 % to 20% of total requests got affected- mostly authenticated and power users, edits and certain kinds of special requests
Uncached requests started to get slower at 5:36, reaching highest level of slowness (full outage) at 5:39.
Issue started to get better at 5:51, reducing the average latency
Issue fully solved at 5:55

• jcrespo mentioned this in T307671: High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20.May 5 2022, 11:06 AM

A most recent test on one of the most affected hosts during the outage, does show a different query plan:

https://phabricator.wikimedia.org/T307501#7906462

Just for greater visibility and awareness, there is T301505 for the upstream connect error or disconnect/reset before headers. reset reason: overflow". error. As pointed out in that task that error message is a symptom and not the cause.

Stang subscribed.May 5 2022, 1:42 PM

Umherirrender subscribed.May 5 2022, 5:16 PM