Page MenuHomePhabricator

2021-09-26 (UTC) Wikimedia sites down
Closed, ResolvedPublic

Description

There are two messages displayed
"upstream connect error or disconnect/reset before headers. reset reason: overflow" or
"upstream connect error or disconnect/reset before headers. reset reason: connection failure"

Incident document: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-26_appserver_latency

Event Timeline

DannyS712 renamed this task from All wikis are down to 2021-09-25 Wikimedia sites down.Sep 26 2021, 3:13 AM
DannyS712 triaged this task as Unbreak Now! priority.
DannyS712 added projects: SRE, Traffic.
DannyS712 subscribed.
Peachey88 renamed this task from 2021-09-25 Wikimedia sites down to 2021-09-26 (UTC) Wikimedia sites down.Sep 26 2021, 3:22 AM

The production (live) sites seem to be back up now, albeit site loading seems to be a bit slower than usual (presumably due to an influx of requests as the site came back online).

Based on the data logged on Grafana, it seems as though there was an trickling decline in API requests at 2:51 AM (UTC). Production seems to have recovered for one minute at 2:58 AM, before going to minimum load ("maximum down") at 2:59. At three minutes past the hour, there was a slight recovery for two minutes, but soon went south again. System recovery seems to have occurred at 3:15 AM, but went to diminished capacity for a couple minutes afterwards. It seems we're now running back at normal capacity and request rate, in-line with the load this time, last week.

image.png (1×2 px, 250 KB)

See this on Grafana here.

Open discussion on English Wikipedia (Technical Village Pump)

This error appeared as HTTP 503 when trying to log into Phabricator via MediaWiki. For a brief moment, the standard “our servers are undergoing maintenance” error appeared, but I wasn’t able to retrieve the debug information at the bottom of the page in time.

Andrew claimed this task.
Andrew subscribed.

SREs are investigating and responding to this issue; it should be largely resolved by now.

As a DOS-related issue the specifics will not be discussed in public until the vulnerability has been fixed; likely that won't be addressed until normal work hours later in the week.

Thank you for the report! Please re-open if further symptoms appear.