Page MenuHomePhabricator

2021-09-18 Wikimedia sites down
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
Nirmos
Sep 18 2021, 12:44 AM
Referenced Files
F34646794: image.png
Sep 18 2021, 9:30 AM
F34646789: image.png
Sep 18 2021, 9:26 AM
F34646785: image.png
Sep 18 2021, 9:26 AM
F34646551: image.png
Sep 18 2021, 12:51 AM
F34646548: image.png
Sep 18 2021, 12:51 AM

Description

Wikimedia sites seem to be down right now. After a very long time, the error message is upstream connect error or disconnect/reset before headers. reset reason: overflow

Incident document: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-18_appserver_latency

Event Timeline

can confirm, all projects, all languages.

This is known and actively being investigated. Please stand by.

From T291312, some other errors I have experienced:

  • upstream connect error or disconnect/reset before headers. reset reason: connection failure
  • upstream connect error or disconnect/reset before headers. reset reason: overflow
  • upstream request timeout
Urbanecm triaged this task as Unbreak Now! priority.Sep 18 2021, 12:51 AM
Peachey88 renamed this task from Wikimedia sites down 18 Sept 2021 to 2021-09-18 Wikimedia sites down.Sep 18 2021, 12:52 AM

Enwiki, wikidata, and meta are all loading normally for me now (frontend & API).

en.wp (at least) was agonisingly slow shortly before it all went down, and even after en.wp came back up I wasn't able to allow Oauth access to login here.

en.wp (at least) was agonisingly slow shortly before it all went down, and even after en.wp came back up I wasn't able to allow Oauth access to login here.

Second this, I was doing a few quick edits on hr.wp, and loading after each one got progressively slower, the last page loaded for a good minute or two before the error showed. Gadgets and .js imports stopped working while the site was still up but slow.

Can confirm : fr.wp went slow around 02:10 AM CEST, came back up around 03:00 AM CEST. No abnormal filtered detections from fr.wp's AbuseFilter.

RLazarus claimed this task.
RLazarus subscribed.

This should be fully resolved as of about 1:10 UTC, sorry for the trouble and thanks for all the reports.

Because the root cause was a DoS vector, we can't publish an incident report yet -- but we'll do so (and make the private task T284419 public) as soon as the vulnerability is addressed.

Resolving this but please let us know if you continue to experience any issues from this incident.

This comment was removed by Nehaoua.
Krinkle added a parent task: Restricted Task.Oct 20 2021, 7:08 PM
Krinkle removed a parent task: Restricted Task.
Krinkle added a subtask: Restricted Task.