Page MenuHomePhabricator

All wikis down: error 503 (resolved, follow-up pending)
Closed, ResolvedPublicBUG REPORT


List of steps to reproduce (step by step, including full links if applicable):

  • Visit a production Wikimedia project

What happens?:
First, loading failed.

Then, 503 Service Unavailable
No server is available to handle this request.

At some point I also got:
Request from (IP redacted) via cp3062 cp3062, Varnish XID 204440230
Error: 503, Backend fetch failed at Sat, 21 May 2022 19:02:26 GMT

What should have happened instead?:

Phabricator was also affected. Beta cluster seemingly not. I tested different PoP locations per but it made no difference.

Edit: sorry the 400 was due to a typo of mine. uses in its example but that was still working so I changed it and messed it up.

Event Timeline

This should be resolved now

I tried to report it sooner, but Phabricator was down!

Krinkle renamed this task from All wikis down: error 400 to All wikis down: error 503.May 21 2022, 7:17 PM

Wikis are back up.. This incident is actively being investigated.

Edit: sorry the 400 was due to a typo of mine. Reporting_a_connectivity_issue maybe shouldn't use in its example because THAT was still working so I had to change it.

It uses precisely because it rarely involves appservers. These www-URLs are less costly to generate, usually cached internally, and more helpful in diagnosing connectivy issues. If you can reach without issue, then it is not a connectivity issue.

This issue described in this task was due to something server-side that affected some traffic, but was not related your connectivity nor our connectivity.

Dzahn triaged this task as High priority.EditedMay 21 2022, 8:06 PM

Status High but not UBN anymore. We will follow-up with an incident report but currently no ongoing outage.

user impact lasted about 2 minutes.

Dzahn renamed this task from All wikis down: error 503 to All wikis down: error 503 (resolved, follow-up pending).May 21 2022, 8:08 PM

This was a failure at the edge / caching layer. All services behind it were not directly affected but appeared down / received no traffic. beta cluster was not because it's not in the production environment.

"A flood of API traffic from an AWS user caused caching servers to be overloaded. Services behind those caching servers were up but not reachable during this time."
Just curious:

  • A single Amazon Web Services user?
  • Could anyone with an Amazon Web Services account repeat this? (I'm not asking what that was as I wouldn't expect an answer to that, only if someone could)
  • Does it appear to have been an accidental action (like a malfunctioning spider for example) or was it likely malice?
  • If known, how would the AWS user be classified: a (small group of) individual(s), a small/medium business/organization, large business/organization or a (department of) a government?