All wikis down: error 503 (resolved, follow-up pending)
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	AlexisJazz
	May 21 2022, 7:11 PM

Description

List of steps to reproduce (step by step, including full links if applicable):

Visit a production Wikimedia project

What happens?:
First, loading failed.

Then, 503 Service Unavailable
No server is available to handle this request.

At some point I also got:
Request from (IP redacted) via cp3062 cp3062, Varnish XID 204440230
Error: 503, Backend fetch failed at Sat, 21 May 2022 19:02:26 GMT

What should have happened instead?:
200

Phabricator was also affected. Beta cluster seemingly not. I tested different PoP locations per https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue but it made no difference.

Edit: sorry the 400 was due to a typo of mine. https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue uses www.wikimedia.org in its example but that was still working so I changed it and messed it up.

Related Objects

Mentioned In: T308952: get a legend for haproxy "anomalous session termination states"

Event Timeline

AlexisJazz created this task.May 21 2022, 7:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 21 2022, 7:11 PM

This should be resolved now

In T308940#7947028, @RhinosF1 wrote:

This should be resolved now

I tried to report it sooner, but Phabricator was down!

Aklapper added a project: Wikimedia-Incident.May 21 2022, 7:13 PM

Tractopelle-jaune subscribed.May 21 2022, 7:15 PM

Krinkle renamed this task from All wikis down: error 400 to All wikis down: error 503.May 21 2022, 7:17 PM

AlexisJazz updated the task description. (Show Details)May 21 2022, 7:20 PM

GPSLeo subscribed.May 21 2022, 7:26 PM

Se4598 subscribed.May 21 2022, 7:30 PM

Wikis are back up.. This incident is actively being investigated.

Edit: sorry the 400 was due to a typo of mine. Reporting_a_connectivity_issue maybe shouldn't use www.wikimedia.org in its example because THAT was still working so I had to change it.

It uses www.wikimedia.org precisely because it rarely involves appservers. These www-URLs are less costly to generate, usually cached internally, and more helpful in diagnosing connectivy issues. If you can reach www.wikimedia.org without issue, then it is not a connectivity issue.

This issue described in this task was due to something server-side that affected some traffic, but was not related your connectivity nor our connectivity.

AlexisJazz updated the task description. (Show Details)May 21 2022, 7:41 PM

Status High but not UBN anymore. We will follow-up with an incident report but currently no ongoing outage.

user impact lasted about 2 minutes.

Dzahn renamed this task from All wikis down: error 503 to All wikis down: error 503 (resolved, follow-up pending).May 21 2022, 8:08 PM

Dzahn mentioned this in T308952: get a legend for haproxy "anomalous session termination states" .May 21 2022, 8:37 PM

This was a failure at the edge / caching layer. All services behind it were not directly affected but appeared down / received no traffic. beta cluster was not because it's not in the production environment.

https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting

In T308940#7951736, @Dzahn wrote:

https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_-_varnish_cache_busting

"A flood of API traffic from an AWS user caused caching servers to be overloaded. Services behind those caching servers were up but not reachable during this time."
Just curious:

A single Amazon Web Services user?
Could anyone with an Amazon Web Services account repeat this? (I'm not asking what that was as I wouldn't expect an answer to that, only if someone could)
Does it appear to have been an accidental action (like a malfunctioning spider for example) or was it likely malice?
If known, how would the AWS user be classified: a (small group of) individual(s), a small/medium business/organization, large business/organization or a (department of) a government?

All wikis down: error 503 (resolved, follow-up pending)Closed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

All wikis down: error 503 (resolved, follow-up pending)
Closed, ResolvedPublicBUG REPORT
Actions