Page MenuHomePhabricator

Intermittent 503's on multiple sites
Closed, ResolvedPublic

Description

Commons, Dutch Wikipedia. Seems to have been solved again and went down again after that. I'm based out of Europe so using the European farm.

Example output:

Request: GET http://nl.wikipedia.org/wiki/Speciaal:Volglijst, from 10.20.0.103 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2215869727
Forwarded for: 87.210.129.192, 10.20.0.176, 10.20.0.176, 10.20.0.103
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:21:06 GMT
WIkidata Request: GET http://www.wikidata.org/wiki/Special:Contributions/Multichill, from 10.20.0.108 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 699828265
Forwarded for: 87.210.129.192, 10.20.0.108, 10.20.0.108, 10.20.0.108
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:58 GMT

Event Timeline

Multichill raised the priority of this task from to Needs Triage.
Multichill updated the task description. (Show Details)
Multichill added a project: acl*sre-team.
Multichill subscribed.
Multichill renamed this task from 503's on multiple sites to Intermittent 503's on multiple sites.Sep 13 2015, 2:29 PM
Multichill triaged this task as Unbreak Now! priority.
Multichill updated the task description. (Show Details)
Multichill set Security to None.

Same at eswiki:

Request: GET [removed], from 10.20.0.176 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2216592190
Forwarded for: ***.***.***.***, 10.20.0.109, 10.20.0.109, 10.20.0.176
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:28 GMT

Same for ca.wiki, en.wiki, phabricator... Now is OK.

Request: GET http://ca.wikipedia.org/wiki/Shakira, from 10.20.0.112 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 700001853
Forwarded for: 88.15.46.69, 10.20.0.109, 10.20.0.109, 10.20.0.112
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:29:05 GMT

Seems to have been a cascading failure: application servers were backed up waiting for databases which resulted into a full-outage. The underlying issue seems to have been a database overload across all s1 slaves, for yet unknown reasons (I seem to have been 5 minutes late, so full processlist is quite useless at the moment).

This is db1055 at the time:

db1055.png (373×747 px, 29 KB)

fluorine's db error log is full of "too many connections" as well.

faidon claimed this task.

The cause was determined to be an attack (in three bursts) on our servers, in a successful attempt to overload them. I won't say more — we have a policy on not documenting (or commenting on) such attacks in public as this may let the attacker know of our response measures or give ideas to other attackers. Resolving.