Intermittent 503's on multiple sites
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Multichill
	Sep 13 2015, 2:23 PM

Description

Commons, Dutch Wikipedia. Seems to have been solved again and went down again after that. I'm based out of Europe so using the European farm.

Example output:

Request: GET http://nl.wikipedia.org/wiki/Speciaal:Volglijst, from 10.20.0.103 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2215869727
Forwarded for: 87.210.129.192, 10.20.0.176, 10.20.0.176, 10.20.0.103
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:21:06 GMT

WIkidata Request: GET http://www.wikidata.org/wiki/Special:Contributions/Multichill, from 10.20.0.108 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 699828265
Forwarded for: 87.210.129.192, 10.20.0.108, 10.20.0.108, 10.20.0.108
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:58 GMT

Event Timeline

Multichill created this task.Sep 13 2015, 2:23 PM

Multichill raised the priority of this task from to Needs Triage.

Multichill updated the task description. (Show Details)

Multichill added a project: acl*sre-team.

Multichill subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 13 2015, 2:23 PM

The same at it.wiki.

Multichill renamed this task from 503's on multiple sites to Intermittent 503's on multiple sites.Sep 13 2015, 2:29 PM

Multichill triaged this task as Unbreak Now! priority.

Multichill updated the task description. (Show Details)

Multichill set Security to None.

Same at eswiki:

Request: GET [removed], from 10.20.0.176 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2216592190
Forwarded for: ***.***.***.***, 10.20.0.109, 10.20.0.109, 10.20.0.176
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:27:28 GMT

Same for ca.wiki, en.wiki, phabricator... Now is OK.

Request: GET http://ca.wikipedia.org/wiki/Shakira, from 10.20.0.112 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 700001853
Forwarded for: 88.15.46.69, 10.20.0.109, 10.20.0.109, 10.20.0.112
Error: 503, Service Unavailable at Sun, 13 Sep 2015 14:29:05 GMT

Seems to have been a cascading failure: application servers were backed up waiting for databases which resulted into a full-outage. The underlying issue seems to have been a database overload across all s1 slaves, for yet unknown reasons (I seem to have been 5 minutes late, so full processlist is quite useless at the moment).

This is db1055 at the time:

fluorine's db error log is full of "too many connections" as well.

Luke081515 awarded a token.Sep 13 2015, 2:40 PM

Luke081515 subscribed.

The cause was determined to be an attack (in three bursts) on our servers, in a successful attempt to overload them. I won't say more — we have a policy on not documenting (or commenting on) such attacks in public as this may let the attacker know of our response measures or give ideas to other attackers. Resolving.

@faidon Thanks for quick resolving

Krenair subscribed.Sep 13 2015, 4:24 PM

	F2584645: db1055.png
	Sep 13 2015, 2:39 PM

Intermittent 503's on multiple sitesClosed, ResolvedPublicActions

Description

Event Timeline

Intermittent 503's on multiple sites
Closed, ResolvedPublic
Actions