Multiple 503 Errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Steinsplitter
	Sep 10 2017, 10:13 AM

Description

Currently a high number of 503 Erros's, see the attached log from wmops irc channel.

11:58:28 <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
11:58:48 <icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
11:59:48 <icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
12:00:38 <icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
12:01:14 → ShakespeareFan00 joined (~chatzilla@gfarlie-adsl.demon.co.uk)
12:01:47 <ShakespeareFan00> Getting a lot of errors of the  "Request from <redacted> via cp1053 cp1053, Varnish XID 883949803
12:01:49 <ShakespeareFan00> Error: 503, Backend fetch failed at Sun, 10 Sep 2017 10:00:53 GMT" kind
12:01:54 <ShakespeareFan00> Replication running?
12:02:03 <ShakespeareFan00> or was there planned mantainence?
12:02:58 <tiddlywink> Wikidata has just died (normally I'd be grateful, but as I'm trying to sort out stuff there, it's a nuisance right now)
12:04:25 <tiddlywink> same error, cp1053, hitting refresh fixes it
12:04:38 <Vito> same here
12:04:43 <Vito> cp1053
12:04:54 <Vito> and no css
12:05:01 <Vito> (from esams)
12:05:57 <rxy> me too.  x-cache:cp1053 int, cp2004 miss, cp4028 pass, cp4027 miss
12:06:36 <ShakespeareFan00> I'd recentlly updated a template at Commons?
12:06:44 <TheresNoTime> cp1053 looks OK in icinga, loads of 503s from 09:50 UTC onwards
12:06:46 <ShakespeareFan00> Would that cause meltdown? I hope not
12:07:57 <ShakespeareFan00> Also https://ganglia.wikimedia.org/latest  seems to giving 403's 
12:08:18 <ShakespeareFan00> Presumably it's intended only approrpiate staff can access it
12:08:28 <ShakespeareFan00> 401's sorru
12:08:54 <Steinsplitter> getting tons of such errors: Caused by: java.io.IOException: Server returned HTTP response code: 503 for URL: https://commons.wikimedia.org/w/api.php?format=xml&rawcontinue=1&maxlag=6&action=query&prop=categories&cllimit=max&titles=File%3A%C3%89glise+St+%C3%89tienne+Montcet+6.jpg
12:09:47 <Steinsplitter> * Ops Clinic Duty: ema
12:09:59 <TheresNoTime> Welp, cp1053 does have a critical (Varnish HTTP text-backend - port 3128)
12:10:18 <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]

Related Objects

Mentioned In: T175588: Server overloaded .. can't save (only remove or cancel)
Mentioned Here: T174932: Recurrent 'mailbox lag' critical alerts and 500s

Event Timeline

Steinsplitter created this task.Sep 10 2017, 10:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 10 2017, 10:13 AM

TheresNoTime subscribed.Sep 10 2017, 10:13 AM

Paladox subscribed.Sep 10 2017, 10:14 AM

Varnish is having a heck of a lot of 503s since 09:50 UTC

cp1053 did flap Varnish HTTP text-backend - port 3128 but has since recovered

Stack from a related event in Logstash

HTTPError: 503
    at Request.P.try.bind.then.err.retry.HTTPError.status (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/preq/index.js:254:23)
    at Request.tryCatcher (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:693:18)
    at Promise._fulfill (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:638:18)
    at Request._callback (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/nodeback.js:45:21)
    at Request.self.callback (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:188:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:1171:10)
    at emitOne (events.js:96:13)
    at Request.emit (events.js:188:7)
    at Gunzip.<anonymous> (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:1091:12)
    at Gunzip.g (events.js:292:16)

Appears to be resolved now

(13:12 UTC) Spoke too soon

elukey added a project: Traffic.Sep 10 2017, 12:58 PM

elukey added a subscriber: • ema.

Mentioned in SAL (#wikimedia-operations) [2017-09-10T13:13:29Z] <elukey> restart cp1053's varnish backend for mailbox expiry lag and 503s - T175473

Aklapper renamed this task from Multiple 503 Erros to Multiple 503 Errors.Sep 10 2017, 1:36 PM

Aklapper triaged this task as High priority.

The 503s seems to be down to zero now, lowering down the priority to Normal since we are aware of this issue (https://phabricator.wikimedia.org/T174932) and everything looks good at the moment.

Overarching issue tracked in T174932, still minimal 503s in logstash and no further reports. Closing resolved (503s "resolved" by restart of varnish backends)

Esc3300 added a subtask: T175588: Server overloaded .. can't save (only remove or cancel).Sep 11 2017, 3:36 PM

BBlack removed a subtask: T175588: Server overloaded .. can't save (only remove or cancel).Sep 11 2017, 3:38 PM

Esc3300 mentioned this in T175588: Server overloaded .. can't save (only remove or cancel).Sep 11 2017, 3:40 PM

	F9435902: image.png
	Sep 10 2017, 1:13 PM

	F9435260: image.png
	Sep 10 2017, 12:19 PM

	F9432897: image.png
	Sep 10 2017, 10:21 AM

Multiple 503 ErrorsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Multiple 503 Errors
Closed, ResolvedPublic
Actions