Multiple 503 Errors
Closed, ResolvedPublic

Description

Currently a high number of 503 Erros's, see the attached log from wmops irc channel.

11:58:28 <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
11:58:48 <icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]
11:59:48 <icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
12:00:38 <icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
12:01:14 → ShakespeareFan00 joined (~chatzilla@gfarlie-adsl.demon.co.uk)
12:01:47 <ShakespeareFan00> Getting a lot of errors of the  "Request from <redacted> via cp1053 cp1053, Varnish XID 883949803
12:01:49 <ShakespeareFan00> Error: 503, Backend fetch failed at Sun, 10 Sep 2017 10:00:53 GMT" kind
12:01:54 <ShakespeareFan00> Replication running?
12:02:03 <ShakespeareFan00> or was there planned mantainence?
12:02:58 <tiddlywink> Wikidata has just died (normally I'd be grateful, but as I'm trying to sort out stuff there, it's a nuisance right now)
12:04:25 <tiddlywink> same error, cp1053, hitting refresh fixes it
12:04:38 <Vito> same here
12:04:43 <Vito> cp1053
12:04:54 <Vito> and no css
12:05:01 <Vito> (from esams)
12:05:57 <rxy> me too.  x-cache:cp1053 int, cp2004 miss, cp4028 pass, cp4027 miss
12:06:36 <ShakespeareFan00> I'd recentlly updated a template at Commons?
12:06:44 <TheresNoTime> cp1053 looks OK in icinga, loads of 503s from 09:50 UTC onwards
12:06:46 <ShakespeareFan00> Would that cause meltdown? I hope not
12:07:57 <ShakespeareFan00> Also https://ganglia.wikimedia.org/latest  seems to giving 403's 
12:08:18 <ShakespeareFan00> Presumably it's intended only approrpiate staff can access it
12:08:28 <ShakespeareFan00> 401's sorru
12:08:54 <Steinsplitter> getting tons of such errors: Caused by: java.io.IOException: Server returned HTTP response code: 503 for URL: https://commons.wikimedia.org/w/api.php?format=xml&rawcontinue=1&maxlag=6&action=query&prop=categories&cllimit=max&titles=File%3A%C3%89glise+St+%C3%89tienne+Montcet+6.jpg
12:09:47 <Steinsplitter> * Ops Clinic Duty: ema
12:09:59 <TheresNoTime> Welp, cp1053 does have a critical (Varnish HTTP text-backend - port 3128)
12:10:18 <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Sep 10, 10:13 AM
Samtar added a subscriber: Samtar.Sun, Sep 10, 10:13 AM

Varnish is having a heck of a lot of 503s since 09:50 UTC

cp1053 did flap Varnish HTTP text-backend - port 3128 but has since recovered

Stack from a related event in Logstash

HTTPError: 503
    at Request.P.try.bind.then.err.retry.HTTPError.status (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/preq/index.js:254:23)
    at Request.tryCatcher (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:512:31)
    at Promise._settlePromise (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:569:18)
    at Promise._settlePromise0 (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:614:10)
    at Promise._settlePromises (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:693:18)
    at Promise._fulfill (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/promise.js:638:18)
    at Request._callback (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/bluebird/js/release/nodeback.js:45:21)
    at Request.self.callback (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:188:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:1171:10)
    at emitOne (events.js:96:13)
    at Request.emit (events.js:188:7)
    at Gunzip.<anonymous> (/srv/deployment/graphoid/deploy-cache/revs/1707b3c0ddab171e0d83b0cf89cbf157200e9e88/node_modules/request/request.js:1091:12)
    at Gunzip.g (events.js:292:16)
Samtar added a comment.EditedSun, Sep 10, 12:19 PM

Appears to be resolved now

(13:12 UTC) Spoke too soon

elukey added a subscriber: ema.

Mentioned in SAL (#wikimedia-operations) [2017-09-10T13:13:29Z] <elukey> restart cp1053's varnish backend for mailbox expiry lag and 503s - T175473

Aklapper renamed this task from Multiple 503 Erros to Multiple 503 Errors.Sun, Sep 10, 1:36 PM
Aklapper triaged this task as High priority.
elukey lowered the priority of this task from High to Normal.Sun, Sep 10, 1:49 PM
elukey added a subscriber: elukey.

The 503s seems to be down to zero now, lowering down the priority to Normal since we are aware of this issue (https://phabricator.wikimedia.org/T174932) and everything looks good at the moment.

Samtar closed this task as Resolved.Sun, Sep 10, 8:26 PM
Samtar assigned this task to elukey.

Overarching issue tracked in T174932, still minimal 503s in logstash and no further reports. Closing resolved (503s "resolved" by restart of varnish backends)