Page MenuHomePhabricator

Text eqiad varnish 503 spikes
Closed, ResolvedPublic

Description

This morning there were spikes of 503s which recovered by themselves but didn't seem to be related to the "upload mailbox lag" problem (i.e. T145661) In fact the affected hosts were not alerting in icinga at the time for mailbox lag and were part of text cluster in eqiad:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2017, 9:18 AM
Samtar added a subscriber: Samtar.Sep 13 2017, 10:14 AM
elukey added a subscriber: elukey.EditedSep 14 2017, 7:26 AM

Same thing happened this morning for cp1068 from 6:45 to 6:48 UTC:

Self recovered, caused 503s and alerts for various text domains.

EDIT: happened two times in an hour, one big spike and a smaller one:

elukey triaged this task as High priority.Sep 14 2017, 7:41 AM

Happened again this morning, cp1053 seems the one to blame this time (ints for it in the X-cache header)

It may be worth noting that the two spikes previous to cp1068's were also caused by cp1053

Urbanecm raised the priority of this task from High to Unbreak Now!.Sep 16 2017, 4:50 PM

Breaking a lot of things.

Paladox lowered the priority of this task from Unbreak Now! to High.Sep 16 2017, 5:00 PM

Changing to high as things are stable now. But when things break again we can set it to unbreak now.

Yann added a subscriber: Yann.Sep 16 2017, 6:15 PM

Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 966459488
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 18:15:18 GMT

Hmm, not stable now.

[19:18:20] <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:29] <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:49] <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0][19:19:59] <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]

It looks like cp1052 had a spike, but has since recovered

RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

Yann added a comment.Sep 16 2017, 7:47 PM

Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 34013240
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 19:46:47 GMT

Steinsplitter added a comment.EditedSep 17 2017, 5:36 AM

It looks like cp1052 had a spike, but has since recovered

The problem seems to be back, getting such erros:

Request from <ip> via cp1053 cp1053, Varnish XID 520454533
Error: 503, Backend fetch failed at Sun, 17 Sep 2017 05:36:03 GMT
07:22:08 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
07:22:28 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
07:24:08 <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000

Problem still persists:

<icinga-wm> PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
jrbs added a subscriber: jrbs.Sep 17 2017, 7:19 AM

The frequency of spikes seems to be increasing over the last 24 hours when compared to the last seven days

Restarted varnish-backend on cp1053 and cp1052 since they were showing up frequently in the X-caches ints.

ema added a subscriber: ema.Sep 17 2017, 9:24 PM
ema moved this task from Triage to Caching on the Traffic board.Sep 18 2017, 8:33 AM

Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

JJMC89 added a subscriber: JJMC89.Sep 20 2017, 12:58 AM
Elitre added a subscriber: Elitre.Sep 20 2017, 10:25 AM

Two weeks later, is there more to do in this open task?
Did https://gerrit.wikimedia.org/r/376751 "properly" fix this?
Or is some "underlying issue investigation" needed?

BBlack added a comment.EditedOct 4 2017, 10:46 PM

The gerrit patch above as well as the revert in https://gerrit.wikimedia.org/r/#/c/378731/ helped a lot. Will downgrade prio for now.

There's still some overlap and/or confusion between the 503 issues in this ticket, T174932 and T145661, and there's some still lesser recurrent 503s in esams that we don't have a good explanation for, yet. We're still looking at those, and kind of stalling on resolving the related tickets until we're sure it's all sorted out.

BBlack lowered the priority of this task from High to Normal.Oct 4 2017, 10:47 PM

Change 382477 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_canary: disable max_connections

https://gerrit.wikimedia.org/r/382477

Change 382477 merged by Ema:
[operations/puppet@production] cache_canary: disable max_connections

https://gerrit.wikimedia.org/r/382477

Change 382651 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable max_connections

https://gerrit.wikimedia.org/r/382651

Change 382651 merged by Ema:
[operations/puppet@production] cache_text: disable max_connections

https://gerrit.wikimedia.org/r/382651

Change 382654 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections

https://gerrit.wikimedia.org/r/382654

Change 382654 merged by Ema:
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections

https://gerrit.wikimedia.org/r/382654

BBlack closed this task as Resolved.Oct 10 2017, 3:56 PM
BBlack claimed this task.

^ The above seems to have resolved the esams-specific 503s. Closing this up!