Page MenuHomePhabricator

Text eqiad varnish 503 spikes
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Sep 13 2017, 9:18 AM
Referenced Files
F9596771: image.png
Sep 17 2017, 8:12 AM
F9596845: image.png
Sep 17 2017, 8:12 AM
F9596862: image.png
Sep 17 2017, 8:12 AM
F9596801: image.png
Sep 17 2017, 8:12 AM
F9585689: image.png
Sep 16 2017, 6:32 PM
F9550407: Screen Shot 2017-09-15 at 10.17.12 AM.png
Sep 15 2017, 8:17 AM
F9522887: Screen Shot 2017-09-14 at 9.38.35 AM.png
Sep 14 2017, 7:39 AM
F9522769: Screen Shot 2017-09-14 at 9.25.36 AM.png
Sep 14 2017, 7:26 AM
Tokens
"The World Burns" token, awarded by Liuxinyu970226."The World Burns" token, awarded by Enterprisey."The World Burns" token, awarded by Thibaut120094.

Description

This morning there were spikes of 503s which recovered by themselves but didn't seem to be related to the "upload mailbox lag" problem (i.e. T145661) In fact the affected hosts were not alerting in icinga at the time for mailbox lag and were part of text cluster in eqiad:

2017-09-13-111526_858x496_scrot.png (496×858 px, 31 KB)

Event Timeline

Same thing happened this morning for cp1068 from 6:45 to 6:48 UTC:

Screen Shot 2017-09-14 at 9.25.36 AM.png (809×1 px, 82 KB)

Self recovered, caused 503s and alerts for various text domains.

EDIT: happened two times in an hour, one big spike and a smaller one:

Screen Shot 2017-09-14 at 9.38.35 AM.png (353×1 px, 45 KB)

elukey triaged this task as High priority.Sep 14 2017, 7:41 AM

Screen Shot 2017-09-15 at 10.17.12 AM.png (299×1 px, 65 KB)

Happened again this morning, cp1053 seems the one to blame this time (ints for it in the X-cache header)

It may be worth noting that the two spikes previous to cp1068's were also caused by cp1053

Urbanecm raised the priority of this task from High to Unbreak Now!.Sep 16 2017, 4:50 PM

Breaking a lot of things.

Paladox lowered the priority of this task from Unbreak Now! to High.Sep 16 2017, 5:00 PM

Changing to high as things are stable now. But when things break again we can set it to unbreak now.

Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 966459488
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 18:15:18 GMT

Hmm, not stable now.

[19:18:20] <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:29] <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:49] <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0][19:19:59] <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]

It looks like cp1052 had a spike, but has since recovered

image.png (847×1 px, 39 KB)

RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 34013240
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 19:46:47 GMT

In T175803#3612226, @Samtar wrote:

It looks like cp1052 had a spike, but has since recovered

The problem seems to be back, getting such erros:

Request from <ip> via cp1053 cp1053, Varnish XID 520454533
Error: 503, Backend fetch failed at Sun, 17 Sep 2017 05:36:03 GMT
07:22:08 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
07:22:28 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
07:24:08 <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000

Problem still persists:

<icinga-wm> PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]

image.png (846×1 px, 65 KB)

image.png (295×1 px, 12 KB)

The frequency of spikes seems to be increasing over the last 24 hours when compared to the last seven days

image.png (841×1 px, 59 KB)

image.png (295×1 px, 17 KB)

Restarted varnish-backend on cp1053 and cp1052 since they were showing up frequently in the X-caches ints.

Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Two weeks later, is there more to do in this open task?
Did https://gerrit.wikimedia.org/r/376751 "properly" fix this?
Or is some "underlying issue investigation" needed?

The gerrit patch above as well as the revert in https://gerrit.wikimedia.org/r/#/c/378731/ helped a lot. Will downgrade prio for now.

There's still some overlap and/or confusion between the 503 issues in this ticket, T174932 and T145661, and there's some still lesser recurrent 503s in esams that we don't have a good explanation for, yet. We're still looking at those, and kind of stalling on resolving the related tickets until we're sure it's all sorted out.

BBlack lowered the priority of this task from High to Medium.Oct 4 2017, 10:47 PM

Change 382477 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_canary: disable max_connections

https://gerrit.wikimedia.org/r/382477

Change 382477 merged by Ema:
[operations/puppet@production] cache_canary: disable max_connections

https://gerrit.wikimedia.org/r/382477

Change 382651 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable max_connections

https://gerrit.wikimedia.org/r/382651

Change 382651 merged by Ema:
[operations/puppet@production] cache_text: disable max_connections

https://gerrit.wikimedia.org/r/382651

Change 382654 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections

https://gerrit.wikimedia.org/r/382654

Change 382654 merged by Ema:
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections

https://gerrit.wikimedia.org/r/382654

BBlack claimed this task.

^ The above seems to have resolved the esams-specific 503s. Closing this up!