This morning there were spikes of 503s which recovered by themselves but didn't seem to be related to the "upload mailbox lag" problem (i.e. T145661) In fact the affected hosts were not alerting in icinga at the time for mailbox lag and were part of text cluster in eqiad:
Description
Details
Related Objects
- Mentioned In
- T217893: Traffic (text) instability due to misbehaving cache server (cp1077), causing a 1.5-2% requests failing
T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)
T189085: Resources and pages occasionally take seconds to respond or fail
T175319: cp1066 unexplained 503 spikes - Mentioned Here
- T174932: Recurrent 'mailbox lag' critical alerts and 500s
T176047: Error 503, Backend fetch failed.
T145661: varnish backends start returning 503s after ~6 days uptime
Event Timeline
Same thing happened this morning for cp1068 from 6:45 to 6:48 UTC:
Self recovered, caused 503s and alerts for various text domains.
EDIT: happened two times in an hour, one big spike and a smaller one:
Happened again this morning, cp1053 seems the one to blame this time (ints for it in the X-cache header)
It may be worth noting that the two spikes previous to cp1068's were also caused by cp1053
Changing to high as things are stable now. But when things break again we can set it to unbreak now.
Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 966459488
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 18:15:18 GMT
Hmm, not stable now.
[19:18:20] <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:29] <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[19:18:49] <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0][19:19:59] <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
It looks like cp1052 had a spike, but has since recovered
RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
Request from 88.182.181.224 via cp1052 cp1052, Varnish XID 34013240
Error: 503, Backend fetch failed at Sat, 16 Sep 2017 19:46:47 GMT
The problem seems to be back, getting such erros:
Request from <ip> via cp1053 cp1053, Varnish XID 520454533 Error: 503, Backend fetch failed at Sun, 17 Sep 2017 05:36:03 GMT
07:22:08 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] 07:22:28 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] 07:24:08 <+icinga-wm> PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000
Problem still persists:
<icinga-wm> PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
The frequency of spikes seems to be increasing over the last 24 hours when compared to the last seven days
Restarted varnish-backend on cp1053 and cp1052 since they were showing up frequently in the X-caches ints.
Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns
Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns
Two weeks later, is there more to do in this open task?
Did https://gerrit.wikimedia.org/r/376751 "properly" fix this?
Or is some "underlying issue investigation" needed?
The gerrit patch above as well as the revert in https://gerrit.wikimedia.org/r/#/c/378731/ helped a lot. Will downgrade prio for now.
There's still some overlap and/or confusion between the 503 issues in this ticket, T174932 and T145661, and there's some still lesser recurrent 503s in esams that we don't have a good explanation for, yet. We're still looking at those, and kind of stalling on resolving the related tickets until we're sure it's all sorted out.
Change 382477 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_canary: disable max_connections
Change 382477 merged by Ema:
[operations/puppet@production] cache_canary: disable max_connections
Change 382651 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable max_connections
Change 382651 merged by Ema:
[operations/puppet@production] cache_text: disable max_connections
Change 382654 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections
Change 382654 merged by Ema:
[operations/puppet@production] cache_text: disable varnish_be<->varnish_be max_connections