Page MenuHomePhabricator

Recurrent 'mailbox lag' critical alerts and 500s
Closed, DuplicatePublic

Description

The 'varnish mailbox lag' icinga alerts as implemented in the parent task have been going CRITICAL for a while and in some cases result in 503s spikes until a manual varnish-backend-restart is issued on the affected machine.

I'm opening a more-specific task than T145661: varnish backends start returning 503s after ~6 days uptime to investigate whether there's more we can do to mitigate the recurring mailbox problem, not the general upload 500s problem and file backend of which AFAIU mailbox lag could be just a symptom and not the root cause.

Details

Related Gerrit Patches:

Event Timeline

ema moved this task from Triage to Caching on the Traffic board.Sep 7 2017, 12:29 PM
Samtar added a subscriber: Samtar.Sep 10 2017, 1:50 PM

Change 376751 had a related patch set uploaded (by Ema; owner: BBlack):
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

Change 376751 merged by BBlack:
[operations/puppet@production] VCL: stabilize backend storage patterns

https://gerrit.wikimedia.org/r/376751

ema added a comment.Sep 25 2017, 1:35 PM

So it looks like stabilizing backend storage patterns in combination with reverting keep time on text back to 7d did improve the situation. The last critical mbox lag alert was raised ~3 days ago, on Sep 22 03:49 (cp2011).

BBlack closed this task as Resolved.Oct 10 2017, 3:56 PM
BBlack claimed this task.

Change 419089 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cron_splay: add a semiweekly mode of operation

https://gerrit.wikimedia.org/r/419089

Change 419090 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] varnish: restart backends every 3.5 days

https://gerrit.wikimedia.org/r/419090

Change 419091 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] varnish: remove weekly restart cron entries

https://gerrit.wikimedia.org/r/419091

Change 419091 abandoned by BBlack:
varnish: remove weekly restart cron entries

https://gerrit.wikimedia.org/r/419091

Change 419089 merged by BBlack:
[operations/puppet@production] cron_splay: add a semiweekly mode of operation

https://gerrit.wikimedia.org/r/419089

Change 419090 merged by BBlack:
[operations/puppet@production] varnish: restart backends every 3.5 days

https://gerrit.wikimedia.org/r/419090

Mentioned in SAL (#wikimedia-operations) [2018-03-15T10:22:16Z] <ema> apt.w.o: upload varnish=5.1.3-1wm4 to jessie-wikimedia/main (upstream "extrachance" fixes) T174932

Change 419705 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: remove gethdr_extrachance

https://gerrit.wikimedia.org/r/419705

Change 419705 merged by Ema:
[operations/puppet@production] varnish: move gethdr_extrachance to runtime_params

https://gerrit.wikimedia.org/r/419705

ema reopened this task as Open.Mar 20 2018, 11:25 AM

This occurred again today. Reopening.

Change 420680 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishospital: send origin servers health logs to logstash

https://gerrit.wikimedia.org/r/420680

Change 420680 merged by Ema:
[operations/puppet@production] varnishospital: send origin servers health logs to logstash

https://gerrit.wikimedia.org/r/420680

Change 420977 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishospital: distinguish between origin server and vcl id

https://gerrit.wikimedia.org/r/420977

Change 420977 merged by Ema:
[operations/puppet@production] varnishospital: distinguish between origin server and vcl id

https://gerrit.wikimedia.org/r/420977

The 'varnish mailbox lag' icinga alerts as implemented in the parent task have been going CRITICAL for a while and in some cases result in 503s spikes until a manual varnish-backend-restart is issued on the affected machine.
I'm opening a more-specific task than T145661: varnish backends start returning 503s after ~6 days uptime to investigate whether there's more we can do to mitigate the recurring mailbox problem, not the general upload 500s problem and file backend of which AFAIU mailbox lag could be just a symptom and not the root cause.

I don't think there's much distinction to be drawn here. Almost anytime anything goes sideways in terms of functionality or performance at the varnishd level (including some problems inducable by bad clients and/or misbehaving applayers), mailbox lag can spike while 503s occur. When varnish starts falling apart, the expiry thread seems to be one of the first things to go. I think in recent times, as evidenced on the upload cluster in particular, the storage-level mitigations + varnish5 upgrades have largely solved the storage-inefficiency-driven mailbox lag ramps. The rest is just effect rather than cause.

Change 425045 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: restart backends every 7 days

https://gerrit.wikimedia.org/r/425045

Change 425045 abandoned by Ema:
varnish: restart backends every 7 days

Reason:
See https://gerrit.wikimedia.org/r/#/c/425046/ instead.

https://gerrit.wikimedia.org/r/425045