Page MenuHomePhabricator

Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently
Open, MediumPublic

Description

The issues described in T154780 have been caused by varnishd crashes triggered by multiple concurrent node depools/repools in codfw. The crashes need to be investigated further. See https://phabricator.wikimedia.org/P4724 for a crash log sample.

Event Timeline

ema created this task.Jan 6 2017, 8:42 PM
Restricted Application added a project: Operations. · View Herald TranscriptJan 6 2017, 8:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jan 6 2017, 8:42 PM
ema moved this task from Triage to Caching on the Traffic board.Jan 8 2017, 7:16 AM
BBlack added a subscriber: BBlack.Feb 3 2017, 1:13 PM

Recording this while I remember it:

  1. The VSLP director code panics if there are no backends defined for a director at VCL reload time.
  2. Backend caches at all sites have director definitions for eqiad and/or codfw backends.
  3. Therefore, if conftool is used to depool all backends for a cluster at either eqiad or codfw, all global varnish backends for that cluster will crash.

Should we do something here? The same crash can exist at remote DCs as well (the frontends would crash if all local backends are depooled). Clearly there should be a depool_threshold sort of behavior here for backend depooling, but I'm not sure which layer we should inject it at. Perhaps confctl? Perhaps the confd VCL go template (/shudder)?

Aklapper removed ema as the assignee of this task.Jun 19 2020, 4:22 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)