Page MenuHomePhabricator

Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently
Closed, InvalidPublic

Description

The issues described in T154780 have been caused by varnishd crashes triggered by multiple concurrent node depools/repools in codfw. The crashes need to be investigated further. See https://phabricator.wikimedia.org/P4724 for a crash log sample.

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-01-06_Cache-upload

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jan 6 2017, 8:42 PM

Recording this while I remember it:

  1. The VSLP director code panics if there are no backends defined for a director at VCL reload time.
  2. Backend caches at all sites have director definitions for eqiad and/or codfw backends.
  3. Therefore, if conftool is used to depool all backends for a cluster at either eqiad or codfw, all global varnish backends for that cluster will crash.

Should we do something here? The same crash can exist at remote DCs as well (the frontends would crash if all local backends are depooled). Clearly there should be a depool_threshold sort of behavior here for backend depooling, but I'm not sure which layer we should inject it at. Perhaps confctl? Perhaps the confd VCL go template (/shudder)?

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@Vgutierrez and @BBlack: Is this still an issue? 6 years is a long time. :)

akosiaris subscribed.

Removing SRE, has been triaged already to a specific team

Several major releases of varnish later I don't think this task still makes sense.