Our "Varnish frontend child restarted" Icinga check detected two unexpected restarts this morning: one on cp3036 and the other on cp3049. The check looks for the number of times a varnish child process has been spawned since manager process startup, expecting such number to be exactly 1, given that we never administratively restart the child without restarting the whole service. Looking at the logs, the events of today might actually be due to two different issues, although the failures happened only a few hours apart after ~11 days of uneventful uptime since the esams upload upgrade to Varnish 5.
The varnish-fe child on cp3049 just died without any prior logs:
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (41757) died signal=9 Jan 29 09:57:41 cp3049 varnishd[41745]: Child cleanup complete Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) Started Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) said Child starts
On cp3036, instead, the varnish-fe manager process somehow failed to communicate with its child and attempted to kill it.
Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error (hdr) Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error Jan 29 11:32:14 cp3036 varnishd[34561]: Child (34570) died signal=9 Jan 29 11:32:14 cp3036 varnishd[34561]: Child cleanup complete
Note how the management process failed to kill its child in cp3036's case, possibly because CAP_KILL is not in the Capability Bounding Set defined in varnish-frontend systemd unit.
There is a known bug introduced in Varnish 5 new VSM/VSC code that looks just like the problem described here, at the very least in cp3036's case. The bug report mentions that both 5.1.3 and 5.2.1 are affected, although the issue shows up significantly less often with 5.1.3 (which is the version we're running). Varnish bug https://github.com/varnishcache/varnish-cache/issues/2518 also tracks the same problem and has been closed with a patch for 5.2.1 which does not apply to 5.1.3. Such patch alleviates the issue but does not fully fix it.