Our "Varnish frontend child restarted" Icinga check detected two unexpected restarts this morning: one on cp3036 and the other on cp3049. The check looks for the number of times a varnish child process has been spawned since manager process startup, expecting such number to be exactly 1, given that we never administratively restart the child without restarting the whole service. Looking at the logs, the events of today might actually be due to two different issues, although the failures happened only a few hours apart after ~11 days of uneventful uptime since the esams upload upgrade to Varnish 5.
The varnish-fe child on cp3049 just died without any prior logs:
```
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (41757) died signal=9
Jan 29 09:57:41 cp3049 varnishd[41745]: Child cleanup complete
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) Started
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) said Child starts
```
On cp3036, instead, the varnish-fe manager process somehow failed to communicate with its child and attempted to kill it.
```
Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted
Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error (hdr)
Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted
Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error
Jan 29 11:32:14 cp3036 varnishd[34561]: Child (34570) died signal=9
Jan 29 11:32:14 cp3036 varnishd[34561]: Child cleanup complete
```
Note how the management process failed to kill its child in cp3036's case, possibly because CAP_KILL is not in the Capability Bounding Set defined in varnish-frontend systemd unit.
There is a [[ https://github.com/varnishcache/varnish-cache/issues/2513 | known bug ]] introduced in Varnish 5 that looks just like the problem described here, at the very least in cp3036's case. The bug report mentions that both 5.1.3 and 5.2.1 are affected, although the issue shows up significantly less often with 5.1.3 (which is the version we're running). Varnish bug [[#2518|https://github.com/varnishcache/varnish-cache/issues/2518]] also tracks the same problem and has been closed with a [[https://github.com/varnishcache/varnish-cache/issues/2518|patch for 5.2.1]] which does not apply to 5.1.3.