This morning we got the following alert:
09:48 < icinga-wm> PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1047 is CRITICAL: Connection refused
Indeed, the backend varnishd was dead on the machine:
● varnish.service - varnish (Varnish HTTP Accelerator) Loaded: loaded (/lib/systemd/system/varnish.service; enabled) Active: active (running) since Mon 2016-07-11 11:26:53 UTC; 1 months 1 days ago Main PID: 20529 (varnishd) CGroup: /system.slice/varnish.service └─20529 /usr/sbin/varnishd -P /run/varnish.pid -a :3128 -T 127.0.0.1:6083 -f /etc/varnish/wikimedia_maps-backend.vcl -p thread_pool_min=250 -p thread_pool_max=8000 -p thread_poo... Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45716 127.0.0.1 6083 Wr 200 Message from VCC-compiler: Unused sub cluster_be_recv_applayer_backend, defined: ('/etc/varnish/maps-backend.inc.vcl' Line 5 Pos 5)... Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd auth blabla Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 ----------------------------- Varnish Cache CLI 1.0 -----------------------------... Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd ping Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 PONG 1470681171 1.0 Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Rd vcl.use 24125a4c-29a9-4702-b184-993780b4db4e Aug 08 18:32:51 cp1047 varnishd[20529]: CLI telnet 127.0.0.1 45770 127.0.0.1 6083 Wr 200 VCL '24125a4c-29a9-4702-b184-993780b4db4e' now active Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) not responding to CLI, killing it. Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) died signal=6 Aug 12 07:43:17 cp1047 varnishd[20529]: Child (20225) Last panic at: Fri, 12 Aug 2016 07:43:17 GMT "Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417: Condition((o)->magic == 0x32851d42) not true.... Aug 12 07:43:17 cp1047 varnishd[20529]: Child cleanup complete Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Started Aug 12 07:43:17 cp1047 varnishd[20529]: Child (17652) Pushing vcls failed: VCL "boot" Failed initialization Message:... Aug 12 07:43:17 cp1047 varnishd[20529]: Stopping Child Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) died signal=6 Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) Last panic at: Fri, 12 Aug 2016 07:43:18 GMT "Assert error in BAN_Shutdown(), cache/cache_ban.c line 798: Condition((pthread_join(ban_thread, &status)) == 0) not true.... Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Child starts Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve Aug 12 07:43:18 cp1047 varnishd[20529]: Child (17652) said Dropped 0 segments to make free_reserve Aug 12 07:43:18 cp1047 varnishd[20529]: Child cleanup complete Hint: Some lines were ellipsized, use -l to show in full.
A simple systemctl varnish restart didn't fix the problem. Noticing that persistent storage seemed to be part of the equation, I've stopped the service, removed /srv/sd*/varnish* and started varnish.service again. The first time, the procedure failed. The second time it worked.