After today's restart of varnishkafkas for a couple of changes (everything went fine) I noticed two weird metrics in the dashboard for cp403 (related to the number of requests waiting to be acked by poll()). I asked some help to Valentin and we quickly found out that those vk instances were stuck with:
%7 VSM_OPEN: Failed to open Varnish VSL: Cannot open /var/lib/varnish/frontend/_.vsm: No such file or directory
(Note: to discover this I started another instance using the webrequest config but emitting data to stdout and with debug logging).
It is not clear why, but the varnishkafka package installed was 1.0.14-1 istead of 1.1.0-1 (the upgrade happened a while ago, IIUC for T264074).
The caching nodes affected were: cp[6002,6005,6009-6013].drmrs.wmnet,cp1087.eqiad.wmnet,cp[4021,4033-4036].ulsfo.wmnet
More specifically, the affected instances were:
- varnishkafka-webrequest on cp1087, cp4035 and cp4036 (text)
- varnishkafka-webrequest on cp4021, cp4033, cp4034 (upload)
- varnishkafka-statsv on cp1087, cp4035 and cp4036
- varnishkafka-eventlogging on cp1087, cp4035 and cp4036
Some notes about these nodes:
- cp1087 is an old node, but the OS was installed for the last time on Fri Jun 4 20:02:59 UTC 2021
- the drmrs nodes are related to the new DC, never served traffic
- cp4021's OS was installed on Wed Oct 13 11:48:34 UTC 2021
- cp4035's OS was installed on Wed Nov 3 18:04:46 UTC 2021
- cp4036's OS was installed on Thu Nov 4 10:42:30 UTC 2021
Why varnishkafka didn't fail triggering an alert
We coded a special feature in vk to retry indefinitely when the Varnish shared memory handle is not available. This is needed to be flexible when Varnish is restarted, since it caused a lot of headaches in the past.
This special function was only present in the old version of Varnishkafka, since in the newer one the Varnish API takes care of everything, even retrying and failing after a bit. Tested it today on cp3035:
elukey@cp3055:~$ sudo varnishkafka -S webrequest.conf ..... VSM: Could not get hold of varnishd, is it running?
I used a wrong Varnish handle name in the config, and I got a meaningful error and exit code after some tries. So the correct/up-to-date version of Varnishkafka doesn't retry indefinitely and fails after some tries. We ended up in this situation due to an old version of Varnishkafka being deployed.
Why didn't we get any alert?
We have alerts for errors like delivery report (so when the message is not delivered to kafka) and alerts on the status of the varnishkafka daemons (up/down) but we don't have any alert related to the traffic that an instance handle (like if it is zero or not).