On multiple cache nodes, but more so on text@esams/eqsin than others, we get frequent errors from `varnishmtail`. The errors are repeated occurrences of:
```
Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log overrun
Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log reacquired
```
The situation is at times so bad that varnishmtail crashes altogether:
```
Oct 19 05:57:19 cp3062 varnishmtail[15013]: Assert error in vtx_append(), vsl_dispatch.c line 457:
Oct 19 05:57:19 cp3062 varnishmtail[15013]: Condition(i != vsl_check_e_inval) not true.
Oct 19 05:57:20 cp3062 varnishmtail[15013]: varnishncsa seems to have crashed, exiting
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Main process exited, code=exited, status=1/FAILURE
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Failed with result 'exit-code'.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Service RestartSec=100ms expired, scheduling restart.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Scheduled restart job, restart counter is at 6928.
Oct 19 05:57:20 cp3062 systemd[1]: Stopped Varnish mtail.
Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time.
Oct 19 05:57:20 cp3062 systemd[1]: Started Varnish mtail.
```
Varnishmtail is nothing more than `varnishncsa | mtail`, so a likely cause for this issue is that mtail is not able to process the input coming from varnishncsa fast enough.
As a consequence, most of the varnish metrics produced by varnishmtail are skewed. Compare the view on 200 responses from [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=72&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008|ats-tls]] and [[ https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=73&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008 | varnish ]]:
{F34701242}
{F34701241}
The overruns happen pretty uniformly throughout the week:
```
11:20:15 ema@cp3062.esams.wmnet:~
$ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep overrun -F | awk '{print $1, $2}' | uniq -c
15679 Oct 13
22299 Oct 14
20593 Oct 15
16158 Oct 16
17279 Oct 17
19051 Oct 18
19804 Oct 19
```
And so do the crashes:
```
11:20:19 ema@cp3062.esams.wmnet:~
$ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep vsl_check_e_inval -F | awk '{print $1, $2}' | uniq -c
98 Oct 13
128 Oct 14
121 Oct 15
94 Oct 16
95 Oct 17
110 Oct 18
107 Oct 19
```
The [[https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop | VarnishTrafficDrop ]] alert is affected by this issue, as well as dashboards used regularly such as [[https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1|frontend-traffic]], [[https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-2d&to=now&var-cluster=cache_text&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5|varnish-caching]], and others.
Upstream PRs [[https://github.com/varnishcache/varnish-cache/pull/3451|3451]] and [[https://github.com/varnishcache/varnish-cache/pull/3468|3468]] may be of interest. See T151643 for a problem with the same symptoms but different circumstances tackled in the past.