Change Details

On multiple cache nodes, but more so on text@esams/eqsin than others, we get frequent errors from `varnishmtail`. The errors are repeated occurrences of: ``` Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log overrun Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log reacquired ``` The situation is at times so bad that varnishmtail crashes altogether: ``` Oct 19 05:57:19 cp3062 varnishmtail[15013]: Assert error in vtx_append(), vsl_dispatch.c line 457: Oct 19 05:57:19 cp3062 varnishmtail[15013]: Condition(i != vsl_check_e_inval) not true. Oct 19 05:57:20 cp3062 varnishmtail[15013]: varnishncsa seems to have crashed, exiting Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Main process exited, code=exited, status=1/FAILURE Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Failed with result 'exit-code'. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Service RestartSec=100ms expired, scheduling restart. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Scheduled restart job, restart counter is at 6928. Oct 19 05:57:20 cp3062 systemd[1]: Stopped Varnish mtail. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: Started Varnish mtail. ``` Varnishmtail is nothing more than `varnishncsa | mtail`, so a likely cause for this issue is that mtail is not able to process the input coming from varnishncsa fast enough. As a consequence, most of the varnish metrics produced by varnishmtail are skewed. Compare the view on 200 responses from [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=72&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008|ats-tls]] and [[ https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=73&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008 | varnish ]]: {F34701242} {F34701241} The overruns happen pretty uniformly throughout the week: ``` 11:20:15 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep overrun -F | awk '{print $1, $2}' | uniq -c 15679 Oct 13 22299 Oct 14 20593 Oct 15 16158 Oct 16 17279 Oct 17 19051 Oct 18 19804 Oct 19 ``` And so do the crashes: ``` 11:20:19 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep vsl_check_e_inval -F | awk '{print $1, $2}' | uniq -c 98 Oct 13 128 Oct 14 121 Oct 15 94 Oct 16 95 Oct 17 110 Oct 18 107 Oct 19 ``` The [[https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop | VarnishTrafficDrop ]] alert is affected by this issue, as well as dashboards used regularly such as [[https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1|frontend-traffic]], [[https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-2d&to=now&var-cluster=cache_text&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5|varnish-caching]], and others. Upstream PRs [[https://github.com/varnishcache/varnish-cache/pull/3451|3451]] and [[https://github.com/varnishcache/varnish-cache/pull/3468|3468]] may be of interest. See T151643 for a problem with the same symptoms but different circumstances tackled in the past.

On multiple cache nodes, but more so on text@esams/eqsin than others, we get frequent errors from `varnishmtail`. The errors are repeated occurrences of: ``` Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log overrun Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log reacquired ``` The situation is at times so bad that varnishmtail crashes altogether: ``` Oct 19 05:57:19 cp3062 varnishmtail[15013]: Assert error in vtx_append(), vsl_dispatch.c line 457: Oct 19 05:57:19 cp3062 varnishmtail[15013]: Condition(i != vsl_check_e_inval) not true. Oct 19 05:57:20 cp3062 varnishmtail[15013]: varnishncsa seems to have crashed, exiting Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Main process exited, code=exited, status=1/FAILURE Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Failed with result 'exit-code'. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Service RestartSec=100ms expired, scheduling restart. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Scheduled restart job, restart counter is at 6928. Oct 19 05:57:20 cp3062 systemd[1]: Stopped Varnish mtail. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: Started Varnish mtail. ``` Varnishmtail is nothing more than `varnishncsa | mtail`, so a likely cause for this issue is that mtail is not able to process the input coming from varnishncsa fast enough. As a consequence, most of the varnish metrics produced by varnishmtail are skewed. Compare the view on 200 responses from [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=72&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008|ats-tls]] and [[ https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=73&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008 | varnish ]]: {F34701242} {F34701241} The overruns happen pretty uniformly throughout the week: ``` 11:51:07 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | awk '/overrun/ {print $1, $2}' | uniq -c 15051 Oct 13 22299 Oct 14 20593 Oct 15 16158 Oct 16 17279 Oct 17 19051 Oct 18 19804 Oct 19 ``` And so do the crashes: ``` 11:51:18 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | awk '/vsl_check_e_inval/ {print $1, $2}' | uniq -c 95 Oct 13 128 Oct 14 121 Oct 15 94 Oct 16 95 Oct 17 110 Oct 18 107 Oct 19 ``` The [[https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop | VarnishTrafficDrop ]] alert is affected by this issue, as well as dashboards used regularly such as [[https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1|frontend-traffic]], [[https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-2d&to=now&var-cluster=cache_text&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5|varnish-caching]], and others. Upstream PRs [[https://github.com/varnishcache/varnish-cache/pull/3451|3451]] and [[https://github.com/varnishcache/varnish-cache/pull/3468|3468]] may be of interest. See T151643 for a problem with the same symptoms but different circumstances tackled in the past.

On multiple cache nodes, but more so on text@esams/eqsin than others, we get frequent errors from `varnishmtail`. The errors are repeated occurrences of: ``` Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log overrun Oct 19 05:23:03 cp3062 varnishmtail[15013]: Log reacquired ``` The situation is at times so bad that varnishmtail crashes altogether: ``` Oct 19 05:57:19 cp3062 varnishmtail[15013]: Assert error in vtx_append(), vsl_dispatch.c line 457: Oct 19 05:57:19 cp3062 varnishmtail[15013]: Condition(i != vsl_check_e_inval) not true. Oct 19 05:57:20 cp3062 varnishmtail[15013]: varnishncsa seems to have crashed, exiting Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Main process exited, code=exited, status=1/FAILURE Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Failed with result 'exit-code'. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Service RestartSec=100ms expired, scheduling restart. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Scheduled restart job, restart counter is at 6928. Oct 19 05:57:20 cp3062 systemd[1]: Stopped Varnish mtail. Oct 19 05:57:20 cp3062 systemd[1]: varnishmtail.service: Consumed 7h 40min 57.670s CPU time. Oct 19 05:57:20 cp3062 systemd[1]: Started Varnish mtail. ``` Varnishmtail is nothing more than `varnishncsa | mtail`, so a likely cause for this issue is that mtail is not able to process the input coming from varnishncsa fast enough. As a consequence, most of the varnish metrics produced by varnishmtail are skewed. Compare the view on 200 responses from [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=72&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008|ats-tls]] and [[ https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=73&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3062&from=1634609076499&to=1634696994008 | varnish ]]: {F34701242} {F34701241} The overruns happen pretty uniformly throughout the week: ``` 11:20:1511:51:07 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep awk '/overrun -F | awk '/ {print $1, $2}' | uniq -c 1567915051 Oct 13 22299 Oct 14 20593 Oct 15 16158 Oct 16 17279 Oct 17 19051 Oct 18 19804 Oct 19 ``` And so do the crashes: ``` 11:20:1911:51:18 ema@cp3062.esams.wmnet:~ $ sudo journalctl -u varnishmtail.service --since '1 week ago' | grep awk '/vsl_check_e_inval -F | awk '/ {print $1, $2}' | uniq -c 9895 Oct 13 128 Oct 14 121 Oct 15 94 Oct 16 95 Oct 17 110 Oct 18 107 Oct 19 ``` The [[https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop | VarnishTrafficDrop ]] alert is affected by this issue, as well as dashboards used regularly such as [[https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1|frontend-traffic]], [[https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=15m&from=now-2d&to=now&var-cluster=cache_text&var-site=esams&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5|varnish-caching]], and others. Upstream PRs [[https://github.com/varnishcache/varnish-cache/pull/3451|3451]] and [[https://github.com/varnishcache/varnish-cache/pull/3468|3468]] may be of interest. See T151643 for a problem with the same symptoms but different circumstances tackled in the past.