Change Details

The rate of sessions accepted (`MAIN.sess_conn`) [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1518734045734&to=1520949118034&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend | keeps on increasing]] on varnish backends, without any significant change in frontend traffic patterns) since the upgrade to varnish 5: `rate(varnish_main_sessions{layer="$layer", type=~"conn", job="varnish-$cache_type",instance=~"($server):.*"}[5m])` {F15602996} The [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen|"steps"]] there happen when a varnish backend is [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=8&fullscreen|restarted by cron]]. The increasing rate does not seem to correlate with the number of established connections on the host. There is, however, a significant increase in the number of connections in time-wait. See for example [[https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=25&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1518667608006&to=1521163855399|cp3040]]: {F15603422} As an immediate mitigation in text-esams, I've rebooted all hosts (which had to be done anyways as part of T188092). That resulted in a [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521196773687&to=1521218694031&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend|decrease in fe<->be session rate]] as well as an [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521196773687&to=1521218694031&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend | increase in fe<->be session reuse rate]]. {F15604001} {F15604205} Open questions: What is causing the increase? Is this related to T181315 and T174932? Is the varnish-be-restart [[https://gerrit.wikimedia.org/r/#/c/419090/|switch from weekly to twice a week]] going to cause harm in this context?

The rate of sessions accepted (`MAIN.sess_conn`) [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1518734045734&to=1520949118034&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend | keeps on increasing]] on varnish backends, without any significant change in frontend traffic patterns, since the upgrade to varnish 5: `rate(varnish_main_sessions{layer="$layer", type=~"conn", job="varnish-$cache_type",instance=~"($server):.*"}[5m])` {F15602996} The [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen|"steps"]] there happen when a varnish backend is [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=8&fullscreen|restarted by cron]]. The increasing rate does not seem to correlate with the number of established connections on the host. There is, however, a significant increase in the number of connections in time-wait. See for example [[https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=25&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1518667608006&to=1521163855399|cp3040]]: {F15603422} As an immediate mitigation in text-esams, I've rebooted all hosts (which had to be done anyways as part of T188092). That resulted in a [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521196773687&to=1521218694031&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend|decrease in fe<->be session rate]] as well as an [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521196773687&to=1521218694031&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend | increase in fe<->be session reuse rate]]. {F15604001} {F15604205} Open questions: What is causing the increase? Is this related to T181315 and T174932? Is the varnish-be-restart [[https://gerrit.wikimedia.org/r/#/c/419090/|switch from weekly to twice a week]] going to cause harm in this context?

The rate of sessions accepted (`MAIN.sess_conn`) [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1518734045734&to=1520949118034&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend | keeps on increasing]] on varnish backends, without any significant change in frontend traffic patterns), since the upgrade to varnish 5: `rate(varnish_main_sessions{layer="$layer", type=~"conn", job="varnish-$cache_type",instance=~"($server):.*"}[5m])` {F15602996} The [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=7&fullscreen|"steps"]] there happen when a varnish backend is [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520224184056&to=1520617807252&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&panelId=8&fullscreen|restarted by cron]]. The increasing rate does not seem to correlate with the number of established connections on the host. There is, however, a significant increase in the number of connections in time-wait. See for example [[https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=25&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1518667608006&to=1521163855399|cp3040]]: {F15603422} As an immediate mitigation in text-esams, I've rebooted all hosts (which had to be done anyways as part of T188092). That resulted in a [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1521196773687&to=1521218694031&edit&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend|decrease in fe<->be session rate]] as well as an [[https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521196773687&to=1521218694031&panelId=11&fullscreen&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend | increase in fe<->be session reuse rate]]. {F15604001} {F15604205} Open questions: What is causing the increase? Is this related to T181315 and T174932? Is the varnish-be-restart [[https://gerrit.wikimedia.org/r/#/c/419090/|switch from weekly to twice a week]] going to cause harm in this context?