Page MenuHomePhabricator

Prometheus varnish metric churn due to VCL reloads
Closed, ResolvedPublic

Description

Noticed this while deploying prometheus in ulsfo, a couple of varnish hosts report errors when collecting metrics from frontend instance (cp400[13] have the error, while cp400[24] are fine!) and only on misc, text/upload/maps are fine.

cp4001:~$ curl localhost:9331/metrics -s
An error has occurred during metrics collection:

48 error(s) occurred:
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4001_ulsfo_wmnet" > gauge:<value:205725 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4002_ulsfo_wmnet" > gauge:<value:199172 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4003_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4004_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_in label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:2.4449164e+07 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_conn label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:5403 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_out label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:24792 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4004_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4002_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_conn label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:10235 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_req label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:8358 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_req label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:10235 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_bodybytes label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.526453131e+09 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4004_ulsfo_wmnet" > gauge:<value:214915 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_bodybytes label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.696922578e+09 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_hdrbytes label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.696003e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_in label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.900312e+07 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4003_ulsfo_wmnet" > gauge:<value:196328 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_hdrbytes label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:933922 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_conn label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:5343 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4002_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_out label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:10429 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4004_ulsfo_wmnet" > gauge:<value:6984 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_req label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:5403 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_hdrbytes label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.027729e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_out label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:10206 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_bodybytes label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.469609655e+09 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_hdrbytes label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:29214 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4003_ulsfo_wmnet" > gauge:<value:8325 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_out label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:4173 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_hdrbytes label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.750007e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_hdrbytes label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.501309e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_hdrbytes label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.797392e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_hdrbytes label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:105127 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_hdrbytes label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.741113e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4002_ulsfo_wmnet" > gauge:<value:0 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_bereq_bodybytes label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4001_ulsfo_wmnet" > gauge:<value:727 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_conn label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:8358 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4003_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:aea64d87-aaf3-4557-96f2-31653db14137.be_cp4001_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_bodybytes label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:1.394161252e+09 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_in label:<name:"backend" value:"be_cp4001_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:2.7790998e+07 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_pipe_in label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.3237511e+07 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_hdrbytes label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:3.047851e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_req label:<name:"backend" value:"be_cp4003_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:5343 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_hdrbytes label:<name:"backend" value:"be_cp4004_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:2.901052e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_happy label:<name:"id" value:"root:34a96b9b-0cd6-48ff-8072-50ddad728930.be_cp4001_ulsfo_wmnet" > gauge:<value:1.8446744073709552e+19 >  has label dimensions inconsistent with previously collected metrics in the same metric family
* collected metric varnish_backend_beresp_hdrbytes label:<name:"backend" value:"be_cp4002_ulsfo_wmnet" > label:<name:"server" value:"3cb4b79d-d995-4aa2-87c0-6c38b34c8b6b" > gauge:<value:2.954338e+06 >  has label dimensions inconsistent with previously collected metrics in the same metric family

Event Timeline

Also the same happens on all esams misc frontend, needs further investigation

ema triaged this task as Medium priority.Nov 11 2016, 10:41 AM
ema moved this task from Triage to General on the Traffic board.

Question for Traffic: I see varnishstat -j reporting some uuid-looking IDs for backends, is that bound to change over the life of a varnish instance (or at all?)

"VBE.root:e64b98b6-dfcc-4f9d-83e3-d2950304ebdd.be_wdqs_svc_eqiad_wmnet.happy": {
  "description": "Happy health probes",
  "type": "VBE", "ident": "root:e64b98b6-dfcc-4f9d-83e3-d2950304ebdd.be_wdqs_svc_eqiad_wmnet", "flag": "b", "format": "b",
  "value": 0
},

I've captured varnishstat -j over the course of 1/2 day on cp4001 and it seems the uuid is the backend "identity" ?

-  "VBE.root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet.happy": {
+  "VBE.root:486cbe97-4ac0-4e09-a1e3-4421ed013e02.be_analytics1027_eqiad_wmnet.happy": {
     "description": "Happy health probes",
-    "type": "VBE", "ident": "root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet", "flag": "b", "format": "b",
+    "type": "VBE", "ident": "root:486cbe97-4ac0-4e09-a1e3-4421ed013e02.be_analytics1027_eqiad_wmnet", "flag": "b", "format": "b",
     "value": 0
   },
-  "VBE.root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet.bereq_hdrbytes": {
+  "VBE.root:486cbe97-4ac0-4e09-a1e3-4421ed013e02.be_analytics1027_eqiad_wmnet.bereq_hdrbytes": {
     "description": "Request header bytes",
-    "type": "VBE", "ident": "root:464fe99d-b7d7-4795-9809-48b32f0acd0b.be_analytics1027_eqiad_wmnet", "flag": "c", "format": "B",
+    "type": "VBE", "ident": "root:486cbe97-4ac0-4e09-a1e3-4421ed013e02.be_analytics1027_eqiad_wmnet", "flag": "c", "format": "B",
     "value": 0
   },

That will create churn in the metric names so I think we should get rid of it and keep only the backend hostname instead

Yeah the UUID in there is actually from the VCL. Every time we change VCL, it's recompiled and the output is given a UUID. The backends are defined in VCL and thus their definitions also get refreshed when VCL is reloaded, and are unique to a VCL compilation. You can also have two different UUIDs for the same hostname both doing traffic at the same time, as each i/o thread doesn't switch to the new VCL until it reaches an idle point.

Change 322025 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: discard varnish backend UUID during collection

https://gerrit.wikimedia.org/r/322025

The error should have been fixed upstream by https://github.com/jonnenauha/prometheus_varnish_exporter/commit/502b06328e1377c1c9881c7865779b2737ef62e3 though I still see there are always two UUIDs present in varnishstat -j for a given backend, I'm collecting data on cp4001 over the weekend to get a better idea.

Looks like it can be any number really, ATM I'm seeing three different UUIDs

varnish_backend_bereq_bodybytes{backend="be_puppetmaster1001_eqiad_wmnet",server="180fe377-0a1a-4871-9b4b-a4d3672579bc"} 0
varnish_backend_bereq_bodybytes{backend="be_puppetmaster1001_eqiad_wmnet",server="a90b49d2-ec16-41ca-bd8d-9f52d32ad738"} 0
varnish_backend_bereq_bodybytes{backend="be_puppetmaster1001_eqiad_wmnet",server="b1c39ec9-8825-4b2c-8d6c-e1c309042262"} 0

That might be problematic for metric churn since each time the UUID changes (i.e. a new VCL) that's a new metric, we could drop server at ingestion time but it would result in duplicated metrics I think. The solution might be to have recording rules doing the aggregation on the prometheus side and accept that.

Mentioned in SAL (#wikimedia-operations) [2016-11-29T01:27:46Z] <godog> upgrade prometheus-varnish-exporter on cache_misc/ulsfo T150479

Mentioned in SAL (#wikimedia-operations) [2016-11-29T01:42:54Z] <godog> upgrade prometheus-varnish-exporter on cache boxes in ulsfo T150479

The errors mentioned in the ticket description seem to be gone from cp4001.

@ema indeed the new upstream version should fix the errors. What's left to figure out is what to do when multiple VCLs are active, which seems to be the case for us. UUIDs change every now and then causing "metric churn" on the prometheus side, e.g. by having to store more series in memory

series_in_memory.png (206×338 px, 11 KB)

samples_per_second.png (253×999 px, 41 KB)

I took another look at the cause of UUID/VCL churn, concentrating for now only on the backend varnish instance.
confd suggests the backend directors file is out of sync and changed, then again after 5s, 1m and 1m+5s later.

$ sudo zgrep confd /var/log/syslog.1 /var/log/syslog | cut -d' ' -f7- | grep -i md5sum | grep -i backend
2016-12-05T22:35:04Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum e4bb2d5dec06992e714bf347fe5d293c should be 870c20a2bba7d3977b3f82299552d9b7
2016-12-05T22:35:09Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum 870c20a2bba7d3977b3f82299552d9b7 should be f7a5210ad5e97228cc5d9a53d448ecff
2016-12-05T22:36:05Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum f7a5210ad5e97228cc5d9a53d448ecff should be f3c39feac8bb2f303c3d8e196e0d79ad
2016-12-05T22:36:08Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum f3c39feac8bb2f303c3d8e196e0d79ad should be e4bb2d5dec06992e714bf347fe5d293c
2016-12-06T09:05:05Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum e4bb2d5dec06992e714bf347fe5d293c should be 5b6bde225abc447161eaca0ca4351aec
2016-12-06T09:05:09Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum 5b6bde225abc447161eaca0ca4351aec should be 800cba70435d6aafcd84a962bf62fdeb
2016-12-06T09:06:05Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum 800cba70435d6aafcd84a962bf62fdeb should be 464a458c8d927a6d5be3d2b5d6e9a982
2016-12-06T09:06:09Z cp4001 /usr/bin/confd[1001]: INFO /etc/varnish/directors.backend.vcl has md5sum 464a458c8d927a6d5be3d2b5d6e9a982 should be e4bb2d5dec06992e714bf347fe5d293c

Regardless of this issue, even if the uuid would change less frequently we'd have a similar churn.

Mentioned in SAL (#wikimedia-operations) [2016-12-06T23:53:28Z] <godog> upgrade prometheus-varnish-exporter on cache boxes in esams - T150479

Mentioned in SAL (#wikimedia-operations) [2016-12-07T00:04:50Z] <godog> upgrade prometheus-varnish-exporter on cache boxes in codfw and eqiad - T150479

I took another quick look at this, part of the churn is the periodic varnish backend restarts pool/depool-ing from frontend directors VCL

fgiunchedi renamed this task from Error collecting metrics from varnish_exporter on some misc hosts to Prometheus varnish metric churn due to VCL reloads.Apr 26 2017, 5:54 PM

There is still some churn due to the fact that multiple VCLs are loaded at the same time, and we're generating new uuids via reload-vcl, although I don't think there's anything else we can do in this case:

varnish_backend_bereq_hdrbytes{backend="be_cp1075_eqiad_wmnet",server="3e25e821-8da1-49cf-b4d2-2debf07aeaed"} 927793
varnish_backend_bereq_hdrbytes{backend="be_cp1075_eqiad_wmnet",server="a53020e7-9025-42b5-bc84-ca2f6889521e"} 2.57787547e+08
fgiunchedi claimed this task.