Page MenuHomePhabricator

Add Prometheus client support for varnish/statsd metrics daemons
Closed, ResolvedPublic

Description

We have several python daemons that collect derived metrics by reading and parsing output from varnishstat via the CacheStats class. The same metrics need to be exported to Prometheus too.

Daemons to work on:

  • varnishmedia
  • varnishreqstats
  • varnishrls
  • varnishstatsd
  • varnishxcache
  • varnishxcps

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+7 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+27 -1
operations/puppetproduction+3 -1
operations/puppetproduction+48 -0
operations/puppetproduction+58 -0
operations/puppetproduction+6 -0
operations/puppetproduction+3 -0
operations/puppetproduction+65 -27
operations/puppetproduction+5 -0
operations/puppetproduction+30 -63
operations/puppetproduction+36 -42
operations/puppetproduction+6 -0
operations/puppetproduction+14 -0
operations/puppetproduction+4 -0
operations/puppetproduction+92 -3
operations/puppetproduction+49 -0
operations/puppetproduction+130 -0
operations/puppetproduction+80 -4
operations/puppetproduction+115 -39
Show related patches Customize query in gerrit

Event Timeline

ema triaged this task as Medium priority.Oct 6 2017, 12:51 PM
ema added a project: Traffic.

IMO we could approach the problem of getting the stats above to Prometheus in at least two ways:

  1. Import the Prometheus python client and instrument the code natively, expose the metrics from each script via HTTP for Prometheus to pick up.
  2. Have prometheus-statsd-exporter installed in each cp machine using the scripts, send existing statsd traffic also there and export Prometheus metrics out of that.

Going with 2. would mean adding another daemon and having to write a map statsd -> prometheus, at that point we might as well instrument the code. Given that we (ops/traffic) control the code and overall it isn't a whole lot of metrics approach 1. is preferable.

Change 387817 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: add layer information to X-Cache-Status

https://gerrit.wikimedia.org/r/387817

Change 388064 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: log TLS information to VSM

https://gerrit.wikimedia.org/r/388064

Change 387817 merged by Ema:
[operations/puppet@production] VCL: add layer information to X-Cache-Status

https://gerrit.wikimedia.org/r/387817

The approach we have in mind can roughly be summed up with varnishncsa | mtail. We can group the six scripts to be ported into frontend and backend ones:

Frontend

Backend

Let's focus on the frontend part first. Once TLS info logging to VSM is merged, in (very)pseudo-mtail, varnishxcache, varnishxcps and varnishmedia would be something like:

varnishncsa -n frontend -q 'ReqMethod ne "PURGE"' -F 'cache_status=%{X-Cache-Status}o\nkey_exchange=%{VCL_Log:CP-Key-Exchange}x\nhttp2=%{VCL_Log:CP-HTTP2}x\nURL=%U http_status=%s ' | mtail '
/^cache_status/ { xcache[X-Cache-Status]++ }
/^key_exchange/ { key_exchange[CP-Key-Exchange]++ }
[...]
/^URL=\/thumb\// { thumbnail[Http-Status]++ }
'

varnishrls does some more funky stuff but in the end the fields used are %s - %{Cache-Control}o - %{If-None-Match}i, so I suspect it's either not going to be outrageously hard to write it in mtail, or we can easily embed the logic in VCL and use std.log as we've done for varnishxcps.

The remaining backend-related script, varnishstatsd, will need a separate varnishncsa process (the backend is another varnish instance with its own VSM!) and some more thought.

Change 394543 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] WIP: varnish: prometheus equivalent of statsd metrics daemons

https://gerrit.wikimedia.org/r/394543

Change 394597 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: add varnishmtail tests

https://gerrit.wikimedia.org/r/394597

Change 394543 merged by Ema:
[operations/puppet@production] varnish: prometheus equivalent of statsd metrics daemons

https://gerrit.wikimedia.org/r/394543

Change 388064 merged by Ema:
[operations/puppet@production] VCL: log TLS information to VSM

https://gerrit.wikimedia.org/r/388064

Change 394597 merged by Ema:
[operations/puppet@production] mtail: add varnishmtail tests

https://gerrit.wikimedia.org/r/394597

Change 395578 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: port varnishxcps

https://gerrit.wikimedia.org/r/395578

Change 397774 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add mtail to varnish jobs

https://gerrit.wikimedia.org/r/397774

Change 395578 merged by Ema:
[operations/puppet@production] mtail: port varnishxcps

https://gerrit.wikimedia.org/r/395578

Change 397831 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: install varnishxcps.mtail

https://gerrit.wikimedia.org/r/397831

Change 397831 merged by Ema:
[operations/puppet@production] cache: install varnishxcps.mtail

https://gerrit.wikimedia.org/r/397831

Change 397774 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add mtail to varnish jobs

https://gerrit.wikimedia.org/r/397774

Change 397851 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add mtail to varnish-upload job

https://gerrit.wikimedia.org/r/397851

Change 397876 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishxcps.mtail: use prometheus labels

https://gerrit.wikimedia.org/r/397876

Change 397851 merged by Ema:
[operations/puppet@production] prometheus: add mtail to varnish-upload job

https://gerrit.wikimedia.org/r/397851

Change 397889 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: restructure tests

https://gerrit.wikimedia.org/r/397889

Change 397889 merged by Filippo Giunchedi:
[operations/puppet@production] mtail: restructure tests

https://gerrit.wikimedia.org/r/397889

Change 397876 merged by Ema:
[operations/puppet@production] varnishxcps.mtail: use prometheus labels

https://gerrit.wikimedia.org/r/397876

Change 398441 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: add xcps aggregation rules

https://gerrit.wikimedia.org/r/398441

Change 398441 merged by Ema:
[operations/puppet@production] prometheus: add xcps aggregation rules

https://gerrit.wikimedia.org/r/398441

Change 398819 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: add varnishreqstats.mtail

https://gerrit.wikimedia.org/r/398819

Change 398819 merged by Ema:
[operations/puppet@production] mtail: add varnishreqstats.mtail

https://gerrit.wikimedia.org/r/398819

Change 399199 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: add reqstats aggregation rule

https://gerrit.wikimedia.org/r/399199

Change 399199 merged by Ema:
[operations/puppet@production] prometheus: add reqstats aggregation rule

https://gerrit.wikimedia.org/r/399199

Change 401516 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add mtail to varnish-text job

https://gerrit.wikimedia.org/r/401516

Change 401516 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add mtail to varnish-text job

https://gerrit.wikimedia.org/r/401516

Change 401526 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: add varnishmtail instance for varnish backends

https://gerrit.wikimedia.org/r/401526

Change 401535 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] mtail: add program to count varnish backend metrics

https://gerrit.wikimedia.org/r/401535

Change 401526 merged by Ema:
[operations/puppet@production] varnish: add varnishmtail instance for varnish backends

https://gerrit.wikimedia.org/r/401526

Change 401535 merged by Ema:
[operations/puppet@production] mtail: add program to count varnish backend metrics

https://gerrit.wikimedia.org/r/401535

Change 402022 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: update varnishbackend.mtail regex

https://gerrit.wikimedia.org/r/402022

Change 402022 merged by Ema:
[operations/puppet@production] mtail: update varnishbackend.mtail regex

https://gerrit.wikimedia.org/r/402022

Change 402055 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add backend varnish mtail job

https://gerrit.wikimedia.org/r/402055

Change 402055 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add backend varnish mtail job

https://gerrit.wikimedia.org/r/402055

Change 402328 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: aggregate varnish_requests rate

https://gerrit.wikimedia.org/r/402328

Change 402342 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: specify reload action

https://gerrit.wikimedia.org/r/402342

Change 402353 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmtail: notify daemons upon mtail program modification

https://gerrit.wikimedia.org/r/402353

Change 402328 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: aggregate varnish_requests rate

https://gerrit.wikimedia.org/r/402328

Status update:

  • varnishstats has been replaced with varnishmtail-backend to get a breakdown of status codes per-backend
  • still missing is the backend ttfb histogram (pending upgrade of mtail to rc5 to fix float operations with +=)

Change 402342 abandoned by Ema:
varnishmtail: specify reload action

Reason:
Not OK, sending HUP to the script (and to varnishncsa) causes a service restart.

https://gerrit.wikimedia.org/r/402342

Change 402353 merged by Ema:
[operations/puppet@production] varnishmtail: notify daemons upon mtail program modification

https://gerrit.wikimedia.org/r/402353

fgiunchedi claimed this task.