Page MenuHomePhabricator

Collect Backend-Timing in Prometheus
Closed, ResolvedPublic

Description

Per T124954#2009050 we now emit Backend-Timing on all apaches by default (apache/defaults.conf).

I propose to set up a varnishlog daemon similar to the one we have for the general traffic varnishreqstats, resourceloader varnishrls, and others.

It looks like such a daemon already exists (varnishstatsd), though I wasn't able to find its data anywhere in Graphite.

The main thing I'd like to start tracking is backend response time per data centre and varnish cluster (e.g. eqiad/text). We'll need to make sure to only count new backend requests (and not re-read the header from cached responses).

I'd be nice if we could also break down the text cluster slightly. e.g. varnish.esams.text.all but also varnish.esams.text.load_php for load.php requests. Alternatively, instead of hardcoding entry points in this daemon we could add them to varnishrls instead (and others where we specifically want them).

Event Timeline

A few months ago, we (very briefly) enabled reporting of backend latency via the stats interface in MediaWiki for all API request, and it overwhelmed the central statsd aggregator. Since API requests are a subset of app server requests, we have to imagine that naively logging every single request would be even worse.

We can work around that by sampling -- that is, reporting Backend-Timing for 1 in N requests.

This could be implemented either on the Varnishes, as a varnish-log-tailing Python script, or on the application servers themselves, by having Logster tail the Apache logs (which include the backend response time). Logster would be tailing /var/log/apache2/other_vhosts_access.log and slurping the second field (per https://wikitech.wikimedia.org/wiki/Apache_log_format).

Doing this on the Varnish layer would work, but my hunch is that it would be a bit harder to get right, because we would have to be careful not to report the value of the Backend-Timing response header on cached response objects. Some of the rennovations @BBlack has made to the response headers apparently make that easier to do than it has been in the past, but I do not have any direct experience of this.

ori triaged this task as Medium priority.Apr 11 2016, 6:32 PM
ori moved this task from Inbox to Backlog: Maintenance on the Performance-Team board.
Gilles raised the priority of this task from Medium to High.
gilles@deployment-cache-text04:~$ sudo varnishlog -I BerespHeader:Backend-Timing -g raw

@Gilles there's already a number of statsd/graphite python scripts in puppet that read varnish's shared memory. Could be of inspiration e.g. varnishxcache

Right, I wrote/refactored some :) Do you think we should add that feature to an existing one, or write a new one? Since this will look at almost all requests.

Right, I wrote/refactored some :) Do you think we should add that feature to an existing one, or write a new one? Since this will look at almost all requests.

eheh oops! Probably different scripts depending on how --key-prefix is used for aggregation I guess, also aggregate and flush every now and then like varnishreqstats does to avoid udp-meltdown on the graphite machine.

Finally, if you go the separate script idea it'd be interesting to also include the prometheus python client (already installed on varnish machines) and see how that would look like.

Gilles added a subscriber: Gilles.
Krinkle renamed this task from Collect Backend-Timing in Graphite to Collect Backend-Timing in Graphite (or Prometheus).Oct 11 2017, 4:58 PM

Change 418603 had a related patch set uploaded (by BBlack; owner: Ema):
[operations/puppet@production] varnishslowlog: add Backend-Timing D=, in seconds

https://gerrit.wikimedia.org/r/418603

Change 418603 merged by BBlack:
[operations/puppet@production] varnishslowlog: add Backend-Timing D=, in seconds

https://gerrit.wikimedia.org/r/418603

Change 427641 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Varnish prometheus daemon for Backend-Timing

https://gerrit.wikimedia.org/r/427641

Change 428287 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Introduce mtail role and a couple of varnish mtail programs

https://gerrit.wikimedia.org/r/428287

Change 428287 merged by jenkins-bot:
[mediawiki/vagrant@master] Introduce mtail role and a couple of varnish mtail programs

https://gerrit.wikimedia.org/r/428287

Change 428289 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Serve Backend-Timing header

https://gerrit.wikimedia.org/r/428289

Change 428289 merged by jenkins-bot:
[mediawiki/vagrant@master] Serve Backend-Timing header

https://gerrit.wikimedia.org/r/428289

Change 428293 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Backend-Timing Varnish mtail program

https://gerrit.wikimedia.org/r/428293

Change 427641 merged by jenkins-bot:
[mediawiki/vagrant@master] Refactor varnishslowlog and varnishhospital

https://gerrit.wikimedia.org/r/427641

Change 428293 merged by jenkins-bot:
[mediawiki/vagrant@master] Backend-Timing Varnish mtail program

https://gerrit.wikimedia.org/r/428293

Change 434879 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Backend-Timing Varnish mtail program

https://gerrit.wikimedia.org/r/434879

Gilles renamed this task from Collect Backend-Timing in Graphite (or Prometheus) to Collect Backend-Timing in Prometheus.May 24 2018, 8:09 AM
Gilles changed the task status from Open to Stalled.Jun 25 2018, 6:44 PM

Stuck in review since May

Gilles changed the task status from Stalled to Open.Oct 5 2018, 2:16 PM

Change 434879 merged by Ema:
[operations/puppet@production] Backend-Timing Varnish mtail program

https://gerrit.wikimedia.org/r/434879

Change 465136 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: enable varnish::logging::backendtiming

https://gerrit.wikimedia.org/r/465136

Change 465136 merged by Ema:
[operations/puppet@production] cache: enable varnish::logging::backendtiming

https://gerrit.wikimedia.org/r/465136

Change 465170 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: fix varnishbackendtiming destination directory

https://gerrit.wikimedia.org/r/465170

Change 465170 merged by Ema:
[operations/puppet@production] cache: fix varnishbackendtiming destination directory

https://gerrit.wikimedia.org/r/465170

We now expose the following metrics to Prometheus:

14:35:31 ema@cp2004.codfw.wmnet:~
$ curl -s http://localhost:3904/metrics | grep ^varnish_backend_timing
varnish_backend_timing_bucket{le="0.25",prog="varnishbackendtiming.mtail"} 36574
varnish_backend_timing_bucket{le="0.5",prog="varnishbackendtiming.mtail"} 38937
varnish_backend_timing_bucket{le="1",prog="varnishbackendtiming.mtail"} 39751
varnish_backend_timing_bucket{le="2.5",prog="varnishbackendtiming.mtail"} 40036
varnish_backend_timing_bucket{le="5",prog="varnishbackendtiming.mtail"} 40078
varnish_backend_timing_bucket{le="10",prog="varnishbackendtiming.mtail"} 40087
varnish_backend_timing_bucket{le="15",prog="varnishbackendtiming.mtail"} 40087
varnish_backend_timing_bucket{le="+Inf",prog="varnishbackendtiming.mtail"} 40088
varnish_backend_timing_bucket{le="0.05",prog="varnishbackendtiming.mtail"} 16336
varnish_backend_timing_bucket{le="0.1",prog="varnishbackendtiming.mtail"} 27889
varnish_backend_timing_bucket{le="0.025",prog="varnishbackendtiming.mtail"} 2051
varnish_backend_timing_bucket{le="0.005",prog="varnishbackendtiming.mtail"} 78
varnish_backend_timing_bucket{le="0.01",prog="varnishbackendtiming.mtail"} 78
varnish_backend_timing_sum{prog="varnishbackendtiming.mtail"} 470.54365380000104
varnish_backend_timing_count{prog="varnishbackendtiming.mtail"} 40088

The basic functionality is there. If we want to iterate on that, it should be the subject of a new task.