Page MenuHomePhabricator

Per-backend ATS Prometheus metrics
Open, MediumPublic

Description

While investigating T184942: Deprecate python varnish cachestats we've ran into the fact that maps runs on cache upload, and the ATS migration for upload has been completed, and we don't have per-backend metrics (latency, status code, etc) from ATS. Ideally we have at least the same metrics we're collecting from varnishlog + mtail available from ATS, although I don't know the specifics of what's possible.

In terms of dashboards, we're looking at replacing varnish_backend_requests and varnish_backend_timing in dashboards and possibly alerts. The latter is a subset of the former, so we should be able to rewrite _timing in terms of _requests.

varnish_backend_requests

Matched db/api-frontend-summary (API frontend summary)
Matched db/maps-performances-filippo-t184942 (Maps performances Filippo T184942)
Matched db/wikidata-query-service-frontend (Wikidata Query Service Frontend)

varnish_backend_timing

Matched db/apache-backend-timing (Apache Backend-Timing)

Details

Related Gerrit Patches:
operations/puppet : productionprometheus: add trafficserver_backend_requests_seconds_count rules
operations/puppet : productionprometheus: rename trafficserver metrics
operations/puppet : productionprometheus: fetch ATS origin server metrics
operations/puppet : productionATS: add atsbackend.mtail
operations/puppet : productionATS: add support for atsmtail systemd services
operations/puppet : productionATS: pass -socket and -regexp to fifo-log-tailer
operations/software/fifo-log-demux : master0.3: implement fifo-log-tailer in go
operations/puppet : productionATS: log origin server hostname and Backend-Timing

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptJul 10 2019, 1:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jul 10 2019, 2:05 PM
ema moved this task from Triage to Caching on the Traffic board.
CDanis added a subscriber: CDanis.Jul 10 2019, 2:07 PM

Change 523130 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: log origin server hostname and Backend-Timing

https://gerrit.wikimedia.org/r/523130

Change 523168 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add atsbackend.mtail

https://gerrit.wikimedia.org/r/523168

Change 523130 merged by Ema:
[operations/puppet@production] ATS: log origin server hostname and Backend-Timing

https://gerrit.wikimedia.org/r/523130

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 16 2019, 10:30 AM

Change 523705 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add support for atsmtail systemd services

https://gerrit.wikimedia.org/r/523705

Change 523768 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/fifo-log-demux@master] 0.3: implement fifo-log-tailer in go

https://gerrit.wikimedia.org/r/523768

Change 523768 merged by Ema:
[operations/software/fifo-log-demux@master] 0.3: implement fifo-log-tailer in go

https://gerrit.wikimedia.org/r/523768

Mentioned in SAL (#wikimedia-operations) [2019-07-17T09:07:44Z] <ema> upload fifo-log-demux 0.3 to stretch-wikimedia T227668

Change 523881 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: pass -socket and -regexp to fifo-log-tailer

https://gerrit.wikimedia.org/r/523881

Change 523881 merged by Ema:
[operations/puppet@production] ATS: pass -socket and -regexp to fifo-log-tailer

https://gerrit.wikimedia.org/r/523881

Mentioned in SAL (#wikimedia-operations) [2019-07-17T09:21:43Z] <ema> cp-ats: upgrade fifo-log-demux to 0.3 T227668

Change 523705 merged by Ema:
[operations/puppet@production] ATS: add support for atsmtail systemd services

https://gerrit.wikimedia.org/r/523705

Change 523168 merged by Ema:
[operations/puppet@production] ATS: add atsbackend.mtail

https://gerrit.wikimedia.org/r/523168

Change 523898 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: fetch ATS origin server metrics

https://gerrit.wikimedia.org/r/523898

fgiunchedi updated the task description. (Show Details)Jul 17 2019, 10:47 AM

Change 523898 merged by Ema:
[operations/puppet@production] prometheus: fetch ATS origin server metrics

https://gerrit.wikimedia.org/r/523898

Mentioned in SAL (#wikimedia-operations) [2019-07-17T13:06:47Z] <ema> prometheus servers: remove varnish-upload_$dc_backend.yaml, replaced by ATS equivalent T227668

Change 525081 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: add ats_backend_requests_seconds_count rules

https://gerrit.wikimedia.org/r/525081

Change 525085 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: rename trafficserver metrics

https://gerrit.wikimedia.org/r/525085

Change 525085 merged by Ema:
[operations/puppet@production] prometheus: rename trafficserver metrics

https://gerrit.wikimedia.org/r/525085

Change 525081 merged by Ema:
[operations/puppet@production] prometheus: add trafficserver_backend_requests_seconds_count rules

https://gerrit.wikimedia.org/r/525081

Per-backend metrics are in place now via mtail, specifically:

  • request count: by backend, method, and status
  • total time spent took by requests: by backend, method, status
  • request latency buckets: by backend and method

Notably the buckets/histograms do not break down by status for cardinality reasons, although by sum()ing the _count and _sum metrics across status codes then the histogram_quantile functions can be used as expected

e.g.

cp1075:~$ curl -s localhost:3904/metrics | grep -v ^# | sort | grep -i appservers-rw
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="+Inf",method="GET",prog="atsbackend.mtail"} 431060
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="0.01",method="GET",prog="atsbackend.mtail"} 4931
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="0.05",method="GET",prog="atsbackend.mtail"} 52582
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="0.1",method="GET",prog="atsbackend.mtail"} 110478
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="0.5",method="GET",prog="atsbackend.mtail"} 390573
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="1.0",method="GET",prog="atsbackend.mtail"} 419147
trafficserver_backend_requests_seconds_bucket{backend="appservers-rw.discovery.wmnet",le="5.0",method="GET",prog="atsbackend.mtail"} 430660
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="200",prog="atsbackend.mtail"} 371045
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="204",prog="atsbackend.mtail"} 17315
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="301",prog="atsbackend.mtail"} 14349
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="302",prog="atsbackend.mtail"} 11916
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="303",prog="atsbackend.mtail"} 438
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="304",prog="atsbackend.mtail"} 4641
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="400",prog="atsbackend.mtail"} 353
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="403",prog="atsbackend.mtail"} 8
trafficserver_backend_requests_seconds_count{backend="appservers-rw.discovery.wmnet",method="GET",status="404",prog="atsbackend.mtail"} 10995
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="200",prog="atsbackend.mtail"} 97109.5199999952
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="204",prog="atsbackend.mtail"} 1215.767000000002
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="301",prog="atsbackend.mtail"} 699.352999999978
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="302",prog="atsbackend.mtail"} 1348.6029999999773
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="303",prog="atsbackend.mtail"} 1.6779999999999828
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="304",prog="atsbackend.mtail"} 255.75299999999888
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="400",prog="atsbackend.mtail"} 34.54900000000002
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="403",prog="atsbackend.mtail"} 0.424
trafficserver_backend_requests_seconds_sum{backend="appservers-rw.discovery.wmnet",method="GET",status="404",prog="atsbackend.mtail"} 2385.3479999999954
fgiunchedi moved this task from Doing to Backlog on the User-fgiunchedi board.Oct 18 2019, 1:40 PM
fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Wed, Nov 13, 11:23 AM