Page MenuHomePhabricator

Deprecate python varnish cachestats
Closed, ResolvedPublic

Description

With parent task done and once there's enough data accumulated in Promethes (e.g. 10/12 weeks) we can deprecate the cachestats varnish python subsystem. The following daemons will need to be removed:

  • varnishmedia
  • varnishreqstats
  • varnishrls
  • varnishstatsd
  • varnishxcache
  • varnishxcps

varnishstatsd

The following dashboards have varnishstatsd Graphite metrics (i.e. varnish\..*\.backends\..*)

  • db/api-frontend-summary API frontend summary

    The Prometheus version is at https://grafana.wikimedia.org/d/Dueegx4Zz/api-frontend-summary-filippo-t184942 although the "rest api varnish hit rate %" for varnish vs restbase metric can't be converted yet because we'd need to calculate over metrics from two different data sources: restbase GET/s (graphite) and varnish restbase backend GET/s (prometheus). @Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be problematic if we drop "REST API Varnish hit rate (GETs, %)" until at least we have restbase req/s in prometheus?
  • db/experimental-backend-5xx (Experimental - backend 5xx)

    Scheduled to be deleted, unused
  • db/media (Media)

    Fixed

varnishreqstats

The following dashboards have varnishreqstats Graphite metrics (i.e. varnish\..+\..+\.frontend)

  • db/experimental-backend-5xx (Experimental - backend 5xx)

    Scheduled for removal
  • db/varnish-http-requests (Varnish HTTP Requests)
  • db/varnish-http-errors (Varnish: HTTP Errors)
  • db/varnish-http-errors-datacenters (Varnish: HTTP Errors (datacenters))

    Prometheus version is live, graphite version is scheduled for deletion
  • db/varnish-http-errors-copy-jun-2019 (Varnish: HTTP Errors Copy Jun 2019)

    Uses Prometheus, varnishreqstats metrics removed from json model
  • db/xxx-cdanis-estimation-prometheus-varnish-http-requests-copy (xxx cdanis estimation Prometheus Varnish HTTP Requests Copy)

    Uses Prometheus, varnishreqstats metrics removed from json model

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -11
operations/puppetproduction+1 -156
operations/puppetproduction+0 -385
operations/puppetproduction+0 -1
operations/puppetproduction+8 -7
operations/puppetproduction+0 -92
operations/puppetproduction+10 -8
operations/puppetproduction+7 -7
operations/puppetproduction+340 -443
operations/puppetproduction+1 -570
operations/puppetproduction+0 -16
operations/puppetproduction+7 -95
operations/puppetproduction+0 -16
operations/puppetproduction+8 -119
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/puppetproduction+4 -3
operations/puppetproduction+3 -2
operations/puppetproduction+9 -9
operations/puppetproduction+0 -16
operations/puppetproduction+7 -102
operations/puppetproduction+0 -16
operations/puppetproduction+9 -139
operations/puppetproduction+54 -4
operations/puppetproduction+1 -1
operations/puppetproduction+21 -4
integration/configmaster+2 -2
integration/configmaster+9 -0
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 422381 merged by Vgutierrez:
[operations/puppet@production] mtail: Add varnish_resourceloader_resp in varnishrls

https://gerrit.wikimedia.org/r/422381

Change 422910 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] prometheus: varnish_x_cache rate for the last 2m

https://gerrit.wikimedia.org/r/422910

Change 422910 abandoned by Vgutierrez:
prometheus: varnish_x_cache rate for the last 2m

Reason:
not needed.. dashboard was using the wrong metric.

https://gerrit.wikimedia.org/r/422910

Change 422155 merged by Vgutierrez:
[operations/puppet@production] mtail: Provide ttfb histogram for varnishbackend

https://gerrit.wikimedia.org/r/422155

Change 421338 merged by Ema:
[operations/puppet@production] varnishxcps: remove python daemon

https://gerrit.wikimedia.org/r/421338

Change 423861 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishxcps: remove nrpe::monitor_service

https://gerrit.wikimedia.org/r/423861

Change 423861 merged by Ema:
[operations/puppet@production] varnishxcps: post-removal cleanup

https://gerrit.wikimedia.org/r/423861

Change 424611 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: varnishxcache post-removal cleanup

https://gerrit.wikimedia.org/r/424611

Change 421925 merged by Vgutierrez:
[operations/puppet@production] varnish: Remove varnishxcache python daemon

https://gerrit.wikimedia.org/r/421925

Change 424611 merged by Vgutierrez:
[operations/puppet@production] varnish: varnishxcache post-removal cleanup

https://gerrit.wikimedia.org/r/424611

Change 429833 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmedia: remove python daemon

https://gerrit.wikimedia.org/r/429833

@Krinkle I've pushed https://gerrit.wikimedia.org/r/429833 to remove varnishmedia, my understanding is that there's only one dashboard currently using statsd data under media.thumbnail.varnish. We do have prometheus data that can be used to replace it. Thoughts?

Change 431528 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: varnish_thumbnails aggregation rule

https://gerrit.wikimedia.org/r/431528

Change 431608 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mtail: Add test case from current varnishncsa sample

https://gerrit.wikimedia.org/r/431608

@Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dashboards but running into an issue with the varnish_resourceloader_inm metrics. Its rate seems to be nearly the same as varnish_resourceloader_resp which cannot be true (it tends be around 25% of requests, based on statsd metrics, as well as based on manual samples from varnishlog I gathered).

GraphitePrometheus
Screen Shot 2018-05-07 at 18.04.08.png (605×1 px, 152 KB)
Screen Shot 2018-05-07 at 18.03.53.png (558×1 px, 119 KB)

Change 431712 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] mtail: Fix varnishrls regex [WIP]

https://gerrit.wikimedia.org/r/431712

@Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dashboards but running into an issue with the varnish_resourceloader_inm metrics. Its rate seems to be nearly the same as varnish_resourceloader_resp which cannot be true (it tends be around 25% of requests, based on statsd metrics, as well as based on manual samples from varnishlog I gathered).

GraphitePrometheus
Screen Shot 2018-05-07 at 18.04.08.png (605×1 px, 152 KB)
Screen Shot 2018-05-07 at 18.03.53.png (558×1 px, 119 KB)

@Krinkle I t looks like our regex to match inm was too weak and it was capturing all the H2 and TLS info as the inm value, it should be fixed with https://gerrit.wikimedia.org/r/431712

Change 431712 merged by Vgutierrez:
[operations/puppet@production] mtail: Fix varnishrls regex

https://gerrit.wikimedia.org/r/431712

Change 431608 abandoned by Krinkle:
mtail: Update a /w/load.php test case from a current varnishncsa sample

Reason:
Fixed by https://gerrit.wikimedia.org/r/#/c/431712/

https://gerrit.wikimedia.org/r/431608

Change 432090 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] prometheus: Add varnishrls aggregation rules

https://gerrit.wikimedia.org/r/432090

Change 432117 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mtail: Use a temporary variable for $cache_control

https://gerrit.wikimedia.org/r/432117

Change 432117 abandoned by Krinkle:
mtail: Use a temporary variable for $cache_control

https://gerrit.wikimedia.org/r/432117

Change 432090 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: Add varnishrls aggregation rules

https://gerrit.wikimedia.org/r/432090

Change 431528 merged by Ema:
[operations/puppet@production] prometheus: varnish_thumbnails aggregation rule

https://gerrit.wikimedia.org/r/431528

@ema ResourceLoader dashboards in Grafana have been updated to use Prometheus for all Varnish metrics. The varnishrls deamon for Graphite may now be removed.

Change 435739 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishrls: remove python daemon

https://gerrit.wikimedia.org/r/435739

Change 435739 merged by Ema:
[operations/puppet@production] varnishrls: remove python daemon

https://gerrit.wikimedia.org/r/435739

Change 435752 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishrls: post-removal cleanup

https://gerrit.wikimedia.org/r/435752

Change 435752 merged by Ema:
[operations/puppet@production] varnishrls: post-removal cleanup

https://gerrit.wikimedia.org/r/435752

ema changed the task status from Stalled to Open.May 28 2018, 11:33 AM

varnishrls removed, thanks @Krinkle.

Change 429833 merged by Ema:
[operations/puppet@production] varnishmedia: remove python daemon

https://gerrit.wikimedia.org/r/429833

Change 465383 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishmedia: post-removal cleanup

https://gerrit.wikimedia.org/r/465383

Change 465383 merged by Ema:
[operations/puppet@production] varnishmedia: post-removal cleanup

https://gerrit.wikimedia.org/r/465383

ema updated the task description. (Show Details)

Ran Timo's grafana audit script to find dashboards using remaining varnish statsd metrics, note some hits can be false positives (i.e. the metric is in the dashboard json but not displayed/hidden)

Dashboard audit for varnishstatsd (i.e. key_prefix => "varnish.${::site}.backends")

$ nodejs 01-search-all-grafana.js 'varnish\..+\.backends' | grep Matched
Matched db/api-frontend-summary (API frontend summary)
Matched db/experimental-backend-5xx (Experimental - backend 5xx)
Matched db/maps-performances (Maps performances)
Matched db/media (Media)
Matched db/wdqs-paper-data (WDQS Paper data)
Matched db/wikidata-query-service-frontend (Wikidata Query Service Frontend)

And varnishreqstats (key_prefix => "varnish.${::site}.${cache_cluster}.frontend.request"):

$ nodejs 01-search-all-grafana.js 'varnish\..+\..+\.frontend' | tee varnishreqstats_dashboards.log | grep Matched
Matched db/experimental-backend-5xx (Experimental - backend 5xx)
Matched db/interactive-team-kpi (Interactive team KPI)
Matched db/interactive-team-kpi-backup (Interactive team KPI (backup))
Matched db/julien-maps-dashboard (Julien Maps Dashboard)
Matched db/maps-dashboard-draft (Maps Dashboard - draft)
Matched db/maps-kpi (Maps KPI)
Matched db/prometheus-varnish-http-requests (Prometheus Varnish HTTP Requests)
Matched db/prometheus-varnish-http-errors-datacenters (Prometheus Varnish: HTTP Errors (datacenters))
Matched db/service-maps-varnish (Service :: Maps - Varnish)
Matched db/varnish-http-requests (Varnish HTTP Requests)
Matched db/varnish-aggregate-client-status-codes (Varnish: Aggregate Client Status Codes)
Matched db/varnish-http-errors (Varnish: HTTP Errors)
Matched db/varnish-http-errors-datacenters (Varnish: HTTP Errors (datacenters))

Latest dashboard audit:

'varnish\..+\.backends'

  • "Media"
  • "API frontend summary"
  • "Experimental - backend 5xx"
  • "Maps performances"
  • "WDQS Paper data"
  • "Wikidata Query Service Frontend"

'varnish\..+\..+\.frontend'

  • "Varnish: HTTP Errors (datacenters)" - added deprecation warning in favor of "Prometheus Varnish: HTTP Errors (datacenters)
  • "Experimental - backend 5xx" - Scheduled for deletion
  • "Varnish: Aggregate Client Status Codes" - Recreated here, needs review

Change 519410 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: remove legacy varnish-aggregate-client-status-codes

https://gerrit.wikimedia.org/r/519410

The queries for varnishstatsd metrics I've been able to find during the audit:

(varnish.$dc.backends.be_*api_svc*.GET.sample_rate, 60)
alias(scale(varnish.$dc.backends.be_*api_svc*.POST.sample_rate, 60)
alias(scale(varnish.$dc.backends.be_*restbase_svc*.GET.sample_rate, 60)
alias(scale(varnish.$dc.backends.be_*restbase_svc*.POST.sample_rate, 60)
alias(scale(offset(asPercent(varnish.$dc.backends.be_*restbase_svc*.GET.sample_rate, 
varnish.$dc.backends.be_*restbase_svc*.GET.$percentile
varnish.$dc.backends.be_*api_svc*.GET.$percentile
maxSeries(varnish.*.backends.be_*restbase_svc*.GET.median)
maxSeries(varnish.*.backends.be_*api_svc*.GET.median)
varnish.$dc.backends.be_*restbase_svc*.POST.$percentile
varnish.$dc.backends.be_*api_svc*.POST.$percentile
varnish.eqiad.backends.*.5xx.sum
sumSeries(varnish.eqiad.backends.{be_appservers,be_api,be_restbase,be_rendering,be_appservers_debug
(varnish.*.backends.be_kartotherian_svc_*wmnet.*xx.rate, 3)
aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_codfw_wmnet.GET.p99), 3, 5)"
aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_eqiad_wmnet.GET.p99)
aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_codfw_wmnet.GET.p95), 3, 5)
aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_eqiad_wmnet.GET.p95)
"aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_codfw_wmnet.GET.p50), 3, 5)"
"aliasByNode(averageSeries(varnish.*.backends.be_kartotherian_svc_eqiad_wmnet.GET.p50'
'(varnish.*.backends.be_ms_fe.2xx.rate, '
'(varnish.*.backends.be_wdqs_svc*.5xx.count)"}]
integral(varnish.*.backends.be_wdqs_svc*.4xx.count)
integral(varnish.*.backends.be_wdqs_svc*.2xx.count)
integral(varnish.*.backends.be_wdqs_svc*.3xx.count'
(varnish.*.backends.be_wdqs*.[123]xx.rate
aliasByNode(exclude(varnish.*.backends.be_wdqs*.[45]xx.rate, \'wdqs100[12]\'), 3, 4)"
aliasByNode(exclude(sumSeriesWithWildcards(varnish.*.backends.be_wdqs*.*xx.rate, 4), \'be_wdqs100[12]\'), 3)"}
aliasByNode(varnish.*.backends.be_wdqs_svc*.GET.p99, 3, 4, 5)
aliasByNode(varnish.*.backends.be_wdqs_svc*.GET.p95, 3, 4, 5)
aliasByNode(varnish.*.backends.be_wdqs_svc*.GET.p50, 3, 4, 5'

In other words:

  • request rates, per backend and per method
  • request latency, per backend and per method
  • response rates, per backend and per status

Change 519664 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] grafana: update varnish-aggregate-client-status-codes to prometheus version

https://gerrit.wikimedia.org/r/519664

Change 519410 abandoned by Cwhite:
grafana: remove legacy varnish-aggregate-client-status-codes

Reason:
superseded by Ibb58806c2166a3200b4685e5a7cea6fb97f010f1

https://gerrit.wikimedia.org/r/519410

Change 519664 merged by Cwhite:
[operations/puppet@production] grafana: update varnish-aggregate-client-status-codes to prometheus version

https://gerrit.wikimedia.org/r/519664

Change 520187 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: remove varnishstatsd

https://gerrit.wikimedia.org/r/520187

@Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be problematic if we drop "REST API Varnish hit rate (GETs, %)" until at least we have restbase req/s in prometheus?

ye, sure. we don't really monitor this on a daily basis, so there's no need for a dashboard. the number can be calculated manually if needed

@Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be problematic if we drop "REST API Varnish hit rate (GETs, %)" until at least we have restbase req/s in prometheus?

ye, sure. we don't really monitor this on a daily basis, so there's no need for a dashboard. the number can be calculated manually if needed

Sounds great, thanks! I've replaced the dashboard with the one with Prometheus metrics now

@MSantos @Mathew.onipe we're moving from graphite-based varnish metrics to prometheus-based varnish metrics, I see you were amongst the authors of https://grafana.wikimedia.org/d/000000305/maps-performances, could you take a look at the prometheus version at https://grafana.wikimedia.org/d/kcAMMw4Wk/maps-performances-filippo-t184942?orgId=1 and let us know if it looks good? If so I'll replace the former with the latter. (cc @Gehel, I know Matt is away ATM)

@fgiunchedi overall it looks good, just have one question. In the Varnish response time graph, do you know why eqiad p99 values are so different? The current board has values up to 20s and the new one 5s.

@fgiunchedi overall it looks good, just have one question. In the Varnish response time graph, do you know why eqiad p99 values are so different? The current board has values up to 20s and the new one 5s.

I don't know offhand, although I'd be interested to know what percentiles karthoterian sees, do you know if we have those available?

@fgiunchedi overall it looks good, just have one question. In the Varnish response time graph, do you know why eqiad p99 values are so different? The current board has values up to 20s and the new one 5s.

I don't know offhand, although I'd be interested to know what percentiles karthoterian sees, do you know if we have those available?

Unfortunately, I don't know. Maybe @Mathew.onipe or @Gehel know it better.

Change 521427 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] monitoring: update icinga links to varnish-aggregate-client-status-codes

https://gerrit.wikimedia.org/r/521427

Change 521427 merged by Ema:
[operations/puppet@production] monitoring: update icinga links to varnish-aggregate-client-status-codes

https://gerrit.wikimedia.org/r/521427

I don't know offhand, although I'd be interested to know what percentiles karthoterian sees, do you know if we have those available?

As far as I know, we don't collect %-iles at kartotherian level. As for the difference in numbers, my guess is that the buckets we use (10ms, 50ms, 100ms, 500ms, 1s, 5s, +Inf) don't have much precision for requests over 5s. And since the p99 of maps was mostly >5s, we're just loosing precision and should not trust those values too much. I'm not sure how the math checks out.

It still shows that p99 on maps is way to high, but sadly, that's not really news.

I don't know offhand, although I'd be interested to know what percentiles karthoterian sees, do you know if we have those available?

As far as I know, we don't collect %-iles at kartotherian level. As for the difference in numbers, my guess is that the buckets we use (10ms, 50ms, 100ms, 500ms, 1s, 5s, +Inf) don't have much precision for requests over 5s. And since the p99 of maps was mostly >5s, we're just loosing precision and should not trust those values too much. I'm not sure how the math checks out.

It still shows that p99 on maps is way to high, but sadly, that's not really news.

Thanks for taking a look @Gehel ! I agree the difference might be due to the bucketing.

In the process we've also discovered that those metrics stopped updating both for varnish+mtail and varnishstatsd due to upload cache fully moving to ATS, getting equivalent metrics is tracked in T227668: Per-backend ATS Prometheus metrics.

Since the maps performance dashboard with graphite metrics is broken anyways ATM I think it makes sense to go ahead and remove varnishstatsd since the rest of the dashboards are migrated and are using cache text backends.

Since the maps performance dashboard with graphite metrics is broken anyways ATM I think it makes sense to go ahead and remove varnishstatsd since the rest of the dashboards are migrated and are using cache text backends.

Agreed, that should not be blocking anything on the maps side.

Change 520187 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: remove varnishstatsd

https://gerrit.wikimedia.org/r/520187

Change 523891 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: remove varnishreqstats-based alerts

https://gerrit.wikimedia.org/r/523891

Change 523892 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: ensure varnishreqstats is absent

https://gerrit.wikimedia.org/r/523892

Change 523891 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: remove varnishreqstats-based alerts

https://gerrit.wikimedia.org/r/523891

Change 523892 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: ensure varnishreqstats is absent

https://gerrit.wikimedia.org/r/523892

Change 525252 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: fix varnishreqstats systemd::service usage

https://gerrit.wikimedia.org/r/525252

Change 525252 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: fix varnishreqstats systemd::service usage

https://gerrit.wikimedia.org/r/525252

Change 525259 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: remove varnishreqstats and varnishstatsd

https://gerrit.wikimedia.org/r/525259

Change 525259 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: remove varnishreqstats and varnishstatsd

https://gerrit.wikimedia.org/r/525259

fgiunchedi claimed this task.

All varnish statsd daemons have been retired, "maps performance" dashboard is missing per-backend ATS metrics which is tracked in T227668: Per-backend ATS Prometheus metrics

Change 737655 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: remove cachestats.py

https://gerrit.wikimedia.org/r/737655

Change 737655 merged by Ema:

[operations/puppet@production] varnish: remove cachestats.py

https://gerrit.wikimedia.org/r/737655

Change 737670 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish::logging: remove statsd_host and mtail_progs

https://gerrit.wikimedia.org/r/737670

Change 737670 merged by Ema:

[operations/puppet@production] varnish::logging: remove statsd_host and mtail_progs

https://gerrit.wikimedia.org/r/737670