Page MenuHomePhabricator

Ensure graphs used by Performance account for Varnish-to-ATS migration
Closed, ResolvedPublic

Description

For each of the following metrics/graphs verify that either ATS servers are sending the same metrics in a compatible way to Prometheus already, make it so, or figure out what we should query instead.

In particular for higher-level monitoring (like HTTP status codes and traffic size), it'd be great if we can monitor them through a single aggregated query, without needing to differentiate (for things not specific to Varnish). Otherwise during the migration the quantities will be off making it hard for our alerts to keep working correctly.

ResouceLoader

Dashboards:

Metrics:

  • status_cc_xc:varnish_resourceloader_resp:irate5m{site, staus, cache_control, x_cache} (prometheus/global)
Apache/MediaWiki Backend-Timing

Dashboard: https://grafana.wikimedia.org/d/000000580/apache-backend-timing

Metrics:

  • varnish_backend_timing_count{cluster} (prometheus/ops)
  • varnish_backend_timing_bucket{cluster, le} (prometheus/ops)

Event Timeline

I've added the ones I use most frequently. I'm probably missing others. To be added after Monday's meeting.

Krinkle triaged this task as Medium priority.
Krinkle lowered the priority of this task from Medium to Low.

Letting go and lowering priority. While ATS Frontends are on the way, they're only replacing Nginx/TLS right now. Replacing the Varnish part of the cacheproxy fontends is not yet happening this quarter.

Krinkle edited projects, added Traffic; removed serviceops.

It looks like the Apache Backend-Timing graphs dried up.

https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=1576540800000&to=1577145600000

Screenshot 2019-12-24 at 02.09.27.png (501×1 px, 63 KB)

Seems to align exactly with the following SAL entry:

2019-12-19
  • 12:52 ema: depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817

Change 561266 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

Change 561266 merged by Ema:
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

I have ported varnishbackendtiming.mtail to ATS and replaced varnish_backend_timing with ats_backend_timing everywhere in the dashboard:
https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-1h&to=now

Given that now the applayer receives direct requests from all DCs, and not only eqiad/codfw, I have added each individual DC to the "Apache Requests" graph. When it comes to all remaining graphs, they are eqiad-only as before.

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

That's right!

Not anymore. Thank you!