Page MenuHomePhabricator

Ensure graphs used by Performance account for Varnish-to-ATS migration
Closed, ResolvedPublic

Description

For each of the following metrics/graphs verify that either ATS servers are sending the same metrics in a compatible way to Prometheus already, make it so, or figure out what we should query instead.

In particular for higher-level monitoring (like HTTP status codes and traffic size), it'd be great if we can monitor them through a single aggregated query, without needing to differentiate (for things not specific to Varnish). Otherwise during the migration the quantities will be off making it hard for our alerts to keep working correctly.

ResouceLoader

Dashboards:

Metrics:

  • status_cc_xc:varnish_resourceloader_resp:irate5m{site, staus, cache_control, x_cache} (prometheus/global)
Apache/MediaWiki Backend-Timing

Dashboard: https://grafana.wikimedia.org/d/000000580/apache-backend-timing

Metrics:

  • varnish_backend_timing_count{cluster} (prometheus/ops)
  • varnish_backend_timing_bucket{cluster, le} (prometheus/ops)

Event Timeline

Krinkle created this task.Sep 21 2019, 1:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2019, 1:03 AM

I've added the ones I use most frequently. I'm probably missing others. To be added after Monday's meeting.

Krinkle claimed this task.Sep 23 2019, 8:07 PM
Krinkle triaged this task as Medium priority.
Krinkle moved this task from Inbox to Backlog: Future Goals on the Performance-Team board.
Krinkle removed Krinkle as the assignee of this task.Sep 25 2019, 11:06 PM
Krinkle lowered the priority of this task from Medium to Low.

Letting go and lowering priority. While ATS Frontends are on the way, they're only replacing Nginx/TLS right now. Replacing the Varnish part of the cacheproxy fontends is not yet happening this quarter.

Krinkle assigned this task to ema.Dec 24 2019, 1:11 AM
Krinkle edited projects, added Traffic; removed serviceops.

It looks like the Apache Backend-Timing graphs dried up.

https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=1576540800000&to=1577145600000

Seems to align exactly with the following SAL entry:

2019-12-19
  • 12:52 ema: depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817
CDanis added a subscriber: CDanis.Dec 24 2019, 5:51 AM
ema moved this task from Triage to Caching on the Traffic board.Dec 29 2019, 10:53 AM

Change 561266 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

Change 561266 merged by Ema:
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

ema added a comment.Jan 2 2020, 8:54 AM

I have ported varnishbackendtiming.mtail to ATS and replaced varnish_backend_timing with ats_backend_timing everywhere in the dashboard:
https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-1h&to=now

Given that now the applayer receives direct requests from all DCs, and not only eqiad/codfw, I have added each individual DC to the "Apache Requests" graph. When it comes to all remaining graphs, they are eqiad-only as before.

Krinkle updated the task description. (Show Details)Jan 7 2020, 8:51 PM

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

ema added a comment.Jan 9 2020, 10:29 AM

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

That's right!

fgiunchedi moved this task from Inbox to Radar on the observability board.Apr 6 2020, 12:35 PM
ema added a comment.Nov 5 2020, 8:50 AM

@Krinkle: anything left TBD here?

Krinkle closed this task as Resolved.Dec 15 2020, 12:34 AM

Not anymore. Thank you!