Page MenuHomePhabricator

Ensure graphs used by Performance account for Varnish-to-ATS migration
Open, LowPublic

Description

For each of the following metrics/graphs verify that either ATS servers are sending the same metrics in a compatible way to Prometheus already, make it so, or figure out what we should query instead.

In particular for higher-level monitoring (like HTTP status codes and traffic size), it'd be great if we can monitor them through a single aggregated query, without needing to differentiate (for things not specific to Varnish). Otherwise during the migration the quantities will be off making it hard for our alerts to keep working correctly.

ResouceLoader

Dashboards:

Metrics:

  • status_cc_xc:varnish_resourceloader_resp:irate5m{site, staus, cache_control, x_cache} (prometheus/global)
Apache/MediaWiki Backend-Timing

Dashboard: https://grafana.wikimedia.org/d/000000580/apache-backend-timing

Metrics:

  • varnish_backend_timing_count{cluster} (prometheus/ops)
  • varnish_backend_timing_bucket{cluster, le} (prometheus/ops)

Event Timeline

Krinkle created this task.Sep 21 2019, 1:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2019, 1:03 AM

I've added the ones I use most frequently. I'm probably missing others. To be added after Monday's meeting.

Krinkle claimed this task.Sep 23 2019, 8:07 PM
Krinkle triaged this task as Normal priority.
Krinkle moved this task from Inbox to Backlog: Future Goals on the Performance-Team board.
Krinkle removed Krinkle as the assignee of this task.Wed, Sep 25, 11:06 PM
Krinkle lowered the priority of this task from Normal to Low.

Letting go and lowering priority. While ATS Frontends are on the way, they're only replacing Nginx/TLS right now. Replacing the Varnish part of the cacheproxy fontends is not yet happening this quarter.