Change Details

For each of the following metrics/graphs verify that either ATS servers are sending the same metrics in a compatible way to Prometheus already, make it so, or figure out what we should query instead. In particular for higher-level monitoring (like HTTP status codes and traffic size), it'd be great if we can monitor them through a single aggregated query, without needing to differentiate (for things not specific to Varnish). Otherwise during the migration the quantities will be off making it hard for our alerts to keep working correctly. ##### {icon square-o} ResouceLoader Dashboards: * https://grafana.wikimedia.org/d/000000402/resourceloader-alerts * https://grafana.wikimedia.org/d/000000066/resourceloader Metrics: * `status_cc_xc:varnish_resourceloader_resp:irate5m{site, staus, cache_control, x_cache}` (prometheus/global) ##### {icon check-square-o} Apache/MediaWiki Backend-Timing Dashboard: <https://grafana.wikimedia.org/d/000000580/apache-backend-timing> Metrics: * `varnish_backend_timing_count{cluster}` (prometheus/ops) * `varnish_backend_timing_bucket{cluster, le}` (prometheus/ops)