Ensure graphs used by Performance account for Varnish-to-ATS migration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Sep 21 2019, 1:03 AM

Description

For each of the following metrics/graphs verify that either ATS servers are sending the same metrics in a compatible way to Prometheus already, make it so, or figure out what we should query instead.

In particular for higher-level monitoring (like HTTP status codes and traffic size), it'd be great if we can monitor them through a single aggregated query, without needing to differentiate (for things not specific to Varnish). Otherwise during the migration the quantities will be off making it hard for our alerts to keep working correctly.

ResouceLoader

Dashboards:

Metrics:

status_cc_xc:varnish_resourceloader_resp:irate5m{site, staus, cache_control, x_cache} (prometheus/global)

Apache/MediaWiki Backend-Timing

Dashboard: https://grafana.wikimedia.org/d/000000580/apache-backend-timing

Metrics:

varnish_backend_timing_count{cluster} (prometheus/ops)
varnish_backend_timing_bucket{cluster, le} (prometheus/ops)

Details

	Subject	Repo	Branch	Lines +/-
	mtail: port varnishbackendtiming to ATS	operations/puppet	production	+104 -138

Customize query in gerrit

Related Objects

Mentioned Here: T227432: Replace Varnish backends with ATS on cache text nodes
T238817: Request routing to active/passive services active in codfw only stopped working

Event Timeline

Krinkle created this task.Sep 21 2019, 1:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2019, 1:03 AM

I've added the ones I use most frequently. I'm probably missing others. To be added after Monday's meeting.

Krinkle claimed this task.Sep 23 2019, 8:07 PM

Krinkle triaged this task as Medium priority.

Krinkle moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.

Letting go and lowering priority. While ATS Frontends are on the way, they're only replacing Nginx/TLS right now. Replacing the Varnish part of the cacheproxy fontends is not yet happening this quarter.

It looks like the Apache Backend-Timing graphs dried up.

https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=1576540800000&to=1577145600000

Screenshot 2019-12-24 at 02.09.27.png (501×1 px, 63 KB)

Seems to align exactly with the following SAL entry:

2019-12-19

12:52 ema: depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817

CDanis subscribed.Dec 24 2019, 5:51 AM

• ema moved this task from Backlog to Caching on the Traffic board.Dec 29 2019, 10:53 AM

Change 561266 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

gerritbot added a project: Patch-For-Review.Dec 31 2019, 9:03 AM

Change 561266 merged by Ema:
[operations/puppet@production] mtail: port varnishbackendtiming to ATS

https://gerrit.wikimedia.org/r/561266

In T233474#5761873, @Krinkle wrote:

It looks like the Apache Backend-Timing graphs dried up.

https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=1576540800000&to=1577145600000

I have ported varnishbackendtiming.mtail to ATS and replaced varnish_backend_timing with ats_backend_timing everywhere in the dashboard:
https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-1h&to=now

Given that now the applayer receives direct requests from all DCs, and not only eqiad/codfw, I have added each individual DC to the "Apache Requests" graph. When it comes to all remaining graphs, they are eqiad-only as before.

Maintenance_bot removed a project: Patch-For-Review.Jan 2 2020, 9:10 AM

Krinkle updated the task description. (Show Details)Jan 7 2020, 8:51 PM

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

In T233474#5786575, @Krinkle wrote:

@ema If I understand correctly, varnishrls does not yet require migration because it's logged by varnish-fe instead of the (now migrated to ATS) varnish-be. Is that correct?

That's right!

fgiunchedi moved this task from Inbox to Radar on the observability board.Apr 6 2020, 12:35 PM

@Krinkle: anything left TBD here?

Not anymore. Thank you!

	F31487033: Screenshot 2019-12-24 at 02.09.27.png
	Dec 24 2019, 1:11 AM

Ensure graphs used by Performance account for Varnish-to-ATS migrationClosed, ResolvedPublicActions