We currently cannot see the effect of cache response time per instance when servers get depooled/repooled, etc.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T238085 Depooling single text caching server in esams had a disproportionate performance impact | |||
Resolved | • Gilles | T238086 Edge cache response time per server should be monitored |
Event Timeline
As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing
I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by a new dimension.
Change 591098 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Collect response time per DC/host for Prometheus
Change 591098 merged by jenkins-bot:
[performance/navtiming@master] Collect response time per DC/host for Prometheus
Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:19:55Z] <dpifke@deploy1001> Started deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086
Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:20:00Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086 (duration: 00m 05s)
Change 594264 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart
Change 594264 merged by jenkins-bot:
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart
@dpifke once deployed, we will need to nuke the existing data for navtiming_responsestart_by_host_seconds on Prometheus. Otherwise it's going to pollute the label values for Grafana drop-down menus and such.
Put together a dashboard (with the underlying labels swapped for now): https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1
Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:09Z] <dpifke@deploy1001> Started deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086
Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:14Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 (duration: 00m 05s)
This is deployed, and I updated the Grafana dashboard.
To nuke the data, we would need to restart Prometheus with --web.enable-admin-api flag and run something like:
curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]=navtiming_responsestar t_by_host_seconds&end=2020-05-12T19:55:14+0000'
How important is it that the old labels are deleted? If it's just the Grafana variables we're worried about, we can use a regex there (until the old data ages out).
Weird. It was working yesterday (I verified that new data was appearing with the correct labels), but is now hanging for me as well. I'll investigate.
The dashboard was broken such that it would not load even load the settings page. It seemed to hang indefinitely; I left it open in the background an hour or so yesterday, and it was still spinning when I came back. I tried to get around Grafana not rendering the edit button by manually inserting ?editview=settings into the URI, but that hung as well.
I was able to back up a copy of the JSON definition, by going to https://grafana.wikimedia.org/api/dashboards/uid/M7xQ_BeWk, but looking at the contents didn't give me any immediate clues as to why it wouldn't load. Re-importing that JSON as a new dashboard failed silently.
Recreating the entire dashboard from scratch and then overwriting the old copy seemed to be successful, although I'm sure I missed a few fields. If it's still working in a day or two, I'll clean it up, otherwise consider this an experiment to see if it'll meet the same demise.
Regarding the label pollution, I added a regex to $dc which excludes values containing numerals (and thus hostnames). This fixes the drop-downs from the days the labels were swapped; let me know if you think that's sufficient, or if I should request access/assistance from SRE to be able to delete the old data.
Not sure if this is merely a display issue, but I see fairly odd buckets on the dashboard:
- 0ms - 438ms
- 438ms - 877ms
- …
- 3.07s - 3.51s
- …
- 3.94s - 4.38s
- …
That's because I forgot to change query format to "heatmap" in the panel settings. :) Fixed.
I think that I've fixed the display further, the format of the heatmap needed to be "Time series buckets". Looking good!
@ema @Vgutierrez you can now use this dashboard when doing per-host experiments!
That's because of the switchover, the metric is now coming from the "codfw prometheus/ops" instead of the "eqiad prometheus/ops" source. I'm not how to fix that, it seems like a panel can only have one source.
For now I've switched the source, we'll have to remember doing it again when the primary DC is switched back.