We currently cannot see the effect of cache response time per instance when servers get depooled/repooled, etc.
|Open||None||T238085 Depooling single text caching server in esams had a disproportionate performance impact|
|Resolved||Gilles||T238086 Edge cache response time per server should be monitored|
As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing
This is deployed, and I updated the Grafana dashboard.
To nuke the data, we would need to restart Prometheus with --web.enable-admin-api flag and run something like:
curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match=navtiming_responsestar t_by_host_seconds&end=2020-05-12T19:55:14+0000'
How important is it that the old labels are deleted? If it's just the Grafana variables we're worried about, we can use a regex there (until the old data ages out).
The dashboard was broken such that it would not load even load the settings page. It seemed to hang indefinitely; I left it open in the background an hour or so yesterday, and it was still spinning when I came back. I tried to get around Grafana not rendering the edit button by manually inserting ?editview=settings into the URI, but that hung as well.
I was able to back up a copy of the JSON definition, by going to https://grafana.wikimedia.org/api/dashboards/uid/M7xQ_BeWk, but looking at the contents didn't give me any immediate clues as to why it wouldn't load. Re-importing that JSON as a new dashboard failed silently.
Recreating the entire dashboard from scratch and then overwriting the old copy seemed to be successful, although I'm sure I missed a few fields. If it's still working in a day or two, I'll clean it up, otherwise consider this an experiment to see if it'll meet the same demise.
Regarding the label pollution, I added a regex to $dc which excludes values containing numerals (and thus hostnames). This fixes the drop-downs from the days the labels were swapped; let me know if you think that's sufficient, or if I should request access/assistance from SRE to be able to delete the old data.
That's because of the switchover, the metric is now coming from the "codfw prometheus/ops" instead of the "eqiad prometheus/ops" source. I'm not how to fix that, it seems like a panel can only have one source.