Page MenuHomePhabricator

Edge cache response time per server should be monitored
Closed, ResolvedPublic

Description

We currently cannot see the effect of cache response time per instance when servers get depooled/repooled, etc.

Event Timeline

ema triaged this task as Medium priority.Nov 12 2019, 4:11 PM

As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing

I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by a new dimension.

Gilles raised the priority of this task from Medium to High.

Change 591098 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Collect response time per DC/host for Prometheus

https://gerrit.wikimedia.org/r/591098

Change 591098 merged by jenkins-bot:
[performance/navtiming@master] Collect response time per DC/host for Prometheus

https://gerrit.wikimedia.org/r/591098

Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:19:55Z] <dpifke@deploy1001> Started deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086

Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:20:00Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086 (duration: 00m 05s)

Change 594264 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart

https://gerrit.wikimedia.org/r/594264

Change 594264 merged by jenkins-bot:
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart

https://gerrit.wikimedia.org/r/594264

@dpifke once deployed, we will need to nuke the existing data for navtiming_responsestart_by_host_seconds on Prometheus. Otherwise it's going to pollute the label values for Grafana drop-down menus and such.

Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:09Z] <dpifke@deploy1001> Started deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086

Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:14Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 (duration: 00m 05s)

This is deployed, and I updated the Grafana dashboard.

To nuke the data, we would need to restart Prometheus with --web.enable-admin-api flag and run something like:

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]=navtiming_responsestar
t_by_host_seconds&end=2020-05-12T19:55:14+0000'

How important is it that the old labels are deleted? If it's just the Grafana variables we're worried about, we can use a regex there (until the old data ages out).

The dashboard won't open for me now, it's stuck on a spinner:

Screenshot 2020-05-13 at 09.42.57.png (234×414 px, 15 KB)

Weird. It was working yesterday (I verified that new data was appearing with the correct labels), but is now hanging for me as well. I'll investigate.

The dashboard was broken such that it would not load even load the settings page. It seemed to hang indefinitely; I left it open in the background an hour or so yesterday, and it was still spinning when I came back. I tried to get around Grafana not rendering the edit button by manually inserting ?editview=settings into the URI, but that hung as well.

I was able to back up a copy of the JSON definition, by going to https://grafana.wikimedia.org/api/dashboards/uid/M7xQ_BeWk, but looking at the contents didn't give me any immediate clues as to why it wouldn't load. Re-importing that JSON as a new dashboard failed silently.

Recreating the entire dashboard from scratch and then overwriting the old copy seemed to be successful, although I'm sure I missed a few fields. If it's still working in a day or two, I'll clean it up, otherwise consider this an experiment to see if it'll meet the same demise.

Regarding the label pollution, I added a regex to $dc which excludes values containing numerals (and thus hostnames). This fixes the drop-downs from the days the labels were swapped; let me know if you think that's sufficient, or if I should request access/assistance from SRE to be able to delete the old data.

Not sure if this is merely a display issue, but I see fairly odd buckets on the dashboard:

  • 0ms - 438ms
  • 438ms - 877ms
  • 3.07s - 3.51s
  • 3.94s - 4.38s

Screenshot 2020-05-15 at 01.11.13.png (1×1 px, 110 KB)

That's because I forgot to change query format to "heatmap" in the panel settings. :) Fixed.

Gilles added a subscriber: Vgutierrez.

I think that I've fixed the display further, the format of the heatmap needed to be "Time series buckets". Looking good!

@ema @Vgutierrez you can now use this dashboard when doing per-host experiments!

Screenshot 2020-05-15 at 08.48.12.png (1×2 px, 781 KB)

@ema @Vgutierrez you can now use this dashboard when doing per-host experiments!

This is great! Thank you @dpifke and @Gilles.

The dashboard has stopped working on September 1st:

Screenshot from 2020-09-23 11-33-56.png (1×2 px, 276 KB)

That's because of the switchover, the metric is now coming from the "codfw prometheus/ops" instead of the "eqiad prometheus/ops" source. I'm not how to fix that, it seems like a panel can only have one source.

For now I've switched the source, we'll have to remember doing it again when the primary DC is switched back.