Page MenuHomePhabricator

Edge cache response time per server should be monitored
Closed, ResolvedPublic

Description

We currently cannot see the effect of cache response time per instance when servers get depooled/repooled, etc.

Event Timeline

Gilles created this task.Nov 12 2019, 1:25 PM
ema moved this task from Triage to Caching on the Traffic board.Nov 12 2019, 3:19 PM
ema triaged this task as Medium priority.Nov 12 2019, 4:11 PM
Gilles moved this task from Inbox to Radar on the Performance-Team board.Nov 12 2019, 8:56 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
ema added a comment.Nov 19 2019, 4:43 PM

As per irc conversation with @Gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the information in graphite and add the cache frontends as a dropdown to https://grafana.wikimedia.org/d/000000143/navigation-timing

I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by a new dimension.

CDanis added a subscriber: CDanis.Nov 20 2019, 1:55 PM
Gilles claimed this task.Apr 20 2020, 3:16 PM
Gilles raised the priority of this task from Medium to High.

Change 591098 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Collect response time per DC/host for Prometheus

https://gerrit.wikimedia.org/r/591098

Change 591098 merged by jenkins-bot:
[performance/navtiming@master] Collect response time per DC/host for Prometheus

https://gerrit.wikimedia.org/r/591098

Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:19:55Z] <dpifke@deploy1001> Started deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086

Mentioned in SAL (#wikimedia-operations) [2020-05-04T18:20:00Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@239d359]: Deploy navtiming with new/updated Prometheus metrics - T249822, T238086 (duration: 00m 05s)

Change 594264 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart

https://gerrit.wikimedia.org/r/594264

Change 594264 merged by jenkins-bot:
[performance/navtiming@master] Fix swapped host/dc in Prometheus responsestart

https://gerrit.wikimedia.org/r/594264

Gilles added a subscriber: dpifke.May 5 2020, 9:31 AM

@dpifke once deployed, we will need to nuke the existing data for navtiming_responsestart_by_host_seconds on Prometheus. Otherwise it's going to pollute the label values for Grafana drop-down menus and such.

Gilles added a comment.May 5 2020, 9:34 AM

Put together a dashboard (with the underlying labels swapped for now): https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1

Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:09Z] <dpifke@deploy1001> Started deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086

Mentioned in SAL (#wikimedia-operations) [2020-05-12T19:55:14Z] <dpifke@deploy1001> Finished deploy [performance/navtiming@48110b9]: Fixes swapped dc/host labels - T238086 (duration: 00m 05s)

This is deployed, and I updated the Grafana dashboard.

To nuke the data, we would need to restart Prometheus with --web.enable-admin-api flag and run something like:

curl -X POST -g 'http://127.0.0.1:9090/api/v1/admin/tsdb/delete_series?match[]=navtiming_responsestar
t_by_host_seconds&end=2020-05-12T19:55:14+0000'

How important is it that the old labels are deleted? If it's just the Grafana variables we're worried about, we can use a regex there (until the old data ages out).

The dashboard won't open for me now, it's stuck on a spinner:

Weird. It was working yesterday (I verified that new data was appearing with the correct labels), but is now hanging for me as well. I'll investigate.

The dashboard was broken such that it would not load even load the settings page. It seemed to hang indefinitely; I left it open in the background an hour or so yesterday, and it was still spinning when I came back. I tried to get around Grafana not rendering the edit button by manually inserting ?editview=settings into the URI, but that hung as well.

I was able to back up a copy of the JSON definition, by going to https://grafana.wikimedia.org/api/dashboards/uid/M7xQ_BeWk, but looking at the contents didn't give me any immediate clues as to why it wouldn't load. Re-importing that JSON as a new dashboard failed silently.

Recreating the entire dashboard from scratch and then overwriting the old copy seemed to be successful, although I'm sure I missed a few fields. If it's still working in a day or two, I'll clean it up, otherwise consider this an experiment to see if it'll meet the same demise.

Regarding the label pollution, I added a regex to $dc which excludes values containing numerals (and thus hostnames). This fixes the drop-downs from the days the labels were swapped; let me know if you think that's sufficient, or if I should request access/assistance from SRE to be able to delete the old data.

Not sure if this is merely a display issue, but I see fairly odd buckets on the dashboard:

  • 0ms - 438ms
  • 438ms - 877ms
  • 3.07s - 3.51s
  • 3.94s - 4.38s

That's because I forgot to change query format to "heatmap" in the panel settings. :) Fixed.

Gilles closed this task as Resolved.May 15 2020, 6:50 AM
Gilles added a subscriber: Vgutierrez.

I think that I've fixed the display further, the format of the heatmap needed to be "Time series buckets". Looking good!

@ema @Vgutierrez you can now use this dashboard when doing per-host experiments!

ema added a comment.May 15 2020, 7:18 AM

@ema @Vgutierrez you can now use this dashboard when doing per-host experiments!

This is great! Thank you @dpifke and @Gilles.

ema awarded a token.May 15 2020, 7:19 AM
ema reopened this task as Open.Sep 23 2020, 9:34 AM

The dashboard has stopped working on September 1st:

That's because of the switchover, the metric is now coming from the "codfw prometheus/ops" instead of the "eqiad prometheus/ops" source. I'm not how to fix that, it seems like a panel can only have one source.

Gilles closed this task as Resolved.Sep 23 2020, 10:07 AM

For now I've switched the source, we'll have to remember doing it again when the primary DC is switched back.