Page MenuHomePhabricator

Expose CPU benchmark data to Prometheus/Grafana
Closed, ResolvedPublic

Description

This requires:

  • Adding request context to the CPUBenchmark schema
  • Populating that context in the NavigationTiming extension
  • Consuming that context in the navtiming daemon
  • Exposing facetted data in the navtiming daemon
  • Creating a Grafana dashboard consuming the new data

Event Timeline

Change 693163 had a related patch set uploaded (by Gilles; author: Gilles):

[schemas/event/secondary@master] Add request context fields to CpuBenchmark schema

https://gerrit.wikimedia.org/r/693163

Change 693163 merged by jenkins-bot:

[schemas/event/secondary@master] Add request context fields to CpuBenchmark schema

https://gerrit.wikimedia.org/r/693163

Change 693415 had a related patch set uploaded (by Gilles; author: Gilles):

[mediawiki/extensions/NavigationTiming@master] Add request context to CpuBenchmark event

https://gerrit.wikimedia.org/r/693415

Change 693423 had a related patch set uploaded (by Gilles; author: Gilles):

[performance/navtiming@master] Send CpuBenchmark data to Prometheus

https://gerrit.wikimedia.org/r/693423

Change 693415 merged by jenkins-bot:

[mediawiki/extensions/NavigationTiming@master] Add request context to CpuBenchmark event

https://gerrit.wikimedia.org/r/693415

Change 693423 merged by jenkins-bot:

[performance/navtiming@master] Send CpuBenchmark data to Prometheus

https://gerrit.wikimedia.org/r/693423

Krinkle moved this task from Doing: Goals to Inbox, needs triage on the Performance-Team board.
dpifke subscribed.

This should go out tomorrow (Tuesday). I'll verify all the dependencies are deployed, and keep an eye on the rollout of this.

There was an open question (at least in my mind) as to whether or not https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/693163 had been deployed, or if we needed to ask the Analytics folks to do something.

I've verified it's live at https://schema.wikimedia.org/#!//secondary/jsonschema/analytics/legacy/cpubenchmark.

Useful background reading: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#EventGate_clusters

Change 712725 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[performance/navtiming@master] Fix CpuBenchmark unhandled exception

https://gerrit.wikimedia.org/r/712725

Change 712725 merged by jenkins-bot:

[performance/navtiming@master] Fix CpuBenchmark unhandled exception

https://gerrit.wikimedia.org/r/712725

Mentioned in SAL (#wikimedia-operations) [2021-08-18T00:38:06Z] <dpifke@deploy1002> Started deploy [performance/navtiming@88f12a0]: Re-deploy fixed CpuBenchmark (T281243)

Mentioned in SAL (#wikimedia-operations) [2021-08-18T00:38:15Z] <dpifke@deploy1002> Finished deploy [performance/navtiming@88f12a0]: Re-deploy fixed CpuBenchmark (T281243) (duration: 00m 06s)

Mentioned in SAL (#wikimedia-operations) [2021-08-18T00:39:13Z] <dpifke@deploy1002> Started deploy [performance/navtiming@88f12a0]: Revert CpuBenchmark again (T281243)

Mentioned in SAL (#wikimedia-operations) [2021-08-18T00:39:21Z] <dpifke@deploy1002> Finished deploy [performance/navtiming@88f12a0]: Revert CpuBenchmark again (T281243) (duration: 00m 05s)

Sigh. New issue:

Aug 18 00:39:15 webperf2001 navtiming[12027]: 2021-08-18 00:39:15,462 [ERROR] (run:899) Unhandled exception in main loop, restarting consumer
Aug 18 00:39:15 webperf2001 navtiming[12027]: Traceback (most recent call last):
Aug 18 00:39:15 webperf2001 navtiming[12027]:   File "/srv/deployment/performance/navtiming-cache/revs/88f12a07957a223a2d6805c290ce97a6471cbd6b/navtiming/__init__.py", line 891, in run
Aug 18 00:39:15 webperf2001 navtiming[12027]:     for stat in f(meta):
Aug 18 00:39:15 webperf2001 navtiming[12027]:   File "/srv/deployment/performance/navtiming-cache/revs/88f12a07957a223a2d6805c290ce97a6471cbd6b/navtiming/__init__.py", line 599, in handle_cpu_benchmark
Aug 18 00:39:15 webperf2001 navtiming[12027]:     bucketed_battery_level = str(int(round(event['batteryLevel'] * 10) * 10))
Aug 18 00:39:15 webperf2001 navtiming[12027]: KeyError: 'batteryLevel'

Change 713723 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[performance/navtiming@master] Fix CpuBenchmark KeyError

https://gerrit.wikimedia.org/r/713723

Change 713723 merged by jenkins-bot:

[performance/navtiming@master] Fix CpuBenchmark KeyError on missing batteryLevel

https://gerrit.wikimedia.org/r/713723

Mentioned in SAL (#wikimedia-operations) [2021-08-19T15:52:56Z] <dpifke@deploy1002> Started deploy [performance/navtiming@f8bf39f]: Deploy CpuBenchmark processor again T281243

Mentioned in SAL (#wikimedia-operations) [2021-08-19T15:53:05Z] <dpifke@deploy1002> Finished deploy [performance/navtiming@f8bf39f]: Deploy CpuBenchmark processor again T281243 (duration: 00m 06s)

Quick and dirty dashboard I threw together to verify the data is coming in: https://grafana.wikimedia.org/d/cFMjrb7nz/cpu-benchmark?orgId=1

Change 716045 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[performance/navtiming@master] Reduce cardinality of CpuBenchmark metrics

https://gerrit.wikimedia.org/r/716045

Change 716045 merged by jenkins-bot:

[performance/navtiming@master] Reduce cardinality of CpuBenchmark metrics

https://gerrit.wikimedia.org/r/716045

Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:04:13Z] <dpifke@deploy1002> Started deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243

Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:04:19Z] <dpifke@deploy1002> Finished deploy [performance/navtiming@63c9d31]: Deploy fix for CpuBenchmark-related Prometheus timeouts T281243 (duration: 00m 06s)