Page MenuHomePhabricator

Create Icinga check for navtiming.py service health
Closed, ResolvedPublic

Description

If for whatever reason it is not working correctly (e.g. writing data to Graphite), we should know immediately and not rely on humans finding it manually when they need the data upon browsing Grafana.

The service uses Scap3 for deployments and systemd for automatic start/restart. But we don't monitor its overall health in any way.

See also:

Event Timeline

Change 597176 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] Add check_prometheus rules for navtiming

https://gerrit.wikimedia.org/r/597176

Change 597176 merged by CDanis:
[operations/puppet@production] Add check_prometheus rules for navtiming

https://gerrit.wikimedia.org/r/597176

Change 619256 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::webperf::processors: add generic dashboard_links

https://gerrit.wikimedia.org/r/619256

Change 619256 merged by Elukey:
[operations/puppet@production] profile::webperf::processors: fix prometheus monitors

https://gerrit.wikimedia.org/r/619256

Hi! Puppet was broken on webperf hosts, I just merged https://gerrit.wikimedia.org/r/619256 to re-enable it. Ideally dashboard_links should contain a link to a specific graph pointing to the metric that we are alarming on, I didn't have a lot of context about the alerts so I just added a generic https://grafana.wikimedia.org/d/000000143/navigation-timing. If you come up with better links I'll be happy to merge the change!

Thanks. I've filed T260086 as a follow-up, to come up with a better dashboard to link for our alerts.