Page MenuHomePhabricator

Create Icinga check for navtiming.py service health
Closed, ResolvedPublic

Description

If for whatever reason it is not working correctly (e.g. writing data to Graphite), we should know immediately and not rely on humans finding it manually when they need the data upon browsing Grafana.

The service uses Scap3 for deployments and systemd for automatic start/restart. But we don't monitor its overall health in any way.

See also:

Event Timeline

Krinkle created this task.Jun 13 2019, 6:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2019, 6:14 PM
Gilles assigned this task to dpifke.Jan 7 2020, 11:38 AM

Change 597176 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] Add check_prometheus rules for navtiming

https://gerrit.wikimedia.org/r/597176

fgiunchedi moved this task from Inbox to Radar on the observability board.Mon, Jul 20, 1:28 PM

Change 597176 merged by CDanis:
[operations/puppet@production] Add check_prometheus rules for navtiming

https://gerrit.wikimedia.org/r/597176

Change 619256 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::webperf::processors: add generic dashboard_links

https://gerrit.wikimedia.org/r/619256

Change 619256 merged by Elukey:
[operations/puppet@production] profile::webperf::processors: fix prometheus monitors

https://gerrit.wikimedia.org/r/619256

elukey added a subscriber: elukey.Mon, Aug 10, 8:03 AM

Hi! Puppet was broken on webperf hosts, I just merged https://gerrit.wikimedia.org/r/619256 to re-enable it. Ideally dashboard_links should contain a link to a specific graph pointing to the metric that we are alarming on, I didn't have a lot of context about the alerts so I just added a generic https://grafana.wikimedia.org/d/000000143/navigation-timing. If you come up with better links I'll be happy to merge the change!

dpifke closed this task as Resolved.Mon, Aug 10, 8:16 PM

Thanks. I've filed T260086 as a follow-up, to come up with a better dashboard to link for our alerts.