Page MenuHomePhabricator

Adapt Toolschecker to work with Prometheus
Open, Needs TriagePublic

Description

toolschecker performs certain actions when triggered by Icinga. More details are available here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring

Ideally, we should have it integrated with Prometheus instead of being a one-off solution (kind of a Python clone of NRPE in a way).

If there are other ways to arrange these checks so it plays better with Prometheus, that should be investigated too.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2019, 11:37 AM
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM

Change 519718 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] icinga: fix tools checker stretch jobs

https://gerrit.wikimedia.org/r/519718

Fixing these critical alerts in icinga to reduce noise and give us a better direction for migrating over to prometheus.

Bstorm added a subscriber: Bstorm.Jun 28 2019, 9:15 PM

Those two fixes are probably good, but the issues with them run deeper. There's a race condition in the way they work with cron and webservice.

Change 519718 merged by Jhedden:
[operations/puppet@production] icinga: fix tools checker stretch jobs

https://gerrit.wikimedia.org/r/519718

Never mind! The race condition timeout thing is this one : T221301

What you did there should be a clean fix!