Page MenuHomePhabricator

Port most/all Icinga checks to Prometheus/Alertmanager
Open, Needs TriagePublic

Description

This is a tracking task for the general work of moving alerts from Icinga to Prometheus/Alertmanager.

Note that the title says most because while the perfect end goal is to migrate all alerts (and thus shut down Icinga) that might be unpractical and/or too much effort with respect to the gains.

On a pragmatic level though what we can do is reduce Icinga' scope over time, and turn it into a "backend" component. In this scenario for example we would stop using Icinga's web UI for all/most operations, and delegate all functionality to AM / alerts.w.o

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedfgiunchedi
Resolvedlmata
Resolvedfgiunchedi
ResolvedLadsgroup
OpenNone
Resolvedfgiunchedi
InvalidNone
ResolvedVolans
Resolvedfgiunchedi
Resolvedherron
Openherron
OpenNone
Resolvedfgiunchedi
ResolvedNone
ResolvedArnoldokoth
Resolvedfgiunchedi
OpenNone
OpenNone
ResolvedNone
ResolvedEBernhardson
ResolvedBTullis
Resolvedjbond
In ProgressBUG REPORTNone
Resolvedjhathaway
ResolvedBCornwall
ResolvedBCornwall
DuplicateNone
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedJMeybohm
ResolvedBCornwall
Resolvedfgiunchedi
Resolvedcmooney
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedjbond
OpenNone
Resolvedcmooney
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenNone
Resolvedfgiunchedi
OpenNone
OpenNone
OpenSLyngshede-WMF
OpenNone
Opentaavi
Opentaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvednskaggs
Resolvedtaavi
Resolvedtaavi
Resolveddcaro
OpenNone
Resolvedtaavi
Opentaavi

Event Timeline

Have we thought about creating a small middleware that would change nagios output format into prometheus-scrapable metrics (maybe including some kind of memory/disk cache for long running ones/ones where they are supposed to be run only once per hour)? I checked and I don't see anything already existing that does that (note I am talking about nagios checks, without icinga) I know there is already in place a scraper for the icinga service itself.

While this would not be an ideal situation in many cases- native solutions would be preferred- it would avoid headaches like T315866#8194791, where a very inferior metrics solution is proposed to substitute a proper, long-standing icinga check, by reusing the specific logic on a better management system. A single job would scrape all icinga-based checks for a host and aggregate them into prometheus metrics- including the error text- and that would allow us to replace fully icinga itself, while keeping the custom alert logic, all consolidated in prometheus. This would also solve the issue with the many upstream solutions not having space for a few custom WMF-specific checks- leading the way for an icinga-free WMF.

The closest I can think of is nrpe_exporter: https://github.com/canonical/nrpe_exporter and certainly something we can consider!

The closest I can think of is nrpe_exporter: https://github.com/canonical/nrpe_exporter and certainly something we can consider!

Nice. I see this still uses the NRPE daemon.

I think in general this shouldn't be plan A for any migration, but I am sure complex cases like the one I mention (WMF-specific behaviour potentially not found upstream), or others where there are no longer maintainers around to do the right thing, we could use this or something similar, and migrate the puppet class to use this, getting eventually rid of icinga itself (which I think we all agree is not a great alerting manager).

For example, when I did the check_bacula.py from zero, I implemented both nagios output format and a prometheus exporter daemon, but I guess there may be very old pieces of small checks that could take a lot of time to migrate to proper exporters.