Page MenuHomePhabricator

Port all Icinga checks to Prometheus/Alertmanager
Open, Needs TriagePublic

Description

This is a tracking task for the general work of moving alerts from Icinga to Prometheus/Alertmanager.

Subtasks with the GOAL subtype include an auto-generated migration table to track progress. The others are related to the migration as well, but they contain additional notes or refer to checks migrated before the automated auditing process was implemented.

Related Objects

StatusSubtypeAssignedTask
Opentappof
Opentappof
Opentappof
Resolvedfgiunchedi
Resolved lmata
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedNone
ResolvedArnoldokoth
Resolvedfgiunchedi
DuplicateNone
ResolvedGoaltappof
ResolvedNone
ResolvedEBernhardson
ResolvedBTullis
Resolvedjbond
Resolvedjhathaway
ResolvedBCornwall
ResolvedBCornwall
DuplicateNone
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedJMeybohm
ResolvedBCornwall
Resolvedfgiunchedi
Resolvedcmooney
In ProgressGoaltappof
Resolvedtappof
InvalidNone
OpenGoalNone
ResolvedABran-WMF
ResolvedBTullis
ResolvedBTullis
ResolvedABran-WMF
DeclinedABran-WMF
ResolvedABran-WMF
OpenNone
OpenNone
OpenNone
In ProgressNone
Resolvedfgiunchedi
Resolvedjbond
ResolvedGoalcolewhite
Resolvedcmooney
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenNone
Resolvedfgiunchedi
Resolvedfgiunchedi
OpenNone
OpenGoalSLyngshede-WMF
ResolvedBUG REPORTfgiunchedi
OpenGoalherron
OpenGoalSLyngshede-WMF
OpenGoalherron
OpenNone
Resolvedfgiunchedi
InvalidNone
ResolvedVolans
Resolvedfgiunchedi
Resolvedfgiunchedi
OpenGoalherron
Resolvedtappof
ResolvedBUG REPORTtappof
Resolvedtappof
Resolvedcmooney
Resolvedtappof
Resolvedtappof
OpenGoaltappof
Resolvedandrea.denisse
OpenGoaltappof
Resolved Stevemunene
OpenGoaltappof
ResolvedBUG REPORTtappof
Resolvedtappof
ResolvedBTullis
In ProgressGoalSLyngshede-WMF
OpenSLyngshede-WMF
OpenGoalherron
Resolvedfgiunchedi
Resolvedtaavi
Resolvedfgiunchedi
ResolvedBTullis
OpenGoaltappof
Resolvedfgiunchedi
ResolvedGoalcolewhite
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
Resolvedfgiunchedi
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalbking
OpenGoalNone
OpenNone
OpenNone
OpenNone
Resolvedtaavi
Resolved nskaggs
Resolvedtaavi
Resolvedtaavi
Resolveddcaro
OpenNone
Resolvedtaavi
OpenNone
Resolvedtaavi
OpenNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone
OpenGoalNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 991801 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove legacy check_nagios_paging

https://gerrit.wikimedia.org/r/991801

Change 991801 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove legacy check_nagios_paging

https://gerrit.wikimedia.org/r/991801

fgiunchedi renamed this task from Port most/all Icinga checks to Prometheus/Alertmanager to Port all Icinga checks to Prometheus/Alertmanager.Sep 6 2024, 8:18 AM
fgiunchedi updated the task description. (Show Details)

Change #1074368 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: remove obsolete check

https://gerrit.wikimedia.org/r/1074368

Change #1074369 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] librenms: remove obsolete checks

https://gerrit.wikimedia.org/r/1074369

Change #1074369 merged by Filippo Giunchedi:

[operations/puppet@production] librenms: remove obsolete checks

https://gerrit.wikimedia.org/r/1074369

Change #1074368 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: remove obsolete check

https://gerrit.wikimedia.org/r/1074368

Change #1075515 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] vopsbot: remove systemd::service alert, replaced by alertmanager

https://gerrit.wikimedia.org/r/1075515

Change #1075516 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: replace url checks with pingthing

https://gerrit.wikimedia.org/r/1075516

Change #1075515 merged by Filippo Giunchedi:

[operations/puppet@production] vopsbot: remove systemd::service alert, replaced by alertmanager

https://gerrit.wikimedia.org/r/1075515

Change #1075516 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: replace url checks with pingthing

https://gerrit.wikimedia.org/r/1075516

I would like to send a question for observability (Observability-Alerting ) about how to go for a very simple check that currently lives in icinga. I am not too worried about the individual check, I know how to fix it technologically, but maybe as a question for how to do this kind of checks.

Normal service checks should have metrics already integrated, or have a dedicated exporter, or transform the existing custom icinga checks into a custom prometheus exporter.

And then we have things like modules/icinga/files/check_legal_html.py. I am not sure how many things we have like that, but maybe quite some- simple one time scripts that do a simple check (in this case, at legal's request) and just email legal team if they fail (and some SREs). This is not a technical issue, I know how to make that into a prometheus metric, but I am not sure if I *should*. This is not a script I have to maintain, nor I want to, but when it was just a python file on puppet, I wouldn't mind. Now, setting up a dedicated exporter, and a dedicated scrapper, just for Error/No error and a message feels wrong and overblown (it is not a metric, it is literally just an alert).

I could setup an email automations, but emails are bad alert setups.

I wonder if we could have like some vm with some puppet where to throw random and semi-orphan icinga scripts with barely any modifications and scrape them all together, with minimal effort, when it makes no sense to have a dedicated metric service for it?

Hi @jcrespo and Thank you for getting in touch with us,

Normal service checks should have metrics already integrated, or have a dedicated exporter, or transform the existing custom icinga checks into a custom prometheus exporter.

And then we have things like modules/icinga/files/check_legal_html.py. I am not sure how many things we have like that, but maybe quite some- simple one time scripts that do a simple check (in this case, at legal's request) and just email legal team if they fail (and some SREs). This is not a technical issue, I know how to make that into a prometheus metric, but I am not sure if I *should*. This is not a script I have to maintain, nor I want to, but when it was just a python file on puppet, I wouldn't mind. Now, setting up a dedicated exporter, and a dedicated scrapper

I agree with you: emails are not meant for alerts. On the other hand, you shouldn't feel obligated to write a custom exporter for every script you have. IMHO, you have two possible options, and both require the execution of the script in a scheduled manner. At this time, you also don't need a custom scraper, since on every node you have the node-exporter, which automatically exports all the files from the directory /var/lib/prometheus/node.d. Alternatively, you can choose to send your metrics to the Pushgateway instance, which could even be more lightweight.

just for Error/No error and a message feels wrong and overblown (it is not a metric, it is literally just an alert).

I could setup an email automations, but emails are bad alert setups.

I totally get what you're saying about boolean metrics not being metrics but literally alerts. However, I still think it's better to have boolean metrics that represent simple on/off alerts compared to multiple custom and different alerting systems like the emails you mentioned. This way, you can take advantage of the monitoring/alerting infrastructure, which offers all the benefits, such as the ability to choose how to enrich and route your alerts, an alerts dashboard, and so on, with a shared approach.

I wonder if we could have like some vm with some puppet where to throw random and semi-orphan icinga scripts with barely any modifications and scrape them all together, with minimal effort, when it makes no sense to have a dedicated metric service for it?

My opinion is that it could be beneficial to consolidate all of this into a single dedicated VM, but I have some concerns regarding the ownership of the machine itself and the onboarded scripts, as well as the commitment required to migrate the scripts. I see that the script you mentioned has no dependencies, but I can imagine other cases where, for example, there are interactions with the file system or with daemons that are not exposed outside of the "original" machine. I think that managing these aspects could potentially slow down the migration further.

Change #1114650 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] kartotherian: disable icinga check

https://gerrit.wikimedia.org/r/1114650

Change #1114651 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove obsolete poolcounter icinga checks

https://gerrit.wikimedia.org/r/1114651

Change #1114655 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] dumps: remove nfs port icinga checks

https://gerrit.wikimedia.org/r/1114655

Change #1114655 merged by Filippo Giunchedi:

[operations/puppet@production] dumps: remove nfs port icinga checks

https://gerrit.wikimedia.org/r/1114655

Change #1114650 merged by Filippo Giunchedi:

[operations/puppet@production] kartotherian: disable icinga check

https://gerrit.wikimedia.org/r/1114650

Change #1114651 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove obsolete poolcounter icinga checks

https://gerrit.wikimedia.org/r/1114651

Change #1155607 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] monitoring services: add migration task T321808 to instances

https://gerrit.wikimedia.org/r/1155607

Change #1155607 merged by Tiziano Fogli:

[operations/puppet@production] monitoring services: add migration task T321808 to instances

https://gerrit.wikimedia.org/r/1155607

Change #1181791 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch: selectively enable cluster health check

https://gerrit.wikimedia.org/r/1181791

Change #1181791 merged by Cwhite:

[operations/puppet@production] opensearch: selectively enable cluster health check

https://gerrit.wikimedia.org/r/1181791