This task tracks feedback on current Alertmanager IRC notifications, by way of alertmanager-irc-relay (i.e. the actual webhook -> IRC bot that does the work) which in turns runs the jinxer-wm bot.
Recoveries for grouped alerts (e.g. SystemdUnitFailed) are not clear
Case in point for an httpbb unit failure as reported by @Joe the recovery in jinxer showed up as a decrease of firing count, as opposed to a resolved message:
2024-02-06T18:03:25 jinxer-wm (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/gAaZRFWk/sstemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed 2024-02-06T18:05:16 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/sysemd_unit_tate 2024-02-06T18:05:20 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_uni_state 2024-02-06T18:14:44 icinga-wm RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state 2024-02-06T18:17:26 jinxer-wm (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed 2024-02-06T18:28:25 jinxer-wm (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed 2024-02-06T18:29:24 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
In this case icinga instead emitted a recovery for the individual unit + host, which is more clear about what's going on.
This has to do with grouping alert labels, meaning by default we group by alertname (SystemdUnitFailed above). Something we should explore is to also group by name (i.e. the systemd unit name`) only for systemd-related alerts, to essentially get per-unit alerts (and recoveries)
Another case being "slow trickle" alert groups where even when things are recovering the notification shows "FIRING" even though with a decreasing number of alerts, for example:
09:44 -jinxer-wm:#wikimedia-traffic- FIRING: [10x] PuppetZeroResources: Puppet has failed generate resources on cp1104:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources 09:49 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources 09:50 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources 09:54 -jinxer-wm:#wikimedia-traffic- FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources 10:04 -jinxer-wm:#wikimedia-traffic- FIRING: [16x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
then things start recovering
10:09 -jinxer-wm:#wikimedia-traffic- FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources 10:14 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
And finally the group as a whole has recovered
10:19 -jinxer-wm:#wikimedia-traffic- RESOLVED: [9x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources