Page MenuHomePhabricator

Alertmanager IRC notifications feedback and improvements
Open, Needs TriagePublic

Description

This task tracks feedback on current Alertmanager IRC notifications, by way of alertmanager-irc-relay (i.e. the actual webhook -> IRC bot that does the work) which in turns runs the jinxer-wm bot.

Recoveries for grouped alerts (e.g. SystemdUnitFailed) are not clear

Case in point for an httpbb unit failure as reported by @Joe the recovery in jinxer showed up as a decrease of firing count, as opposed to a resolved message:

2024-02-06T18:03:25 jinxer-wm (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/gAaZRFWk/sstemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-02-06T18:05:16 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/sysemd_unit_tate
2024-02-06T18:05:20 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_uni_state
2024-02-06T18:14:44 icinga-wm RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2024-02-06T18:17:26 jinxer-wm (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-02-06T18:28:25 jinxer-wm (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-02-06T18:29:24 icinga-wm RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

In this case icinga instead emitted a recovery for the individual unit + host, which is more clear about what's going on.

This has to do with grouping alert labels, meaning by default we group by alertname (SystemdUnitFailed above). Something we should explore is to also group by name (i.e. the systemd unit name`) only for systemd-related alerts, to essentially get per-unit alerts (and recoveries)

Another case being "slow trickle" alert groups where even when things are recovering the notification shows "FIRING" even though with a decreasing number of alerts, for example:

09:44 -jinxer-wm:#wikimedia-traffic- FIRING: [10x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1104:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
09:49 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
09:50 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
09:54 -jinxer-wm:#wikimedia-traffic- FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
10:04 -jinxer-wm:#wikimedia-traffic- FIRING: [16x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources

then things start recovering

10:09 -jinxer-wm:#wikimedia-traffic- FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
10:14 -jinxer-wm:#wikimedia-traffic- FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources

And finally the group as a whole has recovered

10:19 -jinxer-wm:#wikimedia-traffic- RESOLVED: [9x] PuppetZeroResources: Puppet has failed generate resources on 
          cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - 
          https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - 
          https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources