Page MenuHomePhabricator

Icinga alerts that should open tasks instead of alerting
Open, LowPublic0 Story Points

Description

While chasing Icinga alerts, I noticed that some of them would have been better as tasks than (ignored) alerts.

I don't know if it's possible to have warning open task and critical alert. If so the two bellow would be good candidates (and unknowns could be good ones as well as it usually mean something is wrong with the check itself).

  • Disk space (warning)
  • BGP status (warning)
  • Long running screen/tmux
  • Memory correctable errors -EDAC-
  • mgmt SSH not working
  • Hosts XXX is not in mediawiki-installation dsh group

I'd bet there are other alerts in the same case, maybe we can complete that list as they show up?

Event Timeline

ayounsi triaged this task as Low priority.Jun 5 2019, 7:24 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2019, 7:24 PM

FYI The current raid_handler.py could be adapted or (ideally) its generic parts extracted to be able to easily add other handlers for different types of checks. Both the state (WARNING, CRITICAL, etc..) and the state type (HARD, SOFT) can be passed to the handler that can decide what to do based on those.

elukey added a subscriber: elukey.Jun 7 2019, 8:35 AM

The proposal makes sense, but I'd make sure that we don't pollute Phabricator when we already have a place to summarize alarms (namely, Icinga). I usually check icinga daily, and I consider it as a summary of outstanding problems.. It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it. But as Volans was saying, the raid handler script is already a good use case, so probably my workflow is wrong and a more phabricator-task driven approach is better. Just adding my 2c :)

ayounsi updated the task description. (Show Details)Jul 9 2019, 1:40 AM
Volans added a comment.Jul 9 2019, 6:32 AM

@ayounsi I'm not sure the last two added in the last update should not alarm. What is the criteria used? According to the proposed document for incident response only incidents of level 5 should open tasks instead of alerting IMHO.

The criteria is that I went through the list of active Icinga alerts, and those two (T227548, and T227547) were active for 5 days (and still going).
This list is not meant to be authoritative but more "things I find along".