Page MenuHomePhabricator

Icinga alerts that should open tasks instead of alerting
Open, LowPublic0 Estimated Story Points


While chasing Icinga alerts, I noticed that some of them would have been better as tasks than (ignored) alerts.

I don't know if it's possible to have warning open task and critical alert. If so the two bellow would be good candidates (and unknowns could be good ones as well as it usually mean something is wrong with the check itself).

  • Disk space (warning)
  • BGP status (warning)
  • Long running screen/tmux
  • Memory correctable errors -EDAC-
  • mgmt SSH not working
  • Hosts XXX is not in mediawiki-installation dsh group
  • SSL WARNING - Certificate XXXX valid until 2020-06-20 07:01:41 +0000 (expires in 53 days)
  • HP RAID - WARNING: Slot 0: Predictive Failure...

I'd bet there are other alerts in the same case, maybe we can complete that list as they show up?

Event Timeline

ayounsi created this task.

FYI The current could be adapted or (ideally) its generic parts extracted to be able to easily add other handlers for different types of checks. Both the state (WARNING, CRITICAL, etc..) and the state type (HARD, SOFT) can be passed to the handler that can decide what to do based on those.

The proposal makes sense, but I'd make sure that we don't pollute Phabricator when we already have a place to summarize alarms (namely, Icinga). I usually check icinga daily, and I consider it as a summary of outstanding problems.. It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it. But as Volans was saying, the raid handler script is already a good use case, so probably my workflow is wrong and a more phabricator-task driven approach is better. Just adding my 2c :)

@ayounsi I'm not sure the last two added in the last update should not alarm. What is the criteria used? According to the proposed document for incident response only incidents of level 5 should open tasks instead of alerting IMHO.

The criteria is that I went through the list of active Icinga alerts, and those two (T227548, and T227547) were active for 5 days (and still going).
This list is not meant to be authoritative but more "things I find along".

It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it

I really disagree with this. I think creating a task and ACKing it is the right thing to do. After somebody saw it and made a task that means it has been seen. That is the point of ACK, to have a mechanism to know what has been seen and what is new.

Without ACKing thing stay in the "unhandled" column and you have no way to know if somebody else already saw it or not. This leads to either more people ignoring it or a person getting pinged about it multiple times.

Either we'd have multiple people looking at Icinga every day and wondering about the same things repeatedly or we'd have nobody handle them because people assume others do it. Both are things to avoid. I think the only way to avoid them is to use the provided mechanism for letting you know if it has been handled.

If incoming tickets in Phabricator get no attention then that's a separate problem.

I like what we do for degraded RAIDs, I think it will help us move forward something like this.

Thank you for the feedback and use cases, we have this feature on the roadmap (namely to connect alerts to phabricator) and will get prioritized. One of the benefits of alertmanager is that it is making it possible to add this feature easily (e.g.