Page MenuHomePhabricator

Icinga alerts that should open tasks instead of alerting
Open, LowPublic0 Estimated Story Points

Description

While chasing Icinga alerts, I noticed that some of them would have been better as tasks than (ignored) alerts.

I don't know if it's possible to have warning open task and critical alert. If so the two bellow would be good candidates (and unknowns could be good ones as well as it usually mean something is wrong with the check itself).

  • Disk space (warning)
  • BGP status (warning)
  • Memory correctable errors -EDAC-
  • mgmt SSH not working
  • Hosts XXX is not in mediawiki-installation dsh group
  • SSL WARNING - Certificate XXXX valid until 2020-06-20 07:01:41 +0000 (expires in 53 days)
  • HP RAID - WARNING: Slot 0: Predictive Failure...
  • Power supply issues should open high priority tasks for DCops - T225140#8803716

I'd bet there are other alerts in the same case, maybe we can complete that list as they show up?

Event Timeline

ayounsi created this task.

FYI The current raid_handler.py could be adapted or (ideally) its generic parts extracted to be able to easily add other handlers for different types of checks. Both the state (WARNING, CRITICAL, etc..) and the state type (HARD, SOFT) can be passed to the handler that can decide what to do based on those.

The proposal makes sense, but I'd make sure that we don't pollute Phabricator when we already have a place to summarize alarms (namely, Icinga). I usually check icinga daily, and I consider it as a summary of outstanding problems.. It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it. But as Volans was saying, the raid handler script is already a good use case, so probably my workflow is wrong and a more phabricator-task driven approach is better. Just adding my 2c :)

@ayounsi I'm not sure the last two added in the last update should not alarm. What is the criteria used? According to the proposed document for incident response only incidents of level 5 should open tasks instead of alerting IMHO.

The criteria is that I went through the list of active Icinga alerts, and those two (T227548, and T227547) were active for 5 days (and still going).
This list is not meant to be authoritative but more "things I find along".

It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it

I really disagree with this. I think creating a task and ACKing it is the right thing to do. After somebody saw it and made a task that means it has been seen. That is the point of ACK, to have a mechanism to know what has been seen and what is new.

Without ACKing thing stay in the "unhandled" column and you have no way to know if somebody else already saw it or not. This leads to either more people ignoring it or a person getting pinged about it multiple times.

Either we'd have multiple people looking at Icinga every day and wondering about the same things repeatedly or we'd have nobody handle them because people assume others do it. Both are things to avoid. I think the only way to avoid them is to use the provided mechanism for letting you know if it has been handled.

If incoming tickets in Phabricator get no attention then that's a separate problem.

I like what we do for degraded RAIDs, I think it will help us move forward something like this.

Thank you for the feedback and use cases, we have this feature on the roadmap (namely to connect alerts to phabricator) and will get prioritized. One of the benefits of alertmanager is that it is making it possible to add this feature easily (e.g. https://github.com/knyar/phalerts)

LibreNMS Inbound interface errors too

Since we've set up task opening for AM alerts this quarter we can definitely tackle some of these.

Ideally we would add severity: task label to the alert, but since librenms' AM integration doesn't allow for user-defined labels we'll have to match the alert name inside AM's routing (ditto for icinga alerts).

Note that the alerts won't get automatically ACK'd e.g. in Icinga but tasks will be open for each distinct alert name.

We'd also need the PHID for projects to open tasks into (so that renaming a project is safe), so far I got (from https://phabricator.wikimedia.org/api/phid.lookup and using ["#sre"] for example as argument

  • #netops -> PHID-PROJ-h2zjwfqqi5cxjonrkfa7
  • #sre -> PHID-PROJ-5hj6ygnanfu23mmnlvmd
  • #observability -> PHID-PROJ-dwtj3e5mikntyhdbnohb

Change 675129 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] alertmanager: open tasks for librenms inbound interface errors

https://gerrit.wikimedia.org/r/675129

Change 675129 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: get librenms alerts for dcops to open tasks

https://gerrit.wikimedia.org/r/675129

I've got a query regarding this, which is something of a follow-up from T310359 in which we are trying to make sure that all relevant alerts are routed to our team from Icinga.

How would we decide which tags (or which teams) to create when creating a ticket from an alert?

For example, with the Disk Space alert mentioned above:

  • if the /var/lib/hadoop/journal file system is filling up on analytics1068 how will the system know that this host should create a task and tag it with Data-Engineering ?

We've already got our data-engineering-task receiver set up in AM - although we don't use it yet.

I'm just trying to work out how this would work for a similar check (e.g. disk space) where there might be multiple teams looking after the different servers involved.

Maybe we can leverage:

analytics1068:~$ cat /etc/wikimedia/contacts.yaml 
---
role::analytics_cluster::hadoop/worker:
- Data Engineering

Good question @BTullis ! For node-exporter level metrics (e.g. disk metrics) we have the cluster label already in the metrics. One solution would be to route alerts based on that and the information @ayounsi mentioned. We have run into a similar use case for WMCS-related alerts in https://gerrit.wikimedia.org/r/c/operations/puppet/+/802074 where we rewrite the team label based on the cluster label.

I need to give this a little more thought but we could be generalizing the cluster -> team mapping, that should be easy since cluster is already a label. Going from Puppet role to team will require a little more thought but I think it should be doable!

Another thought, maybe we could do:
All alerts that have been alerting for more than X days (eg. 7) become a task automatically

As at this point it's clear that it's not critical.
It would help keep an alerting dashboard clean and surface real issues more efficiently.
The downside is that it might "hide issues under the rag".

Change 854039 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: remove check_long_procs, unused

https://gerrit.wikimedia.org/r/854039

Change 854040 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: use 'site' label to route tasks for dcops

https://gerrit.wikimedia.org/r/854040

Change 854039 merged by Dzahn:

[operations/puppet@production] base: remove check_long_procs, unused

https://gerrit.wikimedia.org/r/854039

Change 854040 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: use 'site' label to route tasks for dcops

https://gerrit.wikimedia.org/r/854040

Power supply issues should open high priority tasks for DCops

Screenshot 2023-04-25 at 09-13-38 Alerts for wikimedia.org.png (226×736 px, 44 KB)

ayounsi updated the task description. (Show Details)

Change 913110 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: let power supply issues open tasks

https://gerrit.wikimedia.org/r/913110

Change 913110 merged by Filippo Giunchedi:

[operations/alerts@master] sre: let power supply issues open tasks

https://gerrit.wikimedia.org/r/913110