Icinga alerts that should open tasks instead of alerting
Open, LowPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	ayounsi
	Jun 5 2019, 7:24 PM

Description

While chasing Icinga alerts, I noticed that some of them would have been better as tasks than (ignored) alerts.

I don't know if it's possible to have warning open task and critical alert. If so the two bellow would be good candidates (and unknowns could be good ones as well as it usually mean something is wrong with the check itself).

Disk space (warning)
BGP status (warning)

Memory correctable errors -EDAC-

mgmt SSH not working
Hosts XXX is not in mediawiki-installation dsh group

SSL WARNING - Certificate XXXX valid until 2020-06-20 07:01:41 +0000 (expires in 53 days)

HP RAID - WARNING: Slot 0: Predictive Failure...

Power supply issues should open high priority tasks for DCops - T225140#8803716

I'd bet there are other alerts in the same case, maybe we can complete that list as they show up?

Details

Subject	Repo	Branch	Lines +/-
sre: let power supply issues open tasks	operations/alerts	master	+40 -32
alertmanager: use 'site' label to route tasks for dcops	operations/puppet	production	+3 -0
base: remove check_long_procs, unused	operations/puppet	production	+0 -159
alertmanager: get librenms alerts for dcops to open tasks	operations/puppet	production	+61 -16

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Open	None	T225140 Icinga alerts that should open tasks instead of alerting
Resolved	fgiunchedi	T310266 Move mgmt SSH checks from Icinga to Prometheus/Alertmanager
Invalid	None	T319299 Investigate longer run time for hiera_export netbox script
Resolved	Volans	T320721 Decide whether decom'ing hosts mgmt DNS entry should stay or not

Event Timeline

ayounsi triaged this task as Low priority.Jun 5 2019, 7:24 PM

ayounsi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2019, 7:24 PM

FYI The current raid_handler.py could be adapted or (ideally) its generic parts extracted to be able to easily add other handlers for different types of checks. Both the state (WARNING, CRITICAL, etc..) and the state type (HARD, SOFT) can be passed to the handler that can decide what to do based on those.

The proposal makes sense, but I'd make sure that we don't pollute Phabricator when we already have a place to summarize alarms (namely, Icinga). I usually check icinga daily, and I consider it as a summary of outstanding problems.. It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it. But as Volans was saying, the raid handler script is already a good use case, so probably my workflow is wrong and a more phabricator-task driven approach is better. Just adding my 2c :)

ayounsi updated the task description. (Show Details)Jul 9 2019, 1:40 AM

@ayounsi I'm not sure the last two added in the last update should not alarm. What is the criteria used? According to the proposed document for incident response only incidents of level 5 should open tasks instead of alerting IMHO.

The criteria is that I went through the list of active Icinga alerts, and those two (T227548, and T227547) were active for 5 days (and still going).
This list is not meant to be authoritative but more "things I find along".

fgiunchedi mentioned this in T228878: Reduce Icinga alert noise.Jul 24 2019, 2:16 PM

CDanis subscribed.Nov 11 2019, 7:07 PM

It breaks my workflow if somebody acks the alarm and creates a task, because I might miss it

I really disagree with this. I think creating a task and ACKing it is the right thing to do. After somebody saw it and made a task that means it has been seen. That is the point of ACK, to have a mechanism to know what has been seen and what is new.

Without ACKing thing stay in the "unhandled" column and you have no way to know if somebody else already saw it or not. This leads to either more people ignoring it or a person getting pinged about it multiple times.

Either we'd have multiple people looking at Icinga every day and wondering about the same things repeatedly or we'd have nobody handle them because people assume others do it. Both are things to avoid. I think the only way to avoid them is to use the provided mechanism for letting you know if it has been handled.

If incoming tickets in Phabricator get no attention then that's a separate problem.

ayounsi updated the task description. (Show Details)Apr 27 2020, 4:52 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 11:53 AM

ayounsi updated the task description. (Show Details)Aug 3 2020, 6:14 AM

ayounsi mentioned this in T250053: Netbox report accounting icinga alert.Sep 15 2020, 7:57 AM

LSobanski subscribed.Sep 16 2020, 9:26 AM

jijiki subscribed.Oct 14 2020, 2:18 PM

I like what we do for degraded RAIDs, I think it will help us move forward something like this.

Thank you for the feedback and use cases, we have this feature on the roadmap (namely to connect alerts to phabricator) and will get prioritized. One of the benefits of alertmanager is that it is making it possible to add this feature easily (e.g. https://github.com/knyar/phalerts)

colewhite subscribed.Oct 30 2020, 10:41 PM

LibreNMS Inbound interface errors too

Since we've set up task opening for AM alerts this quarter we can definitely tackle some of these.

Ideally we would add severity: task label to the alert, but since librenms' AM integration doesn't allow for user-defined labels we'll have to match the alert name inside AM's routing (ditto for icinga alerts).

Note that the alerts won't get automatically ACK'd e.g. in Icinga but tasks will be open for each distinct alert name.

We'd also need the PHID for projects to open tasks into (so that renaming a project is safe), so far I got (from https://phabricator.wikimedia.org/api/phid.lookup and using ["#sre"] for example as argument

#netops -> PHID-PROJ-h2zjwfqqi5cxjonrkfa7
#sre -> PHID-PROJ-5hj6ygnanfu23mmnlvmd
#observability -> PHID-PROJ-dwtj3e5mikntyhdbnohb

Change 675129 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] alertmanager: open tasks for librenms inbound interface errors

https://gerrit.wikimedia.org/r/675129

gerritbot added a project: Patch-For-Review.Mar 26 2021, 2:49 PM

fgiunchedi added a project: User-fgiunchedi.Mar 29 2021, 8:26 AM

Change 675129 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: get librenms alerts for dcops to open tasks

https://gerrit.wikimedia.org/r/675129

Maintenance_bot removed a project: Patch-For-Review.Mar 30 2021, 4:11 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Apr 27 2021, 1:47 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Alerting; removed SRE Observability.Aug 9 2021, 2:34 AM

lmata moved this task from Inbox to Backlog on the Observability-Alerting board.Aug 10 2021, 3:18 PM

fgiunchedi added a parent task: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.Aug 12 2021, 9:00 AM

lmata moved this task from Backlog to Prioritized on the Observability-Alerting board.Apr 7 2022, 10:50 PM

fgiunchedi added a project: SRE Observability (FY2021/2022-Q4).Apr 21 2022, 9:02 AM

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.May 2 2022, 2:49 PM

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.May 3 2022, 3:05 PM

fgiunchedi updated the task description. (Show Details)May 26 2022, 1:52 PM

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.May 26 2022, 3:04 PM

BTullis subscribed.Jun 9 2022, 11:40 AM

BTullis mentioned this in T310359: Ensure that the data-engineering team is alerted to all relevant host and service checks from Icinga.Jun 10 2022, 1:22 PM

I've got a query regarding this, which is something of a follow-up from T310359 in which we are trying to make sure that all relevant alerts are routed to our team from Icinga.

How would we decide which tags (or which teams) to create when creating a ticket from an alert?

For example, with the Disk Space alert mentioned above:

if the /var/lib/hadoop/journal file system is filling up on analytics1068 how will the system know that this host should create a task and tag it with Data-Engineering ?

We've already got our data-engineering-task receiver set up in AM - although we don't use it yet.

I'm just trying to work out how this would work for a similar check (e.g. disk space) where there might be multiple teams looking after the different servers involved.

Maybe we can leverage:

analytics1068:~$ cat /etc/wikimedia/contacts.yaml 
---
role::analytics_cluster::hadoop/worker:
- Data Engineering

Good question @BTullis ! For node-exporter level metrics (e.g. disk metrics) we have the cluster label already in the metrics. One solution would be to route alerts based on that and the information @ayounsi mentioned. We have run into a similar use case for WMCS-related alerts in https://gerrit.wikimedia.org/r/c/operations/puppet/+/802074 where we rewrite the team label based on the cluster label.

I need to give this a little more thought but we could be generalizing the cluster -> team mapping, that should be easy since cluster is already a label. Going from Puppet role to team will require a little more thought but I think it should be doable!

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q1); removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 8:10 AM

Another thought, maybe we could do:
All alerts that have been alerting for more than X days (eg. 7) become a task automatically

As at this point it's clear that it's not critical.
It would help keep an alerting dashboard clean and surface real issues more efficiently.
The downside is that it might "hide issues under the rag".

fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Oct 25 2022, 7:33 AM

Change 854039 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] base: remove check_long_procs, unused

https://gerrit.wikimedia.org/r/854039

Change 854040 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: use 'site' label to route tasks for dcops

https://gerrit.wikimedia.org/r/854040

Change 854039 merged by Dzahn:

[operations/puppet@production] base: remove check_long_procs, unused

https://gerrit.wikimedia.org/r/854039

Change 854040 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: use 'site' label to route tasks for dcops

https://gerrit.wikimedia.org/r/854040

lmata edited projects, added SRE Observability (FY2022/2023-Q2); removed Patch-For-Review, SRE Observability (FY2022/2023-Q1), Observability-Alerting, User-fgiunchedi.Nov 8 2022, 4:13 PM

fgiunchedi closed subtask T310266: Move mgmt SSH checks from Icinga to Prometheus/Alertmanager as Resolved.Dec 6 2022, 9:53 AM

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q3); removed SRE Observability (FY2022/2023-Q2).Jan 13 2023, 9:22 AM

fgiunchedi updated the task description. (Show Details)

Dzahn unsubscribed.Jan 13 2023, 4:09 PM

SLyngshede-WMF subscribed.Feb 6 2023, 4:22 PM

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q4); removed SRE Observability (FY2022/2023-Q3).Apr 14 2023, 8:21 AM

Power supply issues should open high priority tasks for DCops

Screenshot 2023-04-25 at 09-13-38 Alerts for wikimedia.org.png (226×736 px, 44 KB)

ayounsi updated the task description. (Show Details)Apr 25 2023, 7:15 AM

ayounsi updated the task description. (Show Details)

Change 913110 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: let power supply issues open tasks

https://gerrit.wikimedia.org/r/913110

gerritbot added a project: Patch-For-Review.Apr 28 2023, 8:27 AM

Change 913110 merged by Filippo Giunchedi:

[operations/alerts@master] sre: let power supply issues open tasks

https://gerrit.wikimedia.org/r/913110

fgiunchedi updated the task description. (Show Details)May 2 2023, 11:59 AM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2022/2023-Q4) board.May 2 2023, 1:58 PM

lmata subscribed.May 5 2023, 2:07 PM

lmata edited projects, added SRE Observability (FY2023/2024-Q1); removed SRE Observability (FY2022/2023-Q4).Jul 18 2023, 4:56 PM

lmata added a project: Observability-Alerting.Jul 18 2023, 8:48 PM

lmata edited projects, added SRE Observability (FY2023/2024-Q2); removed SRE Observability (FY2023/2024-Q1).Oct 9 2023, 4:27 PM

fgiunchedi removed a project: SRE Observability (FY2023/2024-Q2).Jan 15 2024, 11:15 AM

	F36962560: Screenshot 2023-04-25 at 09-13-38 Alerts for wikimedia.org.png
	Apr 25 2023, 7:15 AM

Icinga alerts that should open tasks instead of alertingOpen, LowPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Icinga alerts that should open tasks instead of alerting
Open, LowPublic0 Estimated Story Points
Actions

Related Objects
Search...