Page MenuHomePhabricator

ops-monitoring-bot creating dupes
Open, MediumPublic


See T224794: Degraded RAID on helium and the numerous dupes

Content of the tasks is the same

Event Timeline

Reedy created this task.Jun 29 2019, 10:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 29 2019, 10:53 PM
Zoranzoki21 triaged this task as High priority.Jun 29 2019, 11:26 PM
Zoranzoki21 added a subscriber: Zoranzoki21.

Still happening...

Peachey88 added a subscriber: Peachey88.

Is this actually a problem with the bot, or with how acknowledgements are working in icinga?

Is this actually a problem with the bot, or with how acknowledgements are working in icinga?

Pass. The user facing "issue" is the duplicate tasks being created, so hence filing a bug as such :P

Volans added a subscriber: Volans.Jun 30 2019, 1:06 PM

Sorry for the spam. My guess is that the check is flapping between critical and unknown. The script ignores the unknowns but it doesn't know if there is already a task opened (long story).
I can have a check tomorrow, I'm without laptop at the moment. (It might also be related to the CPU governor task).
For now I've disabled the event handler on icinga for that check on that host so it should not spam anymore. Let me know in case it generate any additional noise and I can try to have a deeper luck tonight or silent it even more.

Volans added a subscriber: crusnov.

Yes it's confirmed that the Icinga check flaps between critical and unknown due to time outs and as a result the even handler created the dupes. See more specific info in T224794#5295606, basically megacli takes very long time to gather info from the broken disk.

As for the check itself, it doesn't search for existing tasks because sometimes tasks are renamed and it's hard to make it reliable in the general case. It also doesn't keep the state of opened tasks by choice (we have two icinga hosts and we'll need to keep this state in sync or move it to a DB), hence is prone to create duplicates in case of Icinga flapping checks (that shouldn't happen in the case of broken disks, either the disk is broken or not).

It would be nice to improve it, but let's also keep in mind this happens more or less once a year, so cost/benefit should be taken into account too.

Volans lowered the priority of this task from High to Medium.Jul 1 2019, 10:25 AM
fgiunchedi moved this task from Inbox to Backlog on the observability board.Mon, Jul 20, 1:28 PM