Page MenuHomePhabricator

shinken is too "volatile" and imprecise to be of use
Closed, DuplicatePublic

Description

I get mail alerts from Shinken for Toolforge and they are currently too "volatile" to be of use. Most (?) of them complain of "PROBLEM alert - $host/Puppet staleness is WARNING", only to be followed up a short time later by "RECOVERY alert - $host/Puppet staleness is OK".

I didn't fix anything in the mean time, and I assume noone else did. And this is the problem: An alert is only helpful if it alerts you to do something. If you don't need to do something, these alerts just mask the real alerts that would require immediate intervention.

When investigating some alerts after the fact, they can often be linked to an OOM event or similar that caused Puppet to fail. But it would have been helpful if in that case the alert message would have been "$host is out of memory".

Unfortunately, I cannot offer a solution that just does the right thing™. But IMHO we should look for one that silences the constant beeping in the background, so that the "red alert!" stands out better.

Event Timeline

scfc raised the priority of this task from to Needs Triage.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc moved this task to Ready to be worked on on the Toolforge board.
scfc subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I have the same problem with shinken alerts in the integration project.

valhallasw subscribed.

Same issue for te "cvn" project. The notifications are pure noise and have never notified me of anything actionable. For us the main one is "Puppet run is CRITICAL" - which presumably means there is something flaky about puppet runs. The warning about puppet not having been run for a while actually never happens for us anymore.

x.png (1×1 px, 116 KB)