Page MenuHomePhabricator

Make sure LibreNMS -> AM alerts don't flap due to lack of notifications
Closed, ResolvedPublic

Description

We've been running recently into an issue where a few console servers have CPU at 100%, the related librenms alerts (and thus AM notifications) were flapping at regular intervals, e.g.

05:07 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) firing: Processor usage over 85%   - 
          https://alerts.wikimedia.org
05:12 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) resolved: Processor usage over 85%   - 
          https://alerts.wikimedia.org
07:37 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) firing: Processor usage over 85%   - 
          https://alerts.wikimedia.org
07:42 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) resolved: Processor usage over 85%   - 
          https://alerts.wikimedia.org
07:47 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) firing: Processor usage over 85%   - 
          https://alerts.wikimedia.org
07:57 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) firing: Processor usage over 85%   - 
          https://alerts.wikimedia.org
08:02 -jinxer-wm:#wikimedia-operations- (Processor usage over 85%) resolved: Processor usage over 85%   - 
          https://alerts.wikimedia.org

This is due to the interaction between librenms' interval for re-sending the alerts (e.g. 3h for the alert above) and the fact that AM expects clients to keep sending notifications while the alerts are active.

AFAICT AM's expectation is maximum ~5 minutes between clients sending notifications, while librenms poller interval we're using ATM is also 5 minutes.

Event Timeline

The solution I think will work:

  • Set alertmanager resolve_timeout (docs) to say 20m. This affects only alerts with no endsAt field, which librenms doesn't set
  • Make sure interval for librenms alerts is set to e.g. 15m

Change 700482 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: bump resolve_timeout to 20m

https://gerrit.wikimedia.org/r/700482

Change 700482 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: bump resolve_timeout to 20m

https://gerrit.wikimedia.org/r/700482

Mentioned in SAL (#wikimedia-operations) [2021-06-21T12:55:21Z] <godog> move librenms alerts with "max alerts" == -1 to "interval" being 15m - T285205

I've adjusted all LibreNMS "persisting" alerts (i.e. those with max alerts == -1 and thus librenms will keep sending notifications) to have "interval" set to 15m, which does avoid the flapping alerts. In practice LibreNMS will send notifications to AM for these alerts faster than AM will consider them "stale" (and thus resolved, after 20m).

There are a bunch of alerts with "max alerts" == 1 (i.e. that will notify only once), we'll need to go through those with @ayounsi and @cmooney and figure out what the expectations are for those.

There are a bunch of alerts with "max alerts" == 1 (i.e. that will notify only once), we'll need to go through those with @ayounsi and @cmooney and figure out what the expectations are for those.

Changed them both from 0 to 900.

fgiunchedi claimed this task.

This is done!