Page MenuHomePhabricator

LibreNMS monitoring glitch caused paging
Closed, ResolvedPublic

Description

The monitoring glitch showed in https://librenms.wikimedia.org/graphs/to=1589328000/id=19265/type=port_bits/from=1589320800/

Caused the following faulty alert in LibreNMS:

Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80%  #page
Device Name: asw2-esams.mgmt.esams.wmnet
Severity: critical
Timestamp: 2020-05-12 23:10:55
Rule:  Primary outbound port utilisation over 80%  #page
Physical Interface: ae2
Interface Description: Core: cr2-esams:ae1
Interface Speed: 80 Gbs
Inbound Utilization: 5.43847256
Outbound Utilization: 395288.13516505

The alert stayed active for 10min:

  1. t=0 device is pulled: alert triggered
  2. t=5 device is pulled: no recoveries
  3. t=10 device is pulled: recoveries

This was enough to trigger the related Icinga alert.

My guess is a bug in asw2-esams SNMP daemon (see also all the VCPs)

As:

  • A switch stack upgrade (especially in esams) is a heavy operation (requires site depool), see T252631
  • Implementing a safeguard in LibreNMS' code is time consuming (if doable)

We can either live with it until the upgrade (low probability of pages, no operational impact), or find a workaround.
For example:

  • Split the Primary outbound port utilisation over 80% alert in two, one that pages (for CRs) and one that alerts normally (for everything else)

I don't think a LibreNMS upgrade would have helped, but doesn't hurt. See T251222.

Event Timeline

ayounsi triaged this task as Medium priority.May 13 2020, 9:37 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This happened again just now.

Something else we could do as a temporary mitigation is just accept a longer time-to-page in legitimate incidents and increase retries to require 15 minutes before a page. Or (Arzhel's suggestion) configure the LibreNMS alert to also require <101% utilization to be an alert.

I added an extra condition to the inbound and outbound alerts to trigger: usage needs to be < 150%. Which is way bellow the crazy % we saw when the bug happens.

This should mitigates this specific issue until the switch stack gets upgraded.

ayounsi claimed this task.

With the mitigation and the task to upgrade the router it's fine to close that one.