The monitoring glitch showed in https://librenms.wikimedia.org/graphs/to=1589328000/id=19265/type=port_bits/from=1589320800/
Caused the following faulty alert in LibreNMS:
Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80% #page Device Name: asw2-esams.mgmt.esams.wmnet Severity: critical Timestamp: 2020-05-12 23:10:55 Rule: Primary outbound port utilisation over 80% #page Physical Interface: ae2 Interface Description: Core: cr2-esams:ae1 Interface Speed: 80 Gbs Inbound Utilization: 5.43847256 Outbound Utilization: 395288.13516505
The alert stayed active for 10min:
- t=0 device is pulled: alert triggered
- t=5 device is pulled: no recoveries
- t=10 device is pulled: recoveries
This was enough to trigger the related Icinga alert.
My guess is a bug in asw2-esams SNMP daemon (see also all the VCPs)
As:
- A switch stack upgrade (especially in esams) is a heavy operation (requires site depool), see T252631
- Implementing a safeguard in LibreNMS' code is time consuming (if doable)
We can either live with it until the upgrade (low probability of pages, no operational impact), or find a workaround.
For example:
- Split the Primary outbound port utilisation over 80% alert in two, one that pages (for CRs) and one that alerts normally (for everything else)
I don't think a LibreNMS upgrade would have helped, but doesn't hurt. See T251222.