The monitoring glitch showed in https://librenms.wikimedia.org/graphs/to=1589328000/id=19265/type=port_bits/from=1589320800/
Caused the following faulty alert in LibreNMS:
Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80% #page
Device Name: asw2-esams.mgmt.esams.wmnet
Timestamp: 2020-05-12 23:10:55
Rule: Primary outbound port utilisation over 80% #page
Physical Interface: ae2
Interface Description: Core: cr2-esams:ae1
Interface Speed: 80 Gbs
Inbound Utilization: 5.43847256
Outbound Utilization: 395288.13516505
The alert stayed active for 10min:
# t=0 device is pulled: alert triggered
# t=5 device is pulled: no recoveries
# t=10 device is pulled: recoveries
This was enough to trigger the related [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=LibreNMS+has+a+critical+alert+%23page | Icinga alert ]].
My guess is a bug in asw2-esams SNMP daemon (see also all the [[ https://librenms.wikimedia.org/device/device=178/tab=port/port=19285/ | VCPs ]])
* A switch stack upgrade (especially in esams) is a heavy operation (requires site depool), see T252631
* Implementing a safeguard in LibreNMS' code is time consuming (if doable)
We can either live with it until the upgrade (low probability of pages, no operational impact), or find a workaround.
* Split the `Primary outbound port utilisation over 80%` alert in two, one that pages (for CRs) and one that alerts normally (for everything else)
I don't think a LibreNMS upgrade would have helped, but doesn't hurt. See T251222.