Change Details

The monitoring glitch showed in https://librenms.wikimedia.org/graphs/to=1589328000/id=19265/type=port_bits/from=1589320800/ Caused the following faulty alert in LibreNMS: ``` Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80% #page Device Name: asw2-esams.mgmt.esams.wmnet Severity: critical Timestamp: 2020-05-12 23:10:55 Rule: Primary outbound port utilisation over 80% #page Physical Interface: ae2 Interface Description: Core: cr2-esams:ae1 Interface Speed: 80 Gbs Inbound Utilization: 5.43847256 Outbound Utilization: 395288.13516505 ``` The alert stayed active for 10min: # t=0 device is pulled: alert triggered # t=5 device is pulled: no recoveries # t=10 device is pulled: recoveries This was enough to trigger the related [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=LibreNMS+has+a+critical+alert+%23page | Icinga alert ]]. My guess is a bug in asw2-esams SNMP daemon (see also all the [[ https://librenms.wikimedia.org/device/device=178/tab=port/port=19285/ | VCPs ]]) As: * A switch stack upgrade (especially in esams) is a heavy operation (requires site depool), see T252631 * Implementing a safeguard in LibreNMS' code is time consuming (if doable) We can either live with it until the upgrade (low probability of pages, no operational impact), or find a workaround. For example: * Split the `Primary outbound port utilisation over 80%` alert in two, one that pages (for CRs) and one that alerts normally (for everything else) I don't think a LibreNMS upgrade would have helped, but doesn't hurt. See T251222.