4 (codfw, ulsfo, eqsin) of the 7 conservers simultaneously seen their CPU usage jump to 100%, triggering alerting.
This seems to be a conserver bug, as the standard SNMP OID ssCpuIdle ("The percentage of processor time spent idle, calculated over the last minute.") is returning 0 when
However according to the MIB http://www.net-snmp.org/docs/mibs/UCD-SNMP-MIB.txt
This object has been deprecated in favour of 'ssCpuRawIdle(53)', which can be used to calculate the same metric, but over any desired time period."
And it's supported by OpenGear:
/usr/bin/snmpwalk -v2c -c <secret> -OUQn -M /srv/deployment/librenms/librenms/mibs:/srv/deployment/librenms/librenms/mibs/opengear udp:scs-ulsfo.mgmt.ulsfo.wmnet:161 .18.104.22.168.4.1.2021.11.53.0 .22.214.171.124.4.1.2021.11.53.0 = 2104302521
the following fixed it in ulsfo:
# ps | grep snmp 2720 root 4840 S /bin/snmpd -Lsd -f -x unix:/var/run/snmpd.agentX -c /etc/config/snmpd.conf -p /var/run/snmpd.pid # kill -9 2720
And snmpd came back up automatically.
Most sustainable fixes are:
- Submit a bug report upstream
- Patch LibreNMS to use ssCpuRawIdle
- Disable CPU alerting for the SCS
The 2nd one seems to be the best option for us if the issue happens regularly, but for now the occasional SNMPd restart is good enough.