Page MenuHomePhabricator

SCS CPU monitoring issue
Open, LowPublic

Description

4 (codfw, ulsfo, eqsin) of the 7 conservers simultaneously seen their CPU usage jump to 100%, triggering alerting.

This seems to be a conserver bug, as the standard SNMP OID ssCpuIdle ("The percentage of processor time spent idle, calculated over the last minute.") is returning 0 when

However according to the MIB http://www.net-snmp.org/docs/mibs/UCD-SNMP-MIB.txt

This object has been deprecated in favour of 'ssCpuRawIdle(53)', which can be used to calculate the same metric, but over any desired time period."

And it's supported by OpenGear:

/usr/bin/snmpwalk -v2c -c <secret> -OUQn -M /srv/deployment/librenms/librenms/mibs:/srv/deployment/librenms/librenms/mibs/opengear udp:scs-ulsfo.mgmt.ulsfo.wmnet:161 .1.3.6.1.4.1.2021.11.53.0
.1.3.6.1.4.1.2021.11.53.0 = 2104302521

the following fixed it in ulsfo:

# ps | grep snmp
 2720 root      4840 S    /bin/snmpd -Lsd -f -x unix:/var/run/snmpd.agentX -c /etc/config/snmpd.conf -p /var/run/snmpd.pid 
# kill -9 2720

And snmpd came back up automatically.

Most sustainable fixes are:

  • Submit a bug report upstream
  • Patch LibreNMS to use ssCpuRawIdle
  • Disable CPU alerting for the SCS

The 2nd one seems to be the best option for us if the issue happens regularly, but for now the occasional SNMPd restart is good enough.

Event Timeline

ayounsi created this task.

CPU in SNMP went back up shortly after... So the only option for now is to reboot the device.