Page MenuHomePhabricator

SCS CPU monitoring issue
Open, LowPublic

Description

4 (codfw, ulsfo, eqsin) of the 7 conservers simultaneously seen their CPU usage jump to 100%, triggering alerting.

This seems to be a conserver bug, as the standard SNMP OID ssCpuIdle ("The percentage of processor time spent idle, calculated over the last minute.") is returning 0 when

However according to the MIB http://www.net-snmp.org/docs/mibs/UCD-SNMP-MIB.txt

This object has been deprecated in favour of 'ssCpuRawIdle(53)', which can be used to calculate the same metric, but over any desired time period."

And it's supported by OpenGear:

/usr/bin/snmpwalk -v2c -c <secret> -OUQn -M /srv/deployment/librenms/librenms/mibs:/srv/deployment/librenms/librenms/mibs/opengear udp:scs-ulsfo.mgmt.ulsfo.wmnet:161 .1.3.6.1.4.1.2021.11.53.0
.1.3.6.1.4.1.2021.11.53.0 = 2104302521

the following fixed it in ulsfo:

# ps | grep snmp
 2720 root      4840 S    /bin/snmpd -Lsd -f -x unix:/var/run/snmpd.agentX -c /etc/config/snmpd.conf -p /var/run/snmpd.pid 
# kill -9 2720

And snmpd came back up automatically.

Most sustainable fixes are:

  • Submit a bug report upstream
  • Patch LibreNMS to use ssCpuRawIdle
  • Disable CPU alerting for the SCS

The 2nd one seems to be the best option for us if the issue happens regularly, but for now the occasional SNMPd restart is good enough.

Event Timeline

ayounsi created this task.

CPU in SNMP went back up shortly after... So the only option for now is to reboot the device.

This regularly alerts and is not actionable as it's a monitoring glitch. The CPU usage on the device is for example:
Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 99.7%id

I excluded the SCS from the high CPU alert.

It would be nice to fix it one way (Opengear support) or the other (LibreNMS patch).

Agreed the librenms patch is the way to go, I won't have the bandwidth any time soon but happy to assist