Page MenuHomePhabricator

SNMP Network Checks throw exception when device is unreachable
Closed, ResolvedPublic

Description

While cloudsw1-b1-codfw was rebooting for upgrade I noticed that the BFD check in Icinga failed. The SNMP poll of the device failed, as expected, but the check script did not handle it correctly:

	Traceback (most recent call last):
File "/usr/lib/nagios/plugins/check_bfd.py", line 65, in <module>
main()
File "/usr/lib/nagios/plugins/check_bfd.py", line 37, in main
for index in snimpyManager.bfdSessState:
File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 426, in __iter__
for k, _ in self.iteritems():
File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 451, in iteritems
for noid, result in self.session.walk(oid):
File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 127, in walk
return self.getorwalk("walkmore", *args)
File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 112, in getorwalk
value = getattr(self._session, op)(*args)
File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 311, in walkmore
return self._op(self._cmdgen.bulkCmd, *args)
File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 267, in _op
raise SNMPException(str(errorIndication))
snimpy.snmp.SNMPException: No SNMP response received before timeout

Think we need to add exception handling for snimpy.snmp.SNMPException. This seems to be common for all our SNMP-based checks so I'll try to address them all.

Event Timeline

cmooney created this task.

Change 898849 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust BFD Icinga check to handle SNMP connection failure

https://gerrit.wikimedia.org/r/898849

Change 899609 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured

https://gerrit.wikimedia.org/r/899609

Change 898849 merged by Cathal Mooney:

[operations/puppet@production] Adjust BFD Icinga check to handle SNMP connection failure

https://gerrit.wikimedia.org/r/898849

cmooney claimed this task.
cmooney renamed this task from BFD Status Check Fails when device is unavailable to SNMP Network Checks throw exception when device is unreachable.Mar 16 2023, 1:55 PM
cmooney reopened this task as Open.
cmooney updated the task description. (Show Details)

Looks like our OSPF check already handles any exceptions:

cmooney@alert1001:~$ ./check_ospf.py --host 1.2.3.4 --community <comm>
Error running check: No SNMP response received before timeout

The VRRP check can fail:

cmooney@alert1001:~$ ./check_vrrp.py -L 1.2.3.4 -R cr2-eqiad --community <comm>
UNKNOWN: snimpy.snmp.SNMPException: No SNMP response received before timeout
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nagiosplugin/runtime.py", line 43, in wrapper
    return func(*args, **kwds)
  File "./check_vrrp.py", line 127, in main
    VRRP(snmp_mgr1, snmp_mgr2),
  File "./check_vrrp.py", line 28, in __init__
    self.left = self.fetch_interface_state(left_m)
  File "./check_vrrp.py", line 37, in fetch_interface_state
    for ifIndex, vrrpOperVrId in snmp_mgr.vrrpOperState:
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 426, in __iter__
    for k, _ in self.iteritems():
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 451, in iteritems
    for noid, result in self.session.walk(oid):
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 127, in walk
    return self.getorwalk("walkmore", *args)
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 112, in getorwalk
    value = getattr(self._session, op)(*args)
  File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 311, in walkmore
    return self._op(self._cmdgen.bulkCmd, *args)
  File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 267, in _op
    raise SNMPException(str(errorIndication))
snimpy.snmp.SNMPException: No SNMP response received before timeout

As can the VC-check:

cmooney@alert1001:~$ ./check_vcp.py --host 1.2.3.4 --community <comm>
Traceback (most recent call last):
  File "./check_vcp.py", line 40, in <module>
    for index in snimpyManager.jnxVirtualChassisPortAdminStatus:
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 426, in __iter__
    for k, _ in self.iteritems():
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 451, in iteritems
    for noid, result in self.session.walk(oid):
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 127, in walk
    return self.getorwalk("walkmore", *args)
  File "/usr/lib/python3/dist-packages/snimpy/manager.py", line 112, in getorwalk
    value = getattr(self._session, op)(*args)
  File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 311, in walkmore
    return self._op(self._cmdgen.bulkCmd, *args)
  File "/usr/lib/python3/dist-packages/snimpy/snmp.py", line 267, in _op
    raise SNMPException(str(errorIndication))
snimpy.snmp.SNMPException: No SNMP response received before timeout

I'll submit a patch to fix the last two.

Change 900360 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Modify netops Icinga checks to gracefully deal with SNMP timeout

https://gerrit.wikimedia.org/r/900360

Change 900360 merged by Cathal Mooney:

[operations/puppet@production] Modify netops Icinga checks to gracefully deal with SNMP timeout

https://gerrit.wikimedia.org/r/900360

Closing this one, all our checks now deal with the scenario gracefully.

The exact code to catch the error circumstance is slightly different in each check. We could definitely do some work around standardising these checks, but I wasn't convinced it was work the time refactoring so I took a bespoke approach to each one.