Page MenuHomePhabricator

cr*-eqsin long poll times from librenms
Closed, ResolvedPublic

Description

I was generally looking at the librenms after the latest upgrade (T344136) and noticed poll times for cr*-eqsin are very close to 300s which is the poll time of librenms and obviously not ideal:

for cr3-eqsin for example:

2023-09-18-110847_2466x987_scrot.png (987×2 px, 616 KB)

one low hanging fruit might be to tune snmpbulkwalk calls as mentioned here: https://docs.librenms.org/Support/Performance/#snmp-max-repeaters and https://docs.librenms.org/Support/Performance/#snmp-max-oids

Event Timeline

Probably a combination of latency (distance between netmon1003 and eqsin) with an increasing number of BGP peers.
Based on https://librenms.wikimedia.org/graphs/type=device_poller_modules_perf/device=159/from=1694940900/ most time is spent on BGP peers. Which is true for all routers, vs. ports for switches, which make sens.

The tuning you suggested make sens to test. I had a quick look on the router side but there are not much knobs to use to improve things.

If it's problematic we could also disable BGP peers pooling on the config: https://librenms.wikimedia.org/device/device=159/tab=edit/section=modules/ but there we would loose 2 alerts (including 1 important).

Distributed pooling could also be a solution by reducing the latency between collector and network device, but much more complex to deploy. https://docs.librenms.org/Extensions/Distributed-Poller/

Mentioned in SAL (#wikimedia-operations) [2023-09-18T09:28:36Z] <godog> set max-repeaters for cr3-eqsin in librenms - T346606

Mentioned in SAL (#wikimedia-operations) [2023-09-18T09:28:43Z] <godog> set max-repeaters to 20 for cr3-eqsin in librenms - T346606

I tried the setting above on https://librenms.wikimedia.org/device/device=159/tab=edit/section=snmp/ though the web UI reloaded and the text field was empty, suggesting to me that the setting "didn't take"

Mentioned in SAL (#wikimedia-operations) [2023-09-18T10:33:58Z] <godog> set max-repeaters to 20 for cr3-eqsin using "force save" - T346606

Mentioned in SAL (#wikimedia-operations) [2023-09-18T13:02:17Z] <godog> set max-repeaters to 30 for cr3-eqsin in librenms - T346606

Setting max-repeaters to 20 definitely had an impact on bgp peers poll time:

2023-09-18-151001_2467x1004_scrot.png (1×2 px, 570 KB)

Mentioned in SAL (#wikimedia-operations) [2023-09-18T13:38:53Z] <godog> force-set max-repeaters to 20 for cr2-eqsin and cr3-eqsin - T346606

+ netops for visibility since this can impact network devices

We had a quick chat on IRC.

the ports and bgp-peers modules are the ones taking the most time, so no need to focus on the snmp-max-oids LibreNMS knob as they're mostly for sensors.

Regarding max-repeaters it seems like a great win and we should look at applying it at scale after testing it on a router (done by Filippo above) and on a large switch stack eg. asw-a-codfw, or asw2-a-eqiad.

Risks I can think about are:

  • overwhelm the network device
  • get packets larger than the mtu causing weird issues if it can't fragment them
  • another is to lose more data at once if a packet is lost

First one can be monitored to see if the memory or CPU increases, as well as if some data stop showing up (eg. SNMP daemon can't keep up)
Second is unlikely seeing the small size of the payload, as long as we keep the max-repeaters value small enough (I recommend we don't try to over-optimize it and keep it at 20).
Last is always a risk but data will show up at next pool. And packet loss is infrequent.