Page MenuHomePhabricator

Investigate and deploy 'max-repeaters = 20' to all librenms devices
Closed, DeclinedPublic

Description

This is a followup from https://phabricator.wikimedia.org/T346606#9174753 , namely to investigate the impact of setting max-repeaters = 20 in librenms for all devices, quoting @ayounsi :

Regarding max-repeaters it seems like a great win and we should look at applying it at scale after testing it on a router (done by Filippo above) and on a large switch stack eg. asw-a-codfw, or asw2-a-eqiad.

Risks I can think about are:

  • overwhelm the network device
  • get packets larger than the mtu causing weird issues if it can't fragment them
  • another is to lose more data at once if a packet is lost

First one can be monitored to see if the memory or CPU increases, as well as if some data stop showing up (eg. SNMP daemon can't keep up)
Second is unlikely seeing the small size of the payload, as long as we keep the max-repeaters value small enough (I recommend we don't try to over-optimize it and keep it at 20).
Last is always a risk but data will show up at next pool. And packet loss is infrequent.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2023-09-21T10:27:05Z] <XioNoX> set max repeaters = 20 on asw2-a-eqiad - T346759

Thanks, I spent a bit more time on that.

Bumping max-repeaters to 20 didn't change a thing on asw2-a-eqiad, which is an old-ish virtual chassis, local to the LibreNMS host.

I can only make some guesses so far, maybe it's because the latency is already quite low, or maybe the device have a hard-coded value or max-repeater, maybe max-repeater isn't as efficient on ports that it is on bgp-peers or maybe the PDUs are already full.

Juniper recommends "to use the ‘max-repetitions’ value of 10, and the maximum number of OIDs per request is 10".

Based on that info and the fact that devices have pooling time within ok limits. We should only apply this change on a case by case basis like we did for eqsin.

I updated the doc accordingly.