Page MenuHomePhabricator

LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs
Closed, ResolvedPublic

Description

Since Jan 17th LibreNMS has been alarming about not receiving prefixes for a number of IPv6 Anycast peers on different routers. But all seem to be ok. Error is reported for eqiad, codfw and ulsfo, yet checking the actual sessions all seems to be ok and we are learning routes from all the hosts listed.

For instance the ones reported for cr1-codfw:

2025-01-17 17:47:13	Not accepting/receiving prefixes from anycast BGP peer cr1-codfw
#1: BGP peer 2620:0:860:1:208:80:153:6, Desc , AS64605, State established
#2: BGP peer 2620:0:860:2:208:80:153:38, Desc , AS64605, State established
#3: BGP peer 2620:0:860:103:10:192:32:58, Desc , AS64605, State established
#4: BGP peer 2620:0:860:104:10:192:48:14, Desc , AS64605, State established
cmooney@re0.cr1-codfw> show route table inet6.0 receive-protocol bgp 2620:0:860:1:208:80:153:6 

inet6.0: 214772 destinations, 548752 routes (214766 active, 0 holddown, 6 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
  2001:67c:930::1/128     2620:0:860:1:208:80:153:6               64605 I

{master}
cmooney@re0.cr1-codfw> show route table inet6.0 receive-protocol bgp 2620:0:860:2:208:80:153:38   

inet6.0: 214776 destinations, 548760 routes (214768 active, 2 holddown, 6 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 2001:67c:930::1/128     2620:0:860:2:208:80:153:38              64605 I

{master}
cmooney@re0.cr1-codfw> show route table inet6.0 receive-protocol bgp 2620:0:860:103:10:192:32:58   

inet6.0: 214779 destinations, 548755 routes (214766 active, 7 holddown, 6 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 2001:67c:930::2/128     2620:0:860:103:10:192:32:58             64605 I
* 2001:67c:930::3/128     2620:0:860:103:10:192:32:58             64605 I
* 2001:67c:930::4/128     2620:0:860:103:10:192:32:58             64605 I
* 2001:67c:930::5/128     2620:0:860:103:10:192:32:58             64605 I

{master}
cmooney@re0.cr1-codfw> show route table inet6.0 receive-protocol bgp 2620:0:860:104:10:192:48:14    

inet6.0: 214775 destinations, 548729 routes (214767 active, 2 holddown, 6 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
  2001:67c:930::2/128     2620:0:860:104:10:192:48:14             64605 I
  2001:67c:930::3/128     2620:0:860:104:10:192:48:14             64605 I
  2001:67c:930::4/128     2620:0:860:104:10:192:48:14             64605 I
  2001:67c:930::5/128     2620:0:860:104:10:192:48:14             64605 I

Need to dig in and see why it's reporting this, very odd. SNMP seems to match what we see above too:

cmooney@alert1002:~# snmpwalk -v2c -c <comm> -M +/var/lib/mibs/site -M +BGP4-V2-MIB-JUNIPER cr1-codfw BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerRemoteAddr | egrep "53 00 06|53 00 38"
BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerRemoteAddr.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.1.2.8.0.128.1.83.0.6 = Hex-STRING: 26 20 00 00 08 60 00 01 02 08 00 80 01 53 00 06 
BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerRemoteAddr.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.2.2.8.0.128.1.83.0.56 = Hex-STRING: 26 20 00 00 08 60 00 02 02 08 00 80 01 53 00 38
cmooney@alert1002:~# snmpget -v2c -c <comm> -M +/var/lib/mibs/site -M +BGP4-V2-MIB-JUNIPER cr1-codfw BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerIndex.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.1.2.8.0.128.1.83.0.6
BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerIndex.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.1.2.8.0.128.1.83.0.6 = Gauge32: 84
cmooney@alert1002:~# snmpget -v2c -c <comm> -M +/var/lib/mibs/site -M +BGP4-V2-MIB-JUNIPER cr1-codfw BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerIndex.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.2.2.8.0.128.1.83.0.56
BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerIndex.0.ipv6.38.32.0.0.8.96.255.255.0.0.0.0.0.0.0.1.2.38.32.0.0.8.96.0.2.2.8.0.128.1.83.0.56 = Gauge32: 85
cmooney@alert1002:~# snmpwalk -v2c -c <comm> -M +/var/lib/mibs/site -M +BGP4-V2-MIB-JUNIPER cr1-codfw BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted.84
BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted.84.ipv6.1 = Gauge32: 1
cmooney@alert1002:~# snmpwalk -v2c -c <comm> -M +/var/lib/mibs/site -M +BGP4-V2-MIB-JUNIPER cr1-codfw BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted.85
BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted.85.ipv6.1 = Gauge32: 1

Event Timeline

cmooney triaged this task as Medium priority.

Thanks for filing this task and looking into it! Just one more data point: this seems to have started Friday Jan 17 at around 18:51 UTC, when we got the first alert. Do we know of anything that was merged/changed around that time?

Might be a red herring: The only thing I see that might be close is https://sal.toolforge.org/log/h5lbdZQBKFqumxvtiNp0 at 17:42, the LiberNMS update. None of the changes there seem related but the timing is interesting: update finishes at 17:42 and the closest alert is at 18:51.

Thanks for filing this task and looking into it! Just one more data point: this seems to have started Friday Jan 17 at around 18:51 UTC, when we got the first alert. Do we know of anything that was merged/changed around that time?

No I don't think so. For instance on cr1-codfw there was a change earlier today, and one on Friday morning, but none around that time (both of those changes were for K8s host BGP being added/removed automatically):

cmooney@re0.cr1-codfw# rollback ?
Possible completions:
  <[Enter]>            Execute this command
  0                    2025-01-20 15:12:34 UTC by cmooney via netconf commit confirmed, rollback in 2mins synchronize
  1                    2025-01-17 10:55:57 UTC by homer via netconf commit confirmed, rollback in 2mins synchronize
  2                    2025-01-16 15:32:37 UTC by homer via netconf commit confirmed, rollback in 2mins synchronize

Might be a red herring: The only thing I see that might be close is https://sal.toolforge.org/log/h5lbdZQBKFqumxvtiNp0 at 17:42, the LiberNMS update. None of the changes there seem related but the timing is interesting: update finishes at 17:42 and the closest alert is at 18:51.

Yeah the timing of that is kind of suspicious alright. Still odd if it was the upgrade that we only get it in certain POPs. Checking the router versions they are on different JunOS as well which means the common factor isn't that.

23.4R2.13 cr1-eqiad
23.4R2.13 cr2-eqiad

21.2R3-S2.9 cr1-codfw
21.2R3-S2.9 cr2-codfw

23.4R2.13 cr3-ulsfo
21.2R3-S2.9 cr4-ulsfo

Not urgent anyway, adding Observability in case they might have any ideas.

So looking at a specific peer - 2620:0:863:1:198:35:26:6 on cr4-ulsfo - I can see the SNMP 'index' for this peer is 23:

BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerRemoteAddr.0.ipv6.38.32.0.0.8.99.255.255.0.0.0.0.0.0.0.2.2.38.32.0.0.8.99.0.1.1.152.0.53.0.38.0.6 = Hex-STRING: 26 20 00 00 08 63 00 01 01 98 00 35 00 26 00 06
BGP4-V2-MIB-JUNIPER::jnxBgpM2PeerIndex.0.ipv6.38.32.0.0.8.99.255.255.0.0.0.0.0.0.0.2.2.38.32.0.0.8.99.0.1.1.152.0.53.0.38.0.6 = Gauge32: 23

Taking a wireshark capture on netmon1003, I can see that the LibreNMS is polling for the same OID I was in the task description, and the correct value is being returned for BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted.23.ipv6.1:

image.png (422×1 px, 133 KB)

I can attach the pcap if needed but it has the SNMP community so didn't want to. Either way I am no way closer to understanding why the data block for this alert shows "AcceptedPrefixes": 0, since that is not the value it's getting back from the router.

Might be a red herring: The only thing I see that might be close is https://sal.toolforge.org/log/h5lbdZQBKFqumxvtiNp0 at 17:42, the LiberNMS update. None of the changes there seem related but the timing is interesting: update finishes at 17:42 and the closest alert is at 18:51.

This is the custom query we have built in LibreNMS for this alert:

image.png (326×969 px, 41 KB)

It does occur to me if something has changed in the underlying table structure LibreNMS uses after the upgrade which is resulting in the zeros in the data (note we see zero not null data though). I'll need to try to look under the hood at that when I get a moment.

If I understand the db structure correctly that should convert into this query:

select * from bgpPeers a left join bgpPeers_cbgp b on b.bgpPeerIdentifier = a.bgpPeerIdentifier and b.device_id = a.device_id and b.context_name = a.context_name where b.AcceptedPrefixes = 0 and a.astext = 'Anycast' and b.safi = 'unicast' and a.bgpPeerState = 'established';

A count of it returns 24 rows, I've saved that into a file and given it to @cmooney

Thanks @Volans you have helped me a lot with this and given me confidence to look at the DB.

I suspected this might just be renamed fields/tables or something but it appears not.

MariaDB [librenms]> select AcceptedPrefixes from bgpPeers_cbgp where device_id=89 and bgpPeerIdentifier="2620:0:863:1:198:35:26:6";
+------------------+
| AcceptedPrefixes |
+------------------+
|                0 |
+------------------+
1 row in set (0.000 sec)

Seems it is not recording the value correctly for any of the v6 peers:

MariaDB [librenms]> select bgpPeerIdentifier,AcceptedPrefixes from bgpPeers_cbgp where device_id=89 and LENGTH(bgpPeerIdentifier) > 15;
+-------------------------------+------------------+
| bgpPeerIdentifier             | AcceptedPrefixes |
+-------------------------------+------------------+
| 2001:418:0:5000::6fa          |                0 |
| 2001:504:0:1::1828:1          |                0 |
| 2001:504:0:1::2152:1          |                0 |
| 2001:504:0:1::2518:1          |                0 |
| 2001:504:0:1::2906:2          |                0 |
| 2001:504:0:1::2906:3          |                0 |
| 2001:504:0:1::293:1           |                0 |
| 2001:504:0:1::3292:1          |                0 |
| 2001:504:0:1::3300:2          |                0 |
| 2001:504:0:1::42:3            |                0 |
| 2001:504:0:1::4788:1          |                0 |
| 2001:504:0:1::4826:1          |                0 |
| 2001:504:0:1::6939:1          |                0 |
| 2001:504:0:1::714:1           |                0 |
| 2001:504:0:1::7843:1          |                0 |
| 2001:504:0:1::8075:1          |                0 |
| 2001:504:0:1::8075:2          |                0 |
| 2001:504:0:1::8220:1          |                0 |
| 2001:504:0:1::8966:1          |                0 |
| 2001:504:0:1:0:1:1404:1       |                0 |
| 2001:504:0:1:0:1:310:1        |                0 |
| 2001:504:0:1:0:1:310:2        |                0 |
| 2001:504:0:1:0:1:3335:1       |                0 |
| 2001:504:0:1:0:1:4492:1       |                0 |
| 2001:504:0:1:0:1:4593:1       |                0 |
| 2001:504:0:1:0:1:4860:1       |                0 |
| 2001:504:0:1:0:1:5133:2       |                0 |
| 2001:504:0:1:0:1:5305:1       |                0 |
| 2001:504:0:1:0:1:5830:1       |                0 |
| 2001:504:0:1:0:1:5830:2       |                0 |
| 2001:504:0:1:0:1:6276:1       |                0 |
| 2001:504:0:1:0:1:6509:1       |                0 |
| 2001:504:0:1:0:1:6509:2       |                0 |
| 2001:504:0:1:0:1:6591:1       |                0 |
| 2001:504:0:1:0:1:7072:1       |                0 |
| 2001:504:0:1:0:1:8779:1       |                0 |
| 2001:504:0:1:0:13:2203:1      |                0 |
| 2001:504:0:1:0:13:8915:1      |                0 |
| 2001:504:0:1:0:19:9524:1      |                0 |
| 2001:504:0:1:0:2:4482:1       |                0 |
| 2001:504:0:1:0:2:940:2        |                0 |
| 2001:504:0:1:0:2:940:3        |                0 |
| 2001:504:0:1:0:3:2787:1       |                0 |
| 2001:504:0:1:0:3:2934:1       |                0 |
| 2001:504:0:1:0:3:2934:2       |                0 |
| 2001:504:0:1:0:3:6012:1       |                0 |
| 2001:504:0:1:0:3:6236:1       |                0 |
| 2001:504:0:1:0:3:6351:1       |                0 |
| 2001:504:0:1:0:3:6692:1       |                0 |
| 2001:504:0:1:0:3:8195:1       |                0 |
| 2001:504:0:1:0:4:934:1        |                0 |
| 2001:504:0:1:0:4:934:2        |                0 |
| 2001:504:0:1:0:4:9544:1       |                0 |
| 2001:504:0:1:0:5:4994:1       |                0 |
| 2001:504:0:1:0:6:2597:1       |                0 |
| 2001:504:0:1:0:6:3949:1       |                0 |
| 2001:504:0:1:0:6:5517:1       |                0 |
| 2001:504:0:1:ffff:ffff:ffff:1 |                0 |
| 2001:504:0:1:ffff:ffff:ffff:2 |                0 |
| 2620:0:863:1:198:35:26:14     |                0 |
| 2620:0:863:1:198:35:26:6      |                0 |
| 2620:0:863:101:10:128:0:6     |                0 |
| 2620:0:863:101:10:128:0:7     |                0 |
| 2620:0:863:ffff::1            |                0 |
| 2620:0:863:ffff::1            |                0 |
+-------------------------------+------------------+
65 rows in set (0.000 sec)

So something is going wrong between the SNMP response from the router and what's going into the db. All the counters for these peers show as zero (not just AcceptedPrefixes). The IPv4 stats are populated correctly.

Will need some more digging. If the fix is too hard I think it might be an idea to cut our losses and export these stats from the routers with gnmi and replace the LibreNMS rule with an alertmanager one.

It also appears we are getting values populated for AcceptedPrefixes for IPv6 peers for some devices, specifically the L3 QFX switches and SRX firewalls we have. But they are zeros across the board for all our MX routers.

Running the poller manually on netmon1003 I can also see it's getting the right value back, but inserting the wrong value in the RRDs. Nothing in the logs gives a clue as to where exactly it's going wrong, or why it's only affecting v6 prefixes on the MX platform.

cmooney@wikilap:~$ egrep "1.198.35.26.6|\[23\]" poller_log2.txt | egrep "Accepted|rrd"
BGP4-V2-MIB-JUNIPER::jnxBgpM2PrefixInPrefixesAccepted[23][ipv6][1] = 1
RRD[update /srv/librenms/rrd/cr4-ulsfo.wikimedia.org/bgp-2620_0_863_1_198_35_26_6.rrd N:0:0:0:0:0]  
Sending to Graphite: librenms.cr4-ulsfo_wikimedia_org.cbgp.cbgp-2620_0_863_1_198_35_26_6_ipv6_unicast.AcceptedPrefixes 0 1737475464
RRD[update /srv/librenms/rrd/cr4-ulsfo.wikimedia.org/cbgp-2620_0_863_1_198_35_26_6.ipv6.unicast.rrd N:0:0:0:0:0]

Looking at the changelog I wonder if this issue could be related to this change. If it's unrelated to it, do you think we should open an issue upstream?

Looking at the changelog I wonder if this issue could be related to this change. If it's unrelated to it, do you think we should open an issue upstream?

Thanks @andrea.denisse! That change definitely looks like it might be somehow related alright. Let me dig into it a little deeper, I can check how the IPv6 addresses are formatted in the response from the working vs broken devices, and see if that makes sense in the context of the above patch.

Either way it does look to be a bug so we'll probably open an upstream task yes. I'm also working in parallel on pulling BGP stats using gnmic, but whichever we use for alerting we should open an issue with LibreNMS if it seems the problem is there.

I was able to run a manual poller command with the updated 'lmns' command and it shows errors processing the IPv6 peers. For example:

Checking BGP peer 2620:0:863:101:10:128:0:6 SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'bgpPeerState' cannot be null (Connection: mysql, SQL: UPDATE `bgpPeers` set `bgpPeerState`=?,`bgpPeerAdminStatus`=?,`bgpPeerInUpdates`=0,`bgpPeerInTotalMessages`=0,`bgpPeerOutTotalMessages`=0,`bgpPeerFsmEstablishedTime`=0,`bgpPeerLastErrorCode`=0,`bgpPeerLastErrorSubCode`=0,`bgpLocalAddr`= WHERE `device_id` = 88 AND `bgpPeerIdentifier` = 2620:0:863:101:10:128:0:6) (Connection: dbFacile, SQL: UPDATE `bgpPeers` set `bgpPeerState`=?,`bgpPeerAdminStatus`=?,`bgpPeerInUpdates`=0,`bgpPeerInTotalMessages`=0,`bgpPeerOutTotalMessages`=0,`bgpPeerFsmEstablishedTime`=0,`bgpPeerLastErrorCode`=0,`bgpPeerLastErrorSubCode`=0,`bgpLocalAddr`= WHERE `device_id` = 88 AND `bgpPeerIdentifier` = 2620:0:863:101:10:128:0:6)#0 /srv/deployment/librenms/librenms-cache/revs/1d8cd7a87db942fc51509095bda71411582eec64/includes/polling/bgp-peers.inc.php(532): dbUpdate()

I've raised an issue on the LibreNMS github now with further info:

https://github.com/librenms/librenms/issues/17068

Hi @cmooney,

I was reviewing the LibreNMS v25.1.0 changelog and noticed that this patch was included. The referenced bug appears to be very similar to the issue we're experiencing.

Would you like me to proceed with upgrading to v25.1.0?

Mentioned in SAL (#wikimedia-operations) [2025-01-29T02:56:21Z] <denisse@deploy2002> Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258

Mentioned in SAL (#wikimedia-operations) [2025-01-29T02:56:34Z] <denisse@deploy2002> Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.1.0 - T384258 (duration: 00m 13s)

After upgrading to v25.1.0, the graphs confirm that IPv6 BGP prefix stats are now working as expected:

img-2025-01-28-21-13-42.png (648×1 px, 54 KB)

I also ran a manual poller operation, and the error no longer appears.

Since the issue was resolved upstream and we’re now running the patched version in production, I’m closing this as resolved. Feel free to reopen if needed.

After upgrading to v25.1.0, the graphs confirm that IPv6 BGP prefix stats are now working as expected:

Thank you @andrea.denisse!! Amazing work :)

Indeed all looks to be ok, I have re-enabled the alert that was incorrectly firing due to this now also so we have monitoring back on that.

Can't believe I didn't spot that issue my search-skills are really letting me down these days! I've closed my reported issue on their github too now too. Thanks.