Page MenuHomePhabricator

Jan 2025 - Magru core router connectivity blips
Closed, ResolvedPublic

Description

We got paged for cr2-magru down, due to ping failure from alert1001

2025-01-25T08:20:58 alert1002 icinga: HOST ALERT: cr2-magru;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
2025-01-25T08:22:26 alert1002 icinga: HOST ALERT: cr2-magru;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 115.53 ms

This seems to be due to IBGP flapping between the core routers in magru, cr2 failed first and the session stayed down for ~20 seconds until it was detected on cr1, after which is re-negotiated and came back:

Jan 25 08:20:52  cr2-magru rpd[34120]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.136 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)
Jan 25 08:21:08  cr1-magru rpd[34124]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.137 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)

It is not clear what has happened. The interfaces are not showing errors, general health sensors are normal on both routers. The equivalent IPv6 peering over the same link was not interrupted at all and is steady for 38 weeks. OSPF for both protocols is stable. No ping loss across the link either (see P72436)

The traffic path from alert1002 to cr2-magru goes via cr1-magru, traversing the link that the above session runs over. So to an extent the ping failure/alert makes sense. But the route for cr2 is known in OSPF, which did not flap, so the IBGP flap shouldn't have affected it.

We had a small bump in TCP timed out NELs at the same time, but I guess that is not unexpected when IBGP paths go and then reconverge.

image.png (493×1 px, 50 KB)

Magru was downtimed by SREs who got the alert (thanks!). Most health-checked on the router other than that seem fine. There was also instability on our EdgeUno transport service back to edfw, before and after the first page, though the path isn't in heavy use:

Jan 25 06:07:23  cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Jan 25 06:07:23  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 25 06:07:24  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)

Jan 25 07:59:33  cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 25 07:59:33  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 25 07:59:34  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)

Jan 25 09:19:57  cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 25 09:19:58  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 25 09:19:59  cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)

I drained that circuit by upping the ospf cost via Netbox/Homer so it's only a backup path now.

I was about to say things look stable and we could think about re-pooling the site, but IBGP flapped again and we were paged:

2025-01-25T10:54:00 alert1002 icinga: HOST ALERT: cr2-magru;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
2025-01-25T10:54:30 alert1002 icinga: HOST ALERT: cr2-magru;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 115.40 ms

Again cr2 seemed to have a problem with the session first:

Jan 25 10:53:54  cr2-magru rpd[34120]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.136 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)

It was 30 seconds later before cr1 detected the problem, after which it restarted and came back fine again:

Jan 25 10:54:28  cr1-magru rpd[34124]: bgp_pp_recv: rejecting connection from 195.200.68.137 (Internal AS 65007), peer in state Established
Jan 25 10:54:28  cr1-magru rpd[34124]: bgp_io_mgmt_cb:3021: NOTIFICATION sent to 195.200.68.137 (Internal AS 65007): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 195.200.68.137 (Internal AS 65007)

I think for now we should leave the site depooled and we can see how things progress. Both core routers have been downtimed until Monday. We can make a call at that point on what to do, though I think it would be a good idea to reboot both routers before any repool at least.

Event Timeline

magru was depooled:

08:34 <•logmsgbot> !log fabfur@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site magru [reason: depool magru to check for cr issues, no task ID specified]
08:34 <•logmsgbot> !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site magru [reason: depool magru to check for cr issues, no task ID

cmooney renamed this task from cr2-magru ping down to Jan 2025 - Magru core router connectivity blips.Jan 25 2025, 12:07 PM
cmooney claimed this task.
cmooney triaged this task as High priority.
cmooney updated the task description. (Show Details)

This looks quite similar to T374401

Yes. We didn't isolate things to the link between he routers on that occasion but the symptoms are very similar.

I'm gonna upgrade the software on both devices to JunOS 23.4R2 and reboot them, then monitor for another 24 hours.

Icinga downtime and Alertmanager silence (ID=5e91bf43-5eb3-4e7a-9b94-8ae6c366da3f) set by cmooney@cumin1002 for 2 days, 0:00:00 on 4 host(s) and their services with reason: upgrading JunOS on magru core routers

cr[1-2]-magru,cr[1-2]-magru IPv6

Mentioned in SAL (#wikimedia-operations) [2025-01-27T10:20:06Z] <topranks> installing updated JunOS image on cr2-magru T384774

Mentioned in SAL (#wikimedia-operations) [2025-01-27T10:47:38Z] <topranks> rebooting cr2-magru to complete upgrade T384774

Icinga downtime and Alertmanager silence (ID=400d1435-bc43-43ba-813d-df761b30f623) set by cmooney@cumin1002 for 2:00:00 on 2 host(s) and their services with reason: upgrading JunOS on magru core routers

asw1-b[3-4]-magru.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-01-27T11:20:15Z] <topranks> installing updated JunOS image on cr1-magru T384774

Mentioned in SAL (#wikimedia-operations) [2025-01-27T11:34:45Z] <topranks> rebooting cr1-magru to complete upgrade T384774

Ok both routers have been upgraded to JunOS 23.4R2 and reset. I've extended the downtime until Wednesday morning. Let's make a call in 24h on whether they have been stable and we can repool the site.

Everything remains stable since the upgrade/reset of the routers yesterday. All protocol adjacencies, interfaces etc look good as are the general health stats for the site.

I've set the EdgeUno transport circuit back to normal priority, it's been stable since Saturday (that seems unrelated to the ping failures just a co-incidence).

Overall I'm happy for us to re-pool the site and keep an eye on how things perform.

Mentioned in SAL (#wikimedia-operations) [2025-01-28T10:35:35Z] <vgutierrez@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: no reason specified, T384774]

Mentioned in SAL (#wikimedia-operations) [2025-01-28T10:35:49Z] <vgutierrez@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: no reason specified, T384774]

Gonna close this one, all is stable after ~24h.

Unfortunately this happened again this evening. Same symptoms.

Feb  1 16:34:07  cr1-magru rpd[17334]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.137 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)
Feb  1 16:34:07  cr1-magru rpd[17334]: bgp_io_mgmt_cb:3105: NOTIFICATION sent to 195.200.68.137 (Internal AS 65007): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 195.200.68.137 (Internal AS 65007), socket buffer sndacc: 10786 rcvacc: 0 , socket buffer sndccc: 10786 rcvccc: 0 TCP state: 4, snd_una: 3366736320 snd_nxt: 3366742833 snd_wnd: 24540 rcv_nxt: 1428539983 rcv_adv: 1428556367, hold timer 90s, hold timer remain 0s, last sent 12s, TCP port (local 179, remote 49526)

I'll open a TAC case with Juniper on Monday about it. On balance I think the disruption these occasional blips are causing do not justify depooling the site. I'll downtime cr2-magru until Monday to stop alerts firing.

I've added BFD to this particular session now. Not that it will fix things but it should give us more datapoints for the (likely) case with Juniper.

The fact IPv6 and OSPF are not flapping makes me fairly sure the problem here is not the link - and thus no point getting and remote hands work done like checking fibers or changing optics. @ayounsi interested if you disagree or if there is anything else we might try here.

We got paged again for ping loss here. I have downtimed cr2-magru for two days and will work with Juniper tomorrow on the issue. We may need to depool the site ultimately, some of the tracing they wish to do could be disruptive but I think for now these blips are still not causing sufficient interruption to be worth depooling the whole site.

Mentioned in SAL (#wikimedia-operations) [2025-02-11T13:21:08Z] <topranks> adding BGP trace/verbose debug mode on magru CRs as advised by Juniper support T384774

Mentioned in SAL (#wikimedia-operations) [2025-03-06T11:44:15Z] <topranks> applying interface-specific arp policer on cr2-magru to IX.BR sub-interface ae0.3347 (T384774)

After a good deal of back and forth with JTAC they were able to point us in the right direction.

By default the MX platfrom has a built-in "arp policer", which limits the number of ARP requests/responses the PFE will process per second. What seems to be happening is at IX.BR, where we have almost 3,000 ARP entries on the LAN (and who knows how many requests arriving per second), we are hitting this arp policer limit quite regularly.

This policer is global for all interfaces on the box. Meaning that the high level of ARP packets on the IXP interface has the potential to push the global limit over threshold, after which ARP packets are not processed properly on other interfaces. This scenario is also detailed in this blog I found:

https://lkhill.com/juniper-arp-policer/

For now I've followed the advice from JTAC and on the blog and created an arp_policer manually in the config and applied to the IX.BR IPv4 sub-interface. The counters on this are incrementing (so the limits are being hit - perhaps we need higher ones?). But we no longer have drops on the "default" one so I think this should stop the CR losing the ARP entry for its peer.

cmooney@cr2-magru> show policer __default_arp_policer__               
Policers:
Name                                                                            Bytes              Packets
__default_arp_policer__                                                             0                    0
cmooney@cr2-magru> show policer arp_limit-ae0.3347-inet-arp           
Policers:
Name                                                                            Bytes              Packets
arp_limit-ae0.3347-inet-arp                                                   1970936                43431

I've asked Juniper what their recommendation in terms of sizing are, and what the reasonable limits on MX204 might be. We also might think about any other mechanisms to limit this, we have 34 peers on the exchange currently, the 3000 ARP table entries on the interface is excessive for that.

cmooney lowered the priority of this task from High to Medium.Mar 7 2025, 2:00 PM

Everything still seems table. Juniper also provided this link to their KB article on it

https://supportportal.juniper.net/s/article/Multiple-BGPBFD-sessions-bouncing-due-to-default-arp-policer-drops

FWIW should also mention here the info in the below slide deck:

https://www.lacnic.net/innovaportal/file/3207/1/lacnog2018-douglasfischer_analisearpstorm.pdf

From my reading it seems like IX.BR have implemented EVPN/SR-IPv6 with arp suppression in Brasilia to deal with the amount of broadcast (presumably ARP) traffic on the IX LAN. And they say they are going to expand that to Sao Paolo in the first half of 2025. So with any luck this situation won't persist for much longer.

In the meantime, in terms of what kind of policer level we should set, it's hard to say. Right now I have increased the policer limit to 256k and we are still showing constant drops. I'll keep increasing it, slowly, while keeping an eye on the CPU usage in LibreNMS. Looking at the BGP neighbor stats for the IX peers the policing does seem to be causing disruptions / resets so we need to stop dropping arps from them.

Change #1127592 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Support setting custom arp-policer on CR interfaces

https://gerrit.wikimedia.org/r/1127592

FWIW the router CPU is still fine with the arp policer set to 2MB size, which is how high I had to go before it stopped incrementing all the time.

cmooney@cr2-magru> show policer arp_ae0_3347-ae0.3347-inet-arp 
Policers:
Name                                                                            Bytes              Packets
arp_ae0_3347-ae0.3347-inet-arp                                                      0                    0

Change #1127592 merged by jenkins-bot:

[operations/homer/public@master] Support setting custom arp-policer on CR interfaces

https://gerrit.wikimedia.org/r/1127592

Router stable and config added to automation templates, closing task.