We got paged for cr2-magru down, due to ping failure from alert1001
2025-01-25T08:20:58 alert1002 icinga: HOST ALERT: cr2-magru;DOWN;HARD;2;PING CRITICAL - Packet loss = 100% 2025-01-25T08:22:26 alert1002 icinga: HOST ALERT: cr2-magru;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 115.53 ms
This seems to be due to IBGP flapping between the core routers in magru, cr2 failed first and the session stayed down for ~20 seconds until it was detected on cr1, after which is re-negotiated and came back:
Jan 25 08:20:52 cr2-magru rpd[34120]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.136 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)
Jan 25 08:21:08 cr1-magru rpd[34124]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.137 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)
It is not clear what has happened. The interfaces are not showing errors, general health sensors are normal on both routers. The equivalent IPv6 peering over the same link was not interrupted at all and is steady for 38 weeks. OSPF for both protocols is stable. No ping loss across the link either (see P72436)
The traffic path from alert1002 to cr2-magru goes via cr1-magru, traversing the link that the above session runs over. So to an extent the ping failure/alert makes sense. But the route for cr2 is known in OSPF, which did not flap, so the IBGP flap shouldn't have affected it.
We had a small bump in TCP timed out NELs at the same time, but I guess that is not unexpected when IBGP paths go and then reconverge.
Magru was downtimed by SREs who got the alert (thanks!). Most health-checked on the router other than that seem fine. There was also instability on our EdgeUno transport service back to edfw, before and after the first page, though the path isn't in heavy use:
Jan 25 06:07:23 cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode) Jan 25 06:07:23 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Jan 25 06:07:24 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed) Jan 25 07:59:33 cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Jan 25 07:59:33 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Jan 25 07:59:34 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed) Jan 25 09:19:57 cr2-magru rpd[34120]: RPD_OSPF_NBRDOWN: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Jan 25 09:19:58 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Jan 25 09:19:59 cr2-magru rpd[34120]: RPD_OSPF_NBRUP: OSPF neighbor 195.200.68.152 (realm ospf-v2 xe-0/1/1.0 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)
I drained that circuit by upping the ospf cost via Netbox/Homer so it's only a backup path now.
I was about to say things look stable and we could think about re-pooling the site, but IBGP flapped again and we were paged:
2025-01-25T10:54:00 alert1002 icinga: HOST ALERT: cr2-magru;DOWN;HARD;2;PING CRITICAL - Packet loss = 100% 2025-01-25T10:54:30 alert1002 icinga: HOST ALERT: cr2-magru;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 115.40 ms
Again cr2 seemed to have a problem with the session first:
Jan 25 10:53:54 cr2-magru rpd[34120]: BGP_IO_ERROR_CLOSE_SESSION: BGP peer 195.200.68.136 (Internal AS 65007): Error event Operation timed out(60) for I/O session - closing it (instance master)
It was 30 seconds later before cr1 detected the problem, after which it restarted and came back fine again:
Jan 25 10:54:28 cr1-magru rpd[34124]: bgp_pp_recv: rejecting connection from 195.200.68.137 (Internal AS 65007), peer in state Established Jan 25 10:54:28 cr1-magru rpd[34124]: bgp_io_mgmt_cb:3021: NOTIFICATION sent to 195.200.68.137 (Internal AS 65007): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 195.200.68.137 (Internal AS 65007)
I think for now we should leave the site depooled and we can see how things progress. Both core routers have been downtimed until Monday. We can make a call at that point on what to do, though I think it would be a good idea to reboot both routers before any repool at least.
