Yesterday we moved the GTT link from xe-4/2/2 to xe-3/0/7 on cr1-eqiad (part or T304712), Icinga has been alerting about the v4 and v6 BFD sessions to cr2-drmrs (over vlan 16) to be flapping.
Looking at the logs, it seems like the old interface (xe-4/2/2.16) was still "stuck" somewhere, see bellow:
Oct 18 06:57:57 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-4/2/2.16, peer addr: 185.15.58.146 Oct 18 06:58:01 re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 2a02:ec80:600:fe04::1 (IFL 418) state Up -> Down LD/RD(1533/32) Up time:00:00:12 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Oct 18 06:58:01 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1533, new state: down, interface: xe-4/2/2.16, peer addr: 2a02:ec80:600:fe04::1 Oct 18 06:58:02 re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:05 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Oct 18 06:58:02 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-4/2/2.16, peer addr: 185.15.58.146 Oct 18 06:58:02 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Oct 18 06:58:02 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Oct 18 06:58:03 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed) Oct 18 06:58:06 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-4/2/2.16, peer addr: 185.15.58.146 Oct 18 06:58:07 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1533, new state: up, interface: xe-4/2/2.16, peer addr: 2a02:ec80:600:fe04::1 Oct 18 06:58:12 re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:06 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Oct 18 06:58:12 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-4/2/2.16, peer addr: 185.15.58.146 Oct 18 06:58:12 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Oct 18 06:58:12 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Oct 18 06:58:13 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Grepping at the full config for "xe-4/2/2" didn't return anything other than the now disabled interface.
This BFD adjacency is used by OSPF, OSPFv3 as well as BGP.
Bouncing the matching BGP sessions (v4 and v6) solved that specific issue (of using the wrong interface).
Since then the IPv6 BFD (and above protocols) sessions are now up and stable.
However the v4 side is still flapping:
Oct 18 07:23:31 re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:05 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. Oct 18 07:23:31 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-3/0/7.16, peer addr: 185.15.58.146 Oct 18 07:23:31 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead) Oct 18 07:23:31 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router) Oct 18 07:23:32 re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed) Oct 18 07:23:35 re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-3/0/7.16, peer addr: 185.15.58.146
Clearing BFD on either side didn't help.
Rapid ping (eg. interval 0.2) between the v4 IPs works fine.