Page MenuHomePhabricator

BFD flapping between cr1-eqiad and cr2-drmrs
Closed, ResolvedPublic

Description

Yesterday we moved the GTT link from xe-4/2/2 to xe-3/0/7 on cr1-eqiad (part or T304712), Icinga has been alerting about the v4 and v6 BFD sessions to cr2-drmrs (over vlan 16) to be flapping.

Looking at the logs, it seems like the old interface (xe-4/2/2.16) was still "stuck" somewhere, see bellow:

Oct 18 06:57:57  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-4/2/2.16, peer addr: 185.15.58.146
Oct 18 06:58:01  re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 2a02:ec80:600:fe04::1 (IFL 418) state Up -> Down LD/RD(1533/32) Up time:00:00:12 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Oct 18 06:58:01  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1533, new state: down, interface: xe-4/2/2.16, peer addr: 2a02:ec80:600:fe04::1
Oct 18 06:58:02  re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:05 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Oct 18 06:58:02  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-4/2/2.16, peer addr: 185.15.58.146
Oct 18 06:58:02  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Oct 18 06:58:02  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Oct 18 06:58:03  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Oct 18 06:58:06  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-4/2/2.16, peer addr: 185.15.58.146
Oct 18 06:58:07  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1533, new state: up, interface: xe-4/2/2.16, peer addr: 2a02:ec80:600:fe04::1
Oct 18 06:58:12  re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:06 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Oct 18 06:58:12  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-4/2/2.16, peer addr: 185.15.58.146
Oct 18 06:58:12  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Oct 18 06:58:12  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Oct 18 06:58:13  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)

Grepping at the full config for "xe-4/2/2" didn't return anything other than the now disabled interface.
This BFD adjacency is used by OSPF, OSPFv3 as well as BGP.

Bouncing the matching BGP sessions (v4 and v6) solved that specific issue (of using the wrong interface).

Since then the IPv6 BFD (and above protocols) sessions are now up and stable.

However the v4 side is still flapping:

Oct 18 07:23:31  re0.cr1-eqiad bfdd[29381]: BFDD_STATE_UP_TO_DOWN: BFD Session 185.15.58.146 (IFL 418) state Up -> Down LD/RD(1520/28) Up time:00:00:05 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Oct 18 07:23:31  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 1520, new state: down, interface: xe-3/0/7.16, peer addr: 185.15.58.146
Oct 18 07:23:31  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRDOWN: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Oct 18 07:23:31  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Oct 18 07:23:32  re0.cr1-eqiad rpd[31216]: RPD_OSPF_NBRUP: OSPF neighbor 185.15.58.146 (realm ospf-v2 xe-3/0/7.16 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Oct 18 07:23:35  re0.cr1-eqiad bfdd[29381]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 1520, new state: up, interface: xe-3/0/7.16, peer addr: 185.15.58.146

Clearing BFD on either side didn't help.

Rapid ping (eg. interval 0.2) between the v4 IPs works fine.

Event Timeline

ayounsi triaged this task as High priority.Oct 18 2022, 7:34 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think I may have solved this, although through nothing logical, similar to the earlier BGP bounce restoring the IPv6.

I disabled OSPF for the interface and re-added it on cr1-eqiad. At the same time I removed the IPv4 address from the interface. Once re-applied the session came back up.

cmooney@re0.cr1-eqiad> show bfd session address 185.15.58.146 detail    
                                                  Detect   Transmit
Address                  State     Interface      Time     Interval  Multiplier
185.15.58.146            Up        xe-3/0/7.16    0.900     0.300        3   
 Client OSPF realm ospf-v2 Area 0.0.0.0, TX interval 0.300, RX interval 0.300
 Client BGP, TX interval 0.300, RX interval 0.300
 Session up time 00:11:51
 Local diagnostic None, remote diagnostic None
 Remote state Up, version 1
 Replicated 
 Session type: Single hop BFD

1 sessions, 2 clients
Cumulative transmit rate 3.3 pps, cumulative receive rate 3.3 pps

For the record:

cmooney@re0.cr1-eqiad# show | compare 
[edit interfaces xe-3/0/7 unit 16 family inet]
-       address 185.15.58.147/31;
[edit protocols ospf area 0.0.0.0]
-     interface xe-3/0/7.16 {
-         interface-type p2p;
-         link-protection;
-         metric 968;
-         bfd-liveness-detection {
-             minimum-interval 300;
-         }
-     }

{master}[edit]
cmooney@re0.cr1-eqiad# run show route 185.15.58.146 table inet.0    

inet.0: 894080 destinations, 4028040 routes (893018 active, 4 holddown, 4921 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

185.15.58.146/31   *[Direct/0] 16:02:15
                    >  via xe-3/0/7.16

{master}[edit]
cmooney@re0.cr1-eqiad# commit 
re0: 
configuration check succeeds
re1: 
commit complete
re0: 
commit complete

{master}[edit]
cmooney@re0.cr1-eqiad# run show route 185.15.58.146 table inet.0    

inet.0: 894032 destinations, 4027962 routes (892978 active, 0 holddown, 4912 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

185.15.58.146/31   *[OSPF/10] 00:00:02, metric 1850
                    >  to 185.15.58.139 via xe-3/1/4.0
                       to 91.198.174.251 via xe-3/0/7.13

{master}[edit]
cmooney@re0.cr1-eqiad# rollback 1  
load complete

{master}[edit]
cmooney@re0.cr1-eqiad# show | compare 
[edit interfaces xe-3/0/7 unit 16 family inet]
+       address 185.15.58.147/31;
[edit protocols ospf area 0.0.0.0]
      interface xe-3/0/7.13 { ... }
+     interface xe-3/0/7.16 {
+         interface-type p2p;
+         link-protection;
+         metric 968;
+         bfd-liveness-detection {
+             minimum-interval 300;
+         }
+     }
      interface xe-3/1/4.0 { ... }

{master}[edit]
cmooney@re0.cr1-eqiad# commit 
re0: 
configuration check succeeds
re1: 
commit complete
re0: 
commit complete

{master}[edit]
cmooney@re0.cr1-eqiad#
ayounsi assigned this task to cmooney.

Awesome, thanks! I cleared the Icinga downtimes now that it's all back to normal.