Page MenuHomePhabricator

cr4-ulsfo<>cr2-eqsin GRE tunnel flapping due to BFD timer expired
Closed, ResolvedPublic

Description

Since 20:21:46UTC on Friday 2021-01-29, the GRE tunnel between cr4-ulsfo and cr2-eqsin has been flapping a few times a minute.

This is affecting the IPv4 and IPv6 OSPF sessions between the two sites.

While the GRE tunnel is a backup, this does put the eqsin site at N+0 for connectivity to rest of prod.

However, I also am hesitant to proactively depool eqsin all weekend, as that adds a significant latency penalty.

Event Timeline

CDanis triaged this task as High priority.Jan 29 2021, 9:21 PM

The first few cycles of logs from the ulsfo side:

Jan 29 20:21:46  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:4d 18:26 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Jan 29 20:21:46  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:21:46  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 29 20:21:46  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:21:47  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:09  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Jan 29 20:22:39  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:52 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Jan 29 20:22:39  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 159, new state: down, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:39  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Jan 29 20:22:39  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:22:41  cr4-ulsfo rpd[16292]: RPD_OSPF_NBRUP: OSPF neighbor fe80::827f:f800:43:6b66 (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Jan 29 20:22:43  cr4-ulsfo bfdd[16019]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 159, new state: up, interface: gr-0/0/0.1, peer addr: fe80::827f:f800:43:6b66
Jan 29 20:22:50  cr4-ulsfo bfdd[16019]: BFD Session fe80::827f:f800:43:6b66 (IFL 75) state Up -> Down LD/RD(159/26) Up time:00:00:06 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.

and from the eqsin side:

Jan 29 20:21:46  cr2-eqsin bfdd[6957]: BFD Session fe80::ee38:7300:75:34ba (IFL 79) state Up -> Down LD/RD(26/159) Up time:4d 18:26 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received ADMINDOWN from PEER.
Jan 29 20:21:46  cr2-eqsin rpd[7049]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Jan 29 20:21:46  cr2-eqsin bfdd[6957]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 26, new state: down, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Jan 29 20:21:46  cr2-eqsin rpd[7049]: RPD_OSPF_NBRUP: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:21:48  cr2-eqsin bfdd[6957]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 26, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Jan 29 20:22:09  cr2-eqsin rpd[7049]: RPD_OSPF_NBRUP: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)
Jan 29 20:22:39  cr2-eqsin rpd[7049]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Jan 29 20:22:39  cr2-eqsin bfdd[6957]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 26, new state: down, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Jan 29 20:22:39  cr2-eqsin bfdd[6957]: BFD Session fe80::ee38:7300:75:34ba (IFL 79) state Up -> Down LD/RD(26/159) Up time:00:00:50 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received ADMINDOWN from PEER.
Jan 29 20:22:39  cr2-eqsin rpd[7049]: RPD_OSPF_NBRUP: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Jan 29 20:22:40  cr2-eqsin rpd[7049]: RPD_OSPF_NBRUP: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)
Jan 29 20:22:42  cr2-eqsin bfdd[6957]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 26, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Jan 29 20:22:50  cr2-eqsin bfdd[6957]: BFD Session fe80::ee38:7300:75:34ba (IFL 79) state Up -> Down LD/RD(26/159) Up time:00:00:08 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received ADMINDOWN from PEER.

I'm guessing that because the timer is expiring on the ulsfo side, it likely means the issue is with packet loss on the eqsin-->ulsfo path?

ayounsi claimed this task.

Thanks, looks like last flap was on Jan 29 22:06:55. As it's over the wild Internet, there is nobody to complain to and most likely a provider issue now fixed.