Page MenuHomePhabricator

BFD session alerts due to inconsistent status on cr3-knams
Open, Needs TriagePublic

Description

It has already happened more than once that a BFD session to cr3-knams is reported down in icinga, but the following happens:

  • on the other router/port, show bfd session reports ipv6 sessions to cr3-knams down (not ipv4 even the last time that happened IIRC)
  • on cr3-knams, show bfd session reports all up.
  • No ospf issue reported, afaik, on both sides (ospf/ospf3)

Executing clear bfd session address X on cr3-knams solves the problem. Logs of the last event:

  • Before execution of clear bfd session:
elukey@cr3-knams> show log messages | match bfd
Dec 12 23:23:24  cr3-knams bfdd[5731]: BFD Session 208.80.153.217 (IFL 355) state Up -> Down LD/RD(19/19) Up time:2d 21:50 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:23:24  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 12 23:23:24  cr3-knams rpd[5844]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.217 (External AS 65002): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 12 23:23:24  cr3-knams rpd[5844]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.217 (External AS 65002) changed state from Established to Idle (event BfdDown) (instance master)
Dec 12 23:23:24  cr3-knams rpd[5844]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.217 (realm ospf-v2 xe-0/1/5.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:23:57  cr3-knams bfdd[5731]: BFD Session fe80::ee38:7300:17e8:a04e (IFL 355) state Up -> Down LD/RD(33/389) Up time:2d 21:51 Local diag: AdminDown Remote diag: None Reason: Received Upstream Destroy Session.
Dec 12 23:23:57  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 33, new state: down, interface: xe-0/1/5.23, peer addr: fe80::ee38:7300:17e8:a04e
Dec 12 23:24:16  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 12 23:24:22  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 35, new state: up, interface: xe-0/1/5.23, peer addr: fe80::ee38:7300:17e8:a04e
Dec 12 23:42:04  cr3-knams bfdd[5731]: BFD Session 208.80.153.217 (IFL 355) state Up -> Down LD/RD(19/19) Up time:00:17:48 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:42:04  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 12 23:42:04  cr3-knams rpd[5844]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.217 (External AS 65002): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 12 23:42:04  cr3-knams rpd[5844]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.217 (External AS 65002) changed state from Established to Idle (event BfdDown) (instance master)
Dec 12 23:42:04  cr3-knams rpd[5844]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.217 (realm ospf-v2 xe-0/1/5.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:42:23  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 03:40:24  cr3-knams bfdd[5731]: BFD Session 208.80.153.217 (IFL 355) state Up -> Down LD/RD(19/19) Up time:03:58:02 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 03:40:24  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 03:40:38  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 03:56:00  cr3-knams bfdd[5731]: BFD Session 208.80.153.217 (IFL 355) state Up -> Down LD/RD(19/19) Up time:00:15:23 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 03:56:00  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 03:56:11  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 05:21:35  cr3-knams bfdd[5731]: BFD Session 208.80.153.217 (IFL 355) state Up -> Down LD/RD(19/19) Up time:01:25:24 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 05:21:35  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/5.23, peer addr: 208.80.153.217
Dec 13 05:21:35  cr3-knams rpd[5844]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.217 (External AS 65002): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 13 05:21:35  cr3-knams rpd[5844]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.217 (External AS 65002) changed state from Established to Idle (event BfdDown) (instance master)
Dec 13 05:21:35  cr3-knams rpd[5844]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.217 (realm ospf-v2 xe-0/1/5.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 13 05:21:45  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/5.23, peer addr: 208.80.153.217
  • After execution of clear bfd session:
Dec 13 07:54:14  cr3-knams mgd[4988]: UI_CMDLINE_READ_LINE: User 'elukey', command 'clear bfd session address fe80::ee38:7300:17e8:a04e '
Dec 13 07:54:14  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 35, new state: down, interface: xe-0/1/5.23, peer addr: fe80::ee38:7300:17e8:a04e
Dec 13 07:54:19  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 35, new state: up, interface: xe-0/1/5.23, peer addr: fe80::ee38:7300:17e8:a04e

Logs on the other side (this time eqdfw):

Dec 12 03:00:38  cr2-eqdfw bfdd[13844]: BFD Session fe80::ee38:7300:75:34ba (IFL 332) state Up -> Down LD/RD(384/175) Up time:1w3d 04:10 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 03:00:38  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 384, new state: down, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 03:00:38  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 03:00:38  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor fe80::ee38:7300:75:34ba (gr-0/0/0.1) removed due to: bfd state down
Dec 12 03:00:41  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 384, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:38:22  cr2-eqdfw bfdd[13844]: BFD Session 198.35.26.204 (IFL 332) state Up -> Down LD/RD(385/176) Up time:1w3d 08:47 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 07:38:22  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 385, new state: down, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 07:38:22  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 198.35.26.204 (gr-0/0/0.1) removed due to: bfd state down
Dec 12 07:38:22  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 198.35.26.204 (realm ospf-v2 gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 07:38:23  cr2-eqdfw bfdd[13844]: BFD Session fe80::ee38:7300:75:34ba (IFL 332) state Up -> Down LD/RD(384/175) Up time:04:37:41 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 07:38:23  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 384, new state: down, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:38:23  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor fe80::ee38:7300:75:34ba (gr-0/0/0.1) removed due to: bfd state down
Dec 12 07:38:36  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 385, new state: up, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 07:38:37  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 384, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:40:07  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 385, new state: down, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 07:40:07  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 198.35.26.204 (realm ospf-v2 gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 07:40:07  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 198.35.26.204 (gr-0/0/0.1) removed due to: bfd state down
Dec 12 07:40:22  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 384, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:40:22  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 385, new state: up, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 07:40:57  cr2-eqdfw bfdd[13844]: BFD Session fe80::ee38:7300:75:34ba (IFL 332) state Up -> Down LD/RD(384/175) Up time:00:00:35 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 07:40:57  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 384, new state: down, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:40:57  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::ee38:7300:75:34ba (realm ipv6-unicast gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 07:40:57  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor fe80::ee38:7300:75:34ba (gr-0/0/0.1) removed due to: bfd state down
Dec 12 07:40:57  cr2-eqdfw bfdd[13844]: BFD Session 198.35.26.204 (IFL 332) state Up -> Down LD/RD(385/176) Up time:00:00:36 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 07:40:57  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 385, new state: down, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 07:40:57  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 198.35.26.204 (realm ospf-v2 gr-0/0/0.1 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 07:40:57  cr2-eqdfw rpd[13967]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 198.35.26.204 (gr-0/0/0.1) removed due to: bfd state down
Dec 12 07:41:03  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 384, new state: up, interface: gr-0/0/0.1, peer addr: fe80::ee38:7300:75:34ba
Dec 12 07:41:04  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 385, new state: up, interface: gr-0/0/0.1, peer addr: 198.35.26.204
Dec 12 23:23:24  cr2-eqdfw bfdd[13844]: BFD Session 208.80.153.216 (IFL 340) state Up -> Down LD/RD(19/19) Up time:2d 21:50 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:23:24  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 12 23:23:24  cr2-eqdfw rpd[13967]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.216 (External AS 65003): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 12 23:23:24  cr2-eqdfw rpd[13967]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.216 (External AS 65003) changed state from Established to Idle (event BfdDown) (instance master)
Dec 12 23:23:24  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.216 (realm ospf-v2 xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:23:24  cr2-eqdfw bfdd[13844]: BFD Session fe80::7a4f:9b00:174e:8004 (IFL 340) state Up -> Down LD/RD(389/33) Up time:2d 21:50 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:23:24  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 389, new state: down, interface: xe-0/1/3.23, peer addr: fe80::7a4f:9b00:174e:8004
Dec 12 23:23:24  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::7a4f:9b00:174e:8004 (realm ipv6-unicast xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:24:18  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 12 23:24:22  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 390, new state: up, interface: xe-0/1/3.23, peer addr: fe80::7a4f:9b00:174e:8004
Dec 12 23:42:04  cr2-eqdfw bfdd[13844]: BFD Session fe80::7a4f:9b00:174e:8004 (IFL 340) state Up -> Down LD/RD(390/35) Up time:00:17:42 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:42:04  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 390, new state: down, interface: xe-0/1/3.23, peer addr: fe80::7a4f:9b00:174e:8004
Dec 12 23:42:04  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::7a4f:9b00:174e:8004 (realm ipv6-unicast xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:42:05  cr2-eqdfw bfdd[13844]: BFD Session 208.80.153.216 (IFL 340) state Up -> Down LD/RD(19/19) Up time:00:17:46 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 12 23:42:05  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 12 23:42:05  cr2-eqdfw rpd[13967]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.216 (External AS 65003): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 12 23:42:05  cr2-eqdfw rpd[13967]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.216 (External AS 65003) changed state from Established to Idle (event BfdDown) (instance master)
Dec 12 23:42:05  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.216 (realm ospf-v2 xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 12 23:42:21  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 03:40:23  cr2-eqdfw bfdd[13844]: BFD Session 208.80.153.216 (IFL 340) state Up -> Down LD/RD(19/19) Up time:03:58:03 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 03:40:23  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 03:40:23  cr2-eqdfw rpd[13967]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.216 (External AS 65003): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 13 03:40:23  cr2-eqdfw rpd[13967]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.216 (External AS 65003) changed state from Established to Idle (event BfdDown) (instance master)
Dec 13 03:40:23  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.216 (realm ospf-v2 xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 13 03:40:35  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 03:55:59  cr2-eqdfw bfdd[13844]: BFD Session 208.80.153.216 (IFL 340) state Up -> Down LD/RD(19/19) Up time:00:15:24 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 03:55:59  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 03:55:59  cr2-eqdfw rpd[13967]: bgp_bfd_callback:161: NOTIFICATION sent to 208.80.153.216 (External AS 65003): code 6 (Cease) subcode 9 (Hard Reset), Reason: BFD Session Down
Dec 13 03:55:59  cr2-eqdfw rpd[13967]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 208.80.153.216 (External AS 65003) changed state from Established to Idle (event BfdDown) (instance master)
Dec 13 03:55:59  cr2-eqdfw rpd[13967]: RPD_OSPF_NBRDOWN: OSPF neighbor 208.80.153.216 (realm ospf-v2 xe-0/1/3.23 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 13 03:56:09  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 05:21:36  cr2-eqdfw bfdd[13844]: BFD Session 208.80.153.216 (IFL 340) state Up -> Down LD/RD(19/19) Up time:01:25:27 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry.
Dec 13 05:21:36  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 19, new state: down, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 05:21:47  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 19, new state: up, interface: xe-0/1/3.23, peer addr: 208.80.153.216
Dec 13 07:54:17  cr2-eqdfw bfdd[13844]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 391, new state: up, interface: xe-0/1/3.23, peer addr: fe80::7a4f:9b00:174e:8004

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptDec 13 2019, 8:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It happened again between cr1-eqiad and cr3-knams, keeping the session down for troubleshooting, underlying protocols are UP.

Went through all the release notes between S2 and S8, the only one mentioning BFD is https://kb.juniper.net/InfoCenter/index?page=content&id=TSB17605&actp=METADATA and still not very relevant.

Opened Juniper Service Request ID 2019-1218-0038. In addition to staying down on one side, BFD is not behaving as expected when both sides are UP (cr3-knams stays UP even if reachability issues).

An event happen, both sides IPv4 BFD detects and issue, which takes OSPF and PIM (v4) instantly down:
Dec 17 16:55:21  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 20, new state: down, interface: xe-0/1/5.13, peer addr: 91.198.174.250
Dec 17 16:55:21  cr3-knams rpd[5844]: RPD_OSPF_NBRDOWN: OSPF neighbor 91.198.174.250 (realm ospf-v2 xe-0/1/5.13 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 17 16:55:21  cr3-knams rpd[5844]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 91.198.174.250 (xe-0/1/5.13) removed due to: bfd state down
Dec 17 16:55:21  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRDOWN: OSPF neighbor 91.198.174.251 (realm ospf-v2 xe-4/2/2.13 area 0.0.0.0) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Dec 17 16:55:22  re0.cr1-eqiad bfdd[4873]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 84, new state: down, interface: xe-4/2/2.13, peer addr: 91.198.174.251
Dec 17 16:55:22  re0.cr1-eqiad rpd[5239]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor 91.198.174.251 (xe-4/2/2.13) removed due to: bfd state down

Maybe the event stopped, and PIM (v4) detects its neighbor again, on one side only:
Dec 17 16:55:44  re0.cr1-eqiad rpd[5239]: RPD_PIM_NBRUP: Instance PIM.master: PIM new neighbor 91.198.174.251 interface xe-4/2/2.13

On the cr1-eqiad side only, IPv6 BFD detects a loss connectivity (that's 34s after v4) and takes IPv6 OSPF and PIM down):
Dec 17 16:55:55  re0.cr1-eqiad bfdd[4873]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 215, new state: down, interface: xe-4/2/2.13, peer addr: fe80::7a4f:9b00:d4e:8004
Dec 17 16:55:55  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::7a4f:9b00:d4e:8004 (realm ipv6-unicast xe-4/2/2.13 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: BFD session timed out and neighbor was declared dead)
Dec 17 16:55:55  re0.cr1-eqiad rpd[5239]: RPD_PIM_NBRDOWN: Instance PIM.master: PIM neighbor fe80::7a4f:9b00:d4e:8004 (xe-4/2/2.13) removed due to: bfd state down

On the cr3-knams side, BFD didn't detect the v6 issue, and OSPFv3 waits for its timers to detect the connectivity issue:
Dec 17 16:55:59  cr3-knams rpd[5844]: RPD_OSPF_NBRDOWN: OSPF neighbor fe80::5e5e:ab00:d3d:85ce (realm ipv6-unicast xe-0/1/5.13 area 0.0.0.0) state changed from Full to Down due to InActiveTimer (event reason: neighbor was inactive and declared dead)

On cr1-eqiad, v6 PIMs detects its neighbor:
Dec 17 16:56:13  re0.cr1-eqiad rpd[5239]: RPD_PIM_NBRUP: Instance PIM.master: PIM new neighbor fe80::7a4f:9b00:d4e:8004 interface xe-4/2/2.13

8s later, IPv4 BFD gets established on the cr3-knams side:
Dec 17 16:56:21  cr3-knams bfdd[5731]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 20, new state: up, interface: xe-0/1/5.13, peer addr: 91.198.174.250

5s later, OSPFv3 comes back up on both sides, without any notification of BFD v6  coming back up:
Dec 17 16:56:26  cr3-knams rpd[5844]: RPD_OSPF_NBRUP: OSPF neighbor fe80::5e5e:ab00:d3d:85ce (realm ipv6-unicast xe-0/1/5.13 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)
Dec 17 16:56:26  cr3-knams rpd[5844]: RPD_OSPF_NBRUP: OSPF neighbor fe80::5e5e:ab00:d3d:85ce (realm ipv6-unicast xe-0/1/5.13 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Dec 17 16:56:26  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRUP: OSPF neighbor fe80::7a4f:9b00:d4e:8004 (realm ipv6-unicast xe-4/2/2.13 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)
Dec 17 16:56:26  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRUP: OSPF neighbor fe80::7a4f:9b00:d4e:8004 (realm ipv6-unicast xe-4/2/2.13 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)

Then IPv4 for OSPF, and BFD (but still no mention of BFD v6) on both sides:
Dec 17 16:56:28  cr3-knams rpd[5844]: RPD_OSPF_NBRUP: OSPF neighbor 91.198.174.250 (realm ospf-v2 xe-0/1/5.13 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Dec 17 16:56:28  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRUP: OSPF neighbor 91.198.174.251 (realm ospf-v2 xe-4/2/2.13 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: neighbor detected this router)
Dec 17 16:56:29  cr3-knams rpd[5844]: RPD_OSPF_NBRUP: OSPF neighbor 91.198.174.250 (realm ospf-v2 xe-0/1/5.13 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of slave completed)
Dec 17 16:56:29  re0.cr1-eqiad bfdd[4873]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 84, new state: up, interface: xe-4/2/2.13, peer addr: 91.198.174.251
Dec 17 16:56:29  re0.cr1-eqiad rpd[5239]: RPD_OSPF_NBRUP: OSPF neighbor 91.198.174.251 (realm ospf-v2 xe-4/2/2.13 area 0.0.0.0) state changed from Exchange to Full due to ExchangeDone (event reason: DBD exchange of master completed)

Last cr3-knams finds its v4 neighbor:
Dec 17 16:56:40  cr3-knams rpd[5844]: RPD_PIM_NBRUP: Instance PIM.master: PIM new neighbor 91.198.174.250 interface xe-0/1/5.13
ayounsi claimed this task.Dec 18 2019, 8:40 AM

Mentioned in SAL (#wikimedia-operations) [2019-12-31T10:23:33Z] <elukey> execute 'clear bfd session address fe80::5e5e:ab00:d3d:85ce' on cr3-knams - T240659

Mentioned in SAL (#wikimedia-operations) [2020-01-02T13:00:31Z] <XioNoX> enable BFD traceoptions on cr1-eqiad and cr3-knams - T240659

Mentioned in SAL (#wikimedia-operations) [2020-01-09T07:40:32Z] <XioNoX> enable traceoptions for BFD on cr2-eqdfw - T240659

Mentioned in SAL (#wikimedia-operations) [2020-01-13T07:30:50Z] <XioNoX> cr3-knams> clear bfd session fe80::5e5e:ab00:d3d:85c - T240659

Removed BFD traceoptions on cr1-eqiad, keeping knams-eqdfw down for JTAC investigation.

From JTAC:

Been checking this issue with one of my seniors, on MX204 cr3-knams can we set on the below command:

set routing-options ppm no-delegate-processing

And cofigured the below filter configured on interfaces  xe-0/1/3.23 OF cr2-eqdfw AND xe-0/1/5.23 OF cr3-knams:

User@Router# show firewall filter test-in
term 10 {
                from {
                                protocol udp;
                                port 3784;
                }
                then {
                                count bfd-in;
                                syslog;
                                accept;
                }
}
term 20 {
                                then accept;
}

[edit]
User@Router# show firewall filter test-out
term 10 {
                from {
                                protocol udp;
                                port 3784;
                }
                then {
                                count bfd-out;
                                syslog;
                                accept;
                }
}

term 20 {
                                then accept;
}
 
[edit]
User@Router# show system syslog
                file bfd-fw {
                firewall any;
}

Collect bfd-fw file and run several captures of the below

show firewall filter test-out
show firewall filter test-in

Mentioned in SAL (#wikimedia-operations) [2020-01-21T19:22:20Z] <XioNoX> cr3-knams# set routing-options ppm no-delegate-processing - T240659

Note that the above probably reset the sessions, as they are now up.

Leaving it configured until the issue happen again.