Page MenuHomePhabricator

cr1-esams listing transport ospf interfaces multiple times
Closed, ResolvedPublic

Description

We've an unusual issue on cr1-esams after the works this week. Specifically on interfaces xe-0/0/2 and xe-0/0/7 the OSPF process is showing the interfaces multiple times when you do "show interfaces".

root@re0.cr1-esams> show ospf interface xe-0/0/2.0 detail 
Interface           State   Area            DR ID           BDR ID          Nbrs
xe-0/0/2.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
  Type: P2P, Address: 185.15.59.146, Mask: 255.255.255.254, MTU: 9178, Cost: 5000
  Adj count: 1
  Hello: 10, Dead: 40, ReXmit: 5, Not Stub
  Auth type: MD5, Active key ID: 1, Start time: 1970 Jan  1 00:00:00 UTC
  Protection type: Link
  Topology default (ID 0) -> Cost: 5000
xe-0/0/2.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            0
  Type: P2P, Address: 185.15.59.148, Mask: 255.255.255.254, MTU: 9178, Cost: 5000
  Adj count: 0
  Hello: 10, Dead: 40, ReXmit: 5, Not Stub
  Auth type: MD5, Active key ID: 1, Start time: 1970 Jan  1 00:00:00 UTC
  Protection type: Link
  Topology default (ID 0) -> Cost: 5000
xe-0/0/2.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            0
  Type: P2P, Address: 91.198.174.224, Mask: 255.255.255.254, MTU: 9178, Cost: 5000
  Adj count: 0
  Hello: 10, Dead: 40, ReXmit: 5, Not Stub
  Auth type: MD5, Active key ID: 1, Start time: 1970 Jan  1 00:00:00 UTC
  Protection type: Link                 
  Topology default (ID 0) -> Cost: 5000
root@re0.cr1-esams> show ospf interface xe-0/0/7.0 detail    
Interface           State   Area            DR ID           BDR ID          Nbrs
xe-0/0/7.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
  Type: P2P, Address: 185.15.59.149, Mask: 255.255.255.254, MTU: 9178, Cost: 5000
  Adj count: 1
  Hello: 10, Dead: 40, ReXmit: 5, Not Stub
  Auth type: MD5, Active key ID: 1, Start time: 1970 Jan  1 00:00:00 UTC
  Protection type: Link
  Topology default (ID 0) -> Cost: 5000
xe-0/0/7.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            0
  Type: P2P, Address: 91.198.174.249, Mask: 255.255.255.254, MTU: 9178, Cost: 5000
  Adj count: 0
  Hello: 10, Dead: 40, ReXmit: 5, Not Stub
  Auth type: MD5, Active key ID: 1, Start time: 1970 Jan  1 00:00:00 UTC
  Protection type: Link
  Topology default (ID 0) -> Cost: 5000

On each interface we have a working adjacency, listed against the currently configured IP for each interface. The additional 'interface' listings show IP addresses that had previously been configured on those interfaces, but changed due to renumbering (or in one case an IP was briefly on the wrong interface due to a typo).

Very odd. For instance the 91.198.174.224 IP is not in the device configuration anywhere, nor on the network. It's like the OSPF process has cached the IPs formerly on the interface and is showing them. BFD is down over xe-0/0/7 to eqiad, which I believe may be related.

OSPFv3 for IPv6 is fine:

root@re0.cr1-esams> show ospf3 interface  
Interface           State   Area            DR ID           BDR ID          Nbrs
ae0.0               PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
lo0.0               DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
xe-0/0/2.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
xe-0/0/7.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1

Given this looks like a bug I've done the following in an attempt to kick the router into behaving itself:

  1. Shutting / unshutting the relevant interfaces
  2. Clearing OSPF for the interfaces
  3. Clearing the entire OSPF database
  4. De-activating and re-activating OSPF completely

Despite this the peculiarity remains. Searching online I don't see any similar issues although it may be worth digging deeper.

Given esams is currently down I am going to take the ultimate "kick it" measure and reboot the CR to see if it fixes it. If not we will have to review / possibly take to JTAC.

Event Timeline

cmooney triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2023-08-19T08:37:22Z] <topranks> downtiming esams hosts ahead of core router (cr1-esams) reboot T344546

Icinga downtime and Alertmanager silence (ID=9daa4455-f5dd-4a7a-8558-107a3e4842a2) set by cmooney@cumin1001 for 2:00:00 on 29 host(s) and their services with reason: Downtime esams hosts prior to migration week.

cp[3066-3081].esams.wmnet,durum[3003-3004].esams.wmnet,ganeti[3005-3008].esams.wmnet,lvs[3008-3010].esams.wmnet,ncredir[3003-3004].esams.wmnet,netflow3003.esams.wmnet,prometheus3003.esams.wmnet

@ayounsi advised to try running a "commit full" on the router rather than a reboot.

Apparently this command forces a full config load, which "commit" doesn't do? TIL.

Anyway following this our problem has disappeared:

root@re0.cr1-esams> show ospf interface 
Interface           State   Area            DR ID           BDR ID          Nbrs
ae0.0               PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
lo0.0               DRother 0.0.0.0         0.0.0.0         0.0.0.0            0
xe-0/0/2.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1
xe-0/0/7.0          PtToPt  0.0.0.0         0.0.0.0         0.0.0.0            1

{master}

BFD has also come up across the link to cr2-eqiad:

root@re0.cr1-esams> show bfd session 
                                                  Detect   Transmit
Address                  State     Interface      Time     Interval  Multiplier
80.249.209.211           Up        ae1.380        0.900     0.300        3   
185.15.59.147            Up        xe-0/0/2.0     0.900     0.300        3   
185.15.59.148            Up        xe-0/0/7.0     0.900     0.300        3   
2001:7f8:1::a500:3320:1  Up        ae1.380        0.900     0.300        3   
fe80::8618:88ff:fe0d:dc83 Up       xe-0/0/7.0     0.900     0.300        3   
fe80::a6e1:1aff:fe6f:d3a2 Up       xe-0/0/2.0     0.900     0.300        3
Aug 19 09:09:06  re0.cr2-eqiad bfdd[29401]: BFDD_TRAP_SHOP_STATE_UP: local discriminator: 2039, new state: up, interface: xe-3/2/1.0, peer addr: 185.15.59.149

Hopefully the last we have to worry about this one.