We seem to have had a vlan-wide problem this evening which disrupted comms to multiple hosts on the private1-c-eqiad vlan for approx 9 minutes.
Multiple host down alerts were received, along with higher-level application errors and host BGP errors which were the result of the lower-layer comms issue.
Checking logs the cause seems to coincide with these logs on ssw1-d1-eqiad:
2025-11-11T03:03:20.562 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 was detected on vlan-1019. 2025-11-11T03:03:20.562 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 was detected on sub-interface lag2.1019. 2025-11-11T03:12:21.100 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 detected on sub-interface lag2.1019 is now deleted. 2025-11-11T03:12:21.100 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 detected on vlan-1019 is now deleted.
This is the unicast MAC address of et-1/0/5.1019 on cr1-eqiad:
cmooney@re0.cr1-eqiad> show interfaces et-1/0/5 | match "Description|Current address"
Description: Core: ssw1-d1-eqiad:ethernet-1/32 {#B00397}
Current address: 5c:5e:ab:3d:81:54, Hardware address: 5c:5e:ab:3d:81:54In the current topology this should only be learnt on the other side of that port - ssw1-d1-eqiad ethernet1-/32.1019 - as it currently is:
A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 5C:5E:AB:3D:81:54 -------------------------------------------------------------------------------------------------------------------------- Mac-table of network instance vlan-1019 -------------------------------------------------------------------------------------------------------------------------- Mac : 5C:5E:AB:3D:81:54 Destination : ethernet-1/32.1019 Dest Index : 35 Type : learnt Programming Status : Success Aging : 1172 Last Update : 2025-11-11T03:12:21.000Z Duplicate Detect time : N/A Hold down time remaining: N/A
Duplicate
The port it was learnt as a "duplicate" on was lag2.1019, which is the vlan-1019 subinterface on the 2x40G LAG to asw2-c-eqiad. There is no sensible way a frame with this source MAC should be coming in that port. The MAC is known on 'ae0' of the old asw2-c-eqiad stack. Nothing connected to the old switch stack should be sourcing frames from this MAC. The only place they should be seen coming from is on ae0 (from one of the spine switches) and those frames should not be sent back out towards ae0 again (normal L2 forwarding behavior):
cmooney@asw2-c-eqiad> show ethernet-switching table vlan-id 1019 | match "5C:5E:AB:3D:81:54"
private1-c-eqiad 5c:5e:ab:3d:81:54 D - ae0.0 0 0No other logs on asw2-c-eqiad show anything out of the ordinary for the affected time.
Outage
Whatever the reason this brought most hosts on the vlan to a complete stop while it was present. Obviously related to the fact this is a MAC that is on the interface that is VRRP master for the vlan, so whatever caused it was obviously affecting traffic to the CR/vlan gateway itself.
Despite that being the obvious conclusion the actual VRRP virtual MAC address has not moved on either spine switch since it was moved to the CR et-1/0/5 link last week:
A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 00:00:5e:00:01:13 --------------------------------------------------------------------------------------------------------- Mac-table of network instance vlan-1019 --------------------------------------------------------------------------------------------------------- Mac : 00:00:5E:00:01:13 Destination : ethernet-1/32.1019 Dest Index : 35 Type : learnt Programming Status : Success Aging : 1194 Last Update : 2025-11-06T14:33:10.000Z Duplicate Detect time : N/A Hold down time remaining: N/A ---------------------------------------------------------------------------------------------------------
A:ssw1-d8-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 00:00:5e:00:01:13 --------------------------------------------------------------------------------------------------------- Mac-table of network instance vlan-1019 --------------------------------------------------------------------------------------------------------- Mac : 00:00:5E:00:01:13 Destination : vxlan-interface:vxlan0.1019 vtep:10.64.128.17 vni:2001019 Dest Index : 5726513 Type : evpn Programming Status : Success Aging : N/A Last Update : 2025-11-06T14:33:10.000Z Duplicate Detect time : N/A Hold down time remaining: N/A ---------------------------------------------------------------------------------------------------------
That is the MAC hosts are sending frames to for traffic outside their own subnet. Unless I'm missing something it shouldn't matter where the switches see 5C:5E:AB:3D:81:54 connected (its the MAC for 10.64.32.2, not the gateway IP, which hosts don't talk to).
Root cause
Right now honestly I don't know what the problem is. The duplicate MAC on the face of it suggests the Juniper is sending frames back out the interface they arrived on (ae0), but that seems like something that wouldn't happen. The Nokias are obviously the new element here and likely at fault, but I'm not sure which vendor to pursue this with or whether they won't just blame each other.
The times of the duplicate mac detection match the time hosts traffic was affected it seems. The 9 minutes is the default timer:
set / network-instance vlan-1019 bridge-table mac-duplication admin-state enable set / network-instance vlan-1019 bridge-table mac-duplication monitoring-window 3 set / network-instance vlan-1019 bridge-table mac-duplication num-moves 5 set / network-instance vlan-1019 bridge-table mac-duplication hold-down-time 9 set / network-instance vlan-1019 bridge-table mac-duplication action stop-learning
Mitigations
For now to try and prevent any re-occurrence I have set the unicast CR MAC addresses for vlan 1019 statically on ssw1-d1-eqiad:
set / network-instance vlan-1019 bridge-table static-mac mac 5C:5E:AB:3D:81:54 destination ethernet-1/32.1019
The switch now shows it as learnt statically, which will take precedence over any dynamically learnt entry. In other words if a frame from this MAC comes in lag2 from asw2-c-eqiad it shouldn't over-ride the static entry:
A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 5C:5E:AB:3D:81:54 --------------------------------------------------------------------------------------------------------- Mac-table of network instance vlan-1019 --------------------------------------------------------------------------------------------------------- Mac : 5C:5E:AB:3D:81:54 Destination : ethernet-1/32.1019 Dest Index : 35 Type : static Programming Status : Success Aging : N/A Last Update : 2025-11-11T05:40:58.000Z Duplicate Detect time : N/A Hold down time remaining: N/A ---------------------------------------------------------------------------------------------------------
I also reduced the timers for the duplicate MAC detection in this vlan on both spines:
A:ssw1-d1-eqiad# info flat network-instance vlan-1019 bridge-table mac-duplication set / network-instance vlan-1019 bridge-table mac-duplication monitoring-window 1 set / network-instance vlan-1019 bridge-table mac-duplication hold-down-time 2
The other potential step would be to disable duplicate mac detection entirely. That obviously brings some risks but perhaps here it would ensure if this weird scenario happens we flap a few times but recover quicker, instead of being dead for 9 mins.