Page MenuHomePhabricator

Row C traffic outage Nov 11 2025
Open, HighPublic

Description

We seem to have had a vlan-wide problem this evening which disrupted comms to multiple hosts on the private1-c-eqiad vlan for approx 9 minutes.

Multiple host down alerts were received, along with higher-level application errors and host BGP errors which were the result of the lower-layer comms issue.

Checking logs the cause seems to coincide with these logs on ssw1-d1-eqiad:

2025-11-11T03:03:20.562 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 was detected on vlan-1019.
2025-11-11T03:03:20.562 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 was detected on sub-interface lag2.1019.
2025-11-11T03:12:21.100 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 detected on sub-interface lag2.1019 is now deleted.
2025-11-11T03:12:21.100 sr_l2_mac_mgr: bridgetable|6555|N: A duplicate MAC address 5C:5E:AB:3D:81:54 detected on vlan-1019 is now deleted.

This is the unicast MAC address of et-1/0/5.1019 on cr1-eqiad:

cmooney@re0.cr1-eqiad> show interfaces et-1/0/5 | match "Description|Current address" 
  Description: Core: ssw1-d1-eqiad:ethernet-1/32 {#B00397}
  Current address: 5c:5e:ab:3d:81:54, Hardware address: 5c:5e:ab:3d:81:54

In the current topology this should only be learnt on the other side of that port - ssw1-d1-eqiad ethernet1-/32.1019 - as it currently is:

A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 5C:5E:AB:3D:81:54
--------------------------------------------------------------------------------------------------------------------------
Mac-table of network instance vlan-1019
--------------------------------------------------------------------------------------------------------------------------
Mac                     : 5C:5E:AB:3D:81:54
Destination             : ethernet-1/32.1019
Dest Index              : 35
Type                    : learnt
Programming Status      : Success
Aging                   : 1172
Last Update             : 2025-11-11T03:12:21.000Z
Duplicate Detect time   : N/A
Hold down time remaining: N/A

Duplicate

The port it was learnt as a "duplicate" on was lag2.1019, which is the vlan-1019 subinterface on the 2x40G LAG to asw2-c-eqiad. There is no sensible way a frame with this source MAC should be coming in that port. The MAC is known on 'ae0' of the old asw2-c-eqiad stack. Nothing connected to the old switch stack should be sourcing frames from this MAC. The only place they should be seen coming from is on ae0 (from one of the spine switches) and those frames should not be sent back out towards ae0 again (normal L2 forwarding behavior):

cmooney@asw2-c-eqiad> show ethernet-switching table vlan-id 1019 | match "5C:5E:AB:3D:81:54" 
    private1-c-eqiad    5c:5e:ab:3d:81:54   D             -   ae0.0                  0         0

No other logs on asw2-c-eqiad show anything out of the ordinary for the affected time.

Outage

Whatever the reason this brought most hosts on the vlan to a complete stop while it was present. Obviously related to the fact this is a MAC that is on the interface that is VRRP master for the vlan, so whatever caused it was obviously affecting traffic to the CR/vlan gateway itself.

Despite that being the obvious conclusion the actual VRRP virtual MAC address has not moved on either spine switch since it was moved to the CR et-1/0/5 link last week:

A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 00:00:5e:00:01:13
---------------------------------------------------------------------------------------------------------
Mac-table of network instance vlan-1019
---------------------------------------------------------------------------------------------------------
Mac                     : 00:00:5E:00:01:13
Destination             : ethernet-1/32.1019
Dest Index              : 35
Type                    : learnt
Programming Status      : Success
Aging                   : 1194
Last Update             : 2025-11-06T14:33:10.000Z
Duplicate Detect time   : N/A
Hold down time remaining: N/A
---------------------------------------------------------------------------------------------------------
A:ssw1-d8-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 00:00:5e:00:01:13
---------------------------------------------------------------------------------------------------------
Mac-table of network instance vlan-1019
---------------------------------------------------------------------------------------------------------
Mac                     : 00:00:5E:00:01:13
Destination             : vxlan-interface:vxlan0.1019 vtep:10.64.128.17 vni:2001019
Dest Index              : 5726513
Type                    : evpn
Programming Status      : Success
Aging                   : N/A
Last Update             : 2025-11-06T14:33:10.000Z
Duplicate Detect time   : N/A
Hold down time remaining: N/A
---------------------------------------------------------------------------------------------------------

That is the MAC hosts are sending frames to for traffic outside their own subnet. Unless I'm missing something it shouldn't matter where the switches see 5C:5E:AB:3D:81:54 connected (its the MAC for 10.64.32.2, not the gateway IP, which hosts don't talk to).

Root cause

Right now honestly I don't know what the problem is. The duplicate MAC on the face of it suggests the Juniper is sending frames back out the interface they arrived on (ae0), but that seems like something that wouldn't happen. The Nokias are obviously the new element here and likely at fault, but I'm not sure which vendor to pursue this with or whether they won't just blame each other.

The times of the duplicate mac detection match the time hosts traffic was affected it seems. The 9 minutes is the default timer:

set / network-instance vlan-1019 bridge-table mac-duplication admin-state enable
set / network-instance vlan-1019 bridge-table mac-duplication monitoring-window 3
set / network-instance vlan-1019 bridge-table mac-duplication num-moves 5
set / network-instance vlan-1019 bridge-table mac-duplication hold-down-time 9
set / network-instance vlan-1019 bridge-table mac-duplication action stop-learning

Mitigations

For now to try and prevent any re-occurrence I have set the unicast CR MAC addresses for vlan 1019 statically on ssw1-d1-eqiad:

set / network-instance vlan-1019 bridge-table static-mac mac 5C:5E:AB:3D:81:54 destination ethernet-1/32.1019

The switch now shows it as learnt statically, which will take precedence over any dynamically learnt entry. In other words if a frame from this MAC comes in lag2 from asw2-c-eqiad it shouldn't over-ride the static entry:

A:ssw1-d1-eqiad# show network-instance vlan-1019 bridge-table mac-table mac 5C:5E:AB:3D:81:54
---------------------------------------------------------------------------------------------------------
Mac-table of network instance vlan-1019
---------------------------------------------------------------------------------------------------------
Mac                     : 5C:5E:AB:3D:81:54
Destination             : ethernet-1/32.1019
Dest Index              : 35
Type                    : static
Programming Status      : Success
Aging                   : N/A
Last Update             : 2025-11-11T05:40:58.000Z
Duplicate Detect time   : N/A
Hold down time remaining: N/A
---------------------------------------------------------------------------------------------------------

I also reduced the timers for the duplicate MAC detection in this vlan on both spines:

A:ssw1-d1-eqiad# info flat network-instance vlan-1019 bridge-table mac-duplication
set / network-instance vlan-1019 bridge-table mac-duplication monitoring-window 1
set / network-instance vlan-1019 bridge-table mac-duplication hold-down-time 2

The other potential step would be to disable duplicate mac detection entirely. That obviously brings some risks but perhaps here it would ensure if this weird scenario happens we flap a few times but recover quicker, instead of being dead for 9 mins.

Event Timeline

cmooney triaged this task as High priority.

Icinga downtime and Alertmanager silence (ID=e41d36ab-ea9e-437e-a0db-341d018dedf6) set by cmooney@cumin1003 for 2:00:00 on 2 host(s) and their services with reason: shutting down one leg of LAG from ssw1-d8-eqiad to asw2-c7-eqiad

asw2-c-eqiad,ssw1-d8-eqiad

Mentioned in SAL (#wikimedia-operations) [2025-11-12T11:18:56Z] <topranks> shut down link from ssw1-d8-eqiad ethernet-1/28 <-> asw2-c7-eqiad et-7/0/49 to observe results T409800

Mentioned in SAL (#wikimedia-operations) [2025-11-12T11:22:58Z] <topranks> will not shut just yet will log again when about to do so T409800

Mentioned in SAL (#wikimedia-operations) [2025-11-12T12:14:35Z] <topranks> shut down link from ssw1-d8-eqiad ethernet-1/28 <-> asw2-c7-eqiad et-7/0/49 to observe results T409800