Page MenuHomePhabricator

Nokia SR-Linux DHCP Relay Bug
Open, LowPublic

Description

We have observed a bug with Nokia SR-Linux that affects DHCP relay function in some circumstances. Specifically the switches do not seem to forward packets returned to them from our install/dhcp server. This behaviour has been observed with SR-Linux v24.10.x and v24.7.x, however it has been inconsistent with the problem disappearing on lsw1-c3-eqiad and lsw1-d3-eqiad after they were downgraded to v24.7.2 and then upgraded to v24.10.4 again (whether the downgrade/upgrade cycle made a difference or some other random factor "fixed" it is unknown). The problem has only been observed on switches running EVPN/VXLAN (meaning the DHCP response packets arrive on the leaf switch VXLAN encapsulated following a type-5 EVPN route destinated to the L3VNI/RMAC).

Detail

A network diagram we made for Nokia support can be seen here:

The basic symptoms of what was observed was that DHCP responses sent to the Nokia top-of-rack switch (acting as DHCP relay for hosts on attached vlans) were received by the switch but not forwarded out to the host that sent the initial DHCP request. We could verify they were received as the counters increased on the configured cpm acl entry:

A:lsw1-d3-eqiad# info flat acl acl-filter cpm type ipv4 entry 215
set / acl acl-filter cpm type ipv4 entry 215 description allow_dhcp_reply4
set / acl acl-filter cpm type ipv4 entry 215 match ipv4 protocol 17
set / acl acl-filter cpm type ipv4 entry 215 match ipv4 source-ip prefix 208.80.152.0/22
set / acl acl-filter cpm type ipv4 entry 215 match transport destination-port range start 67
set / acl acl-filter cpm type ipv4 entry 215 match transport destination-port range end 68
set / acl acl-filter cpm type ipv4 entry 215 match transport source-port value 67
set / acl acl-filter cpm type ipv4 entry 215 action accept
A:lsw1-d3-eqiad# info from state acl acl-filter cpm type ipv4 entry 215 statistics
    acl {
        acl-filter cpm type ipv4 {
            entry 215 {
                statistics {
                    last-clear "2025-10-28T16:11:40.000Z (5 minutes ago)"
                    incomplete false
                    matched-packets 12
                    last-match "2025-10-28T16:16:48.000Z (17 seconds ago)"
                }
            }
        }
    }

However if we looked at the statistics for the DHCP relay daemon on the device it would show the same number of packets sent to our install server, but zero having come back:

A:lsw1-d3-eqiad# info from state interface irb0 subinterface 1079 ipv4 dhcp-relay statistics
    interface irb0 {
        subinterface 1079 {
            ipv4 {
                dhcp-relay {
                    statistics {
                        client-packets-received 12
                        client-packets-relayed 12
                        client-packets-discarded 0
                        server-packets-received 0
                        server-packets-relayed 0
                        server-packets-discarded 0
                    }
                }
            }
        }
    }

We were also able to do a mirror on the Spine uplink port and capture the packets arriving on the leaf:

Status

While working to determine what version of SR-Linux introduced the ARP bug we were also dealing with (see T409178) the two switches connecting our test hosts we downgraded through various versions of the OS. It was discovered that following this, when back on v24.10.4, both lsw1-c3-eqiad and lsw1-d3-eqiad correctly relayed DHCP responses to hosts.

While the OS downgrades were focused on the arp issue one DHCP test was done on v24.7.2 and the problem still occurred on it. The ARP bug has not been observed on that version of the OS. So the problems are likely something different.

The current status is that this issue remains with Nokia, but we are not observing it on the switches we formerly had problems. Nokia assure us they are working on it, when we have a test Nokia switch connected we can do more tests ourselves to try and reproduce.

Event Timeline

cmooney triaged this task as Medium priority.

I'm gonna close this one for now. We have not seen a repeat of this since we have adjusted the config to deal with the ARP resolution bug, though Nokia support said the root cause was not that.

We've had many successful reimages since then with the Nokia DHCP relay working so whatever specific conditions caused it it doesn't seem to be affecting us now. As agreed with Nokia we will re-open our case with them immediately if there is a repeat.

Well of course this has occurred again as soon as I made the decisions to close. @ayounsi hit it today on lsw1-d7-eqiad trying to reimage aux-k8s-worker1007.

Same symptoms, we can see the switch control-plane receives the DHCP reply packets from the install server:

A:lsw1-d7-eqiad# info from state acl acl-filter cpm type ipv4 entry 215 statistics
    acl {
        acl-filter cpm type ipv4 {
            entry 215 {
                statistics {
                    matched-packets 36
                    last-match "2026-02-09T10:37:46.000Z (4 minutes ago)"
                }
            }
        }
    }

But the dhcp relay daemon reports zero packets having been returned from server:

A:lsw1-d7-eqiad# info from state interface irb0 subinterface 1085 ipv4 dhcp-relay statistics
    interface irb0 {
        subinterface 1085 {
            ipv4 {
                dhcp-relay {
                    statistics {
                        client-packets-received 36
                        client-packets-relayed 36
                        client-packets-discarded 0
                        server-packets-received 0
                        server-packets-relayed 0
                        server-packets-discarded 0
                    }
                }
            }
        }
    }

Ticket 05430684 created with Nokia

I've removed parent task T409286 to track this independently but commenting for the record.

For the record currently aux-k8s-worker1007 and aqs1023 are blocked by this issue.

Mentioned in SAL (#wikimedia-operations) [2026-02-19T09:37:58Z] <XioNoX> lsw1-d7-eqiad# tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer - T411054

@ayounsi directed me to this ticket after reading: T418398: Two hosts are failing to do DHCP based PXE booting after renaming and moving vlan

I believe that this is also preventing the reimaging of:

  • dse-k8s-worker1026 on lsw1-c2-eqiad
  • dse-k8s-worker1027 on lsw1-c7-eqiad

I'm currently able to test with the sre.hosts.dhcp cookbook and manually selecting PXE boot, on either host.
Both hosts show no DHCP response, although I have not yet started capturing packets to try to confirm where they are getting dropped.

Mentioned in SAL (#wikimedia-operations) [2026-03-04T07:54:30Z] <topranks> disabling IBGP session between ssw1-d1-eqiad and ssw1-d8-eqiad to remove backup paths T411054

Mentioned in SAL (#wikimedia-operations) [2026-03-04T08:49:16Z] <topranks> disabling IBGP session between ssw1-d1-eqiad and ssw1-d8-eqiad to remove backup paths try #2 T411054

@ayounsi thanks for following up on this. I've done some testing to see if there may be a better way to force a tunnel teardown/re-establishment today.

The reason clearing a single BGP session does not clear the VXLAN tunnel normally is that the Spines peer with each other in BGP, which means that even with the direct BGP session from a leaf to ssw1 down, it still learns EVPN routes with the next-hop of ssw1 (from ssw2).

A:homer@lswtest-d8-eqiad# show network-instance default protocols bgp neighbor 10.64.128.18 received-routes evpn | grep "208.80.154.128/26"
| *   | 10.64.128.17:5000 | 0 | 208.80.154.128/26 | 0 | 10.64.128.17 | - | 100 |
| u*> | 10.64.128.18:5000 | 0 | 208.80.154.128/26 | 0 | 10.64.128.18 | - | 100 |

So to remove all BGP routes with a given spine as next-hop the best way is probably to disable the peering between the Spines, so that Leaf devices only learn ssw1 originated routes from it, and vice-versa. A drawback of this is that it causes a VRRP split-brain on the CRs, because the VRRP frames need to be sent from Spine<->Spine, but it turns out this doesn't actually matter with this setup (see P89795).

With the Spines not exchanging routes we learn only one route to an external destination from each Spine, with it's own IP as the next-hop:

A:homer@lswtest-d8-eqiad# show network-instance default protocols bgp neighbor 10.64.128.18 received-routes evpn | grep "208.80.154.128/26"
| u*> | 10.64.128.18:5000 | 0 | 208.80.154.128/26 | 0 | 10.64.128.18 | - | 100 |

We still have the VXLAN tunnel in place with the bad index/id:

A:homer@lswtest-d8-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18"
| 10.64.128.17/32 | vxlan | 1 | Y | 8 | 0 | 2025-12-19T09:46:10.200Z | 10.64.129.102 | ethernet-1/56.0 |
| 10.64.128.18/32 | vxlan | 9 | Y | 8 | 0 | 2025-12-19T09:38:36.259Z | 10.64.129.100 | ethernet-1/55.0 |

But now when we clear the session to 10.64.128.17 it does get deleted/recreated:

A:homer@lswtest-d8-eqiad# tools network-instance default protocols bgp neighbor 10.64.128.17 reset-peer
/network-instance[name=default]/protocols/bgp/neighbor[peer-address=10.64.128.17]:
    Successfully executed the tools clear command.
A:homer@lswtest-d8-eqiad# show network-instance default protocols bgp neighbor
---------------------------------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
+----------+--------------+-----------+-------+---------+-------------+-----------------+----------+----------------+
| Net-Inst | Peer         | Group     | Flags | Peer-AS | State       | Uptime          | AFI/SAFI | [Rx/Active/Tx] |
+==========+==============+===========+=======+=========+=============+=================+==========+================+
| default  | 10.64.128.17 | ibgp_evpn | SB    | 64814   | established | 0d:0h:0m:7s     | evpn     | [2554/2554/7]  |
| default  | 10.64.128.18 | ibgp_evpn | SB    | 64814   | established | 74d:23h:16m:47s | evpn     | [2554/2245/7]  |
+----------+--------------+-----------+-------+---------+-------------+-----------------+----------+----------------+
A:homer@lswtest-d8-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18"
| 10.64.128.17/32 | vxlan | 16 | Y | 8 | 0 | 2026-03-04T08:55:28.881Z | 10.64.129.102 | ethernet-1/56.0 |
| 10.64.128.18/32 | vxlan | 9  | Y | 8 | 0 | 2025-12-19T09:38:36.259Z | 10.64.129.100 | ethernet-1/55.0 |

In terms of the hit to traffic from sretest1006 which is connected to this switch it dropped 6 pings:

64 bytes from install1005.wikimedia.org (2620:0:861:2:208:80:154:134): icmp_seq=18 ttl=63 time=0.509 ms
64 bytes from install1005.wikimedia.org (2620:0:861:2:208:80:154:134): icmp_seq=25 ttl=63 time=0.275 ms

Though it is worth noting that only 50% of the traffic should be affected, other flows will have been using the tunnel to the other spine. So not hitless, but definitely less disruptive than having to force both BGP sessions to be down concurrently.

Thinking it through, I think this process could be used to "drain" ssw1-d1 if we wanted to attempt this without any impact to hosts.

  • Set cr2 to be VRRP master for all vlans
    • This will ensure row a/b hosts send traffic to cr2, which will then route to ssw1-d8
    • It will ensure that only ssw1-d8 learns the VRRP GW MAC for the row c/d vlans, so leaf switches will not see route to it from ssw1-d1
  • Disable VRRP for row-wide vlan sub-interfaces of cr1-eqiad et-1/0/5 - P89818
    • This is needed as we don't want to create a VRRP "split brain" scenario (P89795)
  • Disable the EVPN IBGP peering between ssw1-d1 and ssw1-d8:
    • ssw1-d1: set / network-instance default protocols bgp neighbor 10.64.128.18 admin-state disable
    • This ensures that ssw1-d8 does not reflect routes from ssw1-d1 to leafs
    • Which means clearing ssw1-d1 BGP session to leaf will remove all routes using it as next-hop
  • Increase the OSPF cost on the far-side of all transport links terminating on cr1
    • This will ensure traffic from other sites to row c/d vlans should instead arrive on cr2, and take path out via ssw1-d8
  • Adjust the ssw1-d1 BGP config to not accept or announce any routes to cr1 or other row e/f spines
    • By changing the import/export policies to 'NONE' - P89816
  • Adjust the cr1 BGP policy for row e/f and cloudsw to not export directly connected routes
    • cr1-eqiad: delete policy-options policy-statement Switch_out term direct
    • This ensures no L3 switches will use cr1 to get to row c/d vlans, instead they will use cr2 uplink

Result

At this point we should be able to observe the graphs and see traffic reduced to zero on the cr1 -> ssw1-d1 link. Because:

  • Traffic from rows a/b will use cr2 as gateway, due to VRRP, and it will use link to ssw1-d8
  • Traffic from rows e/f will use cr2 to get to rows c/d, as we stopped exporting "direct" routes from cr1 in BGP
  • Traffic from remote sites will route to cr2 over WAN
  • Traffic to c/d per-rack vlans will route to cr2, as cr1 no longer receives them in BGP due to policy change
  • Traffic for CR IP gateways will route to ssw1-d8 from every leaf, as that is where VRRP MAC is learnt
  • Traffic to external IP destinations from c/d per-rack vlans will route to ssw1-d8 from leafs, as those ranges are not being accepted by ssw1-d1 in BGP
  • Traffic between c/d row-wide vlans will use cr2 as gateway, which will hairpin it back down through its link to ssw1-d8

Which I think covers everything?

We will still have a vxlan tunnel to ssw1-d1 on every leaf, but this should only be due to the unicast MAC addresses learnt on that spine from the CRs. We should check these are the only routes known with the spine next-hop:

# NOTE: spacing for the grep might be different based on SRL column widths, we want to grep for the IP in the 'next-hop' column
show network-instance default protocols bgp routes evpn route-type 2 summary | grep "| 0      | 10.64.128.17 "
show network-instance default protocols bgp routes evpn route-type 5 summary | grep "| 0      | 10.64.128.17 "

I think if this is the case we can then proceed and use the tools command to tear down the BGP session from ssw1-d1 to all leaf's, forcing them to rebuild a working vxlan tunnel with a valid ID. And then we could do it all in reverse for the other spine. Whether it is worth the effort versus organising rack maintenance I don't know.

Will all of the switches in rows C & D be getting this configuration change?

I'm asking because I've got another host that is exhibiting a reimage failure and that's dse-k8s-worker1010, which is connected to lsw1-d6-eqiad.

My assumption is that it's the same bug, so I might have to wait for a similar maintenance window before proceeding with the reimage.
If I'm wrong about that, please let me know and I will investigate some more.

Will all of the switches in rows C & D be getting this configuration change?

Yes we need to fix it on all of them.

My assumption is that it's the same bug, so I might have to wait for a similar maintenance window before proceeding with the reimage.

I'm currently planning to drain one spine switch and then the other, to allow us to clear the issue without disrupting servers. I am planning to do that tomorrow morning (EU time), I will confirm afterwards if it was successful, hopefully then you can complete the reimage.

cmooney lowered the priority of this task from Medium to Low.Mar 17 2026, 3:13 PM

Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id. So unless something causes that to change (shouldn't) we should not hit this issue again.

When Nokia have an updated software version with this issue fixed, and other bugfixes and features we need, we will upgrade the switches at which point we can close this task.