Page MenuHomePhabricator

Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage
Closed, ResolvedPublic

Description

Problem

It seems some of the config we have deployed on the new switches in codfw is preventing hosts being reimaged on the public vlans in those rows.

Specifically the problem stems from the decision to not configure a unique-IP on each switch in these vlans, to conserve IP address space. As a reminder the normal "anycast gw" config we have for vlans that traverse multiple boxes is like this (note the GW IP is configured as a 'virtual-gateway-address'):

set interfaces irb unit 2018 virtual-gateway-accept-data
set interfaces irb unit 2018 description "Subnet private1-b-codfw"
set interfaces irb unit 2018 family inet address 10.192.16.84/22 preferred
set interfaces irb unit 2018 family inet address 10.192.16.84/22 virtual-gateway-address 10.192.16.1

To avoid having to allocate an IP from a public vlan across every top-of-rack switch in the fabric, we use a different config for the public vlan gateways, simply configuring the gateway IP as the main IP on every switch:

set interfaces irb unit 2002 description "Subnet public1-b-codfw"
set interfaces irb unit 2002 family inet address 208.80.153.33/27

The problem here arises when the switch DHCP relay happens on the public vlan. The switch chooses the single, shared IP to source the relayed packets from, which duly hit the install server. However the reply is not routed back to the correct switch in most cases, as when it hits the spine layer the spine has a route to that IP from every leaf in the row, and picks one randomly.

Potential Solutions

There are two potential solutions I could think of:

  1. Configure the public IRB GWs like we do the private, and allocate an additional, separate IP for each switch which
    1. If we do this for all 8 switches in each row that's a lot of waste
    2. We could also only configure the GW at the Spine layer, but that makes routing quite inefficient and means the L3 GW is not on the connected device (and thus potential layer-2 complexities)
  2. Configure the switch to use a different IP when relaying DHCP requests to the install server

On the latter there do appear to be config options that allow this:

set routing-instances PRODUCTION forwarding-options dhcp-relay group dhcp_relay source-ip-change

The above config causes the packet to be sent from the switch's lo0.5000 IP, which is unique to each device. Right now those requests are blocked by the install server in iptables, but I also suspect that ISC DHCPd might not respond if the DHCP packet comes from a subnet it's not configured for.

Overall the second option seems better and more scalable if it can be made work.

Event Timeline

cmooney triaged this task as Medium priority.Feb 26 2024, 1:47 PM
cmooney created this task.

Digging a little deeper on this the source IP of the packets hitting the install server don't really matter, what is more important is the value of the "Gateway Address" / giaddr field, which is a field within the DHCP discover message itself. For instance with the dhcp_relay source-ip-change applied on a switch we see this kind of packet hit the install server:

14:11:50.232454 IP (tos 0x0, ttl 62, id 58132, offset 0, flags [none], proto UDP (17), length 654)
    10.192.255.15.67 > 208.80.153.105.67: [udp sum ok] BOOTP/DHCP, Request from 18:66:da:84:0e:a4, length 626, hops 1, xid 0xdc840ea4, secs 8, Flags [Broadcast] (0x8000)
	  Gateway-IP 208.80.153.33
	  Client-Ethernet-Address 18:66:da:84:0e:a4

As per RFC2131 this the 'Gateway-IP' address from the above packet will be used to send any DCHP response from the install server, regardless of the fact the packet came from 10.192.255.15:

If the 'giaddr' field in a DHCP message from a client is non-zero,
   the server sends any return messages to the 'DHCP server' port on the
   BOOTP relay agent whose address appears in 'giaddr'.

Another standard, RFC3527, was specifically created to deal with this scenario, however. It provides a way that the packet source and 'Gateway IP' can be from a separate subnet to the one an IP is being requested on (i.e. could be our loopback IP), but still signal to the DHCP server what the source link/IP that should be considered when finding an IP is (in a separate sub-option of the option 82 data). It seems possible on our QFX devices to configure this operation as follows:

set routing-instances PRODUCTION forwarding-options dhcp-relay relay-option-82 link-selection
set routing-instances PRODUCTION forwarding-options dhcp-relay group dhcp_relay interface irb.2002 overrides relay-source lo0.5000

With that configured the packets hitting the install server look as follows:

16:25:28.421093 IP (tos 0x0, ttl 62, id 46937, offset 0, flags [none], proto UDP (17), length 660)
    10.192.255.15.67 > 208.80.153.105.67: [udp sum ok] BOOTP/DHCP, Request from 18:66:da:84:0e:a4, length 632, hops 1, xid 0xdf840ea4, secs 64, Flags [Broadcast] (0x8000)
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
<------ cut ----->
	    Agent-Information (82), length 82: 
	      Circuit-ID SubOption 1, length 41: lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw
	      Unknown SubOption 12, length 31: 
		0x0000:  0002 0000 0000 0583 0100 0000 6230 3a65
		0x0010:  623a 3766 3a33 383a 6664 3a32 3000 00
	      Unknown SubOption 5, length 4: 
		0x0000:  d050 9921

Checking the value of the sub-option data, d050 9921, the octets translate to 208.80.153.33, the gateway IP of the public1-b-codfw vlan. When the request is made like that the install server responds with an offer just fine:

16:25:28.422002 IP (tos 0x0, ttl 64, id 36658, offset 0, flags [DF], proto UDP (17), length 510)
    208.80.153.105.67 > 10.192.255.15.67: [bad udp cksum 0x7585 -> 0x3a85!] BOOTP/DHCP, Reply, length 482, hops 1, xid 0xdf840ea4, secs 64, Flags [Broadcast] (0x8000)
	  Your-IP 208.80.153.43
	  Server-IP 208.80.153.105
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Offer
	    Server-ID (54), length 4: 208.80.153.105
	    Lease-Time (51), length 4: 43200
	    Subnet-Mask (1), length 4: 255.255.255.224
	    Default-Gateway (3), length 4: 208.80.153.33
	    Domain-Name-Server (6), length 4: 10.3.0.1
	    Hostname (12), length 11: "sretest2004"
	    Domain-Name (15), length 13: "wikimedia.org"
<------ cut ----->

Which all looks great. But for some reason it didn't work at PXE stage, the host just kept sending more DHCP requests as if it wasn't getting the reply. So either the switch is not sending the response back to the host for some reason, or something in the response isn't liked by the PXEboot process. Will run some more tests to investigate.

Juniper seem to document this scenario here, and advise using the "link-selection" keyword:

https://www.juniper.net/documentation/us/en/software/nce/nce-216-evpn-dhcp-relay/topics/concept/nce-216-technical-overview.html

With the config I've been testing this isn't working right though. The install server is using the sub-option 5 information to select the subnet just fine, but for some odd reason the DHCP REQUEST the host sends after receiving the offer is not relayed back to the server.

DHCPDISCOVER on eno1 to 255.255.255.255 port 67 interval 8
DHCPOFFER of 208.80.153.43 from 208.80.153.33
DHCPREQUEST for 208.80.153.43 on eno1 to 255.255.255.255 port 67
DHCPREQUEST for 208.80.153.43 on eno1 to 255.255.255.255 port 67
DHCPDISCOVER on eno1 to 255.255.255.255 port 67 interval 7
<--- continues forever --->

After issuing a manual release of the IP and trying again things seem to be working as expected:

cmooney@install2004:/etc/dhcp$ sudo tcpdump -vvv -i ens13 -l -p -nn host 10.192.255.15 
tcpdump: listening on ens13, link-type EN10MB (Ethernet), snapshot length 262144 bytes

18:17:46.996708 IP (tos 0x0, ttl 62, id 49678, offset 0, flags [none], proto UDP (17), length 412)
    10.192.255.15.67 > 208.80.153.105.67: [udp sum ok] BOOTP/DHCP, Request from 18:66:da:84:0e:a4, length 384, hops 1, xid 0x7f82a356, Flags [none] (0x0000)
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Discover
	    Requested-IP (50), length 4: 208.80.153.43
	    Hostname (12), length 11: "sretest2004"
	    Parameter-Request (55), length 13: 
	      Subnet-Mask (1), BR (28), Time-Zone (2), Default-Gateway (3)
	      Domain-Name (15), Domain-Name-Server (6), Unknown (119), Hostname (12)
	      Netbios-Name-Server (44), Netbios-Scope (47), MTU (26), Classless-Static-Route (121)
	      NTP (42)
	    Agent-Information (82), length 82: 
	      Circuit-ID SubOption 1, length 41: lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw
	      Unknown SubOption 12, length 31: 
		0x0000:  0002 0000 0000 0583 0100 0000 6230 3a65
		0x0010:  623a 3766 3a33 383a 6664 3a32 3000 00
	      Unknown SubOption 5, length 4: 
		0x0000:  d050 9921
	    END (255), length 0
	    PAD (0), length 0, occurs 22
18:17:46.997134 IP (tos 0x0, ttl 64, id 26353, offset 0, flags [DF], proto UDP (17), length 420)
    208.80.153.105.67 > 10.192.255.15.67: [bad udp cksum 0x752b -> 0x97be!] BOOTP/DHCP, Reply, length 392, hops 1, xid 0x7f82a356, Flags [none] (0x0000)
	  Your-IP 208.80.153.43
	  Server-IP 208.80.153.105
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Offer
	    Server-ID (54), length 4: 208.80.153.105
	    Lease-Time (51), length 4: 43200
	    Subnet-Mask (1), length 4: 255.255.255.224
	    BR (28), length 4: 208.80.153.63
	    Default-Gateway (3), length 4: 208.80.153.33
	    Domain-Name (15), length 13: "wikimedia.org"
	    Domain-Name-Server (6), length 4: 10.3.0.1
	    Hostname (12), length 11: "sretest2004"
	    Agent-Information (82), length 82: 
	      Circuit-ID SubOption 1, length 41: lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw
	      Unknown SubOption 12, length 31: 
		0x0000:  0002 0000 0000 0583 0100 0000 6230 3a65
		0x0010:  623a 3766 3a33 383a 6664 3a32 3000 00
	      Unknown SubOption 5, length 4: 
		0x0000:  d050 9921
	    END (255), length 0
18:17:47.000275 IP (tos 0x0, ttl 62, id 49684, offset 0, flags [none], proto UDP (17), length 412)
    10.192.255.15.67 > 208.80.153.105.67: [udp sum ok] BOOTP/DHCP, Request from 18:66:da:84:0e:a4, length 384, hops 1, xid 0x7f82a356, Flags [none] (0x0000)
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Request
	    Server-ID (54), length 4: 208.80.153.105
	    Requested-IP (50), length 4: 208.80.153.43
	    Hostname (12), length 11: "sretest2004"
	    Parameter-Request (55), length 13: 
	      Subnet-Mask (1), BR (28), Time-Zone (2), Default-Gateway (3)
	      Domain-Name (15), Domain-Name-Server (6), Unknown (119), Hostname (12)
	      Netbios-Name-Server (44), Netbios-Scope (47), MTU (26), Classless-Static-Route (121)
	      NTP (42)
	    Agent-Information (82), length 82: 
	      Circuit-ID SubOption 1, length 41: lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw
	      Unknown SubOption 12, length 31: 
		0x0000:  0002 0000 0000 0583 0100 0000 6230 3a65
		0x0010:  623a 3766 3a33 383a 6664 3a32 3000 00
	      Unknown SubOption 5, length 4: 
		0x0000:  d050 9921
	    END (255), length 0
	    PAD (0), length 0, occurs 16
18:17:47.000744 IP (tos 0x0, ttl 64, id 26354, offset 0, flags [DF], proto UDP (17), length 420)
    208.80.153.105.67 > 10.192.255.15.67: [bad udp cksum 0x752b -> 0x94be!] BOOTP/DHCP, Reply, length 392, hops 1, xid 0x7f82a356, Flags [none] (0x0000)
	  Your-IP 208.80.153.43
	  Server-IP 208.80.153.105
	  Gateway-IP 10.192.255.15
	  Client-Ethernet-Address 18:66:da:84:0e:a4
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: ACK
	    Server-ID (54), length 4: 208.80.153.105
	    Lease-Time (51), length 4: 43200
	    Subnet-Mask (1), length 4: 255.255.255.224
	    BR (28), length 4: 208.80.153.63
	    Default-Gateway (3), length 4: 208.80.153.33
	    Domain-Name (15), length 13: "wikimedia.org"
	    Domain-Name-Server (6), length 4: 10.3.0.1
	    Hostname (12), length 11: "sretest2004"
	    Agent-Information (82), length 82: 
	      Circuit-ID SubOption 1, length 41: lsw1-b8-codfw:ge-0/0/41.0:public1-b-codfw
	      Unknown SubOption 12, length 31: 
		0x0000:  0002 0000 0000 0583 0100 0000 6230 3a65
		0x0010:  623a 3766 3a33 383a 6664 3a32 3000 00
	      Unknown SubOption 5, length 4: 
		0x0000:  d050 9921
	    END (255), length 0

So I think the solution is:

  1. For any IRBs with only a single anycast IP on them add interface irb.2002 overrides relay-source lo0.5000 to the config
  2. Add the "link-selection" command to the config on EVPN switches to add the GW IP to the option 82 block in sub-option 5 (RFC3527).

The first ensures packets are sourced from the loopback IP of the switch, and the giaddr in DHCP messages being relayed is also set to the loopback.

The second adds the GW IP to the request so the install server can still select the correct options for the subnet.

I'll do some more tests and then look at the automation if things are ok.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with OS bookworm

Change 1006568 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Use loopback for DHCP relay on single-ip EVPN anycast GWs

https://gerrit.wikimedia.org/r/1006568

  1. Add the "link-selection" command to the config on EVPN switches to add the GW IP to the option 82 block in sub-option 5 (RFC3527).

This command was only added in JunOS 21.2R1, so we can't add it in eqiad just yet. We don't have any stretched vlans on the evpn devices there so it's not critical, but just noting here.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with OS bookworm completed:

  • sretest2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402261915_cmooney_2252134_sretest2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with OS bookworm completed:

  • sretest2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402262003_cmooney_2263393_sretest2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Change 1006568 merged by jenkins-bot:

[operations/homer/public@master] Use loopback for DHCP relay on single-ip EVPN anycast GWs

https://gerrit.wikimedia.org/r/1006568

Patch tested again and still working consistently, I think the initial problems were probably due to cached dhcp-relay bindings on the switches being used (from prior to getting the config right).

Merged successfully, closing task. When we upgrade the EVPN switches in eqiad we can remove the constraint only running in eqiad, I've added a note on that task to remind us.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2004.codfw.wmnet with OS bookworm