Page MenuHomePhabricator

Investigate Ganeti in routed mode
Closed, ResolvedPublic

Description

Our current Ganeti clusters currently run in bridged mode, which mean that a guest VM will have a direct L2 adjacency with its hypervisor's switch vlans.
For example rpki1001 is on the "row C" Ganeti cluster, and have a private1-c-eqiad IP, like any other private host (physical or VMs) in that row.
As the "row C cluster" hypervisors are spread within that L2 domain (all racks in row C), moving the VM from one hypervisor to the other (eg. for hypervisor maintenance) is seamless as the VM stays in the same vlan.

The new eqiad rows E and F differ from that model as the L2 domains are contained per rack (and not spread across the entire row) for various reasons, including stability (smaller failure domains).
Keeping any overlay/tunneling based solution (eg. VXLAN) is out of the equation.

If we were to deploy a Ganeti cluster in bridged mode, each Ganeti nodegroup (a subcluster grouping concept in Ganeti) would need to stay within its own rack (as well as VM mobility), which itself could be an option (eg. multiple tiny nodegroups - say 2 to 3 nodes).

However, Ganeti can also work in routed mode.
In that mode, the VMs have IPs different from the rows/rack subnet (eg. from a prefix reserved to VMs).
The hypervisor acts as a router and advertises to the network the IPs of the VMs it is hosting (eg. with BGP).
This allows hypervisors of the same cluster (nodegroup as well) to reside in various locations (even in different DCs, though not recommended).

This will require changes in provisioning (IP allocation) as well as tooling around Ganeti (to advertise/withdrew) prefixes.

Some relevant links:

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+8 -0
operations/puppetproduction+4 -23
operations/puppetproduction+4 -0
operations/homer/publicmaster+2 -0
operations/puppetproduction+43 -39
operations/puppetproduction+22 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -3
operations/puppetproduction+8 -32
operations/puppetproduction+15 -1
operations/cookbooksmaster+2 -1
operations/puppetproduction+14 -9
operations/puppetproduction+3 -1
operations/software/spicerackmaster+67 -16
operations/puppetproduction+12 -0
operations/puppetproduction+35 -19
operations/puppetproduction+31 -9
operations/homer/publicmaster+0 -4
operations/puppetproduction+4 -5
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+4 -2
operations/dnsmaster+10 -0
operations/puppetproduction+3 -0
operations/puppetproduction+11 -1
operations/homer/publicmaster+1 -1
operations/software/homer/deploymaster+1 -0
operations/homer/publicmaster+54 -0
operations/cookbooksmaster+53 -18
operations/dnsmaster+1 -0
operations/software/spicerackmaster+71 -23
operations/puppetproduction+8 -0
operations/puppetproduction+375 -11
operations/puppetproduction+63 -80
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 990968 merged by Ayounsi:

[operations/puppet@production] Puppet: Routed Ganeti support

https://gerrit.wikimedia.org/r/990968

Change 993662 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] ganeti: Create /var/lib/ganeti/rapi in Puppet

https://gerrit.wikimedia.org/r/993662

Mentioned in SAL (#wikimedia-operations) [2024-01-29T09:56:29Z] <XioNoX> enable Puppet on all the ganeti servers for CR990968 deployment - T300152

Mentioned in SAL (#wikimedia-operations) [2024-01-29T10:00:58Z] <moritzm> upload prometheus-ganeti-exporter 0.3+deb12u1 to apt.wikimedia.org T300152

Change 993662 merged by Muehlenhoff:

[operations/puppet@production] ganeti: Create /var/lib/ganeti/rapi in Puppet

https://gerrit.wikimedia.org/r/993662

Change 991325 merged by jenkins-bot:

[operations/software/spicerack@master] Spicerack: Add support for routed Ganeti

https://gerrit.wikimedia.org/r/991325

Change 993669 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Add routed ganeti VIP A record

https://gerrit.wikimedia.org/r/993669

Change 993669 merged by Ayounsi:

[operations/dns@master] Add routed ganeti VIP A record

https://gerrit.wikimedia.org/r/993669

Mentioned in SAL (#wikimedia-operations) [2024-01-29T11:38:27Z] <moritzm> upload ganeti 3.0.2-3+wmf1 (bookworm package of Ganeti plus backport for SSL chain handling in RAPI) to apt.wikimedia.org T300152

cookbooks.sre.hosts.decommission executed by ayounsi@cumin2002 for hosts: sretest1005.eqiad.wmnet

  • sretest1005.eqiad.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change 991348 merged by jenkins-bot:

[operations/cookbooks@master] sre.ganeti: add support for routed Ganeti

https://gerrit.wikimedia.org/r/991348

Change 993090 merged by jenkins-bot:

[operations/homer/public@master] Homer-public: add Ganeti BGP group

https://gerrit.wikimedia.org/r/993090

Change 993089 merged by Ayounsi:

[operations/software/homer/deploy@master] wmf-netbox: add Ganeti BGP group support

https://gerrit.wikimedia.org/r/993089

Change 993760 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] vms_import policy: fix typo

https://gerrit.wikimedia.org/r/993760

Change 993760 merged by jenkins-bot:

[operations/homer/public@master] vms_import policy: fix typo

https://gerrit.wikimedia.org/r/993760

Change 994114 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Ganeti: readvertise netdev_master VIP

https://gerrit.wikimedia.org/r/994114

Change 994114 merged by Ayounsi:

[operations/puppet@production] Ganeti: readvertise netdev_master VIP

https://gerrit.wikimedia.org/r/994114

Change 994173 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add routed ganeti cluster to Netbox sync jobs

https://gerrit.wikimedia.org/r/994173

Change 994173 merged by Ayounsi:

[operations/puppet@production] Add routed ganeti cluster to Netbox sync jobs

https://gerrit.wikimedia.org/r/994173

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: srestest2005.codfw.wmnet

  • srestest2005.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

Change 994223 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] DHCP: set "use-host-decl-names on"

https://gerrit.wikimedia.org/r/994223

Change 994246 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] DNS: add includes for private1-virtual-codfw DNS PTRs

https://gerrit.wikimedia.org/r/994246

Change 994246 merged by Ayounsi:

[operations/dns@master] DNS: add includes for private1-virtual-codfw DNS PTRs

https://gerrit.wikimedia.org/r/994246

Change 994661 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch ganeti/routed PoC servers to nftables

https://gerrit.wikimedia.org/r/994661

Change 994661 merged by Muehlenhoff:

[operations/puppet@production] Switch ganeti/routed PoC servers to nftables

https://gerrit.wikimedia.org/r/994661

Change 994663 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Enable forwarding more broadly and fix nftables bug

https://gerrit.wikimedia.org/r/994663

Change 994663 merged by Ayounsi:

[operations/puppet@production] Enable forwarding more broadly and fix nftables bug

https://gerrit.wikimedia.org/r/994663

Change 994666 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: rollback global v6 forwarding

https://gerrit.wikimedia.org/r/994666

Change 994667 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: fix nftables bug

https://gerrit.wikimedia.org/r/994667

Change 994666 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: rollback global v6 forwarding

https://gerrit.wikimedia.org/r/994666

Change 994667 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: fix nftables bug

https://gerrit.wikimedia.org/r/994667

Current status, ignoring IPv6 for now.

The cluster VIP is dynamically announced from the primary cluster node.

Limitation from isc-dhcp-relay, it need to be run with the tap interfaces defined explicitly, for example sudo /usr/sbin/dhcrelay 208.80.153.105 -U eno12399np0 -i tap0 -d (multiple -i can be defined).
The DHCP relay is only needed during re-images, but there is no clean path to have it working.
If the tap interface doesn't exist, the daemon doesn't start.
A few options discussed to start the relay with the proper options :

  • a wrapper daemon (eg. monitoring the list of interfaces)
  • using the net-common bash script
  • patching dhcp-relay
  • using the re-image cookbook

When working around this issue manually, the VMs start properly all the way to Puppet running and the host being accessible over SSH. IP propagation over BGP works fine as well.

DHCP gives the VMs an IP from the proper range, but with a /23 subnet mask, like if the private1-virtual-codfw range was a vlan. This works surprisingly well, but is bogus, for example two routed VMs won't be able to communicate between each other until those changes are done on the VMs :

ip addr add 10.192.24.4/32 dev ens13
ip route add 10.192.24.1 dev ens13 scope link
ip addr del 10.192.24.4/23 dev ens13

Question is, when to set the proper IP/mask/route?

  • Directly at the DHCP allocation step, as initially planned using this blogpost as reference
    • Requires a more significant change on our DHCP config
    • Might make iPXE choke
  • At the 2nd DHCP request/allocation, done by the Debian Installer (D-I)
    • More complete environment, less risk of choking
    • Possibly complex to do on the DHCP side
    • If the D-I files are hosted on a VM in the same IP range, iPXE won't be able to fetch them
  • In the late_command.sh script
    • Much more flexible, but same issue with being able to fetch that script

Next step, find a good solution for those 2 points.

Indeed the dhcrelay not working as expected is a bit annoying also because if we run a dhcrelay for each VM, we'd need to hook also at VM shutdown to kill it otherwise at the next startup on the same tap interface we'll get 2 instances running (unless the previous one crashes when the interface is deleted, but I doubt it).

As for the netmask I agree to either try the DHCP solution to see if that works fine or alternatively do it in the late_command. Would it work also in the PoPs where we have install servers that are VMs?

cookbooks.sre.hosts.decommission executed by ayounsi@cumin2002 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (WARN)
    • Missing DNSName in Nebox for testvm2006, unable to verify it.
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

For the latter, some more debug:
I added

shared-network "test" { 
        subnet 10.192.24.1 netmask 255.255.255.255 { 
            option subnet-mask 255.255.255.255; 
            option routers 10.192.24.1; 
        }
        subnet 10.192.24.3 netmask 255.255.255.255 {
            option subnet-mask 255.255.255.255;
            range 10.192.24.3;
            option routers 10.192.24.1;
        }
}

to /etc/dhcp/dhcpd.conf

DHCP server logs after the VM DHCP request:

Jan 31 17:15:39 install2004 dhcpd[1738530]: DHCPDISCOVER from aa:00:00:97:32:4f via 10.192.24.1
Jan 31 17:15:39 install2004 dhcpd[1738530]: DHCPOFFER on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.24.1
Jan 31 17:15:39 install2004 dhcpd[1738530]: DHCPDISCOVER from aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:39 install2004 dhcpd[1738530]: DHCPOFFER on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:40 install2004 dhcpd[1738530]: DHCPDISCOVER from aa:00:00:97:32:4f via 10.192.24.1
Jan 31 17:15:40 install2004 dhcpd[1738530]: DHCPOFFER on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.24.1
Jan 31 17:15:40 install2004 dhcpd[1738530]: DHCPDISCOVER from aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:40 install2004 dhcpd[1738530]: DHCPOFFER on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:42 install2004 dhcpd[1738530]: Dynamic and static leases present for 10.192.24.3.
Jan 31 17:15:42 install2004 dhcpd[1738530]: Remove host declaration sretest2005 or remove 10.192.24.3
Jan 31 17:15:42 install2004 dhcpd[1738530]: from the dynamic address pool for test
Jan 31 17:15:42 install2004 dhcpd[1738530]: DHCPREQUEST for 10.192.24.3 (208.80.153.105) from aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:42 install2004 dhcpd[1738530]: DHCPACK on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.21.6
Jan 31 17:15:42 install2004 dhcpd[1738530]: Dynamic and static leases present for 10.192.24.3.
Jan 31 17:15:42 install2004 dhcpd[1738530]: Remove host declaration sretest2005 or remove 10.192.24.3
Jan 31 17:15:42 install2004 dhcpd[1738530]: from the dynamic address pool for test
Jan 31 17:15:42 install2004 dhcpd[1738530]: DHCPREQUEST for 10.192.24.3 (208.80.153.105) from aa:00:00:97:32:4f via 10.192.24.1
Jan 31 17:15:42 install2004 dhcpd[1738530]: DHCPACK on 10.192.24.3 to aa:00:00:97:32:4f via 10.192.24.1

And the DHCP exchange captured on tap0:

ayounsi@ganeti2033:~$ sudo dhcpdump -i tap0
  TIME: 2024-01-31 17:15:39.648
    IP: 0.0.0.0 (aa:0:0:97:32:4f) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
    OP: 1 (BOOTPREQUEST)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 3d7d025a
  SECS: 4
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         1 (DHCPDISCOVER)
OPTION:  57 (  2) Maximum DHCP message size 1472
OPTION:  93 (  2) Client System             0000             ..
OPTION:  94 (  3) Client NDI                010201           ...
OPTION:  60 ( 32) Vendor class identifier   PXEClient:Arch:00000:UNDI:002001
OPTION:  77 (  4) User-class Identification 69505845         iPXE
OPTION:  55 ( 23) Parameter Request List      1 (Subnet mask)
					      3 (Routers)
					      6 (DNS server)
					      7 (Log server)
					     12 (Host name)
					     15 (Domainname)
					     17 (Root path)
					     26 (Interface MTU)
					     43 (Vendor specific info)
					     60 (Vendor class identifier)
					     66 (TFTP server name)
					     67 (Bootfile name)
					    119 (Domain Search)
					    128 (???)
					    129 (???)
					    130 (???)
					    131 (???)
					    132 (???)
					    133 (???)
					    134 (???)
					    135 (???)
					    175 (???)
					    203 (???)
					    
OPTION: 175 ( 48) ???                       b105011af41000eb ........
					    0301000017010122 ......."
					    0101130101110101 ........
					    2701011901012901 '.....).
					    0110010221010115 ....!...
					    0101180101120101 ........                 
OPTION:  61 (  7) Client-identifier         01:aa:00:00:97:32:4f
OPTION:  97 ( 17) UUID/GUID                 008fb1bf433e67d0 ....C>g.
					    4aa09d0b22f779a6 J...".y.
					    fd               .
---------------------------------------------------------------------------

  TIME: 2024-01-31 17:15:39.649
    IP: 10.192.24.1 (22:22:22:22:22:1) > 10.192.24.3 (aa:0:0:97:32:4f)
    OP: 2 (BOOTPREPLY)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 1
   XID: 3d7d025a
  SECS: 4
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 10.192.24.3
SIADDR: 208.80.153.105
GIADDR: 10.192.21.6
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: lpxelinux.0.
OPTION:  53 (  1) DHCP message type         2 (DHCPOFFER)
OPTION:  54 (  4) Server identifier         208.80.153.105
OPTION:  51 (  4) IP address leasetime      43200 (12h)
OPTION:   1 (  4) Subnet mask               255.255.255.255
OPTION:   3 (  4) Routers                   10.192.24.1
OPTION:   6 (  4) DNS server                10.3.0.1
OPTION:  15 ( 11) Domainname                codfw.wmnet
OPTION:  17 ( 10) Root path                 /tftpboot/
OPTION:  43 ( 82) Vendor specific info      d1197078656c696e ..pxelin
					    75782e6366672f74 ux.cfg/t
					    747953302d313135 tyS0-115
					    323030d235687474 200.5htt
					    703a2f2f6170742e p://apt.
					    77696b696d656469 wikimedi
					    612e6f72672f7466 a.org/tf
					    7470626f6f742f62 tpboot/b
					    6f6f6b776f726d2d ookworm-
					    696e7374616c6c65 installe
					    722f             r/
---------------------------------------------------------------------------

  TIME: 2024-01-31 17:15:40.688
    IP: 0.0.0.0 (aa:0:0:97:32:4f) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
    OP: 1 (BOOTPREQUEST)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 3d7d025a
  SECS: 10
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         1 (DHCPDISCOVER)
OPTION:  57 (  2) Maximum DHCP message size 1472
OPTION:  93 (  2) Client System             0000             ..
OPTION:  94 (  3) Client NDI                010201           ...
OPTION:  60 ( 32) Vendor class identifier   PXEClient:Arch:00000:UNDI:002001
OPTION:  77 (  4) User-class Identification 69505845         iPXE
OPTION:  55 ( 23) Parameter Request List      1 (Subnet mask)
					      3 (Routers)
					      6 (DNS server)
					      7 (Log server)
					     12 (Host name)
					     15 (Domainname)
					     17 (Root path)
					     26 (Interface MTU)
					     43 (Vendor specific info)
					     60 (Vendor class identifier)
					     66 (TFTP server name)
					     67 (Bootfile name)
					    119 (Domain Search)
					    128 (???)
					    129 (???)
					    130 (???)
					    131 (???)
					    132 (???)
					    133 (???)
					    134 (???)
					    135 (???)
					    175 (???)
					    203 (???)
					    
OPTION: 175 ( 48) ???                       b105011af41000eb ........
					    0301000017010122 ......."
					    0101130101110101 ........
					    2701011901012901 '.....).
					    0110010221010115 ....!...
					    0101180101120101 ........                 
OPTION:  61 (  7) Client-identifier         01:aa:00:00:97:32:4f
OPTION:  97 ( 17) UUID/GUID                 008fb1bf433e67d0 ....C>g.
					    4aa09d0b22f779a6 J...".y.
					    fd               .
---------------------------------------------------------------------------

  TIME: 2024-01-31 17:15:40.689
    IP: 10.192.24.1 (22:22:22:22:22:1) > 10.192.24.3 (aa:0:0:97:32:4f)
    OP: 2 (BOOTPREPLY)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 1
   XID: 3d7d025a
  SECS: 10
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 10.192.24.3
SIADDR: 208.80.153.105
GIADDR: 10.192.21.6
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: lpxelinux.0.
OPTION:  53 (  1) DHCP message type         2 (DHCPOFFER)
OPTION:  54 (  4) Server identifier         208.80.153.105
OPTION:  51 (  4) IP address leasetime      43200 (12h)
OPTION:   1 (  4) Subnet mask               255.255.255.255
OPTION:   3 (  4) Routers                   10.192.24.1
OPTION:   6 (  4) DNS server                10.3.0.1
OPTION:  15 ( 11) Domainname                codfw.wmnet
OPTION:  17 ( 10) Root path                 /tftpboot/
OPTION:  43 ( 82) Vendor specific info      d1197078656c696e ..pxelin
					    75782e6366672f74 ux.cfg/t
					    747953302d313135 tyS0-115
					    323030d235687474 200.5htt
					    703a2f2f6170742e p://apt.
					    77696b696d656469 wikimedi
					    612e6f72672f7466 a.org/tf
					    7470626f6f742f62 tpboot/b
					    6f6f6b776f726d2d ookworm-
					    696e7374616c6c65 installe
					    722f             r/
---------------------------------------------------------------------------

  TIME: 2024-01-31 17:15:42.664
    IP: 0.0.0.0 (aa:0:0:97:32:4f) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
    OP: 1 (BOOTPREQUEST)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 0
   XID: 3d7d025a
  SECS: 18
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: .
OPTION:  53 (  1) DHCP message type         3 (DHCPREQUEST)
OPTION:  57 (  2) Maximum DHCP message size 1472
OPTION:  93 (  2) Client System             0000             ..
OPTION:  94 (  3) Client NDI                010201           ...
OPTION:  60 ( 32) Vendor class identifier   PXEClient:Arch:00000:UNDI:002001
OPTION:  77 (  4) User-class Identification 69505845         iPXE
OPTION:  55 ( 23) Parameter Request List      1 (Subnet mask)
					      3 (Routers)
					      6 (DNS server)
					      7 (Log server)
					     12 (Host name)
					     15 (Domainname)
					     17 (Root path)
					     26 (Interface MTU)
					     43 (Vendor specific info)
					     60 (Vendor class identifier)
					     66 (TFTP server name)
					     67 (Bootfile name)
					    119 (Domain Search)
					    128 (???)
					    129 (???)
					    130 (???)
					    131 (???)
					    132 (???)
					    133 (???)
					    134 (???)
					    135 (???)
					    175 (???)
					    203 (???)
					    
OPTION: 175 ( 48) ???                       b105011af41000eb ........
					    0301000017010122 ......."
					    0101130101110101 ........
					    2701011901012901 '.....).
					    0110010221010115 ....!...
					    0101180101120101 ........                 
OPTION:  61 (  7) Client-identifier         01:aa:00:00:97:32:4f
OPTION:  97 ( 17) UUID/GUID                 008fb1bf433e67d0 ....C>g.
					    4aa09d0b22f779a6 J...".y.
					    fd               .
OPTION:  54 (  4) Server identifier         208.80.153.105
OPTION:  50 (  4) Request IP address        10.192.24.3
---------------------------------------------------------------------------

  TIME: 2024-01-31 17:15:42.665
    IP: 10.192.24.1 (22:22:22:22:22:1) > 10.192.24.3 (aa:0:0:97:32:4f)
    OP: 2 (BOOTPREPLY)
 HTYPE: 1 (Ethernet)
  HLEN: 6
  HOPS: 1
   XID: 3d7d025a
  SECS: 18
 FLAGS: 0
CIADDR: 0.0.0.0
YIADDR: 10.192.24.3
SIADDR: 208.80.153.105
GIADDR: 10.192.21.6
CHADDR: aa:00:00:97:32:4f:00:00:00:00:00:00:00:00:00:00
 SNAME: .
 FNAME: lpxelinux.0.
OPTION:  53 (  1) DHCP message type         5 (DHCPACK)
OPTION:  54 (  4) Server identifier         208.80.153.105
OPTION:  51 (  4) IP address leasetime      43200 (12h)
OPTION:   1 (  4) Subnet mask               255.255.255.255
OPTION:   3 (  4) Routers                   10.192.24.1
OPTION:   6 (  4) DNS server                10.3.0.1
OPTION:  15 ( 11) Domainname                codfw.wmnet
OPTION:  17 ( 10) Root path                 /tftpboot/
OPTION:  43 ( 82) Vendor specific info      d1197078656c696e ..pxelin
					    75782e6366672f74 ux.cfg/t
					    747953302d313135 tyS0-115
					    323030d235687474 200.5htt
					    703a2f2f6170742e p://apt.
					    77696b696d656469 wikimedi
					    612e6f72672f7466 a.org/tf
					    7470626f6f742f62 tpboot/b
					    6f6f6b776f726d2d ookworm-
					    696e7374616c6c65 installe
					    722f             r/
---------------------------------------------------------------------------

Unfortunately... iPXE doesn't seem to be compatible with this sort of setup.

Trying to work around it by setting a /24 subnet (and yes, it's possible to put it all in the cookbook generated snipped) :

shared-network "test" {
        subnet 10.192.24.1 netmask 255.255.255.255 {
            option subnet-mask 255.255.255.255;
            option routers 10.192.24.1;
        }
        subnet 10.192.24.0 netmask 255.255.255.0 {
            option subnet-mask 255.255.255.0;
            option routers 10.192.24.1;
        }
}
host sretest2005 {
    hardware ethernet aa:00:00:97:32:4f;
    fixed-address 10.192.24.3;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bookworm-installer/";
}

Fails with

Screenshot from 2024-01-31 18-26-46.png (499×772 px, 23 KB)

(link is https://ipxe.org/err/3c0920)

Manually configuring IPv6 is straightforward as well once we know a couple points :

When enabling forwarding on an interface (for example with sudo sysctl -w net.ipv6.conf.eno12399np0.forwarding=1 or sudo sysctl -w net.ipv6.conf.all.forwarding=1
The kernel automatically stops listening to inbound router advertisement (RA) on that interface. As the default route is set using RAs in our infra, the host loses IPv6 connectivity.

To fix (or workaround) it, RA needs to be explicitly enabled before enabling forwarding, for example using sudo sysctl -w net.ipv6.conf.eno12399np0.accept_ra=2

Next, like IPv4, enabling IPv6 forwarding on the primary and tap interfaces isn't enough to make the host forward packets from external hosts. It needs to be enabled globally with sudo sysctl -w net.ipv6.conf.all.forwarding=1

The remaining of the IPv6 config is similar to the IPv4 one:
Hypervisor side :
sudo ip -6 route add 2620:0:860:140:10:192:24:3 proto static dev tap0
In addition to the line in net-common adding the (not necessary, see below) IP 2620:0:860:140::1/128 to tap0.

VM side :

ip -6 addr del fe80::10:192:24:3/64 dev ens13            <- bogus from the D-I late_command.sh when no RA is running
ip -6 addr add 2620:0:860:140:10:192:24:3/128 dev ens13
ip -6 route add 2620:0:860:140::1 dev ens13 scope link
ip -6 route add default via 2620:0:860:140::1

Note that thanks to IPv6 link local addresses, it's possible to fully remove the last two lines :

# ip -6 route del default via 2620:0:860:140::1 dev ens13
# ip -6 route add default via fe80::2022:22ff:fe22:2201 dev ens13
# ip -6 route del 2620:0:860:140::1 dev ens13 scope link
# ping -6 en.wikipedia.org
PING en.wikipedia.org(text-lb.codfw.wikimedia.org (2620:0:860:ed1a::1)) 56 data bytes
64 bytes from text-lb.codfw.wikimedia.org (2620:0:860:ed1a::1): icmp_seq=1 ttl=60 time=0.379 ms

At first I thought it would be convenient to have 2620:0:860:140::1 in the traceroute/mtr output. But it actually does show up when only defined on the hypervisor side. So we can keep the VM side lean.

Next steps on that front :

  • Figure out how to do the v4-mapped-v6 for the VM static route in the net-common script (probably define the prefix in the script using Puppet, then copy/adapt what's done in late_command.sh)
  • Figure out the best path to configure the VM side, possibly a mix of radvd and patch to late_command.sh
  • Puppetize the sysctl calls

Change 994997 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: enable IPv6 forwarding

https://gerrit.wikimedia.org/r/994997

Change 995032 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: Add v6 static route to VM

https://gerrit.wikimedia.org/r/995032

@ayounsi Apologies for the trouble, I didn't realize sretest2005 was in active use. Unfortunately, I reimaged it while I was working on T345778 . The host is re-imaged and ready for use again. Sorry about that.

@ayounsi Apologies for the trouble, I didn't realize sretest2005 was in active use. Unfortunately, I reimaged it while I was working on T345778 . The host is re-imaged and ready for use again. Sorry about that.

No problem at all ! Glad it was useful.

Now that everything works as expected when configured manually, the "last" part is to automatically do the VM's IP config.

In some way, the question is : how to get v4 and v6 data from Netbox to the guest VM?

As a reminder, the current setup uses the following :

  • The cookbook/spiceracks gets the hostname v4 data from Netbox and exposes it to the VM through DHCP
  • For v6, the router's RA advertises the v6 prefix to use in the v4 to v6 mapping script in late_command.sh

I opened a ticket with iPXE to see if it could support /32 v4 or /128 v6 DHCP allocations : https://github.com/ipxe/ipxe/issues/1141 in the meantime I'm looking at alternatives ways to transfer such info at the earliest provisioning stage as possible.

For example I looked at using LLDP (for example to advertise the v6 prefix), but it can only be configured globally.

ganeti2033:~$ sudo lldpcli configure lldp custom-tlv add oui 11,22,33 subtype 44 oui-info 74,65,73,74

sretest2005:~$ sudo lldpctl
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    ens13, via: LLDP, RID: 1, Time: 0 day, 00:22:57
[...]
  Unknown TLVs:
    TLV:          OUI: 11,22,33, SubType: 44, Len: 4 74,65,73,74

Where
74,65,73,74 is "test" in hex.

I also looked at using qemu guest agent:
ganeti2034:~$ sudo gnt-instance modify -H use_guest_agent sretest2005.codfw.wmnet (then restart the VM)
sretest2005:~$ sudo systemctl start qemu-guest-agent (it's already installed on the VMs)
ganeti2033:~$ sudo socat - unix:/var/run/ganeti/kvm-hypervisor/ctrl/sretest2005.codfw.wmnet.qga
From that socket it's possible to create a file on the Guest VM, not sure how secure that is though following for example the json commands in https://wiki.qemu.org/Features/GuestAgent
Unfortunately it's quite cumbersome to use and requires a daemon running on the VM.

Last, and maybe the best option, using fw_cfg:
ganeti2034:~$ sudo gnt-instance modify -H kvm_extra="-fw_cfg name=net/ip\,string=10.192.24.3 -fw_cfg name=net/ip6\,string=2620:0:860:140:10:192:24:3" sretest2005.codfw.wmnet (then restart the VM)
sretest2005:~$ sudo cat /sys/firmware/qemu_fw_cfg/by_name/net/ip/raw returns 10.192.24.3
sretest2005:~$ sudo cat /sys/firmware/qemu_fw_cfg/by_name/net/ip6/raw returns 2620:0:860:140:10:192:24:3

This could be set once for all during the VM creation and queried at worse in the late_command.sh for proper IP configuration.

Not tested, but in the longer run, we could imaging late_command.sh (or similar) to query a Netbox endpoint to fetch directly its final and full IP configuration, for example to do the initial systemd-networkd config (see also T234207: Investigate improvements to how puppet manages network interfaces).

It however doesn't solve the issue of PXE (and then d-i) getting/setting a bogus netmask, which is at best not clean, at worse could prevent iPXE to fetch D-I, and D-I to fetch everything else.

Note that one of the prerequisite is to stay as close as the initial provisioning process as possible, thus keeping DHCP/PXE, etc. But I'm wondering what the tradeoff would be if we could boot from a Debian image instead of from the network. And for example use a "preseed/early_command" to configure the IPs from the fw_cfg entries.
This would also have the benefit of speeding up the process.

Change 1003416 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add support for routed Ganeti in D-I early_command.sh

https://gerrit.wikimedia.org/r/1003416

Change 1003452 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: use per tap interface dhcrelay

https://gerrit.wikimedia.org/r/1003452

Change 1003464 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add routed ganeti support to late_command.sh

https://gerrit.wikimedia.org/r/1003464

Change 1003490 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] makevm: pass the v6 IP to GntInstance.add

https://gerrit.wikimedia.org/r/1003490

Change 1003491 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/spicerack@master] Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg

https://gerrit.wikimedia.org/r/1003491

Change 1003511 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove BFD from routed ganeti peerings on router side

https://gerrit.wikimedia.org/r/1003511

Change 994997 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: enable IPv6 forwarding

https://gerrit.wikimedia.org/r/994997

Change 1003511 merged by jenkins-bot:

[operations/homer/public@master] Remove BFD from routed ganeti peerings on router side

https://gerrit.wikimedia.org/r/1003511

Change 1003605 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Create a dedicated role for routed Ganeti

https://gerrit.wikimedia.org/r/1003605

Change 1003605 merged by Muehlenhoff:

[operations/puppet@production] Create a dedicated role for routed Ganeti

https://gerrit.wikimedia.org/r/1003605

Change 1003464 merged by Ayounsi:

[operations/puppet@production] Add routed ganeti support to late_command.sh

https://gerrit.wikimedia.org/r/1003464

In theory if all those patches are merged/deployed, the VM will be using /32 IPs from early_command.sh all the way to its final state and setup the v6 /128 in the way.

That means it will still use the bogus /23 IP during the Debian Installer, and early_command.sh fetch steps so those files can't be hosted on a host in the same IP range (except the gateway)
The clean path is to have iPXE and debian-installer support /32s out of the box, that's why I opened upstream tasks : https://github.com/ipxe/ipxe/issues/1141 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064005
In the meantime I don't see it as a blocker as I tested multiple workarounds:

  • serve the files from a different site (by changing next-server in DHCP)
  • serve the files from the hypervisors (on the VM's gateway IP), this have the side benefit of speeding up the re-imaging process.

Note that one of the prerequisite is to stay as close as the initial provisioning process as possible, thus keeping DHCP/PXE, etc. But I'm wondering what the tradeoff would be if we could boot from a Debian image instead of from the network. And for example use a "preseed/early_command" to configure the IPs from the fw_cfg entries.

It's working good so no need to change things, but I agree this would be a very clean way to handle things.

BTW the whole using the fw_cfg framework to expose the network info to VM is genius @ayounsi nice work! Now that I'm aware I'm surprised nobody has tried to standardize it:

  • A well-known namespace in fw_cfg to expose IP and DNS resolver info
  • OS network init support to look for this (pre-dhcp) and use it to set up the network if present

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: sretest2005.codfw.wmnet

  • sretest2005.codfw.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Failed to force sync of VMs in Ganeti cluster codfw02 to Netbox: Cumin execution failed (exit_code=2)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

ERROR: some step on some host failed, check the bolded items above

Change 994223 merged by Ayounsi:

[operations/puppet@production] DHCP: set "use-host-decl-names on"

https://gerrit.wikimedia.org/r/994223

Change 1005450 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: move the tap v4 IP to Hiera

https://gerrit.wikimedia.org/r/1005450

Change 1003491 merged by jenkins-bot:

[operations/software/spicerack@master] Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg

https://gerrit.wikimedia.org/r/1003491

Change 1005450 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: move the tap v4 IP to Hiera

https://gerrit.wikimedia.org/r/1005450

Change 1003490 merged by jenkins-bot:

[operations/cookbooks@master] makevm: pass the v6 IP to GntInstance.add

https://gerrit.wikimedia.org/r/1003490

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox

Change 995032 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: Add v6 static route to VM

https://gerrit.wikimedia.org/r/995032

Change 1003416 merged by Ayounsi:

[operations/puppet@production] Add support for routed Ganeti in D-I early_command.sh

https://gerrit.wikimedia.org/r/1003416

Change #1003452 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: use per tap interface dhcrelay

https://gerrit.wikimedia.org/r/1003452

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Failed to force sync of VMs in Ganeti cluster codfw02 to Netbox: Cumin execution failed (exit_code=2)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

ERROR: some step on some host failed, check the bolded items above

Change #1016708 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: fix v6 route install

https://gerrit.wikimedia.org/r/1016708

Change #1016708 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: fix v6 route install

https://gerrit.wikimedia.org/r/1016708

Change #1016714 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add routed Ganeti to Prometheus monitoring

https://gerrit.wikimedia.org/r/1016714

Change #1016714 merged by Ayounsi:

[operations/puppet@production] Add routed Ganeti to Prometheus monitoring

https://gerrit.wikimedia.org/r/1016714

We can consider this task completed with success.

Next step is to discuss the next steps and open more specific tasks.

a rough outline of possible next steps is available there : https://wikitech.wikimedia.org/wiki/Ganeti#Future/possible_improvements

Change #1017047 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] add_ip6_mapped - don't fail if the host already have a /128 address

https://gerrit.wikimedia.org/r/1017047

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Failed to force sync of VMs in Ganeti cluster codfw02 to Netbox: Cumin execution failed (exit_code=2)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

ERROR: some step on some host failed, check the bolded items above

Change #1017047 merged by Ayounsi:

[operations/puppet@production] add_ip6_mapped - don't fail if the host already have a /128 address

https://gerrit.wikimedia.org/r/1017047

Change #1019002 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Add public Ganeti IP ranges

https://gerrit.wikimedia.org/r/1019002

Change #1019005 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add public testvm200x support

https://gerrit.wikimedia.org/r/1019005

Change #1019002 merged by Ayounsi:

[operations/homer/public@master] Add public Ganeti IP ranges

https://gerrit.wikimedia.org/r/1019002

Change #1019005 merged by Ayounsi:

[operations/puppet@production] Add public testvm200x support

https://gerrit.wikimedia.org/r/1019005

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: testvm2008.wikimedia.org

  • testvm2008.wikimedia.org (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: testvm2008.wikimedia.org

  • testvm2008.wikimedia.org (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw02 to Netbox

Change #1051342 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Preseed: set /32 netmask for virtual ranges

https://gerrit.wikimedia.org/r/1051342

Change #1051342 merged by Ayounsi:

[operations/puppet@production] Preseed: set /32 netmask for virtual ranges

https://gerrit.wikimedia.org/r/1051342

Change #1051366 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] DHCP: Add support for routed ganeti subnets

https://gerrit.wikimedia.org/r/1051366

Change #1051366 merged by Ayounsi:

[operations/puppet@production] DHCP: send subnet-mask 255.255.255.255 for routed ganeti VMs

https://gerrit.wikimedia.org/r/1051366