Page MenuHomePhabricator

Drive host network config from Netbox, and move away from ifupdown
Open, LowPublic

Description

Background

Our current host-network configuration / provisioning has some rough edges. To give a quick recap on how things currently work:

  1. Server's are assigned an IP in Netbox (with a dummy interface name) when their primary switch link is added to netbox
  2. We assign this IP to the host using DHCP when it is being reimaged
  3. The debian-installer takes the IP details assigned from DHCP and statically configures them in /etc/network/interfaces
  4. Various roles with more complex network setups use puppet to modify this file
    • For example cloud hosts, lvs, ganeti
    • They typically only augment the d-i generated file, rather than replace it, which is brittle
  5. The information from puppetdb is imported back into Netbox
    • This replaces dummy interface names with the host-derived real ones
    • It also adds any additional interfaces and IPs that were created by puppet to Netbox.

This process, combined with use of ifupdown, leads to multiple issues as documented in T234207: Investigate improvements to how puppet manages network interfaces and other tasks. Apologies for yet another one but this didn't seem to fit perfectly in the existing ones.

Future

Moving forward I think it makes sense if we use Netbox to drive the host configuration, really making it our "source of truth". Netbox now models the more elaborate host-side network elements correctly, so there should be no need for any role-specific puppet classes that modify network config, or hiera structures to define these elements. Working to that end also provides a perfect opportunity to move away from ifupdown, towards systemd-networkd, netplan.io, connman or similar.

At a high-level we'd need some changes to our process:

  • Improve the netbox provision script, so we can define more complex setups at that stage
    • i.e. have options for cloud hosts, ganeti, lvs (see T346428)
  • Keep the current scenario whereby we assign IPs from DHCP during install, and d-i creates a conf file based on those details
  • When the host is up, Puppet overwrites the d-i generated files, rewriting the whole config based on data synced from netbox

Interface naming problem

There are some challenges for this last point. They boil down to Netbox having the interface config and IP details, but not knowing the linux device names. The host itself knows the interface names, but does not have access to Netbox. Puppet sort of sits in between.

There are probably various ways to approach:

  • Can we accurately predict the linux netdev names?
    • I don't believe we track the various host-level NIC and BIOS parameters accurately enough to predict the PCIe numbering
  • Could we have some local service on the box that builds the network conf files based on data puppet pushes to it?
    • This data would be synced from netbox
    • The local service would be able to replace any 'dummy' interface names with the real ones
      • Possibly based on lldp info?

We'd probably still need the puppetdb netbox import step, to rename dummy interfaces in Netbox. But we could reduce its function to just renaming dummy interface names, and no longer have any other data being pulled from live hosts into our source of truth.

Not a trivial thing, but thought I'd create the task to explore what our options are.

Event Timeline

cmooney created this task.

@cmooney thanks for the write up, i attached on older task to this which list some of the edge cases and issues etc. overall i think the the proposal is good. as to the open question of how to get the interfaces names. We may be able to add this to the provisioning script. with our configuration we are using "Predictable Interface Names" which are based on the device location i.e. bus, slot etc and not the driver. We may be able to use redfish to get this information (although i couldn't find it from a quick look) and the update netbox. We have also spoken in the past of having some very simple debian live system that we could boot into during provisioning to grab a bunch of data from the host and update netbox.

We may be able to use redfish to get this information (although i couldn't find it from a quick look) and the update netbox.

Hmm that's a very interesting prospect might well be a way to do it.

We have also spoken in the past of having some very simple debian live system that we could boot into during provisioning to grab a bunch of data from the host and update netbox.

Another good suggestion - thanks!

It's great to see momentum on this recurring pain point!

To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP) and then get their final/permanent IP at their first Puppet run.

To update Netbox with the proper list of interfaces, this could be done at the early or late_command stage (cf. https://github.com/wikimedia/operations-puppet/tree/production/modules/install_server/files/autoinstall/scripts )
Or even sooner if we use something like iPXE (which could also workaround the issue we're having in T304483: PXE boot NIC firmware regression

Next step should probably to have a meeting about that, then open (or update) sub tasks with the direction we want to take.

I had a little play with the redfish api and the PCIe info is available. Unfortunately Linux interfaces remain about as predictable as the Lotto numbers.

Taking a random system, cloudservices1006, it has interfaces as follows:

cmooney@cloudservices1006:~$ lspci | grep -i ethernet
04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
4b:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)
4b:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)

From these, systemd-udevd.service creates the following udev properties for ID_NET_NAME_PATH (described by debian here and systemd here):

cmooney@cloudservices1006:~$ udevadm info -e | grep ID_NET_NAME_PATH
E: ID_NET_NAME_PATH=enp4s0f0
E: ID_NET_NAME_PATH=enp4s0f1
E: ID_NET_NAME_PATH=enp75s0f0np0
E: ID_NET_NAME_PATH=enp75s0f1np1

Using the redfish API we can pull the PCI IDs (ultimately the same as shown in lspci although they are decimal not hex here), and almost predict the names:

cmooney@cumin1001:~$ ./get_nic_pci.py --host cloudservices1006.mgmt.eqiad.wmnet
iDRAC Password: 
NIC.Embedded.1-1-1 - enp4s0f0 - NetXtreme BCM5720 Gigabit Ethernet PCIe
NIC.Embedded.2-1-1 - enp4s0f1 - NetXtreme BCM5720 Gigabit Ethernet PCIe
NIC.Integrated.1-1-1 - enp75s0f0 - BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller
NIC.Integrated.1-2-1 - enp75s0f1 - BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller

Comparing that to what the system shows the only difference is the "np0" and "np1" suffix that get assigned to the 10G ("integrated") ones rather than the 1G ("embedded") ones. In all cases the digit after the "np" matches the PCIe function number, but I can't quite see why it is adding them in one case and not the other. Looking at the full output of the API for these devices, and indeed querying their udev properties, I can't see what difference is present to result in one having the 'npX' and one not.

There is possibly a clue in the ID_NET_LABEL_ONBOARD, which in the free-text shows separate 'NICs' for the embedded, but a single NIC with two ports for the integrated:

cmooney@cloudservices1006:~$ udevadm info -e | grep ID_NET_LABEL_ONBOARD
E: ID_NET_LABEL_ONBOARD=Embedded NIC 1
E: ID_NET_LABEL_ONBOARD=Embedded NIC 2
E: ID_NET_LABEL_ONBOARD=Integrated NIC 1 Port 1-1
E: ID_NET_LABEL_ONBOARD=Integrated NIC 1 Port 2-1

That all may be academic, however, as the netdev names are not being set based on PCI location. Instead they're named as follows:

cmooney@cloudservices1006:~$ ip -br link show | grep ^en
eno12399np0      UP             00:62:0b:c9:19:70 <BROADCAST,MULTICAST,UP,LOWER_UP> 
eno8303          DOWN           c4:5a:b1:ab:d7:f8 <BROADCAST,MULTICAST> 
eno12409np1      DOWN           00:62:0b:c9:19:71 <BROADCAST,MULTICAST> 
eno8403          DOWN           c4:5a:b1:ab:d7:f9 <BROADCAST,MULTICAST>

These names are because the NICs are exposed as "onboard" devices, which means ID_NET_NAME_ONBOARD is populated, and as per the default NamePolicy that is preferred. Ultimately those numbers come from the device's ACPI Index ID:

cmooney@cloudservices1006:~$ cat /sys/class/net/eno8303/device/acpi_index 
8303
cmooney@cloudservices1006:~$ cat /sys/class/net/eno12399np0/device/acpi_index 
12399

Certain NICs in our estate are not seen as 'onboard', and expose no 'acpi index'. This results in no ID_NET_NAME_ONBOARD being populated for them, in which case the system does use ID_NET_NAME_PATH for the name:

cmooney@dns5003:~$ ip -br addr show  | egrep ^en
enp94s0f0np0     UP             103.102.166.10/28 2001:df2:e500:1:103:102:166:10/64 fe80::8616:cff:fe5d:a880/64 
eno1             DOWN           
enp94s0f1np1     DOWN           
eno2             DOWN
NOTE: In the above the acpi_index for embedded eno1 and eno2 are 1 and 2 respectively. No idea why these get short, 1-digit IDs and some other systems have much larger ones.

Either way, despite having a good look, I couldn't find any way to get these ACPI Index IDs from the redfish API. So this may be a bit of a dead-end.

Some things that may be possible, if still trying to predict the names from redfish data:

  1. Change the NamePolicy we configure to "path", so systemd will use the name from ID_NET_NAME_PATH (derived from pci location), rather than the 'onboard'/acpi ID.
    • We still need to work out when and why it adds 'npX' suffix to do this
  2. Change the NamePolicy to 'mac', so we get names like enxc45ab1abd7f5. Ugly, but should be consistent?

The idea of booting the system at the provision stage into some "simple live system" and grabbing the information is also a good option perhaps.

To add to it, we could have the hosts boot up with only a v6 SLAAC IP (decommission the DHCP) and then get their final/permanent IP at their first Puppet run.

Definitely a good option. I think we need to be mindful of T102099: Fix IPv6 autoconf issues once and for all, across the fleet. here (esp. the weird bug documented in T102099#8537693). The idea of disabling RAs and SLAAC completely did appeal to me, but no srong feelings. I guess in this case we can pull the system's MAC from redfish API (or maybe switch?) and then derive the SLAAC IP?

To update Netbox with the proper list of interfaces, this could be done at the early or late_command stage (cf. https://github.com/wikimedia/operations-puppet/tree/production/modules/install_server/files/autoinstall/scripts )

That's a very good point. There is no hard requirement to update netbox with the real names at 'provision' stage, the only requirement is that they are available before we try to build the network conf files. If we can control the order of operations and have an 'export' function be run by puppet before it tries to build the config it might be an easy win.

Next step should probably to have a meeting about that, then open (or update) sub tasks with the direction we want to take.

Agreed I think that makes sense. We can possibly discuss briefly during our quarterly planning and if there is agreement to proceed schedule a dedicated one?