Page MenuHomePhabricator

Investigate improvements to how puppet manages network interfaces
Open, MediumPublic

Description

It was recently noticed that the current add_ip6_interface resource does not work with virtual devices. As such we should investigate. There was a discussion on the CR however no clear path forward was devised . Perhaps using systemd-networkd or https://netplan.io/ (essentially a YAML frontend to systemd-networkd) or connman (https://01.org/connman)

listing tasks that would need to be supported by or could have been easier to resolve with better puppitised network management

puppet modules/resources to consider with this work

  • interface::alias
  • interface::txqueuelen
  • interface::ring
  • interface::offload
  • interface::rps
  • interface::add_ip6_mapped
  • interface::ip
  • interface::up_command
  • interface::setting
  • bridge_utils
  • lvs::kenel_config
  • systemd::resolved
  • profile::lvs::interface_tweaks
  • profile::lvs::tagged_interface

Other hacks that could be resolved

  • ./modules/ganeti/files/ganeti_init.sh

Related Objects

Event Timeline

akosiaris triaged this task as Lowest priority.Sep 30 2019, 3:05 PM
akosiaris updated the task description. (Show Details)
akosiaris renamed this task from Investigate improvements to how puppet manages interfaces to Investigate improvements to how puppet manages network interfaces.Dec 2 2019, 5:11 PM

Change 602350 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350

Change 602350 merged by Alexandros Kosiaris:
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350

jbond raised the priority of this task from Lowest to Medium.Oct 7 2022, 9:11 AM

I change the priority to medium,. The lack of a proper solution for network management causes period problems enough that its worth trying to solve this

Thanks for tracking all this John.

As you know most of our hosts just have a single interface with single unicast IP (of each address fam). But a few have vlan sub-interfaces, and some have bridge devices, most notably our Ganeti and LVS hosts. I'm not sure if this task is the best place to discuss this but I'm of the opinion we should be able to drive the creation of these from Netbox data.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

Am I being naive to think if we did that, Puppet could use the Netbox data to drive the configuration on the hosts themselves? i.e. using a template for /etc/network/interfaces, or the equivalent conf files for systemd-networkd?

I think more exotic things, like the VRFs and namespaces on some of the cloud hosts, should probably be out of scope for the generic network configuration (and be instead handled by role-specific templates if needed). As should adjustments to things like hardware offloads, rx/tx queue tweaks etc., as the requirement for those is very service dependent. And we've no standard way to model them in Netbox.

But adding a 'universal' way to define bridge and vlan ints would cover the vast majority of foreseeable use-cases. The current approach, where every role that needs more than the basic single-int has its own way to add the additional elements, doesn't seem like a good pattern longer term.

I'm not sure if this task is the best place to discuss this but I'm of the opinion we should be able to drive the creation of these from Netbox data.

I think here is as good as any place. That said regardless of if the data is added to hiera manually or comes from netbox it doesn't really change the underlining implementation of the puppet code so although worth keeping in mind i dont think the decision blocks this work or that this works would prevent us from using netbox in the future. i.e. its hiera all the way down as far as puppet is concerned. Technical points aside the main question that comes up here when its been diescussed before is how much do we want to allow netbox to configure hosts and how much do we want top protect against that.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

Am I being naive to think if we did that, Puppet could use the Netbox data to drive the configuration on the hosts themselves? i.e. using a template for /etc/network/interfaces, or the equivalent conf files for systemd-networkd?

yes from a technical PoV this should all be doable

I think more exotic things, like the VRFs and namespaces on some of the cloud hosts, should probably be out of scope for the generic network configuration
(and be instead handled by role-specific templates if needed). As should adjustments to things like hardware offloads, rx/tx queue tweaks etc., as the requirement for those is very service dependent. And we've no standard way to model them in Netbox.

Perhaps from the netbox PoV but from any new (networkd) module should support all use cases. the majority of the issues we have in this space is with the more exotic configerations and the fact that users dont have a clear way to configure them, leaving them to create there won solutions.

But adding a 'universal' way to define bridge and vlan ints would cover the vast majority of foreseeable use-cases. The current approach, where every role that needs more than the basic single-int has its own way to add the additional elements, doesn't seem like a good pattern longer term.

completely agree.

Thanks for tracking all this John.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

My current 10k feet view of a future setup is that the core network setup would get provided by systemd-networkd as provisioned by the installer. And that we find a model to sync addons settings for cases beyond the standard setup (such as Ganeti or LVS) where we sync the additional data (e.g. the bridge setup for Ganeti) from Netbox (as systemd-networkd provides a fine-grained enough hierarchy) on demand/via cookbook (e.g. in case VLANs change or a server gets moved).

But I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

Perhaps from the netbox PoV but from any new (networkd) module should support all use cases. the majority of the issues we have in this space is with the more exotic configerations and the fact that users dont have a clear way to configure them, leaving them to create there won solutions.

Definitely not opposed to any of that, and we do have custom fields in Netbox too. My only thought was not making the task too difficult for ourselves day one, or get bogged down trying to support extremely edge cases. But absolutely the more we can automate/standardize the better.

I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

Agreed that's probably best, let's set it up.

Thanks for tracking all this John.

So for instance we could adjust our Netbox interface_automation

One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_automation which is populated via puppedb to drive puppet. We would need some new script that dc-ops ran while there where racking and doing the inital install of the server.

My current 10k feet view of a future setup is that the core network setup would get provided by systemd-networkd as provisioned by the installer.

I think this is fine, however after the server is image puppet sould also manage theses core network items as well. this should also help with resolving things like VM interfaces being renamed when they come up i.e. we could choose to always call the primary interface something like primary, production or data

And that we find a model to sync addons settings for cases beyond the standard setup (such as Ganeti or LVS) where we sync the additional data (e.g. the bridge setup for Ganeti) from Netbox (as systemd-networkd provides a fine-grained enough hierarchy) on demand/via cookbook (e.g. in case VLANs change or a server gets moved).

also agree here however the detail will be where we may differ, ultimatly we currently have the issue that puppet populates netbox and i think that either netbox populates puppet or some data store (hiera) populates both puppet and netbox. basically where do we want the source of truth. currently for things like bridges and vlans tages it is scattered throughout various puppet manifest which make descovering it a pain

But I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

+1 sgtm

Definitely not opposed to any of that, and we do have custom fields in Netbox too. My only thought was not making the task too difficult for ourselves day one, or get bogged down trying to support extremely edge cases. But absolutely the more we can automate/standardize the better.

+1

One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_automation which is populated via puppedb to drive puppet. We would need some new script that dc-ops ran while there where racking and doing the inital install of the server.

Sorry wires crossed here. Currently the import from puppetdb does happen in that script. But I was actually referring to the server network provisioning part of it, which we could augment to add additional host templates with more than a single interface:

https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/

Ideally then the "puppetdb import" bit could be removed from it completely. And instead we could have a Netbox report to warn us of discrepancies between host live state and desired state as defined in Netbox.

i think that either netbox populates puppet or some data store (hiera) populates both puppet and netbox. basically where do we want the source of truth

Agree. This is at the core of the discussion I think, I'm fairly agnostic we should discuss further to see what's the best fit.

For physical servers we indeed need to keep the whole lifecycle/provisioning process in mind (racking/provisioning/re-imaging) etc.
Which means being able to map the real world interface to the logical one, from previous conversations it's one of the tricky points (eg. where there are multiple cards with multiple NICs).
Having a first iteration with VMs only could allow to test the software side mechanisms without having to solve the "physical" part.
I also agree with @cmooney ideal end state.

Which means being able to map the real world interface to the logical one, from previous conversations it's one of the tricky points (eg. where there are multiple cards with multiple NICs).

Yeah that is tricky to predict, good point.

We may well need to import that from puppet still. But potentially it gets reduced to a script that just renames existing Netbox objects, rather than adding anything.