Investigate improvements to how puppet manages network interfaces
Open, MediumPublic
Actions

Assigned To

None

Authored By

	jbond
	Sep 30 2019, 11:58 AM

Description

It was recently noticed that the current add_ip6_interface resource does not work with virtual devices. As such we should investigate. There was a discussion on the CR however no clear path forward was devised . Perhaps using systemd-networkd or https://netplan.io/ (essentially a YAML frontend to systemd-networkd) or connman (https://01.org/connman)

listing tasks that would need to be supported by or could have been easier to resolve with better puppitised network management

puppet modules/resources to consider with this work

interface::alias
interface::txqueuelen
interface::ring
interface::offload
interface::rps
interface::add_ip6_mapped
interface::ip
interface::up_command
interface::setting
bridge_utils
lvs::kenel_config
systemd::resolved
profile::lvs::interface_tweaks
profile::lvs::tagged_interface

Other hacks that could be resolved

./modules/ganeti/files/ganeti_init.sh

Details

	Subject	Repo	Branch	Lines +/-
	ganeti: Add a ganeti_init.sh script	operations/puppet	production	+112 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Open	None	T347411 Drive host network config from Netbox, and move away from ifupdown
Open	None	T234207 Investigate improvements to how puppet manages network interfaces
Resolved	aborrero	T319524 Cloud VPS: neutron network: the ifupdown bridge setup can be fragile
Open	None	T342899 Puppet interface:: resources aren't cleaned up

Event Timeline

jbond created this task.Sep 30 2019, 11:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2019, 11:58 AM

akosiaris triaged this task as Lowest priority.Sep 30 2019, 3:05 PM

akosiaris updated the task description. (Show Details)

MoritzMuehlenhoff subscribed.Sep 30 2019, 3:07 PM

jbond added a project: User-jbond.Oct 30 2019, 5:26 PM

jbond moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-jbond board.Oct 30 2019, 5:33 PM

jbond moved this task from Back Burner 🏛️ to Friday tasks on the User-jbond board.Nov 7 2019, 5:34 PM

akosiaris renamed this task from Investigate improvements to how puppet manages interfaces to Investigate improvements to how puppet manages network interfaces.Dec 2 2019, 5:11 PM

Change 602350 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350

gerritbot added a project: Patch-For-Review.Jun 4 2020, 12:38 PM

Change 602350 merged by Alexandros Kosiaris:
[operations/puppet@production] ganeti: Add a ganeti_init.sh script

https://gerrit.wikimedia.org/r/602350

Maintenance_bot removed a project: Patch-For-Review.Jun 11 2020, 11:10 AM

akosiaris mentioned this in T265904: Remove SLAAC IPs from Ganeti hosts.Oct 19 2020, 1:51 PM

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 9:00 PM

jbond added a parent task: T294906: Puppet Improvements.Nov 3 2021, 11:58 AM

elukey subscribed.Dec 14 2021, 4:49 PM

akosiaris mentioned this in T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs.Jan 18 2022, 11:06 AM

ayounsi moved this task from Backlog to Watching on the netops board.Jul 5 2022, 6:35 AM

jbond mentioned this in T319300: NIC renaming via puppet.Oct 5 2022, 10:53 AM

aborrero subscribed.Oct 5 2022, 11:29 AM

aborrero added a subtask: T319524: Cloud VPS: neutron network: the ifupdown bridge setup can be fragile.Oct 6 2022, 2:11 PM

aborrero closed subtask T319524: Cloud VPS: neutron network: the ifupdown bridge setup can be fragile as Resolved.Oct 7 2022, 9:02 AM

I change the priority to medium,. The lack of a proper solution for network management causes period problems enough that its worth trying to solve this

cmooney subscribed.Oct 7 2022, 9:23 AM

aborrero mentioned this in T320429: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is.Oct 11 2022, 9:40 AM

jbond updated the task description. (Show Details)Oct 11 2022, 11:12 AM

Thanks for tracking all this John.

As you know most of our hosts just have a single interface with single unicast IP (of each address fam). But a few have vlan sub-interfaces, and some have bridge devices, most notably our Ganeti and LVS hosts. I'm not sure if this task is the best place to discuss this but I'm of the opinion we should be able to drive the creation of these from Netbox data.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

Am I being naive to think if we did that, Puppet could use the Netbox data to drive the configuration on the hosts themselves? i.e. using a template for /etc/network/interfaces, or the equivalent conf files for systemd-networkd?

I think more exotic things, like the VRFs and namespaces on some of the cloud hosts, should probably be out of scope for the generic network configuration (and be instead handled by role-specific templates if needed). As should adjustments to things like hardware offloads, rx/tx queue tweaks etc., as the requirement for those is very service dependent. And we've no standard way to model them in Netbox.

But adding a 'universal' way to define bridge and vlan ints would cover the vast majority of foreseeable use-cases. The current approach, where every role that needs more than the basic single-int has its own way to add the additional elements, doesn't seem like a good pattern longer term.

In T234207#8307389, @cmooney wrote:

I'm not sure if this task is the best place to discuss this but I'm of the opinion we should be able to drive the creation of these from Netbox data.

I think here is as good as any place. That said regardless of if the data is added to hiera manually or comes from netbox it doesn't really change the underlining implementation of the puppet code so although worth keeping in mind i dont think the decision blocks this work or that this works would prevent us from using netbox in the future. i.e. its hiera all the way down as far as puppet is concerned. Technical points aside the main question that comes up here when its been diescussed before is how much do we want to allow netbox to configure hosts and how much do we want top protect against that.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

Am I being naive to think if we did that, Puppet could use the Netbox data to drive the configuration on the hosts themselves? i.e. using a template for /etc/network/interfaces, or the equivalent conf files for systemd-networkd?

yes from a technical PoV this should all be doable

I think more exotic things, like the VRFs and namespaces on some of the cloud hosts, should probably be out of scope for the generic network configuration
(and be instead handled by role-specific templates if needed). As should adjustments to things like hardware offloads, rx/tx queue tweaks etc., as the requirement for those is very service dependent. And we've no standard way to model them in Netbox.

Perhaps from the netbox PoV but from any new (networkd) module should support all use cases. the majority of the issues we have in this space is with the more exotic configerations and the fact that users dont have a clear way to configure them, leaving them to create there won solutions.

But adding a 'universal' way to define bridge and vlan ints would cover the vast majority of foreseeable use-cases. The current approach, where every role that needs more than the basic single-int has its own way to add the additional elements, doesn't seem like a good pattern longer term.

completely agree.

In T234207#8307389, @cmooney wrote:

Thanks for tracking all this John.

So for instance we could adjust our Netbox interface_automation, to properly add the additional network devices for hosts when they are added? Possibly we could have templates for say Ganeti and LVS, as well as potentially supporting generic vlan/bridge ints via drop-down or similar.

My current 10k feet view of a future setup is that the core network setup would get provided by systemd-networkd as provisioned by the installer. And that we find a model to sync addons settings for cases beyond the standard setup (such as Ganeti or LVS) where we sync the additional data (e.g. the bridge setup for Ganeti) from Netbox (as systemd-networkd provides a fine-grained enough hierarchy) on demand/via cookbook (e.g. in case VLANs change or a server gets moved).

But I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

ayounsi updated the task description. (Show Details)Oct 11 2022, 12:13 PM

In T234207#8307423, @jbond wrote:

Perhaps from the netbox PoV but from any new (networkd) module should support all use cases. the majority of the issues we have in this space is with the more exotic configerations and the fact that users dont have a clear way to configure them, leaving them to create there won solutions.

Definitely not opposed to any of that, and we do have custom fields in Netbox too. My only thought was not making the task too difficult for ourselves day one, or get bogged down trying to support extremely edge cases. But absolutely the more we can automate/standardize the better.

In T234207#8307431, @MoritzMuehlenhoff wrote:

I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

Agreed that's probably best, let's set it up.

In T234207#8307431, @MoritzMuehlenhoff wrote:

In T234207#8307389, @cmooney wrote:

Thanks for tracking all this John.

So for instance we could adjust our Netbox interface_automation

One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_automation which is populated via puppedb to drive puppet. We would need some new script that dc-ops ran while there where racking and doing the inital install of the server.

My current 10k feet view of a future setup is that the core network setup would get provided by systemd-networkd as provisioned by the installer.

I think this is fine, however after the server is image puppet sould also manage theses core network items as well. this should also help with resolving things like VM interfaces being renamed when they come up i.e. we could choose to always call the primary interface something like primary, production or data

And that we find a model to sync addons settings for cases beyond the standard setup (such as Ganeti or LVS) where we sync the additional data (e.g. the bridge setup for Ganeti) from Netbox (as systemd-networkd provides a fine-grained enough hierarchy) on demand/via cookbook (e.g. in case VLANs change or a server gets moved).

also agree here however the detail will be where we may differ, ultimatly we currently have the issue that puppet populates netbox and i think that either netbox populates puppet or some data store (hiera) populates both puppet and netbox. basically where do we want the source of truth. currently for things like bridges and vlans tages it is scattered throughout various puppet manifest which make descovering it a pain

But I think this topic is large enough to best discussed in person, so let's setup a meeting in the next weeks for interested people to discuss options and a way forward (and to figure what needs further investigation/experiments)?

+1 sgtm

Definitely not opposed to any of that, and we do have custom fields in Netbox too. My only thought was not making the task too difficult for ourselves day one, or get bogged down trying to support extremely edge cases. But absolutely the more we can automate/standardize the better.

One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_automation which is populated via puppedb to drive puppet. We would need some new script that dc-ops ran while there where racking and doing the inital install of the server.

Sorry wires crossed here. Currently the import from puppetdb does happen in that script. But I was actually referring to the server network provisioning part of it, which we could augment to add additional host templates with more than a single interface:

https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/

Ideally then the "puppetdb import" bit could be removed from it completely. And instead we could have a Netbox report to warn us of discrepancies between host live state and desired state as defined in Netbox.

i think that either netbox populates puppet or some data store (hiera) populates both puppet and netbox. basically where do we want the source of truth

Agree. This is at the core of the discussion I think, I'm fairly agnostic we should discuss further to see what's the best fit.

For physical servers we indeed need to keep the whole lifecycle/provisioning process in mind (racking/provisioning/re-imaging) etc.
Which means being able to map the real world interface to the logical one, from previous conversations it's one of the tricky points (eg. where there are multiple cards with multiple NICs).
Having a first iteration with VMs only could allow to test the software side mechanisms without having to solve the "physical" part.
I also agree with @cmooney ideal end state.

Which means being able to map the real world interface to the logical one, from previous conversations it's one of the tricky points (eg. where there are multiple cards with multiple NICs).

Yeah that is tricky to predict, good point.

We may well need to import that from puppet still. But potentially it gets reduced to a script that just renames existing Netbox objects, rather than adding anything.

jbond edited projects, added Puppet-Core; removed Puppet.Jun 12 2023, 5:28 PM

aborrero added a project: User-aborrero.Jul 27 2023, 3:12 PM

aborrero mentioned this in T342899: Puppet interface:: resources aren't cleaned up.Jul 27 2023, 4:03 PM

cmooney mentioned this in T102099: Fix IPv6 autoconf issues once and for all, across the fleet..Sep 12 2023, 12:24 PM

cmooney mentioned this in T347411: Drive host network config from Netbox, and move away from ifupdown.Sep 26 2023, 3:53 PM

jbond added a parent task: T347411: Drive host network config from Netbox, and move away from ifupdown.Sep 26 2023, 5:29 PM

ayounsi mentioned this in Blog Post: Ganeti on modern network design.Dec 14 2023, 4:32 PM

ayounsi mentioned this in T300152: Investigate Ganeti in routed mode.Feb 13 2024, 1:54 PM