Netbox device location information not available on the first Puppet run of a device
Open, MediumPublic
Actions

Assigned To

Authored By

	taavi
	Sep 26 2023, 7:01 AM

Description

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Class[Profile::Wmcs::Cloud_private_subnet]: parameter 'netbox_location' expects a Netbox::Device::Location = Variant[Netbox::Device::Location::BareMetal = Struct[{'site' => Wmflib::Sites = Enum['codfw', 'drmrs', 'eqdfw', 'eqiad', 'eqord', 'eqsin', 'esams', 'ulsfo'], 'row' => String[2], 'rack' => String[2]}], Netbox::Device::Location::Virtual = Struct[{'site' => Wmflib::Sites = Enum['codfw', 'drmrs', 'eqdfw', 'eqiad', 'eqord', 'eqsin', 'esams', 'ulsfo'], 'ganeti_cluster' => String[1], 'ganeti_group' => String[1]}]] value, got Undef (file: /etc/puppet/modules/role/manifests/wmcs/openstack/eqiad1/control.pp, line: 9, column: 9) on node cloudcontrol1007.eqiad.wmnet

Seems like the netbox hiera import is done later in the lifecycle of a host than the cloud_private connection needs?

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	aborrero	T324992 cloudlb: create PoC on codfw
Resolved	aborrero	T332191 Decision request - Choose a subdomain for new cloud-private subnets
Resolved	aborrero	T335759 cloud-private subnet: introduce new domain
Open	cmooney	T346428 Netbox: Add support for our complex host network setups in provision script
Open	cmooney	T347375 Netbox device location information not available on the first Puppet run of a device

Event Timeline

taavi created this task.Sep 26 2023, 7:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2023, 7:01 AM

the netbox hiera wont be generated until the server is made 'active' on netbox.

I belive this will get properly fixed by T346428: Netbox: Add support for our complex host network setups in provision script.

My understanding is that you're one step ahead of Prod here as you're deriving host networking based on Netbox data (eg. rack from vlan, etc) so you might catch new issues.
We should look at provisioning from beginning to end so we can mutualise the efforts here.

For this specific usecase, the server lifecycle states specifies that a host with a "planned" state isn't in Puppet, so it makes sens that it should be moved to "active" before Puppet works.

taavi renamed this task from First Puppet run of a cloud_private connected node fails to Netbox device location information not available on the first Puppet run of a device.Sep 26 2023, 12:38 PM

Tagging @jbond @Volans as this is closely related to the server provisioning workflow. Looking at https://wikitech.wikimedia.org/wiki/Server_Lifecycle, the simple and obvious solution would be to change the deviced to the STAGED Netbox status and run the Hiera generation cookbook just before starting the initial Puppet run. IIRC there were some concerns with exporting network information before running the interface_automation.ImportPuppetDB script, but I think at least the location information should be usable at that point.

In T347375#9198392, @aborrero wrote:

I belive this will get properly fixed by T346428: Netbox: Add support for our complex host network setups in provision script.

I think these are two separate issues. That task is about automatically provisioning the needed networks and IPs in Netbox, which was done manually for the host in question. This issue is about when that data is exported to the hiera repo.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptSep 26 2023, 12:42 PM

The current assumptions are:

Hosts in Active,Failed status in Netbox must be in PuppetDB (there is a Netbox report to check this)
Hosts in PuppetDB must have an Active,Failed status in Netbox (there is a Netbox report to check this)
Status Staged in Netbox cannot be used (there is a Netbox validator that prevents it to be set). See T320696#8383673 for previous reasoning about it.
The hiera key profile::netbox::data::mgmt that is populated from the exported data is used by prometheus to monitor the mgmt interface. Including also Planned hosts might cause false positive alerts.

Re-introducing the usage of the Staged status without all the required automation to ensure that the status is correctly updated in all cases would be a step back IMHO.

The netbox hiera sync happens when the sre.puppet.sync-netbox-hiera cookbook is run (this is also triggered by sre.dns.netbox), and indeed only imports data from Netbox devices in state 'active'.

The reimage cookbook does change server status from 'planned' to 'active' when it completes, but that would be after first puppet run. As a work-around, however, I believe it should be possible to set the status to 'active' in Netbox manually once the Netbox Provision script has been run, before reimage is triggered.

In T347375#9198392, @aborrero wrote:

I belive this will get properly fixed by T346428: Netbox: Add support for our complex host network setups in provision script.

Tbh I wasn't planning on making any change there. Currently status is set to active once reimage completes. If we want to change that perhaps we can, as discussed elsewhere in this task, but I'm not sure it should be done differently for cloud servers versus the rest of our estate.

In T347375#9199084, @ayounsi wrote:

My understanding is that you're one step ahead of Prod here as you're deriving host networking based on Netbox data (eg. rack from vlan, etc) so you might catch new issues.
We should look at provisioning from beginning to end so we can mutualise the efforts here.

Agreed. I think the puppet approach is the only option we have right now. But as we look to configure networking with systemd-networkd, we should aim to drive it all from netbox and remove the puppet-driven network config for cloud hosts, ganeti hosts, lvs etc.

The components they require (bridges, vlan ints etc.) are all now modelled correctly in Netbox, so there is no need to use other sources of data (hiera etc) to drive it.

T346428 will help with that, in terms of allocating all required IPs and interfaces when dc-ops add servers to racks. I'll improve that for Ganeti and LVS too once the picture of what they need longer term is clear.

FWIW this inspired me to write up T347411: Drive host network config from Netbox, and move away from ifupdown

joanna_borun assigned this task to cmooney.Jan 22 2024, 4:19 PM

joanna_borun triaged this task as Medium priority.Feb 12 2024, 3:30 PM

Netbox device location information not available on the first Puppet run of a deviceOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Netbox device location information not available on the first Puppet run of a device
Open, MediumPublic
Actions

Related Objects
Search...