Page MenuHomePhabricator

Netbox device location information not available on the first Puppet run of a device
Open, MediumPublic

Description

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Class[Profile::Wmcs::Cloud_private_subnet]: parameter 'netbox_location' expects a Netbox::Device::Location = Variant[Netbox::Device::Location::BareMetal = Struct[{'site' => Wmflib::Sites = Enum['codfw', 'drmrs', 'eqdfw', 'eqiad', 'eqord', 'eqsin', 'esams', 'ulsfo'], 'row' => String[2], 'rack' => String[2]}], Netbox::Device::Location::Virtual = Struct[{'site' => Wmflib::Sites = Enum['codfw', 'drmrs', 'eqdfw', 'eqiad', 'eqord', 'eqsin', 'esams', 'ulsfo'], 'ganeti_cluster' => String[1], 'ganeti_group' => String[1]}]] value, got Undef (file: /etc/puppet/modules/role/manifests/wmcs/openstack/eqiad1/control.pp, line: 9, column: 9) on node cloudcontrol1007.eqiad.wmnet

Seems like the netbox hiera import is done later in the lifecycle of a host than the cloud_private connection needs?

Event Timeline

the netbox hiera wont be generated until the server is made 'active' on netbox.

My understanding is that you're one step ahead of Prod here as you're deriving host networking based on Netbox data (eg. rack from vlan, etc) so you might catch new issues.
We should look at provisioning from beginning to end so we can mutualise the efforts here.

For this specific usecase, the server lifecycle states specifies that a host with a "planned" state isn't in Puppet, so it makes sens that it should be moved to "active" before Puppet works.

taavi renamed this task from First Puppet run of a cloud_private connected node fails to Netbox device location information not available on the first Puppet run of a device.Sep 26 2023, 12:38 PM
taavi added subscribers: jbond, Volans.

Tagging @jbond @Volans as this is closely related to the server provisioning workflow. Looking at https://wikitech.wikimedia.org/wiki/Server_Lifecycle, the simple and obvious solution would be to change the deviced to the STAGED Netbox status and run the Hiera generation cookbook just before starting the initial Puppet run. IIRC there were some concerns with exporting network information before running the interface_automation.ImportPuppetDB script, but I think at least the location information should be usable at that point.

I think these are two separate issues. That task is about automatically provisioning the needed networks and IPs in Netbox, which was done manually for the host in question. This issue is about when that data is exported to the hiera repo.

The current assumptions are:

  • Hosts in Active,Failed status in Netbox must be in PuppetDB (there is a Netbox report to check this)
  • Hosts in PuppetDB must have an Active,Failed status in Netbox (there is a Netbox report to check this)
  • Status Staged in Netbox cannot be used (there is a Netbox validator that prevents it to be set). See T320696#8383673 for previous reasoning about it.
  • The hiera key profile::netbox::data::mgmt that is populated from the exported data is used by prometheus to monitor the mgmt interface. Including also Planned hosts might cause false positive alerts.

Re-introducing the usage of the Staged status without all the required automation to ensure that the status is correctly updated in all cases would be a step back IMHO.

The netbox hiera sync happens when the sre.puppet.sync-netbox-hiera cookbook is run (this is also triggered by sre.dns.netbox), and indeed only imports data from Netbox devices in state 'active'.

The reimage cookbook does change server status from 'planned' to 'active' when it completes, but that would be after first puppet run. As a work-around, however, I believe it should be possible to set the status to 'active' in Netbox manually once the Netbox Provision script has been run, before reimage is triggered.

Tbh I wasn't planning on making any change there. Currently status is set to active once reimage completes. If we want to change that perhaps we can, as discussed elsewhere in this task, but I'm not sure it should be done differently for cloud servers versus the rest of our estate.

My understanding is that you're one step ahead of Prod here as you're deriving host networking based on Netbox data (eg. rack from vlan, etc) so you might catch new issues.
We should look at provisioning from beginning to end so we can mutualise the efforts here.

Agreed. I think the puppet approach is the only option we have right now. But as we look to configure networking with systemd-networkd, we should aim to drive it all from netbox and remove the puppet-driven network config for cloud hosts, ganeti hosts, lvs etc.

The components they require (bridges, vlan ints etc.) are all now modelled correctly in Netbox, so there is no need to use other sources of data (hiera etc) to drive it.

T346428 will help with that, in terms of allocating all required IPs and interfaces when dc-ops add servers to racks. I'll improve that for Ganeti and LVS too once the picture of what they need longer term is clear.