Problem
Hit a bit of an edge-case today we'd not seen before.
Running Homer against cr[1|2].eqiad it was trying to remove working BGP sessions to a bunch of nodes. On closer inspection it turned out that all of the affected devices were VMs on newly-commissioned ganeti hosts in Eqiad.
Cause
The problem is the Homer code expects to find the primary_ip for any ganeti host to be attached to a 'bridge' device ("private" usually), and tries to find a physical link on the host that belongs to this bridge to get the associated switch port.
However at the point the PuppetDB Import script is run during the reimage process, ganeti hosts do not yet have the configuration in place for the bridge devices, and the IP address is still attached to the physical interface. With the current Homer plugin code this means it does not properly find the attached switch port, and fails to add the VM to the BGP list as a result:
for hypervisor in hypervisors:
ganeti_bridge = hypervisor.primary_ip.assigned_object
ganeti_uplink = self._api.dcim.interfaces.get(device_id=hypervisor.id, bridge_id=ganeti_bridge.id,
type__neq='virtual')
switchport = ganeti_uplink.connected_endpoints[0]Fixes
There are a few ways we can approach this:
- Improve the ganeti install process so we re-run the puppetdb interface import *after* the host has been fully configured
- This is probably a good idea anyway so Netbox properly reflects what is set up
- Move away from the current network setup process so we add everything in Netbox the way we want from day one
- Adjust the Homer code so it will find the switchport in the case where the device primary IP is on either a bridge or physical Ethernet interface
- Adjust the Homer code to instead by default assume VMs on legacy "row-wide" vlans should peer with the CRs
Of all the options I actually like the last one best. The current code is in some ways more flexible, it would work in certain potential scenarios we don't have right now (peering to top-of-rack from row-wide vlan in an EVPN row). Option 3 would retain that flexibility. But we don't expect to ever set things up that way, so I think simply selecting the hosts based on the vlan membership of their primary IP will suffice. It should also be less Netbox API calls which will hopefully speed things up.
As mentioned we should think about how to approach #1 to ensure Netbox reflects the actual setup anyway.