Page MenuHomePhabricator

Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes
Open, LowPublic

Description

Problem

Hit a bit of an edge-case today we'd not seen before.

Running Homer against cr[1|2].eqiad it was trying to remove working BGP sessions to a bunch of nodes. On closer inspection it turned out that all of the affected devices were VMs on newly-commissioned ganeti hosts in Eqiad.

Cause

The problem is the Homer code expects to find the primary_ip for any ganeti host to be attached to a 'bridge' device ("private" usually), and tries to find a physical link on the host that belongs to this bridge to get the associated switch port.

However at the point the PuppetDB Import script is run during the reimage process, ganeti hosts do not yet have the configuration in place for the bridge devices, and the IP address is still attached to the physical interface. With the current Homer plugin code this means it does not properly find the attached switch port, and fails to add the VM to the BGP list as a result:

for hypervisor in hypervisors:
         ganeti_bridge = hypervisor.primary_ip.assigned_object
         ganeti_uplink = self._api.dcim.interfaces.get(device_id=hypervisor.id, bridge_id=ganeti_bridge.id,
                                                       type__neq='virtual')
         switchport = ganeti_uplink.connected_endpoints[0]

Fixes

There are a few ways we can approach this:

  1. Improve the ganeti install process so we re-run the puppetdb interface import *after* the host has been fully configured
    • This is probably a good idea anyway so Netbox properly reflects what is set up
  2. Move away from the current network setup process so we add everything in Netbox the way we want from day one
  3. Adjust the Homer code so it will find the switchport in the case where the device primary IP is on either a bridge or physical Ethernet interface
  4. Adjust the Homer code to instead by default assume VMs on legacy "row-wide" vlans should peer with the CRs

Of all the options I actually like the last one best. The current code is in some ways more flexible, it would work in certain potential scenarios we don't have right now (peering to top-of-rack from row-wide vlan in an EVPN row). Option 3 would retain that flexibility. But we don't expect to ever set things up that way, so I think simply selecting the hosts based on the vlan membership of their primary IP will suffice. It should also be less Netbox API calls which will hopefully speed things up.

As mentioned we should think about how to approach #1 to ensure Netbox reflects the actual setup anyway.

Event Timeline

cmooney triaged this task as High priority.

Change #1099225 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Adjust how we build list of server BGP peerings for CRs

https://gerrit.wikimedia.org/r/1099225

The above patch determines what devices need to peer with CRs based on vlan membership (and the vlan naming pattern being legacy). Tested running against all CRs and lsw's and it is a no-diff.

A nice benefit is that it only takes ~50 seconds to generate the config for cr1-eqiad on my laptop, as opposed to over 5 minutes for the previous version :)

For now, as I've tested it and know it's ok for the issue detailed in the task description, I will run the puppetdb import script for the new ganeti hosts so the issue geos away (even before merging the patch).

Change #1099225 merged by Cathal Mooney:

[operations/software/homer/deploy@master] Adjust how we build list of server BGP peerings for CRs

https://gerrit.wikimedia.org/r/1099225

For (1) we can have the sre.ganeti.addnode cookbook call the PuppetDBImport script towards the end. What do you and @MoritzMuehlenhoff think ?

For (1) we can have the sre.ganeti.addnode cookbook call the PuppetDBImport script towards the end. What do you and @MoritzMuehlenhoff think ?

That's actually not a bad idea yeah, though on the most recent ones its been done out of process I think.

Either way I'm gonna close this particular task as the issue has been addressed through number 4 (which we may need to revisit if we ever reconsider T360772).

Change #1117554 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] ganeti.addnode: run ImportPuppetDB script after node addition

https://gerrit.wikimedia.org/r/1117554

cmooney lowered the priority of this task from High to Low.Feb 6 2025, 12:58 PM