Page MenuHomePhabricator

Routed Ganeti: Add support for VM BGP
Closed, ResolvedPublic

Description

The ideal path would require upstream changes in Bird (especially for v6) - http://trubka.network.cz/pipermail/bird-users/2024-April/017580.html
If this is fixed/implemented, the guest VM side would works out of the box with our current global Bird config.

In the most basic setup the Ganeti side would only require an additional config block (plus the same for v6):

protocol bgp {
    ipv4 {
        import BGP-FROM-VMS;
        export none;
    };
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 external;
    multihop;
}

Where BGP-FROM-VMS is a filter like we have on the switches, defining at least the IPs allowed to be advertised from VMs.
Using dynamic neighbors has a slight security risk as we can't easily enforce which IP a VM can advertised in the allowed BGP-FROM-VMS list for example if a VM setups a rogue BGP daemon and advertise 10.3.0.1 (recdns) we couldn't catch it here.
As all the VMs are trusted, this is not a blocker.
Setting up BGP authentication would remove the "rogue BGP speaker" risk.
An internal RPKI infrastructure would remove the "BGP VM advertise a different IP than it's supposed to" risk, but it's a significant task.
Dynamic neighbors has the advantage of not needing any specific config on the hypervisor side, so when a VM is migrated to a different hypervisor, there is no need for a Puppet run or to prepopulate BGP neighbors (and having them as down most of the time.
Another downside of dynamic neighbors is that only the Guest VM side can initiate the connection.
Unless we can find a way to trigger some kind of probing at each VM creation in net-common (which would be ignored for non BGP speaking VMs).

For the record, the VM side test config:

router id 10.192.24.4;
debug protocols all;

protocol bgp {
    ipv4 {
        import none;
        export none;
    };
    local 10.192.24.4 as 64613;
    neighbor 10.192.24.1 external;
    multihop;
}

Here and above, the multihop; is only a workaround (see the bird-users mailing list link). Note that BGP establishes fine with external set on both sides.

2024-04-12T09:40:15.585454+00:00 testvm2006 bird: bgp1: Started
2024-04-12T09:40:15.585593+00:00 testvm2006 bird: bgp1: Connect delayed by 5 seconds
2024-04-12T09:40:19.492098+00:00 testvm2006 bird: bgp1: Connecting to 10.192.24.1 from local address 10.192.24.4
2024-04-12T09:40:19.493146+00:00 testvm2006 bird: bgp1: Connected
2024-04-12T09:40:19.493305+00:00 testvm2006 bird: bgp1: Sending OPEN(ver=4,as=64613,hold=240,id=0ac01804)
2024-04-12T09:40:19.493384+00:00 testvm2006 bird: bgp1: Got OPEN(as=64612,hold=240,id=10.192.21.6)
2024-04-12T09:40:19.493560+00:00 testvm2006 bird: bgp1: Sending KEEPALIVE
2024-04-12T09:40:19.494178+00:00 testvm2006 bird: bgp1: Got KEEPALIVE
2024-04-12T09:40:19.494610+00:00 testvm2006 bird: bgp1: BGP session established
2024-04-12T09:40:19.494993+00:00 testvm2006 bird: bgp1: State changed to up
2024-04-12T09:40:19.786584+00:00 testvm2006 bird: bgp1: Got UPDATE
2024-04-12T09:40:19.786741+00:00 testvm2006 bird: bgp1: Got END-OF-RIB
2024-04-12T09:40:19.786823+00:00 testvm2006 bird: bgp1: Sending END-OF-RIB

There are then 2 other topics worth discussion:

1/ AS path length
The current eBGP setups is that most straightforward to configure, troubleshot, etc. But has the downside of adding an extra AS hop for the prefix advertised by the VM. So if we keep going that way we would need to do some AS-path prepending to all the other prefixes advertised by a similar ASN. For example if we decide to have a Routed Ganeti VM hosting a recdns server.
It's not an issue, but we should investigate if there are better ways of proceeding (eg. iBGP, or some other config knob).

2/ Migration BGP failover
When a VM is migrated to a different hypervisor, the BGP session will be shutdown and re-established on the new hypervisor.
Assuming we don't use multihop Bird on the hypervisor side will detect the tap interface going down and tear down the session and stop propagating the prefix, so no outage (or a few ms).
On the other side, the VM side might take up to 30s to send an update and realized it's not talking to the same peer, and re-established the session. It might be an ok tradeoff to not add extra complexity.
If faster session re-establishment is needed, we could implement BFD, shorter BGP timer, or investigate if the hypervisor can notify the VM of the migration in some way.
We could also investigate how to cleanly shutdown BGP when a VM is about to migrate to elimitate the few MS outage. Note that the VM's IP will take some MS to propagate as well.

Event Timeline

ayounsi triaged this task as Low priority.

In general I'm a fan of dynamic neighbors so happy to use it here on the Ganeti side. I don't think the security concerns are significant for this use-case.

An internal RPKI infrastructure would remove the "BGP VM advertise a different IP than it's supposed to" risk, but it's a significant task.

Agreed this would be overkill. It also won't constraint what source IPs can announce a given range, only the ASNs.

1/ AS path length

Yeah this one is tricky. We are running into the issue of BGP (and most routing protocols) being designed to find the shortest path to a given destination, and Anycast being kind of a hack.

IBGP would certainly remove one variable, but I fear we could be chasing it forever. My own feeling (sort of similar to T360772) is that we should do what's needed to ensure internet traffic is load-balanced equally amongst VMs, and not worry too much about the internal flows (let them go to the nearest instance).

2/ Migration BGP failover

BFD is the obvious one here. We could also perhaps reduce the timers down, 5/15 might be do-able (below that bfd is better). But maybe check with application owners, if most use-case is anycast perhaps a blip during moves is acceptable.

The ideal path would require upstream changes in Bird (especially for v6) - http://trubka.network.cz/pipermail/bird-users/2024-April/017580.html

I think we only have a few VM types that do BGP do we? Shouldn't be too inconvenient to have two variants of the config (depending what ganeti cluster the VMs are on).

I was playing with this a bit and hit some issues with using IPv6 link-local. The link-local is a normal /64 network, so no complications due to a /128 there. However I think there is a snag, in that the dynamic neighbors on the hypervisor side won't work without an "interface" specified. i.e. just using a statement link this:

neighbor range fe80::/64 external;

Isn't enough to make the connection work, the hypervisor logs this error, as the interface is included in the 'address' it is parsing:

ganeti2033 bird[3917127]: BGP: Unexpected connect from unknown address fe80::a800:ff:fe6b:aa1c%tap0 (port 58323)

Adding interface "tap0"; to the protocol block corrects the issue, allowing dynamic range fe80::/64 to be used, but that doesn't seem very useful - we'd need to have a separate configuration block for each VM/tap interface.

It does work if configured from the VMs global unicast to the hypervisors link-local address, however. We need to make sure the multihop statement is added on the hypervisor side (as its peer isn't on a local subnet), and not on the VM (as its peer is). Fwiw the configs I was playing with are here and here.

In terms of live migration and session re-establishment we should bear in mind that BIRD bgp sessions will use default BGP timers of 240/80. So the session will send a keepalive every 80 seconds and take up to 4 minutes to go down.

root@testvm2006:~# birdc show protocols all ganeti_v4 | egrep "timer|^[a-z|A-Z]"
BIRD 2.0.12 ready.
Name       Proto      Table      State  Since         Info
ganeti_v4  BGP        ---        up     08:55:59.568  Established   
    Hold timer:       194.377/240
    Keepalive timer:  34.664/80

If we don't use BFD we should tweak these in the config down to something more sensible.

@ayounsi just to document our chat on irc about the direct/multihop stuff.

To me the problem is that there is no way to have a neighbor in Bird which is considered "direct" or "onlink" if the neighbor IP is not on a local subnet. There is a 'direct' statement that can be added, but with such a neighbor this has no effect (BIRD still does not attempt to establish the session). Looking at the code I believe the complication may be that a "direct" neighbor is always bound to an interface, and when a neighbor cannot be associated with one based on the set of interface IPs it is much harder to work out that relationship. It is probably possible, but it would take more checks (routing table / ARP table) than are currently done, and I suspect these could vary based on underlying platform/OS. Routes and ARP entries are also liable to change much more than interface IP config, so constantly monitoring those and updating the interface/neighbor status may be difficult.

I think what might be easier to implement in BIRD would be to mark a neighbor direct if:

  • The interface it is reachable on is specified in the protocol block
  • The direct statement is included in the protocol block

I'm not sure if this should be the default or not tbh. Perhaps for EBGP 'direct' should be assumed unless 'multihop' is specified, so that statement may not be required. But I do wonder if there are any existing setups such a default would break. Either way finding the interface to bind to seems complicated, so providing it in the config might be an easier change in Bird. Luckily with our ganeti VMs the interface name is predictable and constant, so it ought not be a headache for us.

For now, if we are required to use multi-hop, the peer won't fail if the interface status changes. I'm not sure if there is an up/down link transition when a VM is moved, but we can use BFD to ensure quick teardown if required. So hopefully this is only a minor concern operationally.

https://trubka.network.cz/pipermail/bird-users/2024-May/017687.html

I already made it a feature request and plan to implement it.

I'll leave it to the Bird maintainers to figure out the best way to implement it :)

Change #1052109 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Bird: use the "interface" config option for v6 peers

https://gerrit.wikimedia.org/r/1052109

The above patch should workaround the issue for v6 (based on @cmooney's testing)

Next for a full workaround we need to:

  • Figure out how to programatically (and cleanly) add the "multihop" config option for v4 on routed ganeti VMs only
  • Enable/test BFD between Ganeti and its VMs

Enable/test BFD between Ganeti and its VMs

Adding the BFD statement works fine for v4, but on the hypervisor side I don't think it can be added for v6 in the current state of things.

hypervisor
protocol bgp guest_private_v4 {
    ipv4 {
        import none;
        export none;
    };
    bfd yes;
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 external;
    multihop;
}
protocol bgp guest_private_v6 {
    ipv6 {
        import none;
        export none;
    };
    dynamic name "private_vm6_";
    dynamic name digits 2;
    local as 64612;
    neighbor range 2620:0:860:140::/64 external;
    multihop;
}

Adding bfd yes; returns bird: /etc/bird/bird.conf:90:1 Multihop BGP with BFD requires specified local address
Adding local fe80::2022:22ff:fe22:2201 requires an interface stanza, which is not possible with dynamic tapX interfaces.

The upstream feature request should hopefully solve that.

VM
protocol bgp ganeti_v4 {
    advertise hostname on;
    ipv4 {
        import none;
        export none;
    };
    bfd yes;
    local 10.192.24.4 as 64613;
    neighbor 10.192.24.1 as 64612 external;
    multihop;
}
protocol bgp ganeti_v6 {
    advertise hostname on;
    ipv6 {
        import none;
        export none;
    };
    local 2620:0:860:140:10:192:24:4 as 64613;
    interface "ens13";
    neighbor fe80::2022:22ff:fe22:2201 as 64612 external;
}

The multihop knob is required on the hypervisor side for v6 because of:
2024-07-05T08:36:56.811256+00:00 ganeti2033 bird: dynbgp2: Incoming connection from 2620:0:860:140:10:192:24:4 (port 40993) rejected if not set.

Figure out how to programmatically (and cleanly) add the "multihop" config option for v4 on routed Ganeti VMs only

Cleanest I think is in Puppet to use: but obviously not ideal to push a code change on all the Bird speaking devices.

$location = lookup('profile::netbox::host::location')
$location['ganeti_cluster'] == 'codfw02'

An alternative is to match on the VM's IP.

Adding bfd yes; returns bird: /etc/bird/bird.conf:90:1 Multihop BGP with BFD requires specified local address
Adding local fe80::2022:22ff:fe22:2201 requires an interface stanza, which is not possible with dynamic tapX interfaces.

Ok yeah that's a drawback. Still I think perhaps not a blocker? We don't shuffle VMs so often it will cause a major headache?

As per my previous comment we better set more aggressive BGP timers if that's the case, 5/30 would seem reasonable (or even 4/20 perhaps).

Figure out how to programmatically (and cleanly) add the "multihop" config option for v4 on routed Ganeti VMs only

Cleanest I think is in Puppet to use: but obviously not ideal to push a code change on all the Bird speaking devices.

$location = lookup('profile::netbox::host::location')
$location['ganeti_cluster'] == 'codfw02'

An alternative is to match on the VM's IP.

Seems ok to me. Regarding changes I think longer term we should maybe think if there is some mechanism to allow for staggered roll-outs or similar. We ought to be able to make small changes here without being afraid, but obviously as a mistake will get pushed everywhere we need to be mindful of that risk.

Adding the BFD statement works fine for v4, but on the hypervisor side I don't think it can be added for v6 in the current state of things.

Adding bfd yes; returns bird: /etc/bird/bird.conf:90:1 Multihop BGP with BFD requires specified local address
Adding local fe80::2022:22ff:fe22:2201 requires an interface stanza, which is not possible with dynamic tapX interfaces.

One potential solution to this, and perhaps cleaner in general, would be to make all the 'tap' interfaces on the hypervisor a member of a bridge device? We would add the IPs used for peering to this bridge and not configure them on the individual tap interfaces. I guess would need tweaking of the ganeti scripts to set it up this way (or not - the default L2 based ganeti uses bridges?).

We should then be able to use the bridge device link-local for BGP and BFD on the hypervisor side, by adding an interface br0 or similar to the Bird config stanza. Using the link-local both sides probably means we could remove the 'multihop' too.

Would need testing of course, but off the top of my head I can't think why it wouldn't work, or any issues it may cause otherwise. The VMs in theory could communicate on their link-locals directly over the Ethernet bridge, but comms to the unicast or v4 address shouldn't change.

interesting idea, definitely worth a try. I'm particularly curious on how routing between VMs would work in that setup, and where to apply filtering. But not requiring multihop would be a plus.

I'm also wondering if we could either set an extra "non link-local" IP on the tap interfaces.
Alternatively, setting the same unicast IP on all the hypervisor's loopbacks, and establish multihop sessions with them. But that would also require to "special case" routed ganeti VMs' neighbor IPs.

interesting idea, definitely worth a try. I'm particularly curious on how routing between VMs would work in that setup, and where to apply filtering.

Routing between VMs would work as before. Each has a /32 or /128 IP, a static route for the host's br0 IP via their main interface onlink, and a default route towards that IP. So even though they are members of the same bridge they don't try to ARP for adjacent VMs, everything is routed - or being specific sent to the MAC of the br0 device. The hypervisor then routes traffic to the other VM with a normal routing lookup same as if the packet came in from the network-side.

I'm also wondering if we could either set an extra "non link-local" IP on the tap interfaces.

Yeah I think that ought to be possible.

Alternatively, setting the same unicast IP on all the hypervisor's loopbacks, and establish multihop sessions with them. But that would also require to "special case" routed ganeti VMs' neighbor IPs.

Yeah I guess that's a similar idea to putting it on a common bridge device. I probably prefer the latter as it means we should be able to declare the interface (br0) in the bird config and have it treated as direct. We can't say "interface lo0" in the bird conf.

Good news and good timing, the contract to implement the new feature in Bird is in the procurement approval pipeline, with an (extremely short) timeline of having the feature landing in Bird's master branch by end of June.

Change #1052109 merged by Ayounsi:

[operations/puppet@production] Bird: use the "interface" config option for v6 peers

https://gerrit.wikimedia.org/r/1052109

Mentioned in SAL (#wikimedia-operations) [2025-06-16T13:32:10Z] <sukhe> sudo cumin -b1 -s30 'A:dnsbox' "run-puppet-agent --enable 'CR1052109'": T362392

Mentioned in SAL (#wikimedia-operations) [2025-06-16T13:32:58Z] <sukhe> sudo cumin -b1 -s30 'A:wikidough' "run-puppet-agent --enable 'CR1052109'": T362392

Mentioned in SAL (#wikimedia-operations) [2025-06-30T08:50:57Z] <XioNoX> test routed ganeti compatible bird on ganeti2034/testvm2006 - T362392

I've been testing the newly released feature and hitting a few issues, I've already emailed the Bird team about them.

1/ Dynamic neighbors

On the hypervisor side, I had to specify the "interface" configuration option to allow for onlink

protocol bgp {
    ipv4 {
        import none;
        export none;
    };
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 onlink external;
    interface "tap0";
}

Commenting out interface shows the error message :

birdc[2972069]: /etc/bird/bird.conf:57:1 Onlink BGP must have interface configured

Unfortunately the "tapX" interfaces (facing the VMs) are created/removed dynamically based on the VM operation (creation/removal/migration).

2/ IPv6 unknown address

On my VM:

protocol bgp ganeti_v6 {
    advertise hostname on;
    ipv6 {
        import none;
        export none;
    };
    local 2620:0:860:140:10:192:24:4 as 64613;
    interface "ens13";
    neighbor fe80::2022:22ff:fe22:2201 onlink as 64612 external;
}

On my hypervisor:

protocol bgp dynamic_v6 {
    ipv6 {
        import none;
        export none;
    };
    local fe80::2022:22ff:fe22:2201 as 64612;
    neighbor range 2620:0:860:140::/64 onlink external;
    interface "tap0";
}

But the logs on the hypervisor side show :

ganeti2034 bird: BGP: Unexpected connect from unknown address 2620:0:860:140:10:192:24:4 (port 35605)

While there is a proper route to that VM's IP:
2620:0:860:140:10:192:24:4 dev tap0 proto static metric 1024 pref medium

On the first issue, a possible workaround might be to pre-populate the config from tap0 to tap200 as the below seems to work (or at least doesn't throw an error)

protocol bgp dynamic_v4 {
    ipv4 {
        import none;
        export none;
    };
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 onlink external;
    interface "tap0";
}
protocol bgp dynamic_v4_test {
    ipv4 {
        import none;
        export none;
    };
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 onlink external;
    interface "tap20";
}

For the first problem I still think the better option is to make all tap* interfaces a member of a bridge, and specify the bridge in the BGP config.

On the second I'm somewhat confused. Looks like a bug needs fixing yeah, with or without a route back the dynamic-neighbor range should mean it allows that peer connect if the packet arrives on the right interface.

For the first problem I still think the better option is to make all tap* interfaces a member of a bridge, and specify the bridge in the BGP config.

Arzhel called my bluff here by asking what this would look like so I had to validate it a little. Some notes here: P78727

Mentioned in SAL (#wikimedia-operations) [2025-07-15T07:50:58Z] <XioNoX> more Bird test on ganeti2034 & testvm2006 - T362392

Change #1169662 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] WIP: Ganeti Bird BGP

https://gerrit.wikimedia.org/r/1169662

Change #1169663 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed ganeti: disable IPv4 ICMP redirects

https://gerrit.wikimedia.org/r/1169663

Change #1169663 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: disable IPv4 ICMP redirects

https://gerrit.wikimedia.org/r/1169663

Change #1170570 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] WIP: Bird: VM side - add support for Routed Ganeti

https://gerrit.wikimedia.org/r/1170570

Mentioned in SAL (#wikimedia-operations) [2025-07-21T09:20:18Z] <XioNoX> manually install bird2_2.17.1+branch.mq.bgp.multilisten.c47b08 on ganeti2033 and ganeti700x - T362392

Change #1169662 merged by Ayounsi:

[operations/puppet@production] Ganeti Bird BGP

https://gerrit.wikimedia.org/r/1169662

Change #1171212 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site.pp: remove doh7003 from insetup

https://gerrit.wikimedia.org/r/1171212

Change #1171212 merged by Ssingh:

[operations/puppet@production] site.pp: remove doh7003 from insetup

https://gerrit.wikimedia.org/r/1171212

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host doh7003.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host doh7003.wikimedia.org with OS bookworm completed:

  • doh7003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507211418_sukhe_2652658_doh7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1171236 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Routed Ganeti: also permit anycast to be advertised from VMs

https://gerrit.wikimedia.org/r/1171236

Change #1171236 merged by jenkins-bot:

[operations/homer/public@master] Routed Ganeti: also permit anycast to be advertised from VMs

https://gerrit.wikimedia.org/r/1171236

Mentioned in SAL (#wikimedia-operations) [2025-07-22T16:06:15Z] <sukhe> sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1170570'": T362392

Change #1170570 merged by Ayounsi:

[operations/puppet@production] Bird: VM side - add support for Routed Ganeti

https://gerrit.wikimedia.org/r/1170570

Change #1171603 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site.pp: move durum700[34] and doh7004 to specific roles

https://gerrit.wikimedia.org/r/1171603

Change #1171603 merged by Ssingh:

[operations/puppet@production] site.pp: move durum700[34] and doh7004 to specific roles

https://gerrit.wikimedia.org/r/1171603

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host durum7003.magru.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host doh7004.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host durum7003.magru.wmnet with OS bookworm completed:

  • durum7003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507221744_sukhe_2956639_durum7003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host doh7004.wikimedia.org with OS bookworm completed:

  • doh7004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507221732_sukhe_2956627_doh7004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1179660 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Create repository components for Bird version with support for routed Ganeti

https://gerrit.wikimedia.org/r/1179660

Change #1179660 merged by Muehlenhoff:

[operations/puppet@production] Create repository components for Bird version with support for routed Ganeti

https://gerrit.wikimedia.org/r/1179660

Mentioned in SAL (#wikimedia-operations) [2025-08-18T13:14:06Z] <moritzm> imported bird2 2.17.1+branch.mq.bgp.multilisten.c47b08a1524c-cznic.1 into component/bird-routed-ganeti for Bookworm T362392

Change #1179689 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add a parameter to the Bird class to install the component enabled for routed Ganeti

https://gerrit.wikimedia.org/r/1179689

Change #1179689 merged by Muehlenhoff:

[operations/puppet@production] Bird: Add a parameter to install the Bird enabled for routed Ganeti

https://gerrit.wikimedia.org/r/1179689

Change #1179706 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] ganeti-routed: Enable bird component for routed Ganeti

https://gerrit.wikimedia.org/r/1179706

Change #1179722 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti

https://gerrit.wikimedia.org/r/1179722

Change #1179706 merged by Muehlenhoff:

[operations/puppet@production] ganeti-routed: Enable bird component for routed Ganeti

https://gerrit.wikimedia.org/r/1179706

Change #1179722 merged by Muehlenhoff:

[operations/puppet@production] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti

https://gerrit.wikimedia.org/r/1179722

Change #1179972 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] durum: Enable bird component in magru

https://gerrit.wikimedia.org/r/1179972

Change #1179981 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] doh: Enable bird component in magru

https://gerrit.wikimedia.org/r/1179981

Change #1179972 merged by Muehlenhoff:

[operations/puppet@production] durum: Enable bird component in magru

https://gerrit.wikimedia.org/r/1179972

Change #1179981 merged by Muehlenhoff:

[operations/puppet@production] doh: Enable bird component in magru

https://gerrit.wikimedia.org/r/1179981

This is all done. Follow up will need to happen (documented in https://wikitech.wikimedia.org/wiki/Ganeti#VMs_BGP )