Page MenuHomePhabricator

Routed Ganeti: Add support for VM BGP
Open, LowPublic

Description

The ideal path would require upstream changes in Bird (especially for v6) - http://trubka.network.cz/pipermail/bird-users/2024-April/017580.html
If this is fixed/implemented, the guest VM side would works out of the box with our current global Bird config.

In the most basic setup the Ganeti side would only require an additional config block (plus the same for v6):

protocol bgp {
    ipv4 {
        import BGP-FROM-VMS;
        export none;
    };
    local 10.192.24.1 as 64612;
    neighbor range 10.192.24.0/23 external;
    multihop;
}

Where BGP-FROM-VMS is a filter like we have on the switches, defining at least the IPs allowed to be advertised from VMs.
Using dynamic neighbors has a slight security risk as we can't easily enforce which IP a VM can advertised in the allowed BGP-FROM-VMS list for example if a VM setups a rogue BGP daemon and advertise 10.3.0.1 (recdns) we couldn't catch it here.
As all the VMs are trusted, this is not a blocker.
Setting up BGP authentication would remove the "rogue BGP speaker" risk.
An internal RPKI infrastructure would remove the "BGP VM advertise a different IP than it's supposed to" risk, but it's a significant task.
Dynamic neighbors has the advantage of not needing any specific config on the hypervisor side, so when a VM is migrated to a different hypervisor, there is no need for a Puppet run or to prepopulate BGP neighbors (and having them as down most of the time.
Another downside of dynamic neighbors is that only the Guest VM side can initiate the connection.
Unless we can find a way to trigger some kind of probing at each VM creation in net-common (which would be ignored for non BGP speaking VMs).

For the record, the VM side test config:

router id 10.192.24.4;
debug protocols all;

protocol bgp {
    ipv4 {
        import none;
        export none;
    };
    local 10.192.24.4 as 64613;
    neighbor 10.192.24.1 external;
    multihop;
}

Here and above, the multihop; is only a workaround (see the bird-users mailing list link). Note that BGP establishes fine with external set on both sides.

2024-04-12T09:40:15.585454+00:00 testvm2006 bird: bgp1: Started
2024-04-12T09:40:15.585593+00:00 testvm2006 bird: bgp1: Connect delayed by 5 seconds
2024-04-12T09:40:19.492098+00:00 testvm2006 bird: bgp1: Connecting to 10.192.24.1 from local address 10.192.24.4
2024-04-12T09:40:19.493146+00:00 testvm2006 bird: bgp1: Connected
2024-04-12T09:40:19.493305+00:00 testvm2006 bird: bgp1: Sending OPEN(ver=4,as=64613,hold=240,id=0ac01804)
2024-04-12T09:40:19.493384+00:00 testvm2006 bird: bgp1: Got OPEN(as=64612,hold=240,id=10.192.21.6)
2024-04-12T09:40:19.493560+00:00 testvm2006 bird: bgp1: Sending KEEPALIVE
2024-04-12T09:40:19.494178+00:00 testvm2006 bird: bgp1: Got KEEPALIVE
2024-04-12T09:40:19.494610+00:00 testvm2006 bird: bgp1: BGP session established
2024-04-12T09:40:19.494993+00:00 testvm2006 bird: bgp1: State changed to up
2024-04-12T09:40:19.786584+00:00 testvm2006 bird: bgp1: Got UPDATE
2024-04-12T09:40:19.786741+00:00 testvm2006 bird: bgp1: Got END-OF-RIB
2024-04-12T09:40:19.786823+00:00 testvm2006 bird: bgp1: Sending END-OF-RIB

There are then 2 other topics worth discussion:

1/ AS path length
The current eBGP setups is that most straightforward to configure, troubleshot, etc. But has the downside of adding an extra AS hop for the prefix advertised by the VM. So if we keep going that way we would need to do some AS-path prepending to all the other prefixes advertised by a similar ASN. For example if we decide to have a Routed Ganeti VM hosting a recdns server.
It's not an issue, but we should investigate if there are better ways of proceeding (eg. iBGP, or some other config knob).

2/ Migration BGP failover
When a VM is migrated to a different hypervisor, the BGP session will be shutdown and re-established on the new hypervisor.
Assuming we don't use multihop Bird on the hypervisor side will detect the tap interface going down and tear down the session and stop propagating the prefix, so no outage (or a few ms).
On the other side, the VM side might take up to 30s to send an update and realized it's not talking to the same peer, and re-established the session. It might be an ok tradeoff to not add extra complexity.
If faster session re-establishment is needed, we could implement BFD, shorter BGP timer, or investigate if the hypervisor can notify the VM of the migration in some way.
We could also investigate how to cleanly shutdown BGP when a VM is about to migrate to elimitate the few MS outage. Note that the VM's IP will take some MS to propagate as well.

Event Timeline

ayounsi created this task.

In general I'm a fan of dynamic neighbors so happy to use it here on the Ganeti side. I don't think the security concerns are significant for this use-case.

An internal RPKI infrastructure would remove the "BGP VM advertise a different IP than it's supposed to" risk, but it's a significant task.

Agreed this would be overkill. It also won't constraint what source IPs can announce a given range, only the ASNs.

1/ AS path length

Yeah this one is tricky. We are running into the issue of BGP (and most routing protocols) being designed to find the shortest path to a given destination, and Anycast being kind of a hack.

IBGP would certainly remove one variable, but I fear we could be chasing it forever. My own feeling (sort of similar to T360772) is that we should do what's needed to ensure internet traffic is load-balanced equally amongst VMs, and not worry too much about the internal flows (let them go to the nearest instance).

2/ Migration BGP failover

BFD is the obvious one here. We could also perhaps reduce the timers down, 5/15 might be do-able (below that bfd is better). But maybe check with application owners, if most use-case is anycast perhaps a blip during moves is acceptable.

The ideal path would require upstream changes in Bird (especially for v6) - http://trubka.network.cz/pipermail/bird-users/2024-April/017580.html

I think we only have a few VM types that do BGP do we? Shouldn't be too inconvenient to have two variants of the config (depending what ganeti cluster the VMs are on).

I was playing with this a bit and hit some issues with using IPv6 link-local. The link-local is a normal /64 network, so no complications due to a /128 there. However I think there is a snag, in that the dynamic neighbors on the hypervisor side won't work without an "interface" specified. i.e. just using a statement link this:

neighbor range fe80::/64 external;

Isn't enough to make the connection work, the hypervisor logs this error, as the interface is included in the 'address' it is parsing:

ganeti2033 bird[3917127]: BGP: Unexpected connect from unknown address fe80::a800:ff:fe6b:aa1c%tap0 (port 58323)

Adding interface "tap0"; to the protocol block corrects the issue, allowing dynamic range fe80::/64 to be used, but that doesn't seem very useful - we'd need to have a separate configuration block for each VM/tap interface.

It does work if configured from the VMs global unicast to the hypervisors link-local address, however. We need to make sure the multihop statement is added on the hypervisor side (as its peer isn't on a local subnet), and not on the VM (as its peer is). Fwiw the configs I was playing with are here and here.

In terms of live migration and session re-establishment we should bear in mind that BIRD bgp sessions will use default BGP timers of 240/80. So the session will send a keepalive every 80 seconds and take up to 4 minutes to go down.

root@testvm2006:~# birdc show protocols all ganeti_v4 | egrep "timer|^[a-z|A-Z]"
BIRD 2.0.12 ready.
Name       Proto      Table      State  Since         Info
ganeti_v4  BGP        ---        up     08:55:59.568  Established   
    Hold timer:       194.377/240
    Keepalive timer:  34.664/80

If we don't use BFD we should tweak these in the config down to something more sensible.

@ayounsi just to document our chat on irc about the direct/multihop stuff.

To me the problem is that there is no way to have a neighbor in Bird which is considered "direct" or "onlink" if the neighbor IP is not on a local subnet. There is a 'direct' statement that can be added, but with such a neighbor this has no effect (BIRD still does not attempt to establish the session). Looking at the code I believe the complication may be that a "direct" neighbor is always bound to an interface, and when a neighbor cannot be associated with one based on the set of interface IPs it is much harder to work out that relationship. It is probably possible, but it would take more checks (routing table / ARP table) than are currently done, and I suspect these could vary based on underlying platform/OS. Routes and ARP entries are also liable to change much more than interface IP config, so constantly monitoring those and updating the interface/neighbor status may be difficult.

I think what might be easier to implement in BIRD would be to mark a neighbor direct if:

  • The interface it is reachable on is specified in the protocol block
  • The direct statement is included in the protocol block

I'm not sure if this should be the default or not tbh. Perhaps for EBGP 'direct' should be assumed unless 'multihop' is specified, so that statement may not be required. But I do wonder if there are any existing setups such a default would break. Either way finding the interface to bind to seems complicated, so providing it in the config might be an easier change in Bird. Luckily with our ganeti VMs the interface name is predictable and constant, so it ought not be a headache for us.

For now, if we are required to use multi-hop, the peer won't fail if the interface status changes. I'm not sure if there is an up/down link transition when a VM is moved, but we can use BFD to ensure quick teardown if required. So hopefully this is only a minor concern operationally.