HomePhabricator
Ganeti on modern network design
Virtualisation with per rack L3 subnets

Context

For reasons already mentioned in other docs (eg. Eqiad Expansion Network Design) we’re moving towards a network architecture where the servers’ layer 3 domains (subnets) are constrained in each rack. Currently (and in most of our core DCs) those layer 3 domains are stretched across all the racks of a given row. In that setting, a Ganeti cluster of a given row (where its hypervisors are spread across the row) leverages this L2 adjacency to be able to live migrate VMs between hypervisors.
In other words, if work is going to be done on hypervisor1, all the VMs it hosts can be temporarily and transparently distributed across the other hypervisorX to prevent any disruptions. Having the same vlan trunked to all the hypervisors of the same row allows the VMs to move to a different hypervisor without requiring any IP renumbering and thus downtime.

There are multiple ways to have Ganeti fully operational on the new network design, all of course have their set of tradeoffs (cost, implementation or migration complexity, uptime).

Per rack clusters

This is the easiest to implement, as we already have all the tooling and automation. This is also what we’re doing in the new POPs design. As each rack is its own domain, we can have one or more bridged hypervisors per rack. We currently have between 6 and 7 hypervisors per row, with 24 to 38 VMs per row.
On one side of the spectrum there is 1 hypervisor per rack: fully prevents any kind of live migration (automatic or manual), all the VMs have the same constraints as physical servers. For example if the ToR or hypervisor needs any kind of maintenance, the VMs will go down during the maintenance window. 1 per rack also means a large number of “micro-cluster” making VM allocation more difficult.
On the other far end, we could have all 6 or 7 hypervisors in the same rack. This option makes any live migration easy as well as hypervisor maintenance, but a ToR maintenance or failure means losing all 24 to 38 VMs. This could also be problematic in terms of overall server placement between racks (for both rack space and network usage).
In between: 2 to 3 hypervisors clusters per rack mitigates the downsides of both extremes, but only mitigates them. Unless running hypervisors at 50% capacity, which is not economically viable, not all VMs could be drained. Similarly, when maintenance needs to happen on a ToR, many VMs will go down.

Even though we’re designing systems to be redundant between racks, rows and sites, some services don’t or can’t follow those principles. For example active/passive with no automatic failovers. Not being able to migrate VMs would increase the workload, especially during planned maintenance.

L2 abstraction at the ToR

This option is in some way mimicking the current situation, but instead of using a proprietary Juniper technology to bridge the same vlans across rows, we use a more standardized technology: VXLAN.
The main downsides to this solution are the increased license cost (Juniper/Dell SONiC require a special license to handle VXLAN), the lower interoperability between network vendors, as well as configuration and operational complexity, where it’s usually preferred to keep the network layer as lean as possible. This can be a temporary solution, for example during a migration phase but not a long term one.

Routed Ganeti

This consists in having each Ganeti host behave as a basic router. This allows each VM to be independent at networking point of view, as the hypervisors will take care of propagating reachability information (IP routes) to the rest of the infrastructure.

Going that way, the requirements are:

  • Functional live migration
  • Minimal modification to our automation (eg. makevm cookbook)
  • Minimal modification to our Debian installer and guest OS
  • Existing VMs can be re-imaged into this new mode

Moving away from L2 adjacency also means that LVS in their current form won’t be able to forward traffic to those VMs, the solution is IPIP support in LVS (T348837).

Setup and investigation

To get this working, I followed the current Building a new cluster steps by steps instructions with a couple adjustments.

First adjustment is that I used the following cluster init command:
sudo gnt-cluster init --no-ssh-init --enabled-hypervisors=kvm --vg-name=ganeti --master-netdev=eno1 --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= --nic-parameters=mode=routed,link=main ganeti-test01.svc.eqiad.wmnet

In the --master-netdev which I bind on the hypervisor’s primary (and only) NIC, in the --nic-parameters link=main means use the default routing table.

As well as manually applied the few commands from the sre.ganeti.add_node cookbook to bypass the checks specific to L2 Ganeti (def is_valid_bridge()).

Then creating a VM (for example in the private range) requires just:
sudo gnt-instance add -t drbd -I hail --net 0:ip=10.66.2.10 --hypervisor-parameters=kvm:boot_order=network,spice_bind=127.0.0.1 -o debootstrap+default --no-install --no-wait-for-sync -g eqiad-test -B vcpus=1,memory=1024m --disk 0:size=10g testvm1001.eqiad.wmnet

The VM is not yet ready to be started but we see here that the VM’s IP needs to be present for the init script to set up the static route. spice_bind=127.0.0.1 is only necessary to access the UI of my test VMs using SPICE.

Guest VM IPv4 connectivity

When a VM is started, Ganeti calls the kvm-ifup bash script, and the setup_route function. This takes care of attaching the VM interface to the proper routing table, as well as adding a static route to this routing table (“if you need to reach IP X, ask interface Y”). As it doesn’t seem possible to pass a custom script to Ganeti, modifying /usr/lib/ganeti/3.0/usr/lib/ganeti/net-common using for example Puppet seems like the best approach to perform additional post VM startup actions.

So far that script needs the following modifications:

  • Disabling proxy_arp (enabled by default) by commenting out that command
  • Add an IP on the VM facing interface (with scope link)
  • Send a gratuitous ARP, for faster recovery after live migration
ip addr add 10.66.1.1/32 dev $INTERFACE scope link
arping -c1 -A -I $INTERFACE 10.66.1.1
#echo 1 > /proc/sys/net/ipv4/conf/$INTERFACE/proxy_arp (commented out)

TODO investigate improvements in addition to those commands such as:

net.ipv4.conf.<int>.arp_ignore=3 
net.ipv4.conf.<int>.arp_notify=1

Starting that test VM with a Basic Debian installer in “rescue mode” to have a prompt:
sudo gnt-instance start -H boot_order=cdrom,cdrom_image_path=/tmp/debian.iso testvm1001.eqiad.wmnet

In the VM, setup its IP and routing configuration:

ip addr add 10.66.2.10/32 dev ens13
ip route add 10.66.1.1 dev ens13 scope link
ip route add default via 10.66.1.1

This can of course look weird to anyone as we’re setting /32 NIC IP as well as a static route pointing to an interface. But that’s how the Linux kernel expects it to be configured.

We can then ping the VM on 10.66.2.10 from the hypervisor, as well as ping the VM from a different host than the hypervisor (as long as the 3rd party host has a route to the VM pointing as the hypervisor). Pings from the VM to that 3rd party host works as well after enabling forwarding on the host’s main NIC.

sysctl -w net.ipv4.conf.eno1.forwarding=1
This gives hope as we don’t need to rely on proxy_arp and we don’t need to change the guest OS much for them to work with IPv4. As long as DHCP behaves.

As Ganeti configures the static route pointing to the VM, and Ganeti supports only 1 v4 IP (with the parameter ip=10.66.2.10) it’s not possible to manually configure multiple IPs on the guest VMs without relying on a dynamic routing protocol (Eg. BGP) between the VM and the hypervisor. In our infra only 3 existing VMs are setup in that way: lists1001.wikimedia.org,mx[1001,2001].wikimedia.org.

Live migration, although with hypervisors in the same VLAN, shows continuity in the reachability between VM and gateway.
sudo gnt-instance migrate testvm1001.eqiad.wmnet

Similarly, two VMs on the same hypervisor can reach each other. Their interfaces always have the same “router” IP:

10: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap0
11: tap1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap1

Going one step further I set up a 3rd Ganeti node in Dallas (the first two are adjacent in Ashburn) as a proof of concept . Live migrating a VM from Dallas to Ashburn worked perfectly. Only 2 pings of the constant ping running from the VM to its gateway (10.66.1.1) were lost, which is beyond acceptable for a 31ms live migration. Not like we will want to do this with production VMs but if it works stretched that far, it will work within the same datacenter.

Guest VM DHCP

When starting a VM, Debootstrap will initialize and ask for an IP using DHCP. The previously configured firewall rule permits the DHCP request to reach the hypervisor. We now have 2 options:

  • Run a DHCP server on the hypervisor and directly reply to the VM
  • Run a DHCP relay on the hypervisor to… relay the request to our DHCP server

We could imagine for example packaging and automating nfdhcpd using spicerack or the makevm cookbooks. However I preferred the 2nd option as it leverages our current DHCP server and makes the hypervisor a “dumb” relay.
For that I installed isc-dhcp-relay and modified /etc/default/isc-dhcp-relay so it points to our local DHCP server. The first limitation is that it binds to the existing interfaces at the daemon startup time. To workaround this limitation I added service isc-dhcp-relay restart to the net-common script so the demon gets restarted after the VM interface gets created.
The 2nd limitation is due to the way we do DHCP relaying on our core routers. Those routers intercept the relayed packets and drop them.
Using the Juniper configuration forwarding-options dhcp-relay forward-only fixes the issue, but breaks DHCP relaying for regular “non routed” hosts. The path forward here is most likely to move away from DHCP option 82 and instead use DHCP option 97 (see T304677).

TODO deploy a specific DHCP config snippet on the DHCP server to deliver the proper IP and route info to the VM (see https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/ ) but blocked by the issue above.

Guest VM IPv6 connectivity

First we can start with a little security housekeeping by modifying the net-common script:

  • Disable learning any router-advertisements from the guest VMs
  • Disable NDP proxying (like ARP proxying but for IPv6)
echo 0  > /proc/sys/net/ipv6/conf/$INTERFACE/accept_ra
echo 0  > /proc/sys/net/ipv6/conf/$INTERFACE/proxy_ndp

While Ganeti is v4 aware (remember the ip=10.66.2.10 parameter for gnt-instance add) this is not the case for IPv6, which is, so far, a blocker. Note that “attaching” an IP to a Ganeti instance object is especially needed for migration as the target hypervisor needs to know which static routes to create.

There are multiple possibles workarounds here, none of which are great:

  • Implement the missing feature (not really a workaround per-se, and significant work, so this is the least preferred option)
  • Rely on the “tags” feature, which supports arbitrary key-value pairs, but this needs to be tested, especially to know if they are passed on to the net-common script
  • Advertise the v6 prefixes over IPv4, with the major downside being that all guest VMs would need to run BGP…
  • Write a demon that inserts static route based on the NDB table (with safeguards)
  • Leverage our current mechanism of deriving the v6 IP from the v4 one

Setting this blocker aside, v6 is better than v4 as it has more IPs. This leads to the question, should I assign a /128 per VM or a /64 ?

/128

This can be configured dynamically using router advertisement. Testing radvd with this /etc/radvd.conf

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::10/128 {
	AdvRouterAddr on;
  };
};

On the VM side this configures the default gateway, leveraging the link-local IPs. It only requires setting the nic IP on the VM side (2001:db8:cb00:7100::10/128) as well as the static route on the hypervisor side.

If we have to use a 3rd party stateful (config state) tool (radvd), we could as well use nfdhcpd but the latter seems abandoned.

To stay stateless on that side, the alternative is to modify the Debian autoinstall script so it sets the NIC v6 IP “manually”. In that case, radvd is only used to advertise the next hop IP:

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7000::/52 {
	AdvOnLink off;
	AdvAutonomous off;
	AdvRouterAddr on;
  };
};

At this point maybe it’s easier to treat it fully like IPv4, ditch radvd and manually configure fe80::1 on tap0 with the matching routes on the VM side. As the v6 IP is generated from the v4 IP and v6 prefix, it’s then also possible to add the Hypervisor’s side route using “net-common”.

/64

More efficient in some ways, the following allows the guest OS to perform automatic IP configuration on the guest VM using SLAAC. Only the static route on the Ganeti side is needed.

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::/64 {
	AdvRouterAddr on;
  };
};

This is also compatible with our Debian autoinstall script as it would still apply the v4 to v6 mapping, in our case configure the IP 2001:db8:cb00:7100:10:66:0:10/64 on the VM’s primary interface. And allows for additional guest IPs (despite not being widely used). Assigning a prefix to a VM means changing the way we do IP allocation within our automation.

This however raises the question: how to update that radvd.conf file (as it needs to have entries for all VM) at each VM movement? This seems tedious for the “net-common” bash script. Upstream started looking into including config files which would help. That’s why my preference is to use a static /128 IP per VM.

Overall there is significant work to be done regarding IPv6 which is outside of the scope of this project, tracked in T102099: Fix IPv6 autoconf issues once and for all, across the fleet., across the fleet and more globally in T234207: Investigate improvements to how puppet manages network interfaces. We could for example use DHCPv6. However, this Ganeti work should, as much as possible, go in the same general direction than what’s planned for those tasks.

Hypervisor firewalling

The test Ganeti hosts have been freshly migrated to nftables (see T336497: Add support for nftables in profile::firewall). For the testing phase a single line in a newly created /etc/nftables/input/10_ganeti_guestvm.nft was enough to permit traffic from the guest VMs to the hypervisor.
iifname "tap*" accept

Run sudo systemctl reload nftables.service so it’s taken into account, then sudo nft list ruleset to confirm it.
Before making it production ready (through Puppet) this needs to be tightened up by allowing only a few ports and protocols:

  • DHCP (see Guest VM DHCP above)
  • BGP & BFD (see Guest VM BGP below)

The forwarding chain already allows all traffic by default, and a rule is already present to permit IPv6 neighbor discovery.

Guest VM routes redistribution

At this point everything local to the hypervisor works. It’s time to make the infra know about the VMs.

As we already use Bird on multiple systems across the infra, it makes sense to use it here as well.
The snippet below instructs Bird to import static v4 and v6 routes from the Linux routing table into Bird while watching every second for any change.

protocol kernel kernel_v4 {
	learn;
	scan time 1;
	ipv4 {
    	import where krt_source = 4; # statics
	};
}
protocol kernel kernel_v6 {
	learn;
	scan time 1;
	ipv6 {
    	import where krt_source = 3; # statics
   };
}
[...]

The other side of the same coin is the “regular” BGP configuration to the routers and additional safeguards filtering needed.

[edit policy-options]
+   prefix-list ganeti4 {
+   	10.66.2.0/24;
+   }
[edit policy-options]
+   policy-statement ganeti_import {
+   	term ganeti4 {
+       	from {
+           	prefix-list-filter ganeti4 longer;
+       	}
+       	then accept;
+   	}
+   	then reject;
+   }
[edit protocols bgp]
+	group Ganeti4 {
+    	type external;
+    	multihop {
+        	ttl 193;
+    	}
+    	local-address 208.80.153.192;
+    	import ganeti_import;
+    	family inet {
+        	unicast {
+            	prefix-limit {
+                	maximum 5;
+                	teardown 80;
+            	}
+        	}
+    	}
+    	export NONE;
+    	peer-as 64650;
+    	neighbor 10.192.48.73 {
+        	description ganeti-test2004;
+    	}
+	}

Here BFD isn’t strictly needed as they’re unicast prefixes (at least for now). If the hypervisor goes down there is no need for faster failover as there is no alternative host anyway.

Testing it with IPv4 only, but the IPv6 behavior is expected to be similar. Running pings to the VM IP from bast4005 in ulsfo, with an interval of 0.5s. This means VM migration downtime and full convergence was achieved in less than 2 seconds.

64 bytes from 10.66.2.15: icmp_seq=30 ttl=60 time=74.3 ms   <- VM in Ashburn
64 bytes from 10.66.2.15: icmp_seq=31 ttl=60 time=73.5 ms
64 bytes from 10.66.2.15: icmp_seq=32 ttl=60 time=73.9 ms
64 bytes from 10.66.2.15: icmp_seq=36 ttl=60 time=42.3 ms   <- VM in Dallas
64 bytes from 10.66.2.15: icmp_seq=37 ttl=60 time=41.9 ms
64 bytes from 10.66.2.15: icmp_seq=38 ttl=60 time=41.9 ms

Guest VMs BGP

As mentioned in the previous section, we rely on BGP on the end hosts to advertise Anycast prefixes for high availability and improved service latency. Some of those services are running in VMs, for example Wikimedia DNS.

For those services (that are likely to grow in numbers) the BGP sessions need to be established with the hypervisor, or in other terms with the VM's next hop gateway. This is how they're currently configured on hosts behind L3 switches.

Adding an extra hop (the hypervisor) in the AS-path (router > switch > hypervisor > VM) means that an additional prepending is needed to the non Ganeti Anycast prefixes, like we did when we introduced the new switching fabric. This is in order to maintain a constant AS path length wherever the end host is located and thus offer proper balancing (otherwise traffic won’t reach longer as-path in normal operations).

Additional configuration needs to be added to Bird for this to work on the Ganeti side. The VM side's config can be left untouched.

First BFD becomes necessary for faster anycast failover between the hypervisor and the network, but not between the hypervisor and the VMs, as Bird will track the VM facing interface (tapX) and withdraw the prefixes if it goes down.
To maintain a dynamic system, we should keep as few states on the Hypervisor as possible. That’s why here Bird is passive and waits for the VM to initiate the session. This could make establishing the session a bit longer after a migration, the time the VM notices it’s not speaking with the same hypervisor, shutdowns the session and re-create it. If this is deemed too long, BFD could be introduced as this layer too for faster recovery.

Security wise, this could permit any rogue user with root on a VM (or through misconfiguration) to pretend to be any allowed AS and advertise an IP permitted in the “VMs_import” filter. This risk is quite low but additional security mechanisms could be used like MD5.…. (at least until TCP-AO is implemented), this wouldn’t prevent misconfigurations though. Another option is to pre-populate using Puppet the full list of BGP peers with their respective AS, with the significant downside of causing config/alerting fatigue and slower provisioning time.

protocol bgp bgp_v4 {
	ipv4 {
    	import filter VMs_import;
    	export none;
	};
	local  as 64650;
	neighbor range 10.66.2.0/24  external;
}
protocol bgp bgp_v6 {
	ipv6 {
    	import filter VMs6_import;
    	export none;
	};
	local  as 64650;
	neighbor fe80::/10 external;
}

This will require thorough testing before using it in production.

v4 and v6 prefixes allocations

In addition to not painting ourselves in a corner with a bad addressing plan, this is important as prefixes allocation defines the scope of each Ganeti cluster. As each VM is routable it can technically live in any location of our network.

There are 2 options here:
Either we use per DC prefixes, to mimic our current way of doing things. For example, use 10.66.2.0/23 for eqiad v4 private (with a total of 512 IPs and possibility to grow it). Unfortunately 208.80.154.0/23 is fully allocated, so any new v4 public IPs will need to come from a subnet re-sizing or a new larger prefix.
V6 being much easier, the allocation only depends on if we allocate a /128 or /64 per VM. Private vs. public IPv6 could even be enforced at the hypervisor and come from the same pool of IPs. Probably not the best option as it goes against our current way of operating hosts, but it’s a possibility.
A variant of this one is to group IPs (using sub-allocations) by Ganeti clusters this allows aggregating prefixes to reduce the size of routing tables, across the infra, but is not necessary due to the small amount of VMs we’re running.

The other option is to use a global pool of IPs. For example start naming the VMs testvm1.global.wmnet and assign the, an IP from a prefix outside of any of our POPs, like we have 10.3.0.0/24 for internal anycast. The major advantage is that the VM can be moved anywhere in our infra without having to be renumbered. The major inconvenience is that it becomes quite confusing and would require significant changes to our infrastructure while providing a false sense of security and increasing the blast radius of a single Ganeti cluster. It’s better to design a service with multiple VMs per site than rely on being able to move a VM from site to site.
It’s not because we CAN do it that we SHOULD, but we could for example have a special long distance cluster for problematic applications that can’t be active/passive (if there are any).

L2 to L3 cluster migration

Going that path will require a many-steps migration. First focusing on simple VMs (eg. not running BGP), in the private subnet, then extending the scope. Tooling will need to be adjusted first.
A hard requirement is that VMs will need to be re-IPed. This means re-imaged, like we’re planning on doing for bare metal servers.
I haven’t tested if a cluster can run both routed and bridged VMs at the same time. Even if it does, this sounds like a risky move, that’s why it is preferable to spin up a new cluster.

This cluster can start with two nodes, then, progressively, receive migrated VMs, freeing up space on other clusters allowing to re-purpose hypervisors, etc…

L2 abstraction at the hypervisor

This solution differs from the previous one by using VXLAN (or any tunneling technology) to provide a L2 domain to the VMs. Instead of relying on Linux's ability to use a /32 prefix or /128 IPv6 on their virtual NIC, they will be assigned a regular /27-ish or /64 v6. The abstraction takes care of propagating reachability information between VMs. Other than that, it reuses most of the building blocks from the previous solution: DHCP relay, BGP, hypervisor firewalling, router advertisement. It also offers shorter downtime during switchover as even if the VM is now live on a different hypervisor, traffic can still be bridged from the previous hypervisor until BGP converges (we’re talking about milliseconds to seconds here). This would have been the preferred option if simulating a L2` adjacency was required, but in the current state of things it only adds an extra layer making management and troubleshooting more complex.

Conclusion

Within Wikimedia’s infrastructure (Debian based, all but 3 VMs having a single IP, BGP on the host needed), migrating the Ganeti clusters to work in a “routed” mode is a viable option to permit VM live migration between hypervisors spread over any number of L3 domains. The main downside is that this solution requires more preparation and deployment work compared to a L2 only solution and possibly a tunnel based one. It also only uses standards and open source components makes it a sustainable and low maintenance cost option as well.

Next steps

The first next step is to get this document reviewed for any pitfall or oversight I could have made. Then shortly after get to a common agreement, including the few open questions if we stick to routed Ganeti:

  • /128 or /64 for IPv6 VMs?
  • Prefix allocations?
  • How far should a Ganeti cluster and Ganeti group spread?

Once this is decided we need to start allocating hardware resources as my test devices are decommissioned servers that need to be returned to DCops. List then start working on the prerequisites needed to make it happen: our automation, (host and network) BGP, Puppet, DHCP, etc (a few are already listed through the document).
Timeline wise this needs to be prioritized to match the core DCs network refreshes, which means ideally fully ready in Dallas in maximum 6 months.

Other considerations

Not going that way

Creating distinct routing tables for public vs. private zones.

echo "100   private" >> /etc/iproute2/rt_tables
echo "200   public" >> /etc/iproute2/rt_tables

This initially sounded like a good idea as we separate public vs. private vlans in our infrastructure, but the only reason we actually separate them is to be able to provide different IPs. All the firewalling is done on the hosts (and a bit on the routers).

Use Linux VRF to separate hypervisor from VM traffic

A bit similar to the one above but with stricter separation between VM and hypervisor. This would have added extra security if we were providing VMs for untrusted customers (like a cloud provider).

Mixed clusters

Test if a cluster can have both routed and bridged VMs in parallel. I didn’t spend time testing this option as even if it works, there is a risk of impacting production VMs.

Possible future work

Dynamic Ganeti cluster VIP

The master-netdev is the interface on which lives the cluster dedicated management IP. It’s currently using a row specific IP while being the management IP for a site cluster. If that IP becomes unreachable the cluster keeps functioning, but operations (create/delete/modify) can’t be performed. We might benefit from assigning the --master-netdev to the loopback interface and the cluster’s FQDN to a VIP, advertised by BGP like VMs IPs. That would allow for seamless VIP migration and thus easier hypervisor maintenance.

Apply the same mechanism to bare-metal servers

This could for example help us save on public IPs instead of having a dedicated public rack or multiple racks with a public IPv4 prefix.

Add support for multiple IPs per host

This would require patching Ganeti and thus might be a complex operation to support only 3 hosts (lists1001.wikimedia.org,mx[1001,2001].wikimedia.org). It can however be done if no alternative exists. Leveraging BGP here again might be the easiest way to go.

Use fixed MAC address for tap* interfaces

This has been suggested by Cathal “Unsure of whether it's an option but potentially all TAP interfaces could also be forced to the same MAC address? Thus making no ARP update for this on the VM side required (similar to anycast gw idea in evpn).”.
If this works as expected, it would make a live migration even faster and not require the “arping” mentioned earlier in this doc as from a VM point of view its gateway would look exactly the same.

Use iBGP between hypervisor and VMs

This path would allow us to not add an extra BGP hop between the hypervisor and the end VM. Exacts tradeoffs to be investigated.

Possible limitations

Applications or OS handling improperly /32 or /128 interface IPs

Legacy softwares or specialized OS might choke on the seemingly odd NIC IP. IF it happens it will need to be tackled on a case by case basis. As the migration to routed Ganeti will take time, those specific cases could keep running on the “former” Ganeti until a solution is found.

Ressources

https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/
https://vincent.bernat.ch/en/blog/2018-l3-routing-hypervisor
https://docs.ganeti.org/docs/ganeti/3.0/html/
https://linux.die.net/man/5/radvd.conf
https://bird.network.cz/?get_doc&v=20&f=bird-6.html
https://www.netfilter.org/projects/nftables/manpage.html
https://docs.ganeti.org/docs/ganeti/2.2/html/design-2.1.html?highlight=routed#non-bridged-instances-support
https://github.com/grnet/snf-network/blob/develop/docs/routed.rst
http://blkperl.github.io/split-brain-ganeti.html
https://blog.cloudflare.com/virtual-networking-101-understanding-tap/

Written by ayounsi on Dec 15 2023, 9:57 AM.
Staff Network SRE
Projects
None
Subscribers
Syaifulnizamshamsudin, World021609366, georg
Tokens
"Manufacturing Defect?" token, awarded by Syaifulnizamshamsudin.

Event Timeline