Ganeti on modern network design
For reasons already mentioned in other docs (eg. Eqiad Expansion Network Design) we’re moving towards a network architecture where the servers’ layer 3 domains (subnets) are constrained in each rack. Currently (and in most of our core DCs) those layer 3 domains are stretched across all the racks of a given row. In that setting, a Ganeti cluster of a given row (where its hypervisors are spread across the row) leverages this L2 adjacency to be able to live migrate VMs between hypervisors.
In other words, if work is going to be done on hypervisor1, all the VMs it hosts can be temporarily and transparently distributed across the other hypervisorX to prevent any disruptions. Having the same vlan trunked to all the hypervisors of the same row allows the VMs to move to a different hypervisor without requiring any IP renumbering and thus downtime.
Multi-platform network configuration
Network configuration is a quite rapidly evolving area which went through multiple phases. It’s also surprisingly tied to monitoring. Below is some historical context from the industry as well as what we’re doing in SRE.
Netbox news
Netbox is a tool used by all SREs, either directly or abstracted through cookbooks and various scripts. Managed by Infrastructure-Foundations, it went through a major (and much needed!) upgrade this past quarter, led by John Bond, myself and with the help of Riccardo.
RPKI Origin Validation
Since the late 90s, databases named Internet Routing Registries (IRR) have been trying to fulfill that (single) source of truth role. Unfortunately, they are subject to a lot of issues: fragmentation (many existing databases, not all equally well-maintained), security (some databases allow anyone to “claim” a prefix) and complexity (for the network operators). They also contain a lot of inaccurate data that have accumulated over time.
Internal anycast
This project brought two major changes to our infrastructure. Firstly, servers that used to be fronted by LVS for load balancing are now peering directly with our routers. Secondly, we have started using IP anycast for a highly critical service: recursive DNS.
Header picture: https://commons.wikimedia.org/wiki/File:SunsetTracksCrop.JPG