An intro on routed Ganeti can be found here: https://phabricator.wikimedia.org/phame/post/view/312/ganeti_on_modern_network_design/
We have piloted routed Ganeti with two test servers who are running some initial workloads. As the next step of the rollout we've decided to migrate one of the PoPs (namely magru) to it, to give it more exposure to real world issues. This will also allow us to install future pops directly with routed Ganeti.
Magru currently consists of two separate Ganeti clusters in two different rows with two servers each.
row B4: ganeti7002 and ganeti7004
- bast7001
- doh7002
- durum7002
- ncredir7002
- prometheus7001
row B3: ganeti7001 and ganeti7003
- atlas7001
- doh7002
- durum7001
- install7001
- ncredir7001
- netflow7001
When the migration is completed, we'll have a common four node Ganeti cluster spanning the two rows (and would also have flexibility in case of potential row changes at the DC).
There is some pending upstream work we have commissioned for Bind which will unblock the use of BGP in VMs (T362392). Until this work is completed, we will keep one node using the old setup (ganeti7002), which will continue to run doh7002 and durum7002.
The migration path will look like the following:
- Move all VMs in ganeti7002 to ganeti7004
- Switch B4 VMs to plain disk storage, i.e. disable DRBD for them.
During this initial period, the B4 VMs are no longer redundant, so if 7004 were to fail, we'd lose the Prometheus metrics, but would still have ncredir/wikidough/durum operational
- Allocate IPs for magru routed Ganeti - https://netbox.wikimedia.org/ipam/prefixes/?role_id=41&site_id=11
- Add allocated IPs to modules/network/data/data.yaml in Puppet
- Reimage ganeti7002 with routed Ganeti
- Update ganeti7002 switch port to remove the trunked public vlan
- Setup routing between ganeti7002 and its ToR switch
- Create bast7002, ncredir7003 and prometheus7002 on routed Ganeti and fail over services
- Decom bast7001, ncredir7002, prometheus7001
- Move all VMs on ganeti7001 to ganeti7003
- Switch B3 VMs to plain disk storage, i.e. disable DRBD for them.
- Decom atlas7001 (We'll re-add a probe in magru at a later point)
- Decom doh7001, durum7001 (they will no longer be redundant, but with the current request rate in magru that's acceptable)
- Create install7002, ncredir7002, netflow7002 on the routed Ganeti cluster and fail over services
- Decom install7001, ncredir7001, netflow7001
- Reimage ganeti7001 with routed Ganeti and also add them to the cluster
- Setup routing between ganeti7001 and its ToR switch
- Switch VMs back to DRBD
- Reimage ganeti7003 with routed Ganeti and also add them to the cluster
- Setup routing between ganeti7003 and it's ToR switch
- Create doh7003, doh7004, durum7003, durum7004 on the routed Ganeti cluster
Once support in Bird is available (T362392)
- Move doh7003, doh7004, durum7003, durum7004 to production
- Decommission doh7002, durum7002
- Reimage ganeti7004 with routed Ganeti and add it to the cluster
- Setup routing between ganeti7004 and its ToR switch