Currently we run 2 control planes as well as 3 etcd nodes per DC as ganeti VMs. We already hit limits in terms of IOPS on the etcd instances and we do scratch on the upper "limit" for memory on ganeti for the control planes (12GB currently).
We should draft a plan to migrate from the 2+3 ganeti instances to 3 hardware nodes (repurposing mw appservers) and co-locate a kubernetes master and etcd sever on each of them.
It should be possible to do this by adding the new control-planes/etcd nodes and remove the ganeti ones after.
In the spreadsheet at T351074: Move servers from the appserver/api cluster to kubernetes I've reserved 3 R440 nodes per DC to be used as apiservers:
- mw2391
- mw2331
- mw2361
- mw1372
- mw1492
- mw1436
These should be renamed during reimage because of their special role in the cluster:
- https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging
- https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1008818
I wrote a documentation on how to add stacked control-planes and how to remove them as well as etcd nodes at: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_control-planes
For preparation we should reimage the above appservers to insetup using the same partition layout as we use for kubernetes workers.
What I totally failed to think about while doing staging is the opportunity to align wikikube control-plane names with the other clusters which use names like ml-serve-ctrlXXXX/aux-k8s-ctrlXXXX. So maybe we could rename to wikikube-ctrlXXX (I really don't like the k8s that dse and aux threw in the mix) to come one step closer to T336861: Fix naming confusion around main/wikikube kubernetes clusters.