As recently demonstrated elsewhere, individual Kubernetes clusters can fail. Having all of the Toolforge tools on a single cluster makes upgrades and other routine maintenance a lot scarier. And as a bonus, the ability to move tools from a cluster to another would let us skip Kubernetes versions when upgrading.
Fully implementing this would be a major project. Roughly I see three large things that we would need:
- Block users from directly accessing the Kubernetes API, and force them to interact via Toolforge custom APIs instead.
- We would need to convert the webservice tooling to use the APIs.
- Add support to our APIs to move a tool from a cluster to another.
- One possible way to implement this would be to make the APIs store canonical data outside the Kubernetes cluster, and then add functionality to sync a tool from the database to a cluster. The current implementation of the jobs api, for example, treats currently existing Kubernetes resources as the canonical list of existing jobs.
- Another possibility is to make a tool which reads all jobs(/webservices/whatever else will exist then) from the origin cluster and recreates them in the target cluster.
- Build tooling to assign tools to clusters, route tools correctly, coordinate moves, etc