Page MenuHomePhabricator

Spread Toolforge tools to multiple Kubernetes clusters
Open, Needs TriagePublic

Description

As recently demonstrated elsewhere, individual Kubernetes clusters can fail. Having all of the Toolforge tools on a single cluster makes upgrades and other routine maintenance a lot scarier. And as a bonus, the ability to move tools from a cluster to another would let us skip Kubernetes versions when upgrading.

Fully implementing this would be a major project. Roughly I see three large things that we would need:

  • Block users from directly accessing the Kubernetes API, and force them to interact via Toolforge custom APIs instead.
    • We would need to convert the webservice tooling to use the APIs.
  • Add support to our APIs to move a tool from a cluster to another.
    • One possible way to implement this would be to make the APIs store canonical data outside the Kubernetes cluster, and then add functionality to sync a tool from the database to a cluster. The current implementation of the jobs api, for example, treats currently existing Kubernetes resources as the canonical list of existing jobs.
    • Another possibility is to make a tool which reads all jobs(/webservices/whatever else will exist then) from the origin cluster and recreates them in the target cluster.
  • Build tooling to assign tools to clusters, route tools correctly, coordinate moves, etc

Event Timeline

Thanks for sharing this idea Taavi. If we allow a tool to be restarted as part of a transition to a different cluster does that change the requirements mentioned? In general, are there other things we can do to simplify the list of requirements? For example, if some downtime was ok, does that change anything? If users weren't blocked on using kubectl, but also had no support if a k8s upgrade broke them, does that lessen requirements in a useful way? What if we deployed a clone of toolforge today, but did so with a current version of k8s? Could tools be migrated?

My questions are mostly around other alternatives, even undesirable ones, for sake of understanding all options we might have. Thanks!

taavi updated the task description. (Show Details)

I am not fully sure if I understand all of your questions but I will try to explain a bit more about my thinking of the technical work required for this and answer your questions.

First, there are two somewhat separate things involved:

  1. Moving tools from one cluster to another.
  2. Running multiple Kubernetes clusters at the same time and spreading all of the tools to all of them.

(1) can be done without (2), and we've actually did that once in the past. We in theory could do that again, but at least currently my opinion is that it is a worse option than continuing to use our current approach to Kubernetes upgrades.

There are two approaches for moving tools: you can move the Kubernetes object (think of Deployment, Service, so on) specs directly, or you can work with our abstractions (webservices and jobs) and tell the abstraction layer to re-create the objects on the new cluster based off data loaded from the old one. Both methods have benefits and drawbacks, most notably the direct object moving can be fully automated to work with our current cluster with less effort, but it's significantly more riskier and blocks us from making certain types of breaking changes to the cluster. On the other hand working with the abstractions takes more work (need to get everything to use them), but greatly reduces the risk of weird edge cases appearing and lets us change things underneath how we want. I currently believe that long-term the abstraction method is better which is why I mentioned that in the task description.

Now the more interesting (at least to me) part is about running tools on several clusters at once. Right now we're fine with running everything in one cluster (as T333929 demonstrated we can simply throw hardware at scalability issues for now), but on the longer term there certainly are benefits from using multiple clusters. As long as we have a reliable and automated way to move a tool from cluster A to cluster B, the remaining thing for multi-cluster work is to update our infrastructure to support it. Roughly that means creating a source of truth to assign tools to clusters, and updating the web ingress, the custom APIs and other support tooling to support multi-cluster operations. I expect this part would include much less surprises and complexity than the tool moving part has.

Again: this is a lot of work, and there's no particular hurry to get it done. I find it interesting and so am planning slowly work on everything needed, but I don't expect everyone else to do the same.

So, with all of that being said (sorry!):

If we allow a tool to be restarted as part of a transition to a different cluster does that change the requirements mentioned?

Not exactly sure what you mean by this. Fundamentally the process of moving a tool involves stopping everything that it's doing in the origin cluster and re-starting them in the destination cluster.

In general, are there other things we can do to simplify the list of requirements?

The tool moving part is the main piece of complexity here. As I mention above there are two separate approaches to that with varying levels of requirements.

For example, if some downtime was ok, does that change anything?

If you mean a Toolforge-wide outage then no, I don't think that significantly changes anything.

If users weren't blocked on using kubectl, but also had no support if a k8s upgrade broke them, does that lessen requirements in a useful way?

This is already the case (as in we say 'if you use Kubernetes directly, we're not responsible for keeping your tool updated for API changes'), although breaking changes to the very core K8s components are very rare and will take several K8s versions to complete.

What if we deployed a clone of toolforge today, but did so with a current version of k8s? Could tools be migrated?

We could do that, but I don't see any real advantages for doing that compared to upgrading the current cluster. We do not have any pre-existing code to migrate new tools and would have to write some from scratch.

Block users from directly accessing the Kubernetes API

I know that others have proposed this as well, but I personally do not believe that we can sufficiently support all current Toolforge use cases to the point that our homegrown abstractions cover them.

I can see a potential need for a higher quality of service offering where services become fully managed that could require the end users to give up flexibility in how their code is packaged, scheduled, and executed in exchange for a promise of active monitoring and increased stability. I do not believe that the majority of tools and maintainers would opt-in to that system however.

What if we supported multiple clusters via sharding of some kind? The idea would not be to keep all tools safe all the time, but to limit the damage of a single cluster failure. With N clusters and M active workloads a balanced distribution of those workloads would mean that only M/N tools would fail with the failure of a single cluster. N==2 and 50% fail, N==10 and 10% fail, N==20 and 5% fail, etc. In theory some failed workloads could also be migrated to other stable clusters for workloads that are using one of our offered abstractions which track enough configuration state to redeploy from nothing. Tools that are not using those abstractions would be no worse off than they are today.

Turning mostly stateless things (jobs-framework-api) into a stateful thing (with a dedicated database or storage of some sorts, etcd??) feels like the wrong movement to me.
We may be protecting ourselves from a problem we never had before, and is mostly a theoretical problem. I believe the reddit incident is not enough justification to claim there is something wrong with the technology itself.
Also, I don't see how skipping kubernetes versions could simplify the core of the problem, which is when we face API migrations like the recent T292238: Figure out certificate generation for admission webhooks before we lose the certificates/v1beta1.

I could understand the point of having clusters dedicated to specific workloads, like a cluster for webservices, another cluster for jobs, etc, because it may simplify tailoring the cluster to the kind of usage/workload they are running. But even that, I'd like to see some factual data that proves (or at least suggests) the necessity/business case of it.

Side note, we are far away from the max numerical figures of kubernetes, see https://kubernetes.io/docs/setup/best-practices/cluster-large/ In that regard our usage of the technology is small.

I see a strong use case to help upgrades.

On the user side, this could be as simple as passing '--toolforge-cluster=new-cluster' when restarting/starting a tool, and changing the default for that option whenever the new cluster is ready to take load.
That should be easily automatable too.

On the infrastructure side, I think it would be simpler to run two deployments of each API at the same time, one on each cluster, rather than make them cluster-aware, I would as much as possible move any cli logic to APIs running on the cluster, to avoid having to implement switches of the type "if running against cluster1, do this, if cluster2, do that" and just deploying a different version of the API with the minimum code to support the k8s version it runs on.

I really resonate with creating a clear distinction between PaaS offering, and K8saaS offering (currently both intertwined in toolforge), I think it will help both us and the users to understand what's expected and ease the management. That would help enormously detach the service from the implementation, and avoid the current issues with the grid from happening again when we want to move to some other compute implementation (because it will happen sooner or later, be that for budget, technical, or any other reasons).