Page MenuHomePhabricator

Discussion: Adopt an upgrade policy / cadence for Toolforge Kubernetes
Closed, ResolvedPublic

Description

Similar to T316866: Decision Request - Openstack Upgrade Cadence it would be useful to have a dependable schedule to plan upgrades and keep the Toolforge kubernetes cluster up to date. This is not a decision request, as I am not personally proposing a cadence yet.

Some notes to consider for this discussion:

  • Kubernetes currently releases on a 4 month cycle (https://kubernetes.io/releases/release/). This means 3 releases a year, occurring generally around May, September and December. Release branches for the most recent three minor releases are maintained, providing ~12-14 months of patch support
  • Latest version of k8s is 1.26 released on 2022-12-09
  • Toolforge k8s is running on 1.21 (EOL on 2022-06-28), almost ready for 1.22 (EOL on 2022-12-08) (thanks @taavi !)
  • We cannot skip versions when upgrading (Per @taavi Kubernetes upstream does not support skipping versions when upgrading an existing cluster, we cannot currently redeploy a cluster and must in-place upgrade)
  • From CNCF user group, most major hosting services are upgrading twice a year, and skipping a version each time. Trying to be a bit slower. Most are running 1.22/1.23 at this time. The production k8s cluster is v1.16.15, upgrading to 1.23 ongoing.

Given this, assuming we want to run a supported version of kubernetes, we would need to upgrade at least once per year (faster than ~12-14 months of support). However, if we cannot or won't skip large versions, it's likely we'll need to upgrade on an even faster cadence to stay on a supported release.

Lastly note that upgrading has been a historical issue for k8s cluster operators. See https://github.com/kubernetes/enhancements/blob/master/keps/sig-release/1498-kubernetes-yearly-support-period/README.md. "The survey conducted in early 2019 by the WG LTS showed that a significant subset of Kubernetes end-users fail to upgrade within the 9-month support period...This, and other responses from the survey, suggest that this 30% of users would better be able to keep their deployments on supported versions if the patch support period were extended to 12-14 months. This appears to be true regardless of whether the users are on DIY build or commercially vendored distributions. An extension would thus lead to more than 80% of users being on supported versions, instead of the 50-60% we have now."

Event Timeline

Also noting that we currently don't really pay attention to the patch releases that happen after we updrade to a 'minor' (1.x) version.

We cannot skip versions when upgrading (Can someone provide more context for this one?)

Kubernetes upstream does not support skipping versions when upgrading an existing cluster.

In the meeting yesterday this was related to the wikiprod clusters being upgraded directly from 1.16 -> 1.23. They do this by completely re-creating the cluster, which is not an option for us at the moment due to the custom workloads running on the cluster. (This also ties in to the toolforge as a platform / "don't expose the k8s api" discussion on cloud-admin@).

Last call for any opinions or proposed cadences! If not, I will create a proposal this quarter and reference this ticket.

This also ties in to the toolforge as a platform / "don't expose the k8s api" discussion on cloud-admin@

This is indeed a very good argument for not exposing the k8s api to end users...

Given that at the moment we cannot skip versions, and we want to stay on track with upstream or upstream-1, we must somehow do 3 upgrades per year.

I find it hard to predict if it would be more effective to:

  • upgrade once a year to the latest available version at the time (doing 3 upgrades back-to-back)
  • upgrade 3 times a year, every time a new version comes out

I was also thinking of an in-between option where:

  • we set an upgrade window once a year, and we try to get to the latest available version
  • but if one of the 3 back-to-back upgrades proves difficult or time-consuming because of some breaking change, we only do 1 or 2 out of the 3 available upgrades, and we set an extra upgrade window after 6 months to tackle the remaining updates

Just my 2c really, I look forward to hearing other proposals.

After thinking on something similar for ceph (see T325223), I think that it can be though of as two things being decided, the upgrade frequency, and content.

In that sense, I prefer frequent upgrades that have small content.

One issue we have for now is that it's hard to create a full testing environment (toolsbeta, hoping that lima-kilo will change that soon), so we only have one testing environment for now, toolsbeta.

We can put more effort on that, and that would help greatly both, testing newer versions, and thus more frequent releases.

Another issue is users depending on k8s APIs, that yes, there's some discussion about it

So I propose to strive for the following:

  • Start a specific effort to continue (or start if it's not the goal) working towards easy repeatable toolforge deployments, so we can redeploy toolforge environments easily (lima-kilo or otherwise, I'd try to use something we can use later to redeploy toolforge if needed, so maybe terraform+helm+lima-kilo as glue, or even cookbooks if needed for orchestration).
  • Start a specific effort to decouple users from infrastructure (toolforge 2.0? Toolforge API? probably where https://docs.google.com/document/d/12fzFPE96KpHMXqdZzrGH6WApqDa5ZxNjL3iQE6KSAPc/edit is going)
  • Do a 4 months periodical release, in which:
    • We upgrade toolsbeta to the latest version, and start working on any changes needed
    • We upgrade tools to the N-1 version, that has been already tested on toolsbeta for an upgrade cycle, and for which we "should" have fixed all the issues

Now, to get there we have to do a big push now, so probably would need some dedicated effort to stabilize (that might mean freezing a bit the rest of toolforge projects depending on k8s, like build service or jobs). And will probably need some extra synchronizing/pairing as this is probably more than just one person worth of work (and knowledge).

Also note that this means that any changes we test on toolsbeta, would happen on a newer version than tools, so we should keep that in mind (having another staging environment/rebuilding it would help greatly here).

Otherwise, if we only want to do the "minimum" to keep the current state, doing a big effort every year to upgrade to latest in a bunch seems the simplest, though I strongly recommend trying to do something that will improve the process with time instead or along with it.

Some "raw" ideas that I have to think more about xd are:

  • Would it help creating a cluster per toolforge subservice? (ingress/jobs/build/webservices/...), this might allow decoupling upgrades for each of them (given that we are different people working on them)
  • Can we create a new cluster and partially migrate the workload? And then plan for downtime of the workloads we can't migrate? (guessing this might be a pain for users that are very bound to k8s, and/or expect very high uptimes)
  • Can we reproduce all this environment on codfw somehow? to what extent?

Sorry, getting out of topic xd, will stop there.

In reading these comments, it sounds like perhaps our time would be better spent improving our ability to upgrade rather than adopting a (faster) cadence. If so, I wonder if we can leave the existing version of kubernetes running (as in plan no upgrades for now), while this work to improve toolforge orchestration is prioritized. There's also a possibility to iterate on toolforge such that version jumps of k8s would be possible. I'm open to this idea of investing in toolforge first, provided we are intentional in planning and doing the work and realistic about timelines to do so.

Thoughts?

Just to clarify, are you proposing to hold off updating the cluster until a hypothetical future where we can move tools between clusters on a whim?

That's an interesting idea, and I think it would require this kind of future where tools don't have direct access to the cluster. It's possible and I think it's something that we should be working towards (or at least planning to, given everything else going on), but it's a lot of work so I'm a bit uncomfortable leaving the cluster as is until that would be complete.

Just to clarify, are you proposing to hold off updating the cluster until a hypothetical future where we can move tools between clusters on a whim?

Yes, but also if there are other smaller investments to be made first, let's consider them. My comment was that there's a large gap in how we orchestrate toolforge today that might make more sense to invest in right now, versus trying for faster upgrades.

That's an interesting idea, and I think it would require this kind of future where tools don't have direct access to the cluster. It's possible and I think it's something that we should be working towards (or at least planning to, given everything else going on), but it's a lot of work so I'm a bit uncomfortable leaving the cluster as is until that would be complete.

Thanks for sharing what prerequisites need to exist! I was hoping to get a better idea of what work would need to happen, what risks would be involved, and of course everyone's opinion on if the tradeoff makes sense or not. No matter what we decide, I wanted us to think about how to best invest our time in the short-term. In reading the comments, they seemed to suggest adopting a fast upgrade cadence wasn't the right problem to tackle first. So I'll ask, what is?

Barring the ability to skip versions, are there other investments we can target first, and potentially delay an upgrade for? For example, perhaps we decide investing in some further automation (repeatable deployment?) should happen before we adopt a faster cadence. We don't have to include the entire ideal scenario of the ability to skip versions before upgrading. Are there steps along the way? Should we target an interim step first to make it easier to upgrade? Are there steps we can take to make it less of an impact for users? If so, should we do those first?

nskaggs closed this task as Resolved.EditedFeb 27 2023, 6:37 PM
nskaggs claimed this task.

On 21 Feb 2023, the toolforge council met and discussed this ticket. The discussion centered around uncertainty on how to deal with kubernetes moving quickly and deprecating things. It was noted that pod security policy will go away in 1.25, and toolforge currently depends on it. This was also discussed as one of the difficulties in choosing an upgrade policy; upgrading isn't a simple package update. I would echo the suggestion from the discussion to take the next upgrades as an opportunity to learn by doing. Figure out how to best prepare for changes and stay current. As it stands, we want to upgrade beyond 1.25, but need to get to 1.24 first anyway.

The overarching goals are to:

  1. Stay up to date on at minimum a supported kubernetes version
  2. Not be surprised / require lots of re-engineering before an upgrade (this happened I believe with 1.18, 1.22, 1.25, etc)

In addition, WMCS still has a longer-term goal of a portable and easily repeatable toolforge deployment, which should be helped by this.

I believe there is consensus to continue upgrades as planned. I think there is more discussion to be had, but it will happen over the future post-upgrades. Given that, I'm going to resolve this ticket in lieu of leaving it as stalled. I anticipate a decision request proposal will come at some point in the future, as more knowledge and experience is obtained. In the interim, don't hesitate to add comments here as desired.

For the records here: I've filed T333059: Spread Toolforge tools to multiple Kubernetes clusters which amongst other things would let us skip versions when upgrading.