Page MenuHomePhabricator

Upgrade K3s cluster to most recent stable version
Closed, ResolvedPublic5 Estimated Story Points

Description

Our current cluster version is already a year and a half old two years old a the time of writing. We should schedule at some point an upgrade to a more recent version. Some guidelines here.

We probably want to create a test cluster somewhere to test out the upgrade and create instructions/a runbook to capture the necessary steps for the future. This process is a good candidate to be automated in our tofu repository; at the very least we can store any tools or scripts in there, even if full automation can't happen.

T408379 is a prerequisite for the version upgrade.

We're upgrading 7 minor versions here from v1.28 to v1.35, we will need to do the upgrade in stages. So far I've only found one pitfall that will very likely need manual intervention on our part:

I also found another couple of breaking changes, but fortunately they shouldn't affect us:

With a bit of luck (and daring moral turpitude) we may be able to upgrade in just three hops: v1.28 -> v1.31, v1.31 -> v1.32, v1.32 -> v1.35

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
update K3s cluster to version `v1.35.3+k3s1`repos/test-platform/catalyst/catalyst-tofu!40jnucheT400077main
update K3s cluster to version `v1.31.14+k3s1`repos/test-platform/catalyst/catalyst-tofu!38jnucheT400077main
Customize query in GitLab

Event Timeline

thcipriani edited projects, added Catalyst (musi); removed Catalyst.

IIRC, during the 2026-02-19 catalyst triage that an acceptable disaster recovery is a snapshot of the volumes. We'll need to ask for additional resources to make that snapshot.

As a first step, I'll test locally and on catalyst-dev how risky the v1.28 -> v1.31 jump is

As it turns out, none of the components we use has any Traefik configuration incompatible with v3. The following components all use the native K8s Ingress resource in their routes:

I couldn't find any references to Traefik's own IngressRoute.

The only Traefik-specific configuration we seem to have at the moment is this middleware in the OpenTofu repo. I verified that the syntax it uses is compatible with Traefik v3.

I updated my local cluster to v1.35.3+k3s1 and catalyst-dev too. Everything I tested worked correctly, including the code server routing that relies on the Traefik middleware.

It seems to me that we can deploy production all the way to v1.35.3+k3s1

jnuche set the point value for this task to 5.Thu, Apr 16, 1:49 PM
jnuche moved this task from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.

Production cluster is now running v1.35.3+k3s1