Page MenuHomePhabricator

Update Kubernetes clusters to >1.25
Open, HighPublic

Description

Umbrella task to track the work required towards upgrading our Kubernetes clusters to Kubernetes >1.25 (1.24 is EOL on 2023-07-28).

We're currently running 1.23 which went EOL on 2023-02-08 and there are some bigger requirement to be dealt with before moving to a newer version:

  • We need to migrate away from docker ad container runtime: T269684
  • We need to migrate away from PodSecurityPolicies: T273507

Together with the Kubernetes update, we need to update the following other components:

  • Calico
  • Istio
  • cert-manager
  • kserve
  • knative-serving
  • coredns
  • helm

Preparation for the Kubernetes update

  • Ensure all our charts are compatible with the new Kubernetes version (currently validating against 1.27)
  • Read Kubernetes changelogs (yellow/red flags just linked below each version. Tick the box if all action required items have been addressed, use ✅ for single items), https://relnotes.k8s.io
  • v1.24
  • Action Required
  • Note
  • v1.25
  • Action Required
  • Note

Upgrade process

Related Objects

StatusSubtypeAssignedTask
ResolvedJMeybohm
OpenNone
OpenJMeybohm
OpenJMeybohm
OpenNone
OpenNone
Resolved ayounsi
Resolvedelukey
ResolvedJMeybohm
ResolvedClement_Goubert
Openelukey
ResolvedJMeybohm
ResolvedJMeybohm
OpenNone
OpenNone
OpenNone
DuplicateNone
DuplicateNone
ResolvedRequestJclark-ctr
OpenNone
Stalledhnowlan
DuplicateClement_Goubert
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedRobH
OpenRequestPapaul
OpenRequestVRiley-WMF

Event Timeline

JMeybohm triaged this task as Medium priority.Jul 17 2023, 12:24 PM
JMeybohm created this task.
JMeybohm raised the priority of this task from Medium to High.Sep 22 2023, 9:12 AM

With the next k8s upgrade we already have the following dependency problems:

  • We need to migrate to containerd before moving to k8s >=1.24 (T269684)
  • containerd version (< 1.6) in bullseye is only supported in kubelet <=1.25 (see)
  • PSPs gone in >=1.25 (T273507)
  • VAPs available in >=1.26 (T273507)

With this in mind (and all the other components version requirements left out for now) we drafted the following update plans for the wikikube clusters. The main problem is the need for containerd >=1.6 if we want to go newer then kubelet 1.25 which would require us to reimage all nodes. In previous updates, that could easily be done during a cluster downtime, but with >130 nodes per cluster (and counting) that is probably no longer feasible (or at least requires extended cluster downtime and still puts us under some level of stress).

One big reimage plan
This is how we did it in the past. Downtime one cluster (during DC switchover) and update/reimage all of its nodes at once.

  • Switch to containerd (node per node)
  • Move everything that is not medawiki to PSSs (depending on how T273507 goes)
  • Upgrade to k8s 1.29, reimaging all nodes to bookworm (while cluster is depooled)
  • Add VAP to enforce policy to mediawiki namespaces (depending on how T273507 goes)

Rolling reimage plan
This idea leverages the fact that there is a supported version skew between kube-apiserver and kubelets.

  • Switch to containerd (node per node)
  • Move everything that is not medawiki to PSSs (depending on how [[PSP replacement]] goes)
  • Upgrade apiserver to 1.27, kubelets to 1.25 (still bullseye, containerd 1.4)
  • Add VAP to enforce policy to mediawiki namespaces (depending on how T273507 goes)
  • rolling reimage of nodes to bookworm and kubelet 1.27

Decouple reimage plan
This would require us to backport the bookworm containerd version (1.6) to bullseye.

  • Switch to containerd (node per node)
  • Move everything that is not medawiki to PSSs (depending on how T273507 goes)
  • Upgrade to k8s 1.29
  • Add VAP to enforce policy to mediawiki namespaces (depending on how T273507 goes)
  • Do bookworm reimaging whenever we like