Page MenuHomePhabricator

Update wikikube-staging-codfw to kubernetes 1.31
Closed, ResolvedPublic

Description

Things are going to be a bit different this time (compared to T326340: Update staging-codfw to k8s 1.23) given we need to be able to update clusters without the need to reimage all workers.

The very generic plan is:

  • Downtime the cluster (ctrl, worker, etcd)
  • Disable puppet on ctrl and worker nodes
  • Stop k8s components on ctrl and worker nodes
  • Delete etcd data
  • Merge updated version and calico_version in hieradata/common/kubernetes.yaml (1110813)
  • Enable and run puppet on ctrl
  • Enable and run puppet on workers
  • Deploy admin_ng
  • Deploy services
  • Deploy some version of mediawiki to validate PSP to VAP migration (T273507)

Event Timeline

Gehel subscribed.

Moving to our current work board to keep visibility on the progress for DPE SRE.

JMeybohm changed the task status from Open to In Progress.Jan 30 2025, 5:19 PM
JMeybohm claimed this task.

Mentioned in SAL (#wikimedia-operations) [2025-01-30T17:20:12Z] <jayme> staging-codfw k8s cluster is currently being updated to k8s 1.31 and in an unusable state - T384450

FTR: A CertManagerCertNotReady critical alert fired for staging-codw which should not happen for staging-codfw at all.

jijiki triaged this task as Medium priority.Feb 3 2025, 1:33 PM
jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops-deprecated board.
JMeybohm raised the priority of this task from Medium to High.Feb 28 2025, 1:05 PM

Change #1124831 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] staging-codfw: Unset image.tag for coredns to apply the default version

https://gerrit.wikimedia.org/r/1124831

Change #1124831 merged by jenkins-bot:

[operations/deployment-charts@master] staging-codfw: Unset image.tag for coredns to apply the default version

https://gerrit.wikimedia.org/r/1124831

@JMeybohm @kamila and I did a upgrade of staging-codfw to 1.31. We used the sre.k8s.wipe-cluster cookbook from https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1115380.

After fixing a few dependency issues in admin-ng we were able to apply admin-ng properly to the fresh cluster. However the next step, deploying istio, failed. The istio pods which were running on the control plane nodes had DNS issues and did not became ready. This was related to the pod ip range change from 10.192.75.0/24 to 10.192.64.0/21 which we combined with the cluster reinstall.

After running homer on the core routers the istio pods became ready and inter-pod traffic worked fine. Thanks @JMeybohm for troubleshooting this.

Deploying a service (ipoid) was successful as well.