Page MenuHomePhabricator

Update wikikube eqiad to kubernetes 1.31
Closed, ResolvedPublic

Description

We're planning to update the wikikube eqiad cluster to kubernetes 1.31 before eqiad repool: Thursday 2 October @ 15:00 UTC (T399891) . The exact deployment windows is 1st Oct 10:00-15:00 UTC (12:00-17:00 UTC+2).

Required patches:

As of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127859 we're still running mw-web and mw-api-ext with replicas suitable for single-DC serving. So for the depool test, no further changes are required.

Upgrade process is:

  • Inform DPE SRE at least 48 hours ahead of time via the SRE mailing list (because of T404605)
  • Announce maintenance start to #wikimedia-operations and the email chain
  • Deploy all services to ensure the current version in git can be deployed, revert all patches that break deployments (if any)
  • scap lock --all "Kubernetes upgrade"
  • Depool toolhub
    • sudo confctl --object-type discovery select 'dnsdisc=toolhub.*' set/pooled=false
  • Depool thumbor
    • Bump thumbor replicas in codfw
    • sudo confctl --object-type discovery select 'dnsdisc=swift.*,name=eqiad' set/pooled=false
    • sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=codfw' set/pooled=true
    • sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=eqiad' set/pooled=false
  • double check all services are depooled sudo cookbook sre.k8s.pool-depool-cluster status --k8s-cluster wikikube-eqiad
  • Take a note on which services are currently deployed (helm list -A > all_services_helm_list.txt)
  • cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-eqiad -H 2 --reason "Kubernetes upgrade"
    • Merge patches after "Cluster's state has been wiped. "
  • Apply admin-ng to all other clusters (because of ip pool change)
  • deploy istio CRDs first and delete namespace (so that it can be recreated by helm): istioctl-1.24.2 install --set profile=remote --skip-confirmation && kubectl delete ns istio-system
  • helmfile sync admin_ng
  • istioctl-1.24.2 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_1.24.2.yaml
  • Deploy and repool toolhub first to resolve downtime
    • cd /srv/deployment-charts/services.d/toolhub; helmfile -e eqiad -i apply
    • Repool toolhub
      • sudo confctl --object-type discovery select 'name=eqiad,dnsdisc=toolhub.*' set/pooled=true
  • Deploy all the services
    • deploy mw-mcrouter first to make sure the daemonset finds available nodes
    • charlie (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188456 installed to /usr/local/bin/charlie)
      • WARNING: charlie will operate on mediawiki services as well, which is probably not what we want (see "Deploy mediawiki" below and T397685: helmfile/scap does not reliably bootstrap mediawiki).
        • Before the next upgrade, we may want to give charlie the ability to optionally exclude mediawiki services, so that they can be sequenced independently (e.g., via SKIP_DIRS).
        • Alternatively, if we do want charlie to bring up mediawiki services (and use scap only to smoke test deployments), we should shift the "Ensure all mediawiki support releases are ready" step earlier, just prior to running charlie.
  • Repool thumbor
    • sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=eqiad' set/pooled=true
    • sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=codfw' set/pooled=false
    • sudo confctl --object-type discovery select 'dnsdisc=swift.*,name=eqiad' set/pooled=true
  • Deploy mediawiki
    • Ensure all mediawiki support releases are ready (otherwise the subsequent scap deploy will fail). For example, from /srv/deployment-charts/helmfile.d/services and setting $CLUSTER appropriately:
      • statsd exporters: for mw in mw-{api-ext,api-int,cron,debug,experimental,jobrunner,misc,parsoid,script,web,wikifunctions}; do pushd $mw; helmfile -e $CLUSTER -l name=prometheus -i apply; popd; done
      • medawiki-common resources: for mw in mw-{cron,script}; do pushd $mw; helmfile -e $CLUSTER -l name=mediawiki-common -i apply; popd; done
    • Deploy mediawiki itself: scap sync-world --k8s-only -Dbuild_mw_container_image:False
  • Announce maintenance end to #wikimedia-operations and the email chain

Todos/fallout from the wikikube-eqiad upgrade:

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1191647 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] admin_ng: Change eqiad pod ip range to 10.67.128.0/17

https://gerrit.wikimedia.org/r/1191647

Change #1191652 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Update eqiad pod ip range

https://gerrit.wikimedia.org/r/1191652

Change #1191653 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Update eqiad to kubernetes 1.31, calico 3.29

https://gerrit.wikimedia.org/r/1191653

Change #1191656 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] Update eqiad to k8s 1.31

https://gerrit.wikimedia.org/r/1191656

Since we're going to depool the whole eqiad cluster we will be running a test depool during the UTC mid-day MW-Infra window on 2025-06-18. TBD is this still needed?

It's not if we push back the eqiad repool from the switchover a week/couple days later, which should be fine.

As of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127859 we're still running mw-web and mw-api-ext with replicas suitable for single-DC serving. So for the depool test, no further changes are required.

We're already single DC post switchover, nothing needed there.

JMeybohm updated the task description. (Show Details)

I will be running the upgrade, @Jelto is backup, and @JMeybohm will be around for moral support and consulting :P

Target time 1000UTC Wednesday October 1st

The update is blocked by T406094 currently

Mentioned in SAL (#wikimedia-operations) [2025-10-01T10:55:13Z] <claime> Starting eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Mentioned in SAL (#wikimedia-operations) [2025-10-01T11:03:53Z] <cgoubert@deploy2002> Locking from deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Zotero was updated to docker-registry.discovery.wmnet/repos/mediawiki/services/zotero:2025-09-18-102701-production during the process

Mentioned in SAL (#wikimedia-operations) [2025-10-01T11:48:38Z] <cgoubert@cumin1003> START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Change #1191652 merged by Clément Goubert:

[operations/puppet@production] Update eqiad pod ip range

https://gerrit.wikimedia.org/r/1191652

Change #1191653 merged by Clément Goubert:

[operations/puppet@production] Update eqiad to kubernetes 1.31, calico 3.29

https://gerrit.wikimedia.org/r/1191653

Clement_Goubert changed the task status from Open to In Progress.Oct 1 2025, 1:13 PM

Change #1191647 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Change eqiad pod ip range to 10.67.128.0/17

https://gerrit.wikimedia.org/r/1191647

Change #1191656 merged by jenkins-bot:

[operations/deployment-charts@master] Update eqiad to k8s 1.31

https://gerrit.wikimedia.org/r/1191656

Mentioned in SAL (#wikimedia-operations) [2025-10-01T13:35:18Z] <cgoubert@cumin1003> END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster wikikube-eqiad: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Icinga downtime and Alertmanager silence (ID=c05638d5-ce6f-47ba-a690-94d6f96d3881) set by jelto@cumin1003 for 0:30:00 on 239 host(s) and their services with reason: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

wikikube-worker[1002-1007,1011-1012,1015-1016,1019-1021,1029-1031,1034-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-10-01T13:51:26Z] <jelto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 239 hosts with reason: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Followup:

  • charlie feature request: Just do it and don't show me any diff or ask for confirmation

Mentioned in SAL (#wikimedia-operations) [2025-10-01T14:24:59Z] <cgoubert@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 201m 05s)

Mentioned in SAL (#wikimedia-operations) [2025-10-01T14:25:25Z] <cgoubert@deploy2002> Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

@BTullis @bking there is an undeployed change for flink in the dse-k8s-eqiad cluster which made it a bit tricky to deploy the new wikikube-eqiad CIDRs (to update network policies).

I made a manual diff and excluded the flink, spark and production releases and it seems all other admin_ng components are up to date.

I used:

for name in rbac-rules pod-security-policies namespaces calico-crds calico coredns external-services cert-manager-networkpolicies cert-manager cfssl-issuer-crds cfssl-issuer namespace-certificates istio-gateways-networkpolicies istio-gateways-envoyfilters istio-proxy-settings eventrouter knative-serving-crds knative-serving kserve kube-state-metrics helm-state-metrics k8s-controller-sidecars main-opentelemetry-collector ceph-csi-rbd ceph-csi-cephfs cloudnative-pg-crds cloudnative-pg priority-classes opensearch-operator-crds; 
do helmfile -e dse-k8s-eqiad diff -l name="$name";
helmfile -e dse-k8s-codfw diff -l name="$name"; 
 done

which did not produce any diff. But you might want to double check the admin_ng chart yourself, especially if you see connectivity issues to service in wikikube-eqiad.

@Jelto thanks and apologies for that. The pending admin_ng diff should be mainly related to the spark-operator, which I'm actively working on as part of: T405490

Mentioned in SAL (#wikimedia-operations) [2025-10-01T15:25:49Z] <cgoubert@deploy2002> Started scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Mentioned in SAL (#wikimedia-operations) [2025-10-01T15:27:59Z] <cgoubert@deploy2002> Finished scap sync-world: eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703 (duration: 03m 16s)

Mentioned in SAL (#wikimedia-operations) [2025-10-01T15:35:40Z] <claime> Finished eqiad Wikikube kubernetes cluster upgrade to 1.31 - T405703

Followup:

  • charlie feature request: Just do it and don't show me any diff or ask for confirmation

Noted! I was thinking about adding this, and hesitated just because you could really powerfully make mistakes with it (plus it's a brand-new tool I didn't trust yet) but good to know it would have helped here. Let me know if there's anything else while I'm at it, I think you're the only one who's tried the tool besides me.

Followup:

  • charlie feature request: Just do it and don't show me any diff or ask for confirmation

Noted! I was thinking about adding this, and hesitated just because you could really powerfully make mistakes with it (plus it's a brand-new tool I didn't trust yet) but good to know it would have helped here. Let me know if there's anything else while I'm at it, I think you're the only one who's tried the tool besides me.

For posterity, that discussion is T406212.

Before the next upgrade, we may want to give charlie the ability to optionally exclude mediawiki services, so that they can be sequenced independently (e.g., via SKIP_DIRS).

You and I talked about this, but noting for history's sake: I think this makes sense and especially in combination with --dangerously_fast (T406212) it might be a sensible default. Happy to review a patch for this, or if you'd like to spin off a subtask and assign it to me, I'm happy to take it.