We're planning to update the wikikube eqiad cluster to kubernetes 1.31 before eqiad repool: Thursday 2 October @ 15:00 UTC (T399891) . The exact deployment windows is 1st Oct 10:00-15:00 UTC (12:00-17:00 UTC+2).
Required patches:
- Patches similar to what was required for wikikube-codfw: T397148: Update wikikube codfw to kubernetes 1.31
- The new, bigger, Pod IP pools can be found at T375845: WikiKube clusters close to exhausting Calico IPPool allocations, routers and ToR switches have already been configured with the new ranges
- Plus changing the kubernetesVersion for mw deployments to eqiad because T388390: Ensure the correct helm version is used for each cluster / T388969: MW deployments shouldn't need a hard-coded kubernetesVersion
As of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127859 we're still running mw-web and mw-api-ext with replicas suitable for single-DC serving. So for the depool test, no further changes are required.
Upgrade process is:
- Inform DPE SRE at least 48 hours ahead of time via the SRE mailing list (because of T404605)
- Announce maintenance start to #wikimedia-operations and the email chain
- Deploy all services to ensure the current version in git can be deployed, revert all patches that break deployments (if any)
- charlie (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188456 installed to /usr/local/bin/charlie)
- scap lock --all "Kubernetes upgrade"
- Depool toolhub
- sudo confctl --object-type discovery select 'dnsdisc=toolhub.*' set/pooled=false
- Depool thumbor
- Bump thumbor replicas in codfw
- sudo confctl --object-type discovery select 'dnsdisc=swift.*,name=eqiad' set/pooled=false
- sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=codfw' set/pooled=true
- sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=eqiad' set/pooled=false
- double check all services are depooled sudo cookbook sre.k8s.pool-depool-cluster status --k8s-cluster wikikube-eqiad
- Take a note on which services are currently deployed (helm list -A > all_services_helm_list.txt)
- cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-eqiad -H 2 --reason "Kubernetes upgrade"
- Merge patches after "Cluster's state has been wiped. "
- Apply admin-ng to all other clusters (because of ip pool change)
- deploy istio CRDs first and delete namespace (so that it can be recreated by helm): istioctl-1.24.2 install --set profile=remote --skip-confirmation && kubectl delete ns istio-system
- helmfile sync admin_ng
- istioctl-1.24.2 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_1.24.2.yaml
- Deploy and repool toolhub first to resolve downtime
- cd /srv/deployment-charts/services.d/toolhub; helmfile -e eqiad -i apply
- Repool toolhub
- sudo confctl --object-type discovery select 'name=eqiad,dnsdisc=toolhub.*' set/pooled=true
- Deploy all the services
- deploy mw-mcrouter first to make sure the daemonset finds available nodes
- charlie (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188456 installed to /usr/local/bin/charlie)
- WARNING: charlie will operate on mediawiki services as well, which is probably not what we want (see "Deploy mediawiki" below and T397685: helmfile/scap does not reliably bootstrap mediawiki).
- Before the next upgrade, we may want to give charlie the ability to optionally exclude mediawiki services, so that they can be sequenced independently (e.g., via SKIP_DIRS).
- Alternatively, if we do want charlie to bring up mediawiki services (and use scap only to smoke test deployments), we should shift the "Ensure all mediawiki support releases are ready" step earlier, just prior to running charlie.
- WARNING: charlie will operate on mediawiki services as well, which is probably not what we want (see "Deploy mediawiki" below and T397685: helmfile/scap does not reliably bootstrap mediawiki).
- Repool thumbor
- sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=eqiad' set/pooled=true
- sudo confctl --object-type discovery select 'dnsdisc=thumbor.*,name=codfw' set/pooled=false
- sudo confctl --object-type discovery select 'dnsdisc=swift.*,name=eqiad' set/pooled=true
- Deploy mediawiki
- Ensure all mediawiki support releases are ready (otherwise the subsequent scap deploy will fail). For example, from /srv/deployment-charts/helmfile.d/services and setting $CLUSTER appropriately:
- statsd exporters: for mw in mw-{api-ext,api-int,cron,debug,experimental,jobrunner,misc,parsoid,script,web,wikifunctions}; do pushd $mw; helmfile -e $CLUSTER -l name=prometheus -i apply; popd; done
- medawiki-common resources: for mw in mw-{cron,script}; do pushd $mw; helmfile -e $CLUSTER -l name=mediawiki-common -i apply; popd; done
- Deploy mediawiki itself: scap sync-world --k8s-only -Dbuild_mw_container_image:False
- Ensure all mediawiki support releases are ready (otherwise the subsequent scap deploy will fail). For example, from /srv/deployment-charts/helmfile.d/services and setting $CLUSTER appropriately:
- Announce maintenance end to #wikimedia-operations and the email chain
Todos/fallout from the wikikube-eqiad upgrade: