We're planning to update the wikikube codfw cluster to kubernetes 1.31 on Monday, 2025-06-23 during the UTC mid-day MW-Infra window, 10:00 - 11:00 UTC (which gives us another 2 hours before the UTC afternoon backport window).
Required patches:
- Patches similar to what was required for staging-eqiad: T389045: Update wikikube-staging-eqiad to kubernetes 1.31
- The new, bigger, Pod IP pools can be found at T375845: WikiKube clusters close to exhausting Calico IPPool allocations, routers and ToR switches have already been configured with the new ranges
- Plus changing the kubernetesVersion for mw deployments to codfw because T388390: Ensure the correct helm version is used for each cluster / T388969: MW deployments shouldn't need a hard-coded kubernetesVersion
Since we're going to depool the whole codfw cluster we will be running a test depool during the UTC mid-day MW-Infra window on 2025-06-18.
As of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127859 we're still running mw-web and mw-api-ext with replicas suitable for single-DC serving. So for the depool test, no further changes are required.
Upgrade process is:
- Deploy all services to ensure the current version in git can be deployed, revert all patches that break deployments (if any)
- scap lock --all "Kubernetes upgrade"
- cookbook sre.k8s.pool-depool-cluster depool codfw codfw
- double check all services are depooled cookbook sre.k8s.pool-depool-cluster status codfw codfw
- Take a note on which services are currently deployed (helm list -A)
- cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-codfw -H 2 --reason "Kubernetes upgrade"
- Merge patches after "Cluster's state has been wiped. "
- Apply admin-ng to all other clusters (because of ip pool change)
- deploy istio CRDs first and delete namespace (so that it can be recreated by helm): istioctl-1.24.2 install --set profile=remote --skip-confirmation && kubectl delete ns istio-system
- helmfile sync admin_ng
- istioctl-X.X manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/<your-cluster>/config.yaml
- Deploy all the services
- deploy_all.sh
- Deploy mediawiki: scap sync-world --k8s-only -Dbuild_mw_container_image:False
- repool
Todos/fallout from the wikikube-codfw upgrade:
- Fix kubernetes-client installations @Jelto, T387548
- increase batch size for puppet run in wipe-cluster cookbook to 50
- create a task to increase the default batch size for puppet.run() from 10 to...25? T397687
- add more downtimes:
- alertname="ProbeDown"family="ip4"instance=~"(chart\-renderer:30443|citoid:4003|cxserver:4002|eventgate\-analytics:4592|eventgate\-main:4492|k8s\-ingress\-wikikube:30443|mathoid:4001|mobileapps:4102|mw\-api\-ext\-next:4455|mw\-api\-ext:4447|mw\-api\-int:4446|mw\-parsoid:4452|mw\-web\-next:4454|mw\-web:4450|sessionstore:8081|shellbox\-constraints:4010|shellbox\-media:4015|shellbox\-syntaxhighlight:4014|shellbox\-timeline:4012|shellbox\-video:4080|shellbox:4008|termbox:4004|thumbor:8800|wikifeeds:4101|zotero:4969)"job="probes/service"module=~"(http_chart\-renderer_ip4|http_citoid_ip4|http_cxserver_ip4|http_eventgate\-analytics_ip4|http_eventgate\-main_ip4|http_mathoid_ip4|http_mobileapps_ip4|http_mw\-api\-ext\-next_ip4|http_mw\-api\-ext_ip4|http_mw\-api\-int_ip4|http_mw\-parsoid_ip4|http_mw\-web\-next_ip4|http_mw\-web_ip4|http_sessionstore_ip4|http_shellbox\-constraints_ip4|http_shellbox\-media_ip4|http_shellbox\-syntaxhighlight_ip4|http_shellbox\-timeline_ip4|http_shellbox\-video_ip4|http_shellbox_ip4|http_termbox_ip4|http_thumbor_ip4|http_wikifeeds_ip4|http_zotero_ip4|tcp_k8s\-ingress\-wikikube_ip4)"prometheus="ops"severity="page"site="codfw"source="prometheus"team="sre"
- alertname="SwaggerProbeHasFailures"instance=~"(https:\/\/citoid\.svc\.codfw\.wmnet:4003|https:\/\/cxserver\.svc\.codfw\.wmnet:4002|https:\/\/echostore\.svc\.codfw\.wmnet:8082|https:\/\/eventgate\-analytics\-external\.svc\.codfw\.wmnet:4692|https:\/\/eventgate\-analytics\.svc\.codfw\.wmnet:4592|https:\/\/eventgate\-logging\-external\.svc\.codfw\.wmnet:4392|https:\/\/eventgate\-main\.svc\.codfw\.wmnet:4492|https:\/\/eventstreams\-internal\.svc\.codfw\.wmnet:4992|https:\/\/eventstreams\.svc\.codfw\.wmnet:4892|https:\/\/mathoid\.svc\.codfw\.wmnet:4001|https:\/\/mobileapps\.svc\.codfw\.wmnet:4102|https:\/\/proton\.svc\.codfw\.wmnet:4030|https:\/\/sessionstore\.svc\.codfw\.wmnet:8081|https:\/\/termbox\.svc\.codfw\.wmnet:4004)"job="probes/swagger"prometheus="ops"severity="critical"site="codfw"source="prometheus"team="sre"
- create a task to add discovery to thumbor and remove the hardcoded backend config for swift, T397618
- create a task to make mw-mcrouter with higher priority pods so they can evict others, T397683
- productionize deploy-all.sh T397684
- next time, run deploy-all.sh before wiping the cluster to ensure services are in a deployable state
- somehow fix scaps ability to bootstrap a mediawiki deployment without failing (helmfile sync instread of helmfile apply), T397685