Page MenuHomePhabricator

Update wikikube codfw to kubernetes 1.31
Closed, ResolvedPublic

Description

We're planning to update the wikikube codfw cluster to kubernetes 1.31 on Monday, 2025-06-23 during the UTC mid-day MW-Infra window, 10:00 - 11:00 UTC (which gives us another 2 hours before the UTC afternoon backport window).

Required patches:

Since we're going to depool the whole codfw cluster we will be running a test depool during the UTC mid-day MW-Infra window on 2025-06-18.

As of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127859 we're still running mw-web and mw-api-ext with replicas suitable for single-DC serving. So for the depool test, no further changes are required.

Upgrade process is:

  • Deploy all services to ensure the current version in git can be deployed, revert all patches that break deployments (if any)
  • scap lock --all "Kubernetes upgrade"
  • cookbook sre.k8s.pool-depool-cluster depool codfw codfw
    • double check all services are depooled cookbook sre.k8s.pool-depool-cluster status codfw codfw
  • Take a note on which services are currently deployed (helm list -A)
  • cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-codfw -H 2 --reason "Kubernetes upgrade"
    • Merge patches after "Cluster's state has been wiped. "
  • Apply admin-ng to all other clusters (because of ip pool change)
  • deploy istio CRDs first and delete namespace (so that it can be recreated by helm): istioctl-1.24.2 install --set profile=remote --skip-confirmation && kubectl delete ns istio-system
  • helmfile sync admin_ng
  • istioctl-X.X manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/<your-cluster>/config.yaml
  • Deploy all the services
    • deploy_all.sh
  • Deploy mediawiki: scap sync-world --k8s-only -Dbuild_mw_container_image:False
  • repool

Todos/fallout from the wikikube-codfw upgrade:

  • Fix kubernetes-client installations @Jelto, T387548
  • increase batch size for puppet run in wipe-cluster cookbook to 50
  • create a task to increase the default batch size for puppet.run() from 10 to...25? T397687
  • add more downtimes:
    • alertname="ProbeDown"family="ip4"instance=~"(chart\-renderer:30443|citoid:4003|cxserver:4002|eventgate\-analytics:4592|eventgate\-main:4492|k8s\-ingress\-wikikube:30443|mathoid:4001|mobileapps:4102|mw\-api\-ext\-next:4455|mw\-api\-ext:4447|mw\-api\-int:4446|mw\-parsoid:4452|mw\-web\-next:4454|mw\-web:4450|sessionstore:8081|shellbox\-constraints:4010|shellbox\-media:4015|shellbox\-syntaxhighlight:4014|shellbox\-timeline:4012|shellbox\-video:4080|shellbox:4008|termbox:4004|thumbor:8800|wikifeeds:4101|zotero:4969)"job="probes/service"module=~"(http_chart\-renderer_ip4|http_citoid_ip4|http_cxserver_ip4|http_eventgate\-analytics_ip4|http_eventgate\-main_ip4|http_mathoid_ip4|http_mobileapps_ip4|http_mw\-api\-ext\-next_ip4|http_mw\-api\-ext_ip4|http_mw\-api\-int_ip4|http_mw\-parsoid_ip4|http_mw\-web\-next_ip4|http_mw\-web_ip4|http_sessionstore_ip4|http_shellbox\-constraints_ip4|http_shellbox\-media_ip4|http_shellbox\-syntaxhighlight_ip4|http_shellbox\-timeline_ip4|http_shellbox\-video_ip4|http_shellbox_ip4|http_termbox_ip4|http_thumbor_ip4|http_wikifeeds_ip4|http_zotero_ip4|tcp_k8s\-ingress\-wikikube_ip4)"prometheus="ops"severity="page"site="codfw"source="prometheus"team="sre"
    • alertname="SwaggerProbeHasFailures"instance=~"(https:\/\/citoid\.svc\.codfw\.wmnet:4003|https:\/\/cxserver\.svc\.codfw\.wmnet:4002|https:\/\/echostore\.svc\.codfw\.wmnet:8082|https:\/\/eventgate\-analytics\-external\.svc\.codfw\.wmnet:4692|https:\/\/eventgate\-analytics\.svc\.codfw\.wmnet:4592|https:\/\/eventgate\-logging\-external\.svc\.codfw\.wmnet:4392|https:\/\/eventgate\-main\.svc\.codfw\.wmnet:4492|https:\/\/eventstreams\-internal\.svc\.codfw\.wmnet:4992|https:\/\/eventstreams\.svc\.codfw\.wmnet:4892|https:\/\/mathoid\.svc\.codfw\.wmnet:4001|https:\/\/mobileapps\.svc\.codfw\.wmnet:4102|https:\/\/proton\.svc\.codfw\.wmnet:4030|https:\/\/sessionstore\.svc\.codfw\.wmnet:8081|https:\/\/termbox\.svc\.codfw\.wmnet:4004)"job="probes/swagger"prometheus="ops"severity="critical"site="codfw"source="prometheus"team="sre"
  • create a task to add discovery to thumbor and remove the hardcoded backend config for swift, T397618
  • create a task to make mw-mcrouter with higher priority pods so they can evict others, T397683
  • productionize deploy-all.sh T397684
  • next time, run deploy-all.sh before wiping the cluster to ensure services are in a deployable state
  • somehow fix scaps ability to bootstrap a mediawiki deployment without failing (helmfile sync instread of helmfile apply), T397685

Event Timeline

Unfortunately the pool-depool-cluster cookbook immediately fails:

jayme@cumin1002:~$ sudo cookbook sre.k8s.pool-depool-cluster -r pre-upgrade-test depool --wipe-cache codfw codfw
Acquired lock for key /spicerack/locks/cookbooks/sre.k8s.pool-depool-cluster: {'concurrency': 20, 'created': '2025-06-18 10:10:32.970217', 'owner': 'jayme@cumin1002 [4126063]', 'ttl': 1800}
START - Cookbook sre.k8s.pool-depool-cluster depool 44 services in codfw/codfw: pre-upgrade-test
Found 44 services for cluster codfw
Exception raised while executing cookbook sre.k8s.pool-depool-cluster:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/k8s/pool-depool-cluster.py", line 149, in run
    elif self.args.action == "check" or service.discovery.active_active:
AttributeError: 'ServiceDiscovery' object has no attribute 'active_active'
Released lock for key /spicerack/locks/cookbooks/sre.k8s.pool-depool-cluster: {'concurrency': 20, 'created': '2025-06-18 10:10:32.970217', 'owner': 'jayme@cumin1002 [4126063]', 'ttl': 1800}
END (FAIL) - Cookbook sre.k8s.pool-depool-cluster (exit_code=99) depool 44 services in codfw/codfw: pre-upgrade-test

It accesses a bunch of non existing attributes from ServiceDiscovery - I'll go ahead fix those and we can retry tomorrow

Change #1160816 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.pool-depool-cluster: Black format

https://gerrit.wikimedia.org/r/1160816

Change #1160817 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter

https://gerrit.wikimedia.org/r/1160817

Change #1161485 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: remove ProxyFetch checks for kartotherian, thumbor

https://gerrit.wikimedia.org/r/1161485

Depool test went fine today. We should be good to update on Monday I'd say

Change #1161929 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Update codfw to kubernetes 1.31, calico 3.29

https://gerrit.wikimedia.org/r/1161929

Change #1161930 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Update codfw eqiad pod ip range

https://gerrit.wikimedia.org/r/1161930

Change #1161945 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] Update codfw to k8s 1.31

https://gerrit.wikimedia.org/r/1161945

Change #1161948 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] admin_ng: Change codfw pod ip range to 10.194.128.0/17

https://gerrit.wikimedia.org/r/1161948

Mentioned in SAL (#wikimedia-operations) [2025-06-23T10:39:13Z] <claime> cookbook sre.k8s.wipe-cluster --k8s-cluster wikikube-codfw -H 2 --reason "Kubernetes upgrade" - T397148

Change #1161929 merged by Kamila Součková:

[operations/puppet@production] Update codfw to kubernetes 1.31, calico 3.29

https://gerrit.wikimedia.org/r/1161929

Change #1161930 merged by Kamila Součková:

[operations/puppet@production] Update codfw pod ip range

https://gerrit.wikimedia.org/r/1161930

Change #1161945 merged by jenkins-bot:

[operations/deployment-charts@master] Update codfw to k8s 1.31

https://gerrit.wikimedia.org/r/1161945

Change #1161948 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Change codfw pod ip range to 10.194.128.0/17

https://gerrit.wikimedia.org/r/1161948

Change #1162897 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] Revert "miscweb(design-landing-page): bump version"

https://gerrit.wikimedia.org/r/1162897

Change #1162897 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "miscweb(design-landing-page): bump version"

https://gerrit.wikimedia.org/r/1162897

Mentioned in SAL (#wikimedia-operations) [2025-06-23T13:12:03Z] <cgoubert@deploy1003> Started scap sync-world: Redeploying mediawiki following kubernets upgrade T397148

Mentioned in SAL (#wikimedia-operations) [2025-06-23T13:42:54Z] <cgoubert@deploy1003> Started scap sync-world: Redeploying mediawiki following kubernets upgrade T397148

Mentioned in SAL (#wikimedia-operations) [2025-06-23T13:43:23Z] <cgoubert@deploy1003> cgoubert: Redeploying mediawiki following kubernets upgrade T397148 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-23T13:44:55Z] <cgoubert@deploy1003> Finished scap sync-world: Redeploying mediawiki following kubernets upgrade T397148 (duration: 02m 00s)

Repool command for ingress:
sudo confctl --object-type discovery select 'dnsdisc=k8s-ingress-wikikube.*,name=codfw' set/pooled=true

Overall this did not really go as planned since we had a couple of issues:

  • We had to re-run wipe-cookbook since the kubernetes-client131 package could not be installed (error in postinst, T387548) and that does not fail the puppet run (or at least not spicerack.puppet.run())
  • spicerack.puppet.run() has a default batch size of 10 - which is way to low (probably even as a default), making the cookbook run much slower than required
  • We created an outage because swift was hardcoded to use the codfw thumbor and thumbor does not have a discovery record (so it practically can't be depooled)
  • Re-deploying services got blocked because
    • ml had made hard to reverse changes to the machinetranslation production release without deploying them
    • miscweb contained broken changes which where never deployed to production
  • scap was unable to deploy mediawiki, due to configmaps being created to late so we fell back to the bash way
  • mw-mcrouter could not be deployed because some nodes where already out of capacity

Todos:
Moved to task desctiption

Change #1162952 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] peopleweb: add KUBEPOD ranges to firewall

https://gerrit.wikimedia.org/r/1162952

Repool command for ingress:

Correct command to just repool ro in codfw:
sudo confctl --object-type discovery select 'dnsdisc=k8s-ingress-wikikube-ro,name=codfw' set/pooled=true

machinetranslation in codfw was deployed successfully.

jelto@cumin1003:~$ sudo confctl --object-type discovery select 'dnsdisc=k8s-ingress-wikikube-ro,name=codfw' set/pooled=true
k8s-ingress-wikikube-ro/codfw: pooled changed False => True
WARNING:conftool.announce:conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-wikikube-ro,name=codfw

Change #1162952 merged by Jelto:

[operations/puppet@production] peopleweb: add KUBEPOD ranges to firewall

https://gerrit.wikimedia.org/r/1162952

Change #1163401 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] k8s.wipe-cluster: Run puppet in batches of 50

https://gerrit.wikimedia.org/r/1163401

Change #1163402 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.wipe-cluster: Ask user to confirm target k8s version

https://gerrit.wikimedia.org/r/1163402

Change #1163401 merged by jenkins-bot:

[operations/cookbooks@master] k8s.wipe-cluster: Run puppet in batches of 50

https://gerrit.wikimedia.org/r/1163401

Change #1163402 merged by jenkins-bot:

[operations/cookbooks@master] sre.wipe-cluster: Ask user to confirm target k8s version

https://gerrit.wikimedia.org/r/1163402

Change #1161485 merged by Hnowlan:

[operations/puppet@production] service: remove ProxyFetch checks for kartotherian, thumbor

https://gerrit.wikimedia.org/r/1161485

Change #1165026 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/cookbooks@master] sre.k8s.wipe-cluster: Downtime services

https://gerrit.wikimedia.org/r/1165026

JMeybohm updated the task description. (Show Details)

I've addressed all action items (apart from the ones with dedicated tasks) and moved the upgrade documentation to a wikitech page at: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Upgrade/1.31

Change #1160816 merged by jenkins-bot:

[operations/cookbooks@master] k8s.pool-depool-cluster: Black format

https://gerrit.wikimedia.org/r/1160816

Change #1160817 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.pool-depool-cluster: Refactor do reuse sre.discovery.datacenter

https://gerrit.wikimedia.org/r/1160817

Change #1165026 merged by jenkins-bot:

[operations/cookbooks@master] sre.k8s.wipe-cluster: Downtime services

https://gerrit.wikimedia.org/r/1165026