This is scheduled for Feb 21st - 09:00-16:00 UTC (actual downtime of the cluster should be smaller than this window) and we will piggyback on T327991: codfw row B switches upgrade as codfw will be depooled for that anyways.
Some relevant (in this context) hosts will be affected by the 30min downtime during the switch upgrade. Ideally reimageing those hosts should be completed before 14:00 UTC:
- kubetcd2006
- kubemaster2002
- kubernetes[2006,2009-2010,2020,2023]
Todos:
- Announce cluster downtime/reimage to ops@
- Ensure PKI intermediates have been created
- Depool wdqs and wcqs in codfw (should already be done as part of T327991)
- Downtime: etcd, master, nodes
- Properly stop rdf-streaming-updater flink job (@dcausse)
- Merge hiera changes for 1.23 (including PKI for etcd): https://gerrit.wikimedia.org/r/c/operations/puppet/+/890390/
- Reimage etcd nodes with bullseye
- Reimage masters
- Reimage ganeti node: kubernetes2006 @JMeybohm
- Reimage nodes: kubernetes[2009-2010,2013,2014,2020,2022] @JMeybohm
- Reimage ganeti nodes: kubernetes2005,kubernetes2015,kubernetes2016 @JMeybohm
- Reimage nodes: 200[7,8] 20[11,12] @elukey
- Reimage nodes: 2017-2021 ("Unable to establish IPMI v2 / RMCP+ session") caused by T328832 / T330048 ?
- Verify basic k8s stuff working (nodes joining the cluster)
- Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/890392/
- Deploy admin_ng & istio
- Deploy services
- Properly start rdf-streaming-updater flink job (@dcausse)
- repool wdqs in codfw
- Lift downtimes (apart from kubernetes2017-2021)
- Reply to the ops/wikitech-l announcement "codfw wikikube kubernetes cluster upgrade on 2023-02-21" to announce the cluster operational again
- Repool services, see bottom of T327991
Detailed steps and commands can be found in T326340: Update staging-codfw to k8s 1.23
Issues
Reimage of kubenetes2017-2021 fails with "Unable to establish IPMI v2 / RMCP+ session" (probably caused by T328832 / T330048). That means we're down 5 nodes. We have kubernetes2023-2024 in role::insetup, so we could compensate for 2 of them. Alternatively we could just run puppet (without reimage) on kubenetes2017-2021 which should work as well. I tried that in pontoon, but never on real workers.
Due to the cluster missing 3 nodes I:
- scaled thumbor down to 1 replica
- scaled mw-api-ext/mediawiki-main from 4 to 2 replicas
- scaled mw-debug/mediawiki-pinkunicorn from 2 to 1 replicas
- scaled mw-web/mediawiki-main from 8 to 4 replicas
As I did that manually using kubectl scale this needs to be persisted in deployment-charts repo for mw deployments in order to not be overridden by scap.
- Persist scale down of mw deployments in codfw