This is scheduled for March 7th - 09:00-16:00 UTC (actual downtime of the cluster should be smaller than this window). Hopefully it will not overlap with T329073 which is being done as it is the last day that eqiad will be fully depooled. Some hosts that will be affected by the task above and should be done before 14:00 UTC:
- kubetcd1005
- kubemaster1001
- kubernetes[1005,1007-1008,1017-1018]
Todos:
- Announce cluster downtime/reimage to ops@
- Ensure PKI intermediates have been created
- Depool wdqs and wcqs in eqiad (This is one of the exceptions in the datacenter switchover)
- Downtime: etcd, master, nodes
- Properly stop rdf-streaming-updater flink job (@dcausse)
- Merge hiera changes for 1.23 (including PKI for etcd): https://gerrit.wikimedia.org/r/c/operations/puppet/+/894586
- Reimage etcd nodes with bullseye
- Reimage masters
- Reimage ganeti node: kubernetes100{5,6}, kubernetes101{5,6}
- Reimage nodes: kubernetes1007-kubernetes1022
- Add new nodes kubernetes1023, kubernetes1024 T313874
- Verify basic k8s stuff working (nodes joining the cluster)
- Marge deployment-charts/admin_ng changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/894591
- Deploy admin_ng & istio
- Deploy services (do toolhub with a higher priority)
- Properly start rdf-streaming-updater flink job (@dcausse)
- repool wdqs in eqiad
- Lift downtimes
- Reply to the ops/wikitech-l announcement to announce the cluster operational again
Services won't be repooled after the reimaging as the eqiad datacenter is anyway depooled due to the switchover.
Detailed steps and commands can be found in T326340: Update staging-codfw to k8s 1.23
@akosiaris will be running point.
Issues
None yet
Impact
The only user visible impact will be for Toolhub, which per T329319#8619246 and related discussions is expected and acceptable. Toolhub will be prioritized when deploying new services to the cluster. Adding @bd808 for their convenience.