We will be upgrading the kubernetes cluster in codfw to kubernetes 1.16, calico 3.17 like we did for staging-eqiad in T276305.
This includes:
* Setting up new master VMs `kubemaster200[12].codfw.wmnet`, VMs set up in T276204
* Rebooting `kubetcd[2004-2006].codfw.wmnet` for T273278
* Reimaging worker nodes `kubernetes[2001-2016].codfw.wmnet`
** Add role to `kubernetes2017.codfw.wmnet` (latest addition to cluster)
** With Kernel 4.19 T262527 (which also fixes issues described in T273279)
The plan is roughly:
== Preparation ==
* Prepare all needed patches
[] Aggregate the IPv4 pools into respective /21
[] Add the role to the new master VMs.
[] Enabling the kernel 4.19 profile for nodes
[] Double check `deployment-charts/helmfile.d/admin_ng` has correct values populated and the cluster enabled
[] Private puppet patches for tokens (controllermanager_token, are certs already done?)
* Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
== Actions ==
* Downtime masters and nodes
** `sudo cookbook sre.hosts.downtime -r 'Reinitialize eqiad k8s cluster with new etcd' -t TXXX -H 4 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)'`
* Downtime all services in the cluster
** `sudo cookbook sre.hosts.downtime -r 'Reinitialize eqiad k8s cluster with new etcd' -t TXXX -H 4 $SERVICE_NAMES`
* Cut traffic to all services in the cluster
** `sre.discovery.service-route ... ... $SERVICE_NAMES`
* Disable puppet on masters and nodes
** `sudo cumin 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)' 'disable-puppet "Reinitializing cluster. TXX'`
* Stop apiserver, controller manager, scheduler on masters
** `sudo cumin 'A:codfw and (A:kubernetes-masters)' 'shutdown -h now'`
* Maybe power them off too?
* Empty etcd (`ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true`)
* Reboot etcd servers (checkmarks in T273278)
* Image the new masters (Merge patch applying the role/hiera and apply puppet)
* Start reimaging nodes (checkmarks in T273279)
* `helmfile sync` admin_ng
* Deploy all services (use deploy_all.sh script)
* Check all services (service-checker if possible)
* End downtime of services
* Decommission the old masters