Change Details

We will be upgrading the kubernetes cluster in codfw to kubernetes 1.16, calico 3.17 like we did for staging-eqiad in T276305. This includes: * Setting up new master VMs `kubemaster200[12].codfw.wmnet`, VMs set up in T276204 * Rebooting `kubetcd[2004-2006].codfw.wmnet` for T273278 * Reimaging worker nodes `kubernetes[2001-2017].codfw.wmnet` ** With Kernel 4.19 T262527 (which also fixes them in T273279) The plan is roughly: * Prepare all needed patches ** Aggregate the IPv4 pools into respective /21 ** Enabling the kernel 4.19 profile for nodes ** Double check `deployment-charts/helmfile.d/admin_ng` has correct values populated and the cluster enabled ** Don't forget private puppet: * controllermanager_token * are certs already done? * Downtime all services in the cluster * Cut traffic to all services in the cluster (`sre.discovery.service-route` cookbook?) * Disable puppet on master and nodes * Stop apiserver, controller manager, scheduler on master * Empty etcd (`ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true`) * Reboot etcd servers (checkmars in T273278) * Image the new master * Start reimaging nodes (checkmarks in T273279 * Start apiserver, controller manager, scheduler * `helmfile sync` admin_ng * Deploy all services * Check all services (service-checker if possible) * End downtime of services * Decommission the old masters