We will be upgrading the kubernetes cluster in codfw to kubernetes 1.16, calico 3.17 like we did for staging-eqiad in T276305.
This includes:
* Setting up new master VMs `kubemaster200[12].codfw.wmnet`, VMs set up in T276204
* Rebooting `kubetcd[2004-2006].codfw.wmnet` for T273278
* Reimaging worker nodes `kubernetes[2001-2016].codfw.wmnet`
** Add role to `kubernetes2017.codfw.wmnet` (latest addition to cluster)
** With Kernel 4.19 T262527 (which also fixes issues described in T273279)
The plan is roughly:
== Preparation ==
* Prepare all needed patches
[] Aggregate the IPv4 pools into respective /21: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671144
[] Add the role to the new master VMs.
[] Enabling the kernel 4.19 profile for nodes
[] Double check `deployment-charts/helmfile.d/admin_ng` has correct values populated and the cluster enabled
[] Private puppet patches for controller manager token
[] The cergen configuration for `kubemaster.svc.codfw.wmnet`
* Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
```
apertium.svc.codfw.wmnet
api-gateway.svc.codfw.wmnet
blubberoid.svc.codfw.wmnet
citoid.svc.codfw.wmnet
cxserver.svc.codfw.wmnet
echostore.svc.codfw.wmnet
eventgate-analytics.svc.codfw.wmnet
eventgate-analytics-external.svc.codfw.wmnet
eventgate-logging-external.svc.codfw.wmnet
eventgate-main.svc.codfw.wmnet
eventstreams.svc.codfw.wmnet
eventstreams-internal.svc.codfw.wmnet
linkrecommendation.svc.codfw.wmnet
mathoid.svc.codfw.wmnet
mobileapps.svc.codfw.wmnet
proton.svc.codfw.wmnet
push-notifications.svc.codfw.wmnet
recommendation-api.svc.codfw.wmnet
sessionstore.svc.codfw.wmnet
similar-users.svc.codfw.wmnet
termbox.svc.codfw.wmnet
wikifeeds.svc.codfw.wmnet
```
== Actions ==
* Downtime masters and nodes
** `sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster with new etcd' -t T277191 -H 24 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)'`
* Downtime all services in the cluster
~~** `for i in api-gateway blubberoid changeprop changeprop-jobqueue citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds zotero; do sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster' -t T277191 -H 24 $i.svc.codfw.wmnet ; done`~~. This doesn't work, had to switch to the icinga UI.
* Cut traffic to all services in the cluster
~~** `sudo cookbook sre.discovery.service-route depool codfw apertium api-gateway blubberoid citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds` ~~. This doesn't work, had to use confctl
* Switch restbase-async to eqiad:
** `sudo confctl --object discovery select "name=codfw,dnsdisc=restbase-async" set/pooled=false && sudo confctl --object discovery select "name=eqiad,dnsdisc=restbase-async" set/pooled=true`
* Disable puppet on masters and nodes
** `sudo cumin 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)' 'disable-puppet "Reinitializing cluster - T277191"'`
* Power them off too. ssh to ganeti01.svc.<site>.wmnet and
** `sudo gnt-instance shutdown -f argon.eqiad.wmnet`
** `sudo gnt-instance shutdown -f chroline.eqiad.wmnet`
** `sudo gnt-instance shutdown -f acrux.codfw.wmnet`
** `sudo gnt-instance shutdown -f acrab.codfw.wmnet`
* Regenerate the codfw master cert using cergen.
** Revoke and remove the old cert on puppetmaster1001 with `sudo puppet cert clean kubemaster.svc.codfw.wmnet`
** Run cergen `sudo cergen --base-path /srv/private/modules/secret/secrets/certificates --generate /srv/private/modules/secret/secrets/certificates/certificate.manifests.d`
** Copy certs to files/ssl
* Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671174
* Start reimaging nodes (checkmarks in T273279)
* Empty etcd (`ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true`)
* Reboot etcd servers (checkmarks in T273278)
* Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671171
* Set `profile::kubernetes::master::controllermanager_token: ... ` in private puppets `hieradata/role/codfw/kubernetes/master.yaml`
* Image the new masters
* `helmfile sync` admin_ng
* Deploy all services (use deploy_all.sh script)
* Check all services (service-checker if possible)
* End downtime of services
* Decommission the old masters
== Action items ==
[] Support downtiming services in our cookbooks
[] Support mutiple services at once in in sre.discovery.service-route
[] Fix sre.discovery.service-route (it's not working)
[] Write a cookbook to set a k8s cluster in maintenance mode (shift all traffic, downtime all services, nodes and masters)
[] Add a step to downtime/whatever LVS to not alert on "marked down but pooled" and "hosts in IPVS but unknown to PyBal" for old masters
[] Add a step to downtime/whatever prometheus to not alert on unavailable masters
[] Have a way to schedule downtime for confd errors of services
[] Track down error starting calico-kube-controller on first sync:
```
create pod sandbox: rpc error: code = Unknown desc =
[failed to set up sandbox container "b3e99e8b33affd2190813ee6a11b61d5e55f55815b5a3c010b685145b8e9ed48" network for pod "calico-kube-controllers-84649ffc66-n6qtp": networkPlugin cni failed to set up pod "calico-kube-controllers-84649ffc66-n6qtp_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "kubelet" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope, failed to clean up sandbox container "X" network for pod "calico-kube-controllers-84649ffc66-n6qtp": networkPlugin cni failed to teardown pod "calico-kube-controllers-84649ffc66-n6qtp_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "kubelet" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope]
```