Sister task of T277191
We will be upgrading the kubernetes cluster in eqiad to kubernetes 1.16, calico 3.17 like we did for codfw in T277191
This includes:
- Setting up new master VMs kubemaster100[12].eqiad.wmnet, VMs set up in T276204
- Rebooting kubetcd[1004-1006].eqiad.wmnet for T273278
- Reimaging worker nodes kubernetes[1001-1016].eqiad.wmnet
- Add role to kubernetes1017.codfw.wmnet (latest addition to cluster)
- Add homer/public change for kubernetes1017
- Pool kubernetes1017 to conftool
- With Kernel 4.19 T262527 (which also fixes issues described in T273279)
The plan is roughly:
Preparation
- Prepare all needed patches
- Aggregate the IPv4 pools into respective /21 and enable eqiad: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/673955
- Add the role to the new master VMs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/673952
- Enabling the kernel 4.19 profile for nodes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/673949/1
- Double check deployment-charts/helmfile.d/admin_ng has correct values populated and the cluster enabled
- Private puppet patches for controller manager token - As 277741 in /srv/private on puppetmaster1001
- The cergen configuration for kubemaster.svc.eqiad.wmnet - Same as above
- Add homer/public patch for kubernetes1017: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/672709/
- Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
apertium.svc.eqiad.wmnet api-gateway.svc.eqiad.wmnet blubberoid.svc.eqiad.wmnet citoid.svc.eqiad.wmnet cxserver.svc.eqiad.wmnet echostore.svc.eqiad.wmnet eventgate-analytics.svc.eqiad.wmnet eventgate-analytics-external.svc.eqiad.wmnet eventgate-logging-external.svc.eqiad.wmnet eventgate-main.svc.eqiad.wmnet eventstreams.svc.eqiad.wmnet eventstreams-internal.svc.eqiad.wmnet linkrecommendation.svc.eqiad.wmnet mathoid.svc.eqiad.wmnet mobileapps.svc.eqiad.wmnet proton.svc.eqiad.wmnet push-notifications.svc.eqiad.wmnet recommendation-api.svc.eqiad.wmnet sessionstore.svc.eqiad.wmnet similar-users.svc.eqiad.wmnet termbox.svc.eqiad.wmnet wikifeeds.svc.eqiad.wmnet
Actions
- Downtime masters and nodes
- sudo cookbook sre.hosts.downtime -r 'Reinitialize eqiad k8s cluster with new etcd' -t T277741 -H 24 'A:eqiad and (A:kubernetes-masters or A:kubernetes-workers)'
- Downtime all services in the cluster
- Use https://gerrit.wikimedia.org/r/c/operations/puppet/+/674147 on alert1001 in a for loop with the service above.
- Cut traffic to all services in the cluster
- Use confctl
- Downtimes for:
- PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled. Use https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1016&service=PyBal+backends+health+check and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1015&service=PyBal+backends+health+check
- PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=k8s+cache#
- PROBLEM - Confd template for /srv/config-master/pybal/eqiad/... on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/... is broken. Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal# (this needs to be downtimed for both DCs!)
Switch restbase-async to eqiad:- sudo confctl --object discovery select "name=eqiad,dnsdisc=restbase-async" set/pooled=false && sudo confctl --object discovery select "name=codfw,dnsdisc=restbase-async" set/pooled=true
- Disable puppet on masters and nodes
- sudo cumin 'A:eqiad and (A:kubernetes-masters or A:kubernetes-workers)' 'disable-puppet "Reinitializing cluster - T277741"'
- Power the masters off. ssh to ganeti01.svc.<site>.wmnet and
- sudo gnt-instance shutdown -f argon.eqiad.wmnet
- sudo gnt-instance shutdown -f chroline.eqiad.wmnet
- Regenerate the eqiad master cert using cergen. sudo puppet cert clean kubemaster.svc.eqiad.wmnet. Then
- Revoke and remove the old cert on puppetmaster1001 with sudo puppet cert clean kubemaster.svc.eqiad.wmnet
- Apply the prepared patch cd /srv/private ; git apply T277741
- Run cergen sudo cergen --base-path /srv/private/modules/secret/secrets/certificates --generate /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
- Copy certs to files/ssl - careful, it's manual and error prone yet.
- Merge: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/672709
- Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/673949
- Start reimaging nodes (checkmarks in T273279)
- Empty etcd (ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true)
- Reboot etcd servers (checkmarks in T273278)
- Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/673952
- Move profile::kubernetes::master::controllermanager_token: ... from private puppets hieradata/role/codfw/kubernetes/master.yaml to hieradata/role/common/kubernetes/master.yaml, part of T277741 patchfile in /srv/private
- Image the new masters
- Merge: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/673955
- helmfile sync admin_ng
- Deploy all services (use deploy_all.sh script)
- Check all services (service-checker if possible)
- End downtime of services
- Decommission the old masters
Action items
TBD