We will be upgrading the kubernetes cluster in codfw to kubernetes 1.16, calico 3.17 like we did for staging-eqiad in T276305.
This includes:
- Setting up new master VMs kubemaster200[12].codfw.wmnet, VMs set up in T276204
- Rebooting kubetcd[2004-2006].codfw.wmnet for T273278
- Reimaging worker nodes kubernetes[2001-2016].codfw.wmnet
- Add role to kubernetes2017.codfw.wmnet (latest addition to cluster)
- With Kernel 4.19 T262527 (which also fixes issues described in T273279)
The plan is roughly:
Preparation
- Prepare all needed patches
- Aggregate the IPv4 pools into respective /21: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671144
- Add the role to the new master VMs.
- Enabling the kernel 4.19 profile for nodes
- Double check deployment-charts/helmfile.d/admin_ng has correct values populated and the cluster enabled
- Private puppet patches for controller manager token
- The cergen configuration for kubemaster.svc.codfw.wmnet
- Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
apertium.svc.codfw.wmnet api-gateway.svc.codfw.wmnet blubberoid.svc.codfw.wmnet citoid.svc.codfw.wmnet cxserver.svc.codfw.wmnet echostore.svc.codfw.wmnet eventgate-analytics.svc.codfw.wmnet eventgate-analytics-external.svc.codfw.wmnet eventgate-logging-external.svc.codfw.wmnet eventgate-main.svc.codfw.wmnet eventstreams.svc.codfw.wmnet eventstreams-internal.svc.codfw.wmnet linkrecommendation.svc.codfw.wmnet mathoid.svc.codfw.wmnet mobileapps.svc.codfw.wmnet proton.svc.codfw.wmnet push-notifications.svc.codfw.wmnet recommendation-api.svc.codfw.wmnet sessionstore.svc.codfw.wmnet similar-users.svc.codfw.wmnet termbox.svc.codfw.wmnet wikifeeds.svc.codfw.wmnet
Actions
- Downtime masters and nodes
- sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster with new etcd' -t T277191 -H 24 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)'
- Downtime all services in the cluster
** for i in api-gateway blubberoid changeprop changeprop-jobqueue citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds zotero; do sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster' -t T277191 -H 24 $i.svc.codfw.wmnet ; done. This doesn't work, had to switch to the icinga UI.
- Cut traffic to all services in the cluster
** sudo cookbook sre.discovery.service-route depool codfw apertium api-gateway blubberoid citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds . This doesn't work, had to use confctl
- Switch restbase-async to eqiad:
- sudo confctl --object discovery select "name=codfw,dnsdisc=restbase-async" set/pooled=false && sudo confctl --object discovery select "name=eqiad,dnsdisc=restbase-async" set/pooled=true
- Disable puppet on masters and nodes
- sudo cumin 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)' 'disable-puppet "Reinitializing cluster - T277191"'
- Power them off too. ssh to ganeti01.svc.<site>.wmnet and
- sudo gnt-instance shutdown -f argon.eqiad.wmnet
- sudo gnt-instance shutdown -f chroline.eqiad.wmnet
- sudo gnt-instance shutdown -f acrux.codfw.wmnet
- sudo gnt-instance shutdown -f acrab.codfw.wmnet
- Regenerate the codfw master cert using cergen.
- Revoke and remove the old cert on puppetmaster1001 with sudo puppet cert clean kubemaster.svc.codfw.wmnet
- Run cergen sudo cergen --base-path /srv/private/modules/secret/secrets/certificates --generate /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
- Copy certs to files/ssl
- Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671174
- Start reimaging nodes (checkmarks in T273279)
- Empty etcd (ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true)
- Reboot etcd servers (checkmarks in T273278)
- Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671171
- Set profile::kubernetes::master::controllermanager_token: ... in private puppets hieradata/role/codfw/kubernetes/master.yaml
- Image the new masters
- helmfile sync admin_ng
- Deploy all services (use deploy_all.sh script)
- Check all services (service-checker if possible)
- End downtime of services
- Decommission the old masters
Action items
- Support downtiming services in our cookbooks T277740
- Support mutiple services at once in in sre.discovery.service-route T260663
- Fix sre.discovery.service-route (it's not working) T260663
- Write a cookbook to set a k8s cluster in maintenance mode (shift all traffic, downtime all services, nodes and masters) T277677
- Add a step to downtime/whatever LVS to not alert on "marked down but pooled" and "hosts in IPVS but unknown to PyBal" for old masters
- Add a step to downtime/whatever prometheus to not alert on unavailable masters
- Have a way to schedule downtime for confd errors of services
- Track down error starting calico-kube-controller on first sync: https://gerrit.wikimedia.org/r/c/operations/puppet/+/672713
- Add docs on how to add a kubernetes node (https://gerrit.wikimedia.org/r/c/operations/puppet/+/672724/ , https://gerrit.wikimedia.org/r/c/operations/homer/public/+/672708, https://gerrit.wikimedia.org/r/c/operations/puppet/+/672537/6/modules/docker_registry_ha/manifests/web.pp#83)