Page MenuHomePhabricator

Update Kubernetes cluster codfw to kubernetes 1.16
Closed, ResolvedPublic

Description

We will be upgrading the kubernetes cluster in codfw to kubernetes 1.16, calico 3.17 like we did for staging-eqiad in T276305.

This includes:

  • Setting up new master VMs kubemaster200[12].codfw.wmnet, VMs set up in T276204
  • Rebooting kubetcd[2004-2006].codfw.wmnet for T273278
  • Reimaging worker nodes kubernetes[2001-2016].codfw.wmnet
    • Add role to kubernetes2017.codfw.wmnet (latest addition to cluster)
    • With Kernel 4.19 T262527 (which also fixes issues described in T273279)

The plan is roughly:

Preparation

  • Prepare all needed patches
  • Aggregate the IPv4 pools into respective /21: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671144
  • Add the role to the new master VMs.
  • Enabling the kernel 4.19 profile for nodes
  • Double check deployment-charts/helmfile.d/admin_ng has correct values populated and the cluster enabled
  • Private puppet patches for controller manager token
  • The cergen configuration for kubemaster.svc.codfw.wmnet
  • Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
apertium.svc.codfw.wmnet
api-gateway.svc.codfw.wmnet
blubberoid.svc.codfw.wmnet
citoid.svc.codfw.wmnet
cxserver.svc.codfw.wmnet
echostore.svc.codfw.wmnet
eventgate-analytics.svc.codfw.wmnet
eventgate-analytics-external.svc.codfw.wmnet
eventgate-logging-external.svc.codfw.wmnet
eventgate-main.svc.codfw.wmnet
eventstreams.svc.codfw.wmnet
eventstreams-internal.svc.codfw.wmnet
linkrecommendation.svc.codfw.wmnet
mathoid.svc.codfw.wmnet
mobileapps.svc.codfw.wmnet
proton.svc.codfw.wmnet
push-notifications.svc.codfw.wmnet
recommendation-api.svc.codfw.wmnet
sessionstore.svc.codfw.wmnet
similar-users.svc.codfw.wmnet
termbox.svc.codfw.wmnet
wikifeeds.svc.codfw.wmnet

Actions

  • Downtime masters and nodes
    • sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster with new etcd' -t T277191 -H 24 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)'
  • Downtime all services in the cluster

** for i in api-gateway blubberoid changeprop changeprop-jobqueue citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds zotero; do sudo cookbook sre.hosts.downtime -r 'Reinitialize codfw k8s cluster' -t T277191 -H 24 $i.svc.codfw.wmnet ; done. This doesn't work, had to switch to the icinga UI.

  • Cut traffic to all services in the cluster

** sudo cookbook sre.discovery.service-route depool codfw apertium api-gateway blubberoid citoid cxserver echostore eventgate-analytics eventgate-analytics-external eventgate-logging-external eventgate-main eventstreams eventstreams-internal linkrecommendation mathoid mobileapps proton push-notifications recommendation-api sessionstore similar-users termbox wikifeeds . This doesn't work, had to use confctl

  • Switch restbase-async to eqiad:
    • sudo confctl --object discovery select "name=codfw,dnsdisc=restbase-async" set/pooled=false && sudo confctl --object discovery select "name=eqiad,dnsdisc=restbase-async" set/pooled=true
  • Disable puppet on masters and nodes
    • sudo cumin 'A:codfw and (A:kubernetes-masters or A:kubernetes-workers)' 'disable-puppet "Reinitializing cluster - T277191"'
  • Power them off too. ssh to ganeti01.svc.<site>.wmnet and
    • sudo gnt-instance shutdown -f argon.eqiad.wmnet
    • sudo gnt-instance shutdown -f chroline.eqiad.wmnet
    • sudo gnt-instance shutdown -f acrux.codfw.wmnet
    • sudo gnt-instance shutdown -f acrab.codfw.wmnet
  • Regenerate the codfw master cert using cergen.
    • Revoke and remove the old cert on puppetmaster1001 with sudo puppet cert clean kubemaster.svc.codfw.wmnet
    • Run cergen sudo cergen --base-path /srv/private/modules/secret/secrets/certificates --generate /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
    • Copy certs to files/ssl
  • Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671174
  • Start reimaging nodes (checkmarks in T273279)
  • Empty etcd (ETCDCTL_API=3 etcdctl --endpoints https://foobar.site.wmnet:2379 del "" --from-key=true)
  • Reboot etcd servers (checkmarks in T273278)
  • Merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671171
  • Set profile::kubernetes::master::controllermanager_token: ... in private puppets hieradata/role/codfw/kubernetes/master.yaml
  • Image the new masters
  • helmfile sync admin_ng
  • Deploy all services (use deploy_all.sh script)
  • Check all services (service-checker if possible)
  • End downtime of services
  • Decommission the old masters

Action items

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
JMeybohm updated the task description. (Show Details)
JMeybohm moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.
JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)
JMeybohm removed a subscriber: JMeybohm.

Change 671144 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Aggregate IPPools in codfw and eqiad, enable codfw

https://gerrit.wikimedia.org/r/671144

Change 671170 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] admin/: Remove codfw

https://gerrit.wikimedia.org/r/671170

Change 671171 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes codfw: Apply role/hiera to new masters

https://gerrit.wikimedia.org/r/671171

Change 671174 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes codfw: Populate new worker hiera keys for k8s update

https://gerrit.wikimedia.org/r/671174

Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize codfw k8s cluster with new etcd

acrab.codfw.wmnet,acrux.codfw.wmnet,kubernetes[2001-2016].codfw.wmnet
akosiaris updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-03-16T09:34:49Z] <akosiaris> poweroff acrux and acrab T277191

Change 672672 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672672

Change 672672 merged by Alexandros Kosiaris:
[operations/puppet@production] Add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672672

Mentioned in SAL (#wikimedia-operations) [2021-03-16T09:59:17Z] <akosiaris> Push new certs for kubemaster.svc.codfw.wmnet - T277191

Change 671174 merged by JMeybohm:
[operations/puppet@production] kubernetes codfw: Populate new worker hiera keys for k8s update

https://gerrit.wikimedia.org/r/671174

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

['kubernetes2001.codfw.wmnet', 'kubernetes2002.codfw.wmnet', 'kubernetes2003.codfw.wmnet', 'kubernetes2004.codfw.wmnet', 'kubernetes2007.codfw.wmnet', 'kubernetes2008.codfw.wmnet', 'kubernetes2009.codfw.wmnet', 'kubernetes2010.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kubernetes2013.codfw.wmnet', 'kubernetes2014.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103161025_jayme_8000.log.

Change 671171 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes codfw: Apply role/hiera to new masters

https://gerrit.wikimedia.org/r/671171

Change 672690 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Correctly add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672690

Change 672690 merged by Alexandros Kosiaris:
[operations/puppet@production] Correctly add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672690

Completed auto-reimage of hosts:

['kubernetes2009.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kubernetes2001.codfw.wmnet', 'kubernetes2002.codfw.wmnet', 'kubernetes2004.codfw.wmnet', 'kubernetes2003.codfw.wmnet', 'kubernetes2007.codfw.wmnet', 'kubernetes2014.codfw.wmnet', 'kubernetes2010.codfw.wmnet', 'kubernetes2008.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kubernetes2013.codfw.wmnet']

and were ALL successful.

Change 671144 merged by jenkins-bot:
[operations/deployment-charts@master] Aggregate IPPools in codfw and eqiad, enable codfw

https://gerrit.wikimedia.org/r/671144

Change 672708 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/homer/public@master] Add kubernetes2017 to BGP

https://gerrit.wikimedia.org/r/672708

Change 672708 merged by jenkins-bot:
[operations/homer/public@master] Add kubernetes2017 to BGP

https://gerrit.wikimedia.org/r/672708

Change 672713 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Move profile::kubernetes::node::cni_config to eqiad only

https://gerrit.wikimedia.org/r/672713

Change 672713 merged by Alexandros Kosiaris:
[operations/puppet@production] Move profile::kubernetes::node::cni_config to eqiad only

https://gerrit.wikimedia.org/r/672713

Mentioned in SAL (#wikimedia-operations) [2021-03-16T13:03:11Z] <akosiaris> sync all services on the new codfw kubernetes cluster T277191

Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime for like a month until we remove the VMs

acrab.codfw.wmnet

Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime for like a month until we remove the VMs

acrux.codfw.wmnet

Change 671170 merged by jenkins-bot:
[operations/deployment-charts@master] admin/: Remove codfw

https://gerrit.wikimedia.org/r/671170

Change 674147 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

akosiaris claimed this task.
akosiaris updated the task description. (Show Details)

Added steps in the eqiad task T277741 from the action items list, I am gonna boldy resolve this one, any extra followups will be tracked in the eqiad respective task.

Change 674147 merged by Alexandros Kosiaris:
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

Change 674270 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] contool-data: Add kubernetes2017.codfw.wmnet

https://gerrit.wikimedia.org/r/674270

Change 674270 merged by Alexandros Kosiaris:
[operations/puppet@production] conftool-data: Add kubernetes2017.codfw.wmnet

https://gerrit.wikimedia.org/r/674270

Mentioned in SAL (#wikimedia-operations) [2021-03-23T12:58:00Z] <akosiaris> remove and decomission argon, chroline, acrab, acrux T277741, T277191

Change 674307 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

Change 674307 merged by Alexandros Kosiaris:
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: acrux.codfw.wmnet

  • acrux.codfw.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: acrab.codfw.wmnet

  • acrab.codfw.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: argon.eqiad.wmnet

  • argon.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox