Page MenuHomePhabricator

Update Kubernetes cluster eqiad to kubernetes 1.16
Closed, ResolvedPublic

Description

Sister task of T277191

We will be upgrading the kubernetes cluster in eqiad to kubernetes 1.16, calico 3.17 like we did for codfw in T277191

This includes:

  • Setting up new master VMs kubemaster100[12].eqiad.wmnet, VMs set up in T276204
  • Rebooting kubetcd[1004-1006].eqiad.wmnet for T273278
  • Reimaging worker nodes kubernetes[1001-1016].eqiad.wmnet
    • Add role to kubernetes1017.codfw.wmnet (latest addition to cluster)
    • Add homer/public change for kubernetes1017
    • Pool kubernetes1017 to conftool
    • With Kernel 4.19 T262527 (which also fixes issues described in T273279)

The plan is roughly:

Preparation

  • Generate a list of all services (service FQDNs) for this DC (from namespaces or service.yaml) as $SERVICE_NAMES
apertium.svc.eqiad.wmnet
api-gateway.svc.eqiad.wmnet
blubberoid.svc.eqiad.wmnet
citoid.svc.eqiad.wmnet
cxserver.svc.eqiad.wmnet
echostore.svc.eqiad.wmnet
eventgate-analytics.svc.eqiad.wmnet
eventgate-analytics-external.svc.eqiad.wmnet
eventgate-logging-external.svc.eqiad.wmnet
eventgate-main.svc.eqiad.wmnet
eventstreams.svc.eqiad.wmnet
eventstreams-internal.svc.eqiad.wmnet
linkrecommendation.svc.eqiad.wmnet
mathoid.svc.eqiad.wmnet
mobileapps.svc.eqiad.wmnet
proton.svc.eqiad.wmnet
push-notifications.svc.eqiad.wmnet
recommendation-api.svc.eqiad.wmnet
sessionstore.svc.eqiad.wmnet
similar-users.svc.eqiad.wmnet
termbox.svc.eqiad.wmnet
wikifeeds.svc.eqiad.wmnet

Actions

Action items

TBD

Event Timeline

akosiaris updated the task description. (Show Details)

Change 672709 had a related patch set uploaded (by JMeybohm; owner: Alexandros Kosiaris):
[operations/homer/public@master] Add kubernetes1017 to BGP peers

https://gerrit.wikimedia.org/r/672709

Change 673949 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes eqiad: Populate hiera keys for k8s worker updates

https://gerrit.wikimedia.org/r/673949

Change 673952 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes eqiad: Apply role and hiera values to new masters

https://gerrit.wikimedia.org/r/673952

Change 673955 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin_ng: Enable eqiad

https://gerrit.wikimedia.org/r/673955

Change 673956 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Remove helmfile.d/admin

https://gerrit.wikimedia.org/r/673956

Change 674147 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

Change 674147 merged by Alexandros Kosiaris:
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize eqiad k8s cluster with new etcd

argon.eqiad.wmnet,chlorine.eqiad.wmnet,kubernetes[1001-1016].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-03-23T08:25:10Z] <akosiaris> beginning the k8s upgrade/reinit process. T277741

Mentioned in SAL (#wikimedia-operations) [2021-03-23T08:28:17Z] <akosiaris> downtime all services in T277741 for 24H

Mentioned in SAL (#wikimedia-operations) [2021-03-23T08:33:39Z] <akosiaris> eqiad services in k8s depooled. T277741

Mentioned in SAL (#wikimedia-operations) [2021-03-23T08:43:22Z] <akosiaris> poweroff argon and chlorine T277741

Change 672709 merged by jenkins-bot:
[operations/homer/public@master] Add kubernetes1017 to BGP peers

https://gerrit.wikimedia.org/r/672709

Change 674261 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add kubemaster.svc.eqiad.wmnet.cert

https://gerrit.wikimedia.org/r/674261

Change 674261 merged by Alexandros Kosiaris:
[operations/puppet@production] Add kubemaster.svc.eqiad.wmnet.cert

https://gerrit.wikimedia.org/r/674261

Change 673949 merged by JMeybohm:
[operations/puppet@production] kubernetes eqiad: Populate hiera keys for k8s worker updates

https://gerrit.wikimedia.org/r/673949

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

['kubernetes1001.eqiad.wmnet', 'kubernetes1002.eqiad.wmnet', 'kubernetes1003.eqiad.wmnet', 'kubernetes1004.eqiad.wmnet', 'kubernetes1007.eqiad.wmnet', 'kubernetes1008.eqiad.wmnet', 'kubernetes1009.eqiad.wmnet', 'kubernetes1010.eqiad.wmnet', 'kubernetes1011.eqiad.wmnet', 'kubernetes1012.eqiad.wmnet', 'kubernetes1013.eqiad.wmnet', 'kubernetes1014.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103230904_jayme_30602.log.

Mentioned in SAL (#wikimedia-operations) [2021-03-23T09:05:03Z] <akosiaris> reboot kubetcd100[456] for kernel upgrades. T277741 T273278

Change 673952 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes eqiad: Apply role and hiera values to new masters

https://gerrit.wikimedia.org/r/673952

Change 674269 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] contool-data: Add kubernetes1017.eqiad.wmnet

https://gerrit.wikimedia.org/r/674269

Change 674269 merged by Alexandros Kosiaris:
[operations/puppet@production] contool-data: Add kubernetes1017.eqiad.wmnet

https://gerrit.wikimedia.org/r/674269

Change 673955 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng: Enable eqiad

https://gerrit.wikimedia.org/r/673955

Mentioned in SAL (#wikimedia-operations) [2021-03-23T09:53:57Z] <akosiaris> deploy helmfile.d/admin_ng for eqiad T277741

Completed auto-reimage of hosts:

['kubernetes1003.eqiad.wmnet', 'kubernetes1002.eqiad.wmnet', 'kubernetes1010.eqiad.wmnet', 'kubernetes1008.eqiad.wmnet', 'kubernetes1013.eqiad.wmnet', 'kubernetes1007.eqiad.wmnet', 'kubernetes1012.eqiad.wmnet', 'kubernetes1004.eqiad.wmnet', 'kubernetes1009.eqiad.wmnet', 'kubernetes1001.eqiad.wmnet', 'kubernetes1014.eqiad.wmnet', 'kubernetes1011.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-03-23T10:56:14Z] <jayme> all services re-deployed to k8s eqiad - T277741

Mentioned in SAL (#wikimedia-operations) [2021-03-23T12:17:07Z] <akosiaris> remove all schedule downtimes for k8s cluster. T277741

Change 673956 merged by jenkins-bot:
[operations/deployment-charts@master] Remove helmfile.d/admin

https://gerrit.wikimedia.org/r/673956

Mentioned in SAL (#wikimedia-operations) [2021-03-23T12:58:00Z] <akosiaris> remove and decomission argon, chroline, acrab, acrux T277741, T277191

Change 674307 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

Mentioned in SAL (#wikimedia-operations) [2021-03-23T14:06:05Z] <akosiaris> pool a few services in eqiad k8s. T277741

Mentioned in SAL (#wikimedia-operations) [2021-03-23T14:20:03Z] <akosiaris> pool a few more services in eqiad k8s. T277741

Change 674307 merged by Alexandros Kosiaris:
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

Mentioned in SAL (#wikimedia-operations) [2021-03-23T14:43:27Z] <akosiaris> pool more services in eqiad k8s. T277741. Only the very large ones traffic wise are still on codfw

JMeybohm claimed this task.
JMeybohm subscribed.

It's safe to say we did this and we have tasks for follow ups (mostly from T277191)

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: chlorine.eqiad.wmnet

  • chlorine.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox