⚓ T277191 Update Kubernetes cluster codfw to kubernetes 1.16

Subject	Repo	Branch	Lines +/-
Decommission argon, chlorine, acrab, acrux	operations/puppet	production	+1 -25
conftool-data: Add kubernetes2017.codfw.wmnet	operations/puppet	production	+1 -0
downtime: Support services and other special icinga host	operations/puppet	production	+5 -2
admin/: Remove codfw	operations/deployment-charts	master	+0 -201
Move profile::kubernetes::node::cni_config to eqiad only	operations/puppet	production	+2 -1
Add kubernetes2017 to BGP	operations/homer/public	master	+1 -0
Aggregate IPPools in codfw and eqiad, enable codfw	operations/deployment-charts	master	+8 -22
Correctly add new kubemaster.svc.codfw.wmnet cert	operations/puppet	production	+23 -29
kubernetes codfw: Apply role/hiera to new masters	operations/puppet	production	+23 -3
kubernetes codfw: Populate new worker hiera keys for k8s update	operations/puppet	production	+15 -4
Add new kubemaster.svc.codfw.wmnet cert	operations/puppet	production	+29 -24

		Status	Subtype	Assigned	Task
		Resolved		akosiaris	T244335 Upgrade kubernetes clusters to v1.16
		Resolved		akosiaris	T277191 Update Kubernetes cluster codfw to kubernetes 1.16

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2021, 4:22 PM

JMeybohm triaged this task as High priority.Mar 11 2021, 4:23 PM

JMeybohm updated the task description. (Show Details)

JMeybohm moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

JMeybohm updated the task description. (Show Details)Mar 11 2021, 4:25 PM

akosiaris updated the task description. (Show Details)Mar 12 2021, 11:18 AM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 11:28 AM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 11:41 AM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 11:51 AM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 11:54 AM

akosiaris updated the task description. (Show Details)Mar 12 2021, 11:56 AM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 11:58 AM

JMeybohm updated the task description. (Show Details)

JMeybohm unsubscribed.

akosiaris updated the task description. (Show Details)Mar 12 2021, 12:06 PM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 12:06 PM

akosiaris updated the task description. (Show Details)Mar 12 2021, 12:10 PM

Change 671144 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Aggregate IPPools in codfw and eqiad, enable codfw

https://gerrit.wikimedia.org/r/671144

gerritbot added a project: Patch-For-Review.Mar 12 2021, 2:23 PM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 2:45 PM

Change 671170 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] admin/: Remove codfw

https://gerrit.wikimedia.org/r/671170

Change 671171 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes codfw: Apply role/hiera to new masters

https://gerrit.wikimedia.org/r/671171

Change 671174 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes codfw: Populate new worker hiera keys for k8s update

https://gerrit.wikimedia.org/r/671174

JMeybohm updated the task description. (Show Details)Mar 12 2021, 3:17 PM

JMeybohm updated the task description. (Show Details)Mar 12 2021, 3:21 PM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 8:34 AM

akosiaris updated the task description. (Show Details)Mar 16 2021, 8:37 AM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 8:37 AM

akosiaris updated the task description. (Show Details)Mar 16 2021, 8:40 AM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 8:45 AM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 8:48 AM

akosiaris updated the task description. (Show Details)Mar 16 2021, 8:58 AM

Icinga downtime set by akosiaris@cumin1001 for 1 day, 0:00:00 18 host(s) and their services with reason: Reinitialize codfw k8s cluster with new etcd

acrab.codfw.wmnet,acrux.codfw.wmnet,kubernetes[2001-2016].codfw.wmnet

akosiaris updated the task description. (Show Details)Mar 16 2021, 9:01 AM

akosiaris updated the task description. (Show Details)

JMeybohm updated the task description. (Show Details)Mar 16 2021, 9:07 AM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 9:12 AM

JMeybohm updated the task description. (Show Details)Mar 16 2021, 9:24 AM

JMeybohm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-03-16T09:34:49Z] <akosiaris> poweroff acrux and acrab T277191

JMeybohm updated the task description. (Show Details)Mar 16 2021, 9:47 AM

Change 672672 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672672

Change 672672 merged by Alexandros Kosiaris:
[operations/puppet@production] Add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672672

Mentioned in SAL (#wikimedia-operations) [2021-03-16T09:59:17Z] <akosiaris> Push new certs for kubemaster.svc.codfw.wmnet - T277191

Change 671174 merged by JMeybohm:
[operations/puppet@production] kubernetes codfw: Populate new worker hiera keys for k8s update

https://gerrit.wikimedia.org/r/671174

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

['kubernetes2001.codfw.wmnet', 'kubernetes2002.codfw.wmnet', 'kubernetes2003.codfw.wmnet', 'kubernetes2004.codfw.wmnet', 'kubernetes2007.codfw.wmnet', 'kubernetes2008.codfw.wmnet', 'kubernetes2009.codfw.wmnet', 'kubernetes2010.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kubernetes2013.codfw.wmnet', 'kubernetes2014.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103161025_jayme_8000.log.

Change 671171 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes codfw: Apply role/hiera to new masters

https://gerrit.wikimedia.org/r/671171

Change 672690 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Correctly add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672690

Change 672690 merged by Alexandros Kosiaris:
[operations/puppet@production] Correctly add new kubemaster.svc.codfw.wmnet cert

https://gerrit.wikimedia.org/r/672690

akosiaris updated the task description. (Show Details)Mar 16 2021, 11:15 AM

Completed auto-reimage of hosts:

['kubernetes2009.codfw.wmnet', 'kubernetes2011.codfw.wmnet', 'kubernetes2001.codfw.wmnet', 'kubernetes2002.codfw.wmnet', 'kubernetes2004.codfw.wmnet', 'kubernetes2003.codfw.wmnet', 'kubernetes2007.codfw.wmnet', 'kubernetes2014.codfw.wmnet', 'kubernetes2010.codfw.wmnet', 'kubernetes2008.codfw.wmnet', 'kubernetes2012.codfw.wmnet', 'kubernetes2013.codfw.wmnet']

and were ALL successful.

Change 671144 merged by jenkins-bot:
[operations/deployment-charts@master] Aggregate IPPools in codfw and eqiad, enable codfw

https://gerrit.wikimedia.org/r/671144

Change 672708 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/homer/public@master] Add kubernetes2017 to BGP

https://gerrit.wikimedia.org/r/672708

Change 672708 merged by jenkins-bot:
[operations/homer/public@master] Add kubernetes2017 to BGP

https://gerrit.wikimedia.org/r/672708

JMeybohm updated the task description. (Show Details)Mar 16 2021, 12:30 PM

akosiaris mentioned this in rOHPUc91a6c97cb78: Add kubernetes2017 to BGP.Mar 16 2021, 12:30 PM

Change 672713 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Move profile::kubernetes::node::cni_config to eqiad only

https://gerrit.wikimedia.org/r/672713

Change 672713 merged by Alexandros Kosiaris:
[operations/puppet@production] Move profile::kubernetes::node::cni_config to eqiad only

https://gerrit.wikimedia.org/r/672713

akosiaris updated the task description. (Show Details)Mar 16 2021, 12:57 PM

Mentioned in SAL (#wikimedia-operations) [2021-03-16T13:03:11Z] <akosiaris> sync all services on the new codfw kubernetes cluster T277191

JMeybohm updated the task description. (Show Details)Mar 16 2021, 1:37 PM

Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime for like a month until we remove the VMs

acrab.codfw.wmnet

Icinga downtime set by akosiaris@cumin1001 for 16 days, 16:00:00 1 host(s) and their services with reason: Extend downtime for like a month until we remove the VMs

acrux.codfw.wmnet

RhinosF1 subscribed.Mar 16 2021, 3:55 PM

Mvolz subscribed.Mar 17 2021, 9:41 AM

JMeybohm updated the task description. (Show Details)Mar 17 2021, 2:00 PM

Change 671170 merged by jenkins-bot:
[operations/deployment-charts@master] admin/: Remove codfw

https://gerrit.wikimedia.org/r/671170

JMeybohm updated the task description. (Show Details)Mar 17 2021, 5:00 PM

JMeybohm updated the task description. (Show Details)Mar 18 2021, 9:51 AM

akosiaris mentioned this in T277741: Update Kubernetes cluster eqiad to kubernetes 1.16.Mar 18 2021, 10:02 AM

JMeybohm updated the task description. (Show Details)Mar 18 2021, 11:42 AM

Change 674147 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

Added steps in the eqiad task T277741 from the action items list, I am gonna boldy resolve this one, any extra followups will be tracked in the eqiad respective task.

Change 674147 merged by Alexandros Kosiaris:
[operations/puppet@production] downtime: Support services and other special icinga host

https://gerrit.wikimedia.org/r/674147

Change 674270 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] contool-data: Add kubernetes2017.codfw.wmnet

https://gerrit.wikimedia.org/r/674270

Change 674270 merged by Alexandros Kosiaris:
[operations/puppet@production] conftool-data: Add kubernetes2017.codfw.wmnet

https://gerrit.wikimedia.org/r/674270

Mentioned in SAL (#wikimedia-operations) [2021-03-23T12:58:00Z] <akosiaris> remove and decomission argon, chroline, acrab, acrux T277741, T277191

Change 674307 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

Change 674307 merged by Alexandros Kosiaris:
[operations/puppet@production] Decommission argon, chlorine, acrab, acrux

https://gerrit.wikimedia.org/r/674307

JMeybohm mentioned this in T277677: Write a cookbook to set a k8s cluster in maintenance mode.Mar 24 2021, 9:16 AM

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: acrux.codfw.wmnet

acrux.codfw.wmnet (WARN)
- Failed downtime host on Icinga (likely already removed)
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

COMMON_STEPS (FAIL)
- Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: acrab.codfw.wmnet

acrab.codfw.wmnet (WARN)
- Failed downtime host on Icinga (likely already removed)
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: argon.eqiad.wmnet

argon.eqiad.wmnet (WARN)
- Failed downtime host on Icinga (likely already removed)
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

akosiaris mentioned this in T302701: Re-evaluate ip pools for ml-serve-{eqiad,codfw}.Mar 2 2022, 3:27 PM

elukey mentioned this in T304673: Re-initialize the Kubernetes ML Serve clusters.Mar 25 2022, 10:22 AM

Update Kubernetes cluster codfw to kubernetes 1.16
Closed, ResolvedPublic
Actions

Description

Preparation

Actions

Action items

Details

Related Objects
Search...

Event Timeline

Update Kubernetes cluster codfw to kubernetes 1.16Closed, ResolvedPublicActions

Description

Preparation

Actions

Action items

Details

Related ObjectsSearch...

Event Timeline

Update Kubernetes cluster codfw to kubernetes 1.16
Closed, ResolvedPublic
Actions

Related Objects
Search...