Upgrade the ml-etcd clusters to bullseye and PKI
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	elukey
	Feb 27 2023, 2:39 PM

Description

The clusters should be migrated to bullseye and PKI before upgrading the whole clusters to k8s 1.23.

The idea is to do one reimage at the time, doing remove/add member for each of them to allow etcd to bootstrap correctly. Finally we'll just apply PKI settings.

Details

	Subject	Repo	Branch	Lines +/-
	role::etcd::v3::ml_etcd: set the new discovery records	operations/puppet	production	+1 -1
	role::etcd::v3::ml_etcd: set cluster status to existing	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	JMeybohm	T307943 Update Kubernetes clusters to v1.23
Resolved	elukey	T324542 Upgrade ML clusters to Kubernetes 1.23
Declined	None	T330662 Upgrade the ml-etcd clusters to bullseye and PKI

Event Timeline

elukey created this task.Feb 27 2023, 2:39 PM

Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye

Change 892462 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::ml_etcd: set cluster status to existing

https://gerrit.wikimedia.org/r/892462

gerritbot added a project: Patch-For-Review.Feb 27 2023, 2:52 PM

Change 892462 merged by Elukey:

[operations/puppet@production] role::etcd::v3::ml_etcd: set cluster status to existing

https://gerrit.wikimedia.org/r/892462

Change 892466 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::ml_etcd: set the new discovery records

https://gerrit.wikimedia.org/r/892466

Change 892466 merged by Elukey:

[operations/puppet@production] role::etcd::v3::ml_etcd: set the new discovery records

https://gerrit.wikimedia.org/r/892466

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2023, 3:10 PM

Tried with 2001 but failed to make it work. The new etcd version, on bullseye, requires a new TLS san in every etcd daemon's certificate to be able to run leader elections. Since the other two nodes in the ensemble, 2002 and 2003, are still running Buster, they don't have the new SAN and 2001 fails to trust them.

We could change the puppet code for etcd to allow this use case, but since the code is shared by a lot of important clusters I'd just reimage all the etcd nodes as part of the upgrade to k8s 1.23.

Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors:

ml-etcd2001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run failed, asking the operator what to do
- First Puppet run failed, asking the operator what to do
- First Puppet run failed, asking the operator what to do
- First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302271445_elukey_247312_ml-etcd2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 11:38 AM

Upgrade the ml-etcd clusters to bullseye and PKIClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade the ml-etcd clusters to bullseye and PKI
Closed, DeclinedPublic
Actions

Related Objects
Search...