Page MenuHomePhabricator

Upgrade the ml-etcd clusters to bullseye and PKI
Closed, DeclinedPublic

Description

The clusters should be migrated to bullseye and PKI before upgrading the whole clusters to k8s 1.23.

The idea is to do one reimage at the time, doing remove/add member for each of them to allow etcd to bootstrap correctly. Finally we'll just apply PKI settings.

Event Timeline

Cookbook cookbooks.sre.ganeti.reimage was started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye

Change 892462 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::ml_etcd: set cluster status to existing

https://gerrit.wikimedia.org/r/892462

Change 892462 merged by Elukey:

[operations/puppet@production] role::etcd::v3::ml_etcd: set cluster status to existing

https://gerrit.wikimedia.org/r/892462

Change 892466 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::etcd::v3::ml_etcd: set the new discovery records

https://gerrit.wikimedia.org/r/892466

Change 892466 merged by Elukey:

[operations/puppet@production] role::etcd::v3::ml_etcd: set the new discovery records

https://gerrit.wikimedia.org/r/892466

Tried with 2001 but failed to make it work. The new etcd version, on bullseye, requires a new TLS san in every etcd daemon's certificate to be able to run leader elections. Since the other two nodes in the ensemble, 2002 and 2003, are still running Buster, they don't have the new SAN and 2001 fails to trust them.

We could change the puppet code for etcd to allow this use case, but since the code is shared by a lot of important clusters I'd just reimage all the etcd nodes as part of the upgrade to k8s 1.23.

Cookbook cookbooks.sre.ganeti.reimage started by elukey@cumin1001 for host ml-etcd2001.codfw.wmnet with OS bullseye executed with errors:

  • ml-etcd2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302271445_elukey_247312_ml-etcd2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details