Page MenuHomePhabricator

Upgrade ganeti/ulsfo to Bullseye
Closed, ResolvedPublic

Description

Upgrade the Ganeti cluster in ulsfo to Bullseye:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • sudo gnt-cluster upgrade --to 3.0

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add with gnt-node add:

  • ganeti4001
  • ganeti4002
  • ganeti4003

Event Timeline

Change 790643 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Enable ganeti3 component in ulsfo

https://gerrit.wikimedia.org/r/790643

Change 790643 merged by Muehlenhoff:

[operations/puppet@production] Enable ganeti3 component in ulsfo

https://gerrit.wikimedia.org/r/790643

Mentioned in SAL (#wikimedia-operations) [2022-05-11T07:05:18Z] <moritzm> updating ganeti4* to Ganeti 3.0.1-1~bpo10+1 T307997

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-sre) [2022-05-11T15:15:51Z] <robh> ganeti4001 updating all firmware revisions T307997\

Mentioned in SAL (#wikimedia-sre) [2022-05-11T15:53:04Z] <robh> firmware upgrade for ganeti4001 complete T307997 (bios, nics, idrac) and manually confirmed first 10G port is link active (it is) and is set to pxe

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Without a firmware update of the system firmware and the NIC firmware, there was no network link in the Debian installer. Rob updated everything to the latest version, but it seems now we're running into this firmware regression that Papaul previously identified for cloudvirt1025/cloudvirt1026: https://phabricator.wikimedia.org/T304483 , the initial PXE boot fails with "Failed to load ldlinux.c32", so we'll need a similar downgrade here (both servers use the same NICs)

I've downgraded ganeti4001 to 21.60.22.11, flashed and confirmed. @MoritzMuehlenhoff give it a shot now!

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Ok, since the last update, it now shows no connection on the switch. However, previous firmware versions all successfully had a link light on the switch (I checked when I updated to the newest one on may 11). Currently troubleshooting.

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye executed with errors:

  • ganeti4001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti4001.ulsfo.wmnet with OS bullseye completed:

  • ganeti4001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121751_robh_1295746_ganeti4001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Summary of work:

  • initially asked to update firmware, updated to the very latest 22.00.07.60
    • this introduced a new pxe boot failure known via T304483, Moritz requested I match to the working version 21.60.22.11 (works on cloudvirts in eqiad.)
  • flashed 10g nic firmware to 21.60.22.11
    • this introduced a new error of the link light/confirmation of connection going away for no explainable reason, as the settings were all consistent for a working NIC but the link was gone
  • flashed 10g nic firmware to 21.85.21.92, a release just under newest release 22.00.07.60
    • link light verification/port info in idrac interface returned, link is in effect
    • launched install via cookbook and while the tftp load of the debian installer seemed to take awhile (about 4 minutes), it did eventually load up and the installer loaded and ran successfully.

ganeti4002 is from the same batch. I've migrated instances, removed it from the cluster for the reimage and downtimed it. @RobH Can you please update it to the same firmware and NIC firmware versions as ganeti4001?

Mentioned in SAL (#wikimedia-sre) [2022-05-16T21:47:17Z] <robh> ganeti4002 rebooting for firmware update via T307997

Mortiz,

I flashed updates to ganeti4002, but it reminded me I need to go onsite this Thursday for T303318 to swap the defective memory (existing open case before the warranty expired)

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye completed:

  • ganeti4002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205170848_jmm_2136727_ganeti4002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-17T11:53:51Z] <moritzm> failover Ganeti master in ulsfo to ganeti4001 T307997

ganeti4003 is from the same batch and needs the same updates. I've migrated instances, removed it from the cluster for the reimage and downtimed it.

Mentioned in SAL (#wikimedia-sre) [2022-05-17T17:16:44Z] <robh> ganeti4003 rebooting for firmware updates via T307997

updates all done, system is back up for reimage whenever

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4003.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4003.ulsfo.wmnet with OS bullseye completed:

  • ganeti4003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205180756_jmm_1188152_ganeti4003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

This is complete. The ulsfo cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart from the fact that gnt-cluster verify fails)