Page MenuHomePhabricator

Upgrade ganeti/eqsin to Bullseye
Closed, ResolvedPublic

Description

Upgrade the Ganeti cluster in Singatore to Bullseye:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • sudo gnt-cluster upgrade --to 3.0

Firmware updates completed: nic 21.85.21.92, bios 2.14.2, idrac 5.10.10.00 to ganeti500[123]

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add with gnt-node add:

  • ganeti5001
  • ganeti5002
  • ganeti5003

Event Timeline

Change 791306 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Enable Ganeti3 component on eqsin servers

https://gerrit.wikimedia.org/r/791306

Change 791306 merged by Muehlenhoff:

[operations/puppet@production] Enable Ganeti3 component on eqsin servers

https://gerrit.wikimedia.org/r/791306

Mentioned in SAL (#wikimedia-operations) [2022-05-18T08:11:13Z] <moritzm> upgrading ganeti packages in eqsin to Ganeti 3.0 T308211

Mentioned in SAL (#wikimedia-operations) [2022-05-18T08:25:43Z] <moritzm> sudo gnt-cluster upgrade --to 3.0 for ganeti/eqsin T308211

ganeti5002 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye.

RobH subscribed.

ganeti5002 firmware updates completed: nic 21.85.21.92, bios 2.14.2, idrac 5.10.10.00. system booted back into OS and is online for reimage later.

can reimage and repool, and kick this back to me for ganeti500[13], whichever is next.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti5002.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti5002.eqsin.wmnet with OS bullseye completed:

  • ganeti5002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205190901_jmm_1370260_ganeti5002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

ganeti5003 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye.

Mentioned in SAL (#wikimedia-sre) [2022-05-19T20:21:55Z] <robh> ganeti5003 updating firmware via T308211

Moritz,

ganeti5003 firmware updates completed: nic 21.85.21.92, bios 2.14.2, idrac 5.10.10.00. system booted back into OS and is online for reimage later.

can reimage and repool, and kick this back to me for ganeti5001

@RobH I'm unable to reimage ganeti5003, the ipmitool call fails with "Error: Unable to establish IPMI v2 / RMCP+ session"

I've tried a racreset, but that didn't change anything. Some digging showed that we had similar symptoms with ganeti5002 before: https://phabricator.wikimedia.org/T261130#6752341 So could you please reset the IDRAC password?

Mentioned in SAL (#wikimedia-sre) [2022-05-20T16:33:50Z] <robh> troubleshooting ganeti5003 ipmi failure via T308211

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye

I just overwrote the password with the exact same password, as described in the comment you linked. It didn't fix it, so I checked the settings, and it seems perhaps the firmware load disabled ipmi over mgmt lan interface or it was set incorrectly to start and hadn't been reimaged via script.

Screen Shot 2022-05-20 at 9.36.42 AM.png (756×1 px, 165 KB)

Fixed, fired off the reimage.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5003.eqsin.wmnet with OS bullseye completed:

  • ganeti5003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205201637_robh_505017_ganeti5003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@MoritzMuehlenhoff,

The warn is due to the host being in bios when i fired off the script, so it couldn't disable puppet on the old OS. This host is now ready for redeployment, I have not run the gnt-add command, since I'm not familiar with how to best monitor the cluster after addition to ensure it is optimal.

https://wikitech.wikimedia.org/wiki/Ganeti#Add_nodes_to_new_cluster_%28or_extend_an_existing_one

This seems to denote I should add via a different command than gnt-add, but perhaps due to it being a reimaged not 'new', so rather than do it wrong and cause a potential outage condition on a Friday, I'll let you evaluate/push this into service. Once its good to go, and you want ganeti5001 firmware updated, feel free to reassign this back to me with comment denoting such. Thanks!

Mentioned in SAL (#wikimedia-operations) [2022-05-23T09:54:32Z] <moritzm> failover ganeti master in eqsin to ganeti5003 T308211

ganeti5001 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye.

I meant: ganeti5001 is removed from the cluster and needs the same firmware/NIC updates as ganeti4* to enable the reimage to Bullseye.

Mentioned in SAL (#wikimedia-sre) [2022-05-23T15:12:54Z] <robh> updating firmware on ganeti5001 per T308211

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti5001.eqsin.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti5001.eqsin.wmnet with OS bullseye completed:

  • ganeti5001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231543_robh_1113641_ganeti5001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@MoritzMuehlenhoff, per our earlier IRC discussion, ganeti5001 has had all the firmware updated and reimaged successfully. All yours!

Mentioned in SAL (#wikimedia-operations) [2022-05-24T10:18:03Z] <moritzm> rebalance Ganeti cluster in eqsin T308211

This is complete. The eqsin cluster is affected by T309724, but that will be investigated via that task (and it doesn't have a functional impact apart from the fact that gnt-cluster verify fails)