Page MenuHomePhabricator

Upgrade ganeti/esams to Bullseye
Closed, ResolvedPublic

Description

Upgrade the Ganeti cluster in Amsterdam to Bullseye:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • sudo gnt-cluster upgrade --to 3.0

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add.

  • ganeti3001
  • ganeti3002
  • ganeti3003

Event Timeline

Change 793491 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Enable component/ganeti3 for the esams cluster

https://gerrit.wikimedia.org/r/793491

Change 793491 merged by Muehlenhoff:

[operations/puppet@production] Enable component/ganeti3 for the esams cluster

https://gerrit.wikimedia.org/r/793491

Mentioned in SAL (#wikimedia-operations) [2022-06-07T07:44:52Z] <moritzm> upgrading ganeti/esams to Ganeti 3 T308238

Mentioned in SAL (#wikimedia-operations) [2022-06-07T08:29:54Z] <moritzm> drain ganeti3003 for reimage T308238

ganeti3003 is removed from the cluster, downtimed and needs the same firmware/NIC updates as ganeti4*/ganeti5* to enable the reimage to Bullseye.

Mentioned in SAL (#wikimedia-sre) [2022-06-08T16:26:37Z] <robh> ganeti5003 firmware updates in progress via T308238

Mentioned in SAL (#wikimedia-sre) [2022-06-08T16:44:48Z] <robh> ganeti3003 (already depooled) coming down for firmware update and reimage via T308238

ganeti3003 firmware updates

bios 2.2.11 to 2.14.2
nic 21.40.22.20 to 21.85.21.92
idrac 3.34.34.34 to 5.10.10.00

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3003.esams.wmnet with OS bullseye completed:

  • ganeti3003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206081715_robh_441138_ganeti3003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
RobH added a subscriber: RobH.

ganeti3003 firmware updated and reimaged to bullseye (easy enough to fire the cookbook to reimage post firmware update to ensure the firmware update fixes pxe issues)

Mentioned in SAL (#wikimedia-operations) [2022-06-09T07:13:32Z] <moritzm> drain ganeti3002 for firmware update/reimage T308238

MoritzMuehlenhoff updated the task description. (Show Details)

ganeti3002 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye.

Mentioned in SAL (#wikimedia-sre) [2022-06-09T15:58:38Z] <robh> ganeti3002 rebooting into firmware update then reimage via T308238

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3002.esams.wmnet with OS bullseye completed:

  • ganeti3002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206091701_robh_623293_ganeti3002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

ganeti3002 firmware updated for nic, bios, and idrac. reimaged and ready for next one after you juggle this back into service =]

Mentioned in SAL (#wikimedia-operations) [2022-06-13T07:54:58Z] <moritzm> failover ganeti master in esams to ganeti3003 T308238

Mentioned in SAL (#wikimedia-operations) [2022-06-13T09:12:16Z] <moritzm> drain ganeti3001 for firmware update/reimage T308238

ganeti3001 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye.

Mentioned in SAL (#wikimedia-sre) [2022-06-13T16:10:15Z] <robh> ganeti3001 rebooting and reimaging for firmware updates via T308238

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye completed:

  • ganeti3001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206131632_robh_1344098_ganeti3001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

ganeti3001 firmware updates
bios 2.2.11 to 2.14.2
nic 21.40.22.20 to 21.85.21.92
idrac 3.34.34.34 to 5.10.10.00

Moritz,

ganeti3001 firmware updated and reimaged to bullseye, back to you for service implementation

Mentioned in SAL (#wikimedia-operations) [2022-06-14T11:02:02Z] <moritzm> rebalancing ganeti cluster in esams T308238