Page MenuHomePhabricator

Upgrade cloudvirt-wdqs servers to Debian Bullseye
Closed, ResolvedPublic

Description

There are three of these servers, running two user-managed VMs:

root@cloudcontrol1003:~# openstack server list --all-projects --host cloudvirt-wdqs1001
+--------------------------------------+--------------------+---------+----------------------------------------+---------------------------------------------+------------------+
| ID                                   | Name               | Status  | Networks                               | Image                                       | Flavor           |
+--------------------------------------+--------------------+---------+----------------------------------------+---------------------------------------------+------------------+
| 38d50a32-c637-4d64-805a-a4afb9144218 | canary-wdqs1001-01 | SHUTOFF | lan-flat-cloudinstances2b=172.16.6.155 | debian-10.0-buster (deprecated 2021-03-01)  | cloudvirt-canary |
| 87acbb5a-ddac-457a-9fab-19b4f8af7916 | wcqs-beta-01       | ACTIVE  | lan-flat-cloudinstances2b=172.16.1.248 | debian-9.11-stretch (deprecated 2020-10-17) | t206636          |
+--------------------------------------+--------------------+---------+----------------------------------------+---------------------------------------------+------------------+
root@cloudcontrol1003:~# openstack server list --all-projects --host cloudvirt-wdqs1002
+--------------------------------------+--------------------+--------+---------------------------------------+--------------------------------------------+------------------+
| ID                                   | Name               | Status | Networks                              | Image                                      | Flavor           |
+--------------------------------------+--------------------+--------+---------------------------------------+--------------------------------------------+------------------+
| 6b8f11de-31b0-4a61-979e-451890f05ebf | canary-wdqs1002-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.81 | debian-10.0-buster (deprecated 2021-03-01) | cloudvirt-canary |
+--------------------------------------+--------------------+--------+---------------------------------------+--------------------------------------------+------------------+
root@cloudcontrol1003:~# openstack server list --all-projects --host cloudvirt-wdqs1003
+--------------------------------------+--------------------+--------+----------------------------------------+--------------------------------------------+------------------------+
| ID                                   | Name               | Status | Networks                               | Image                                      | Flavor                 |
+--------------------------------------+--------------------+--------+----------------------------------------+--------------------------------------------+------------------------+
| a244842b-7204-47ee-8bcb-7d3478971e6b | mwoffliner4        | ACTIVE | lan-flat-cloudinstances2b=172.16.4.162 | debian-11.0-bullseye                       | g3.cores8.ram32.disk20 |
| bdf71776-b78d-4c41-9705-571b01a732e6 | canary-wdqs1003-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.21  | debian-10.0-buster (deprecated 2021-07-30) | cloudvirt-canary-ceph  |
+--------------------------------------+--------------------+--------+----------------------------------------+--------------------------------------------+------------------------+

I'm reimaging 1002 right now since that won't affect our users.

(Note to self: pxe-booting these hosts requires the same network hack as for e.g. cloudvirt1016)

Event Timeline

@EBernhardson, you created the VM 'wcqs-beta-01' which will be affected by this maintenance. Here are options

  1. I delete that VM, reimage the server, and you create at your leisure
  2. I reimage the server leaving that VM in place and then restart it when done. This will result in a couple of hours of downtime.
  3. The same as option #2 except we do this by appointment so you're around to deal with the downtime.

What is your preference? (Obviously #1 is slightly easier for me but I'm fairly confident that I can accomplish options 2 or 3).

@Kelson I'm not clear on why mwoffliner4 is scheduled on a special-purpose wdqs server. It could be by mistake, or it could be by prearrangement and I've just forgotten why.

If it's there on purpose, then I offer you the same three options as above. If not, then I propose two more:

  1. You (or I) delete that VM and you recreate it
  2. I figure out how to migrate the VM off of that special-purpose host and onto a 'normal' hypervisor

5 is probably possible but I haven't done it in ages :)

@Andrew I don't know what is a "special-purpose wdqs server", I suspect this is by mistake. This server as a "special" profile, I remember asking you specifically this config.

I have currently a long process which should end within a week. Then I would delete the VM an recreate it. If there is something I should take care by recreating the VM, please let me know.

@Andrew I don't know what is a "special-purpose wdqs server", I suspect this is by mistake. This server as a "special" profile, I remember asking you specifically this config.

I have currently a long process which should end within a week. Then I would delete the VM an recreate it. If there is something I should take care by recreating the VM, please let me know.

Thanks for the quick response! Is it safe to assume that stopping/starting that VM would also ruin this long process? (If so then I'll just be patient.)

@Andrew Yes it would kill the process and I would have to restart everything.

@Andrew after some discussion with my team, we've decided that wcqs-beta-01 is no longer in use, so feel free to delete it any time.

@Andrew after some discussion with my team, we've decided that wcqs-beta-01 is no longer in use, so feel free to delete it any time.

that's easy :) Thanks for the follow-up.

@Andrew Sorry, we should have just deleted this ourselves already instead of telling you to do it. ;)

But, I did just delete wcqs-beta-01. If we can do anything else to help, please let us know.

@Kelson is your job still in process or can we start moving things?

@Andrew I recreated mwoffliner4. I guess it should be OK now for you.

@Andrew I recreated mwoffliner4. I guess it should be OK now for you.

Yes! Thanks @Kelson

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt-wdqs1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt-wdqs1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt-wdqs1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt-wdqs1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye completed:

  • cloudvirt-wdqs1002 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204041554_andrew_2932947_cloudvirt-wdqs1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye completed:

  • cloudvirt-wdqs1003 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204041558_andrew_2933313_cloudvirt-wdqs1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB