Page MenuHomePhabricator

Our need for cloudvirt hypervisors with local disks
Closed, ResolvedPublic

Description

Until last week we were using cloudvirt1018 to host exactly two VMs:

toolsbeta-test-k8s-etcd-17
tools-k8s-etcd-13

cloudvirt1018 is having RAID problems (and due for refresh) so it's now drained and idle.

Each of the above VMs is part of a three-VM cluster that was spread over cloudvirt1018, 1019 and 1020. Now e have less redundancy as each cluster has two nodes on cloudvirt1019.

We will likely always need two 'fatvirt' hosts for the DB workloads on 1019 and 1020. Do we need a third fatvirt for those other two VMs? It seems like a lot of hardware to use on two VMs.

Event Timeline

Change 743055 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Convert cloudvirt1028 into a local storage hypervisor

https://gerrit.wikimedia.org/r/743055

Change 743055 merged by Andrew Bogott:

[operations/puppet@production] Convert cloudvirt1028 into a local storage hypervisor

https://gerrit.wikimedia.org/r/743055

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster completed:

  • cloudvirt1028 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112080337_andrew_654_cloudvirt1028.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1028.eqiad.wmnet with OS buster executed with errors:

  • cloudvirt1028 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Andrew claimed this task.

cloudvirt1028 is now the third localdisk host. No actions left here until the time comes to refresh 1019, 1020 or 1028.