Page MenuHomePhabricator

Upgrade cloudelastic clusters to Debian Bullseye
Closed, ResolvedPublic3 Estimated Story Points

Description

In T308606 we created a new rolling reimage operation for our generic rolling-operation cookbook; let's use that to upgrade the cluster automatically.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2022-05-26T19:46:43Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2022-05-26T20:40:54Z] <inflatador> bking@install1003 removed cloudelastic1004.conf pxe config file T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Looking at the console, the installer keeps coming up in interactive mode. I tried clicking through, but it said it couldn't download the preseed file. Will raise this issue with the Infrastructure Foundations team tomorrow.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

^^ We've had 2 different hosts fail to reimage in the same way.

Per IRC conversation with Infrastructure Foundations, "when you hit this kind of issues your best bet is usually the dcops IRC channel. In this case it might be another occurrence of outdated HW firmwares".

Will check in #wikimedia-dcops now

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye

Updated firmware on cloudelastic1006 to the following
10g 21.40.25.31 to 21.85.21.92
idrac 4.00.00.00 to 5.10.10.00
bios 2.4.8 to 2.14.2

Once those were done, the installer successfully loaded up and is running through things now.

Per IRC conversation with RobH, we will work together to reimage one cloudelastic host at a time (6 in the cluster). We'll try to keep 5 hosts in the cluster at all times.

Note that the largest cluster, cloudelastic-chi-eqiad , is in red status, so we'll need to address that before any more reimages take place.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudelastic1006.wikimedia.org with OS bullseye completed:

  • cloudelastic1006 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205311651_robh_3130270_cloudelastic1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-07-06T17:06:13Z] <inflatador> bking@cloudelastic1006 "restarting elastic services in preparation for cloudelastic reimage T309343"

Mentioned in SAL (#wikimedia-operations) [2022-07-06T18:02:07Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2022-07-06T18:45:51Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Step Zero: See https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Updating_Firmware

Once you've downloaded the o update firmwares via the iDRAC 8 web GUI:

Click "Overview", then go to "Update and Rollback" under "Quick Launch Tasks"

From there, click "update"

NIC updates failed, I tried a different set of drivers instead of Broadcom. The sparse info under Hardware > Nic Slot 2 on the iDRAC suggests the external NIC is a Marvell qLogic, so I used that package.

iDRAC firmware updates failed several times. I had to use the 64-bit Windows package.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1003.wikimedia.org with OS bullseye completed:

  • cloudelastic1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071602_bking_1613518_cloudelastic1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-07-07T16:48:29Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1002.wikimedia.org with OS bullseye completed:

  • cloudelastic1002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071649_bking_1627048_cloudelastic1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-07-07T18:36:12Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Mentioned in SAL (#wikimedia-operations) [2022-07-08T02:25:30Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2022-07-08T03:32:59Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1004.wikimedia.org with OS bullseye completed:

  • cloudelastic1004 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207081409_bking_1843956_cloudelastic1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1001.wikimedia.org with OS bullseye completed:

  • cloudelastic1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207081802_bking_1884337_cloudelastic1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1005.wikimedia.org with OS bullseye completed:

  • cloudelastic1005 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207111528_bking_2554469_cloudelastic1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB