Page MenuHomePhabricator

Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudelastic10[07-10].wikimedia.org

Hostname / Racking / Installation Details

Refresh/Replace cloudelastic100[1-4]

Hostnames: cloudelastic10[07-10].wikimedia.org

Racking Proposal: Where should these systems be racked?

These should be straight replacements of cloudelastic1001-1004. As such:

cloudelastic1007 row A (replaces cloudelastic1001)
cloudelastic1008 row B (replaces cloudelastic1002)
cloudelastic1009 row C (replaces cloudelastic1003)
cloudelastic1010 row D (replaces cloudelastic1004)

Networking Setup:

  • # of Connections: 1
  • Speed: 10G
  • VLAN: Private
  • AAAA records:Y
  • Additional IP records: N

Partitioning/Raid:

  • SW Raid, using partman recipe in puppet repo (not yet written, see T342463 for progress)

OS Distro: Bullseye (default unless otherwise specified)

Sub-team Technical Contact: Brian King, Ryan Kemper (Search Platform SREs)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudelastic10[07-10].wikimedia.org:
  • - receive in system on procurement task T341214 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initial puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 961167 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Add new hosts into site.pp

https://gerrit.wikimedia.org/r/961167

Change 961167 merged by RobH:

[operations/puppet@production] cloudelastic: Add new hosts into site.pp

https://gerrit.wikimedia.org/r/961167

bking added a project: Data-Platform-SRE.

Change 961173 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Fix number of cloudelastic hosts

https://gerrit.wikimedia.org/r/961173

Change 961173 merged by RobH:

[operations/puppet@production] site.pp: Fix number of cloudelastic hosts

https://gerrit.wikimedia.org/r/961173

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Hello DC Ops,

I'm still getting PXE boot failures on cloudelastic1007 .
I've upgraded/downgraded to the BIOS versions recommended on the DC Ops Dell page , which has worked before in the past, but isn't working now.

I also tried to set the legacy PXE boot option in the NIC BIOS, but either that didn't work or I did it wrong. Are you able to take a look at this one?

Thanks for your time!

bking removed bking as the assignee of this task.Sep 28 2023, 10:01 PM

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

bking changed the task status from Open to In Progress.Oct 4 2023, 2:40 PM
bking claimed this task.
bking lowered the priority of this task from Medium to Low.

Taking this back, as I was able to get the host to boot by changing the boot option for the 2nd NIC interface (confusing, but works)

Now we're getting a partman error, so will continue work in T342463 and change blocking status.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Hello DC Ops,

I've confirmed that our new partman recipe works in T342463 , but the reimage for cloudelastic1007 is still failing with error

Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title cloudelastic1007 not found yet

Also: the firmware update box above was checked, but I had to manually roll back the NIC firmware and configure PXE boot manually. Can you confirm that the provisioning cookbook and firmware updates cookbooks have been run against these hosts and run them if not?

Sorry for the trouble, ping me in IRC (inflatador) if you have any questions.

bking removed bking as the assignee of this task.Oct 4 2023, 4:39 PM
bking raised the priority of this task from Low to Medium.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudelastic1007.eqiad.wmnet with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

@bking I tried to do the re-images on cloudelastic1007, the re-image finished with the OS install without an issue. The part that did failed was the puppet run the reason being that the server was not in site.pp. Another issue is the servers supposed to be on the public VLAN not netbox is showing that is placed in private vVLAN.
@VRiley-WMF @Jclark-ctr can someone add those servers to site.pp. and fix the netbox by putting those servers in the public vlan please.

Thanks

@Papaul I see cloudelasticservers in site.pp it was added by Bking previously

node /^cloudelastic10(0[7-9]|10)\.wikimedia\./ {

role(insetup::search_platform)

}

@Jclark-ctr ok then the only thing left is to change it in netbox to use the public VLAN

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye completed:

  • cloudelastic1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310121514_jclark_2780752_cloudelastic1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@bking @Papaul I was able to change netbox to Public Vlan redoing most of the steps for setting up server and was able to image cloudelastic1007

Networking Setup:

of Connections: 1

Speed: 10G
VLAN: Private
AAAA records:Y
Additional IP records: N

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1009 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1009 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wikimedia.org with OS bullseye completed:

  • cloudelastic1010 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310122005_jclark_2930961_cloudelastic1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wikimedia.org with OS bullseye completed:

  • cloudelastic1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310122002_jclark_2930845_cloudelastic1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye completed:

  • cloudelastic1008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310131429_jclark_3469158_cloudelastic1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)
bking reopened this task as In Progress.Nov 15 2023, 6:03 PM
bking claimed this task.
bking updated Other Assignee, added: RKemper.
bking updated the task description. (Show Details)

Reopening as cloudelastic1008-1010 don't appear to have reimaged properly, and we may need them for T350826 .

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

Not sure what happened, but the cloudelastic1008-1010 hosts are up after a reimage. I had to manually powercycle the DRAC, login to its console, and force-enable/run puppet.

Closing...