Page MenuHomePhabricator

Q4:rack/setup/install dbproxy10[22-27].
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dbproxy10[22-27].

This order was placed in Q3, but delivery won't be until around April 15th (Q4) for budget purposes.

Hostname / Racking / Installation Details

Hostnames: dbproxy10[22-27]
Racking Proposal: We don't mind as long as they can be in different rows.
Networking Setup: # of Connections:1, Speed:1G. Vlan:Private AAAA records:N,
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level: @Marostegui will take care of this
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: @Marostegui

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

dbproxy1022:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dbproxy1023:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dbproxy1024:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dbproxy1025:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dbproxy1026:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dbproxy1027:
  • - receive in system on procurement task 325227 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_persistence
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 897454 merged by Marostegui:

[operations/puppet@production] dbproxy10[22-27]: Add hosts

https://gerrit.wikimedia.org/r/897454

All the puppet patches needed are done.

dbproxy1022. A6. U10. PORT.9 CABLEID 1036
dbproxy1023. B6. U7. PORT.4 CABLEID 1273
dbproxy1024. C6. U28. PORT. 28 CABLEID 3250
dbproxy1025. D6. U34. PORT.34 CABLEID 3754
dbproxy1026. E1. U.38 PORT.42 CABLEID 23000040
dbproxy1027. F1. U.24 PORT.42 CABLEID 2013339101884

Any ETA on when these will be installed? Thanks!

@Volans I am having issues with provisioning script with all servers right now it is not limited to this servers on this ticket if you have time this week can we work through this trying to avoid skipping steps?

sudo secure-cookbook sre.hosts.provision dbproxy1023

Testing Redfish API connection to dbproxy1023 (10.65.1.8)
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f390550>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f3906a0>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f3904c0>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner.run.<locals>.check_connection: Unable to connect to the Redfish API of dbproxy1023. Follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting_2

@Jclark-ctr that's weird, I've opened T337345 as I don't see any DHCP traffic at all.

@Jclark-ctr the DHCP traffic is back to the install servers (see the related task for more details). For now with a workaround but netops are looking for a permanent fix. This should unblock you. Let me know if you encounter any issue.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Papaul having issues imaging servers dbproxy1022,dbproxy1023,dbproxy1026,dbproxy1027

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 564, in run
self._install_os()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 352, in _install_os
self.remote_installer.wait_reboot_since(pxe_reboot_time, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/wmflib/decorators.py", line 210, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 556, in wait_reboot_since
raise RemoteCheckError(
spicerack.remote.RemoteCheckError: Reboot for dbproxy1022.eqiad.wmnet not found yet, keep polling for it: unable to get uptime

dbproxy1024,dbproxy1025 have same error

ATTENTION: destructive action for host: dbproxy1024
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
go
User input is: "go"
Management Password:
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 197, in run
runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 92, in get_runner
return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 192, in init
self.dhcp_config = self._get_dhcp_config_opt82()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 326, in _get_dhcp_config_opt82
vlan=switch_iface.untagged_vlan.name,
AttributeError: 'NoneType' object has no attribute 'name'

Additional servers in row e/f are not posting failed to ticket

@Jclark-ctr

spicerack.remote.RemoteCheckError: Reboot for dbproxy1022.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
``
when you lunch the re-image cookbook on one of those server and you login to the server console what do you see? is it pxe booting, is it trying to load the Debian installer ?

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

@Jclark-ctr i took a quick look at dbproxy1022 the server is connected using the second NIC and not the first NIC that is the reason it is not pxe booting . You can check the other once and fix. After that you should be good.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye completed:

  • dbproxy1022 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306201913_robh_2986932_dbproxy1022.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

  • reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0
    • set the boot disk in raid bios after setting up the VDs
  • reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)
  • exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

  • reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0
    • set the boot disk in raid bios after setting up the VDs
  • reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)
  • exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

fix applied to dbproxy102[4567]

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye completed:

  • dbproxy1027 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202029_jclark_3006028_dbproxy1027.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye completed:

  • dbproxy1026 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202031_jclark_3006054_dbproxy1026.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

  • reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0
    • set the boot disk in raid bios after setting up the VDs
  • reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)
  • exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

Nice catch Rob!!

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye completed:

  • dbproxy1024 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306211407_robh_3214464_dbproxy1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

dbproxy1024's network settings were a bit off, so rather than try to figure out why, i just dumped the primary interface out and re-ran the netbox network provision script, dns cookbook, and network port cookbook again and it resolved the issue. install completed, checklist updated for dbproxy1024

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed:

  • dbproxy1025 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306211404_jclark_3214369_dbproxy1025.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye completed:

  • dbproxy1023 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306212352_jclark_3329835_dbproxy1023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed:

  • dbproxy1025 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306220025_jclark_3338305_dbproxy1025.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Jclark-ctr updated the task description. (Show Details)