⚓ T326346 Q4:rack/setup/install dbproxy10[22-27].

	Subject	Repo	Branch	Lines +/-
	dbproxy10[22-27]: Add hosts	operations/puppet	production	+12 -0
	site.pp: Add dbproxy10[22-27] insetup	operations/puppet	production	+5 -0

Status	Subtype	Assigned	Task
			Unknown Object (Task)
Resolved		Jclark-ctr	T326346 Q4:rack/setup/install dbproxy10[22-27].
Resolved		Marostegui	T337812 Productionize dbproxy10[22-27]
Resolved		Marostegui	T340003 Remove IPv6 from dbproxy10[22-27]
Resolved		Marostegui	T341121 Decommission dbproxy10[12-17]
Resolved	Request	Jclark-ctr	T341510 decommission dbproxy1012.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341711 decommission dbproxy1013.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341782 decommission dbproxy1014.eqiad.wmnet
Resolved	Request	Jclark-ctr	T342103 decommission dbproxy1015.eqiad.wmnet
Resolved	Request	Jclark-ctr	T348956 decommission dbproxy1017.eqiad.wmnet

Change 897454 merged by Marostegui:

[operations/puppet@production] dbproxy10[22-27]: Add hosts

https://gerrit.wikimedia.org/r/897454

All the puppet patches needed are done.

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2023, 7:10 AM

Marostegui updated the task description. (Show Details)Apr 14 2023, 12:03 PM

Jclark-ctr claimed this task.Apr 21 2023, 12:28 AM

Jclark-ctr updated the task description. (Show Details)

dbproxy1022. A6. U10. PORT.9 CABLEID 1036
dbproxy1023. B6. U7. PORT.4 CABLEID 1273
dbproxy1024. C6. U28. PORT. 28 CABLEID 3250
dbproxy1025. D6. U34. PORT.34 CABLEID 3754
dbproxy1026. E1. U.38 PORT.42 CABLEID 23000040
dbproxy1027. F1. U.24 PORT.42 CABLEID 2013339101884

Jclark-ctr moved this task from Racking Tasks to Remote Work on the ops-eqiad board.May 1 2023, 5:13 PM

Jclark-ctr updated the task description. (Show Details)May 1 2023, 7:56 PM

Any ETA on when these will be installed? Thanks!

Jclark-ctr updated the task description. (Show Details)May 23 2023, 7:38 PM

@Volans I am having issues with provisioning script with all servers right now it is not limited to this servers on this ticket if you have time this week can we work through this trying to avoid skipping steps?

sudo secure-cookbook sre.hosts.provision dbproxy1023

Testing Redfish API connection to dbproxy1023 (10.65.1.8)
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f390550>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f3906a0>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff55f3904c0>, 'Connection to 10.65.1.8 timed out. (connect timeout=10)')': /redfish
Failed to run cookbooks.sre.hosts.provision.ProvisionRunner.run.<locals>.check_connection: Unable to connect to the Redfish API of dbproxy1023. Follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting_2

@Jclark-ctr that's weird, I've opened T337345 as I don't see any DHCP traffic at all.

@Jclark-ctr the DHCP traffic is back to the install servers (see the related task for more details). For now with a workaround but netops are looking for a permanent fix. This should unblock you. Let me know if you encounter any issue.

Jclark-ctr updated the task description. (Show Details)May 24 2023, 6:50 PM

Jclark-ctr updated the task description. (Show Details)May 25 2023, 2:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Marostegui changed the status of subtask T337812: Productionize dbproxy10[22-27] from Open to Stalled.May 31 2023, 5:43 AM

Marostegui mentioned this in T337812: Productionize dbproxy10[22-27].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

@Papaul having issues imaging servers dbproxy1022,dbproxy1023,dbproxy1026,dbproxy1027

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 564, in run
self._install_os()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 352, in _install_os
self.remote_installer.wait_reboot_since(pxe_reboot_time, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/wmflib/decorators.py", line 210, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 556, in wait_reboot_since
raise RemoteCheckError(
spicerack.remote.RemoteCheckError: Reboot for dbproxy1022.eqiad.wmnet not found yet, keep polling for it: unable to get uptime

dbproxy1024,dbproxy1025 have same error

ATTENTION: destructive action for host: dbproxy1024
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
go
User input is: "go"
Management Password:
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 197, in run
runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 92, in get_runner
return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 192, in init
self.dhcp_config = self._get_dhcp_config_opt82()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 326, in _get_dhcp_config_opt82
vlan=switch_iface.untagged_vlan.name,
AttributeError: 'NoneType' object has no attribute 'name'

Additional servers in row e/f are not posting failed to ticket

@Jclark-ctr

spicerack.remote.RemoteCheckError: Reboot for dbproxy1022.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
``
when you lunch the re-image cookbook on one of those server and you login to the server console what do you see? is it pxe booting, is it trying to load the Debian installer ?

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

@Jclark-ctr i took a quick look at dbproxy1022 the server is connected using the second NIC and not the first NIC that is the reason it is not pxe booting . You can check the other once and fix. After that you should be good.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1022 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye completed:

dbproxy1022 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306201913_robh_2986932_dbproxy1022.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0
- set the boot disk in raid bios after setting up the VDs
reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)
exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

RobH updated the task description. (Show Details)Jun 20 2023, 7:33 PM

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

In T326346#8950646, @RobH wrote:

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0

set the boot disk in raid bios after setting up the VDs

reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)

exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

fix applied to dbproxy102[4567]

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1027.eqiad.wmnet with OS bullseye completed:

dbproxy1027 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202029_jclark_3006028_dbproxy1027.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1026.eqiad.wmnet with OS bullseye completed:

dbproxy1026 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202031_jclark_3006054_dbproxy1026.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Jclark-ctr updated the task description. (Show Details)Jun 20 2023, 9:27 PM

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

In T326346#8950646, @RobH wrote:

Ok, I figured out the issue on the installations here. The first round of R450s mistakenly came with raid controllers (this has since been corrected in the standard configs for future orders). This batch of hosts needs to have the following done before the OS will image properly, and I did this already on dbproxy1022:

reboot into raid bios, clear config, setup 2 virtual raid0 disks, one for each disk into its own raid0

set the boot disk in raid bios after setting up the VDs

reboot into bios, set the boot order to raid then NIC (won't see the raid disks until reboot)

exit bios, this host can now have the reimage script fired

Once I fired the script, I had a failure for LVM existing (due to the previous failed isntall) but simply re-PXE booting the host fixed that (as the second firing cleared the LVM data properly) and then the host reimaged with script just fine.

Nice catch Rob!!

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dbproxy1024.eqiad.wmnet with OS bullseye completed:

dbproxy1024 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306211407_robh_3214464_dbproxy1024.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

dbproxy1024's network settings were a bit off, so rather than try to figure out why, i just dumped the primary interface out and re-ran the netbox network provision script, dns cookbook, and network port cookbook again and it resolved the issue. install completed, checklist updated for dbproxy1024

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed:

dbproxy1025 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306211404_jclark_3214369_dbproxy1025.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

dbproxy1023 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye completed:

dbproxy1023 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306212352_jclark_3329835_dbproxy1023.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1025.eqiad.wmnet with OS bullseye completed:

dbproxy1025 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306220025_jclark_3338305_dbproxy1025.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Jclark-ctr closed this task as Resolved.Jun 22 2023, 12:41 AM

Jclark-ctr updated the task description. (Show Details)

Marostegui changed the status of subtask T337812: Productionize dbproxy10[22-27] from Stalled to Open.Jun 22 2023, 6:50 AM

Marostegui closed subtask T337812: Productionize dbproxy10[22-27] as Resolved.Jun 29 2023, 5:29 AM

Q4:rack/setup/install dbproxy10[22-27].
Closed, ResolvedPublic
Actions

Description

Hostname / Racking / Installation Details

Per host setup checklist

dbproxy1022:

dbproxy1023:

dbproxy1024:

dbproxy1025:

dbproxy1026:

dbproxy1027:

Details

Related Objects
Search...

Event Timeline

	RobH
	Jan 5 2023, 6:55 PM

Q4:rack/setup/install dbproxy10[22-27].Closed, ResolvedPublicActions

Description

Hostname / Racking / Installation Details

Per host setup checklist

dbproxy1022:

dbproxy1023:

dbproxy1024:

dbproxy1025:

dbproxy1026:

dbproxy1027:

Details

Related ObjectsSearch...

Event Timeline

Q4:rack/setup/install dbproxy10[22-27].
Closed, ResolvedPublic
Actions

Related Objects
Search...