⚓ T343124 Migrate WDQS and WCQS servers to Debian Bullseye

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Gehel	T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye
Resolved	bking	T343124 Migrate WDQS and WCQS servers to Debian Bullseye
Resolved	Papaul	T344518 hw troubleshooting: wdqs1010 unreachable from SSH or DRAC

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2023, 12:55 PM

bking subscribed.Aug 4 2023, 2:06 PM

bking claimed this task.Aug 8 2023, 3:52 PM

bking moved this task from Incoming to In Progress on the Data-Platform-SRE board.

WCQS is now completely on Bullseye, next step is to determine which WDQS hosts need to be upgraded (we don't want to bother upgrading the hosts that'll be retired soon).

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:48Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:55Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye executed with errors:

wdqs2008 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111444_bking_1979643_wdqs2008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye executed with errors:

wdqs2009 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111449_bking_1979813_wdqs2009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:23:30Z] <inflatador> bking@deploy1002 'deploying WDQS on newly-reimaged Bullseye hosts T343124'

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:01Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:23Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 22s)

Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:32:44Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:33:28Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 44s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye completed:

wdqs2010 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111906_bking_2035666_wdqs2010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:01:51Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:02:41Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 41s)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye executed with errors:

wdqs2011 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308112008_bking_2048721_wdqs2011.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:43:22Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:06Z] <bking@deploy1002> deploy aborted: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 02m 44s)

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:20Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 12s)

Headed out for the weekend, we are exactly halfway:

 sudo cumin A:wdqs-all 'cat /etc/debian_version'
30 hosts will be targeted:
wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
===== NODE GROUP =====
(15) wdqs[2007-2011,2013-2022].codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
===== NODE GROUP =====
(15) wdqs2012.codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
10.13

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye

wdqs10[03-05] will be decommissioned soon, so we're going to skip those. Work continues on the other hosts...

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:

wdqs1016 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141419_bking_2766967_wdqs1016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye completed:

wdqs2012 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141404_bking_2765349_wdqs2012.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:29:29Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:30:13Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:43Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:59Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)

bking mentioned this in T331300: Ensure WDQS stack works on Bullseye.Aug 14 2023, 9:08 PM

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye completed:

wdqs1014 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151454_bking_3069979_wdqs1014.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye completed:

wdqs1015 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151452_bking_3069936_wdqs1015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:42Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:56Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 14s)

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:03Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:18Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye

NODE GROUP

(19) wdqs[2007-2022].codfw.wmnet,wdqs[1014-1016].eqiad.wmnet

OUTPUT of 'cat /etc/debian_version' -----

11.7

NODE GROUP

(8) wdqs[1006-1013].eqiad.wmnet

OUTPUT of 'cat /etc/debian_version' -----

10.13

8 hosts left, 2 of those 8 are currently reimaging.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye completed:

wdqs1013 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152153_bking_3162616_wdqs1013.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye completed:

wdqs1012 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152150_bking_3162451_wdqs1012.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-16T21:49:33Z] <ryankemper> T343124 [WDQS] Pooled wdqs1012 and wdqs1013 (passing checks after reimage/data transfer)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:

wdqs1010 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye completed:

wdqs1011 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308172058_bking_3729755_wdqs1011.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:38Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:54Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 16s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:

wdqs1010 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

bking added a subtask: T344518: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC.Aug 18 2023, 9:17 PM

Since wdqs1010 is in an unreachable state after an attempted reimage, I'm going to update firmware on wdqs10[06-09] before attempting their reimages (wdqs10[03-05] are already scheduled for refresh, so we won't be reimaging them).

Working my way back thru tickets, I found the following firmware recommendations:

iDRAC firmware should be 6.10.00.00 [for bullseye installer]

Will work on this now.

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:23Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:36Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:38:05Z] <inflatador> bking@wdqs1008 'depooling for firmware update T343124'

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye

DRAC firmware updates have been staged on wdqs1006, 1007. We need to reboot these hosts before we start the reimage, so the firmware is actually updated.

wdqs1008 is currently running a data-transfer from wdqs1013 after a successful firmware update/reimage.

@Addshore is currently using wdqs1009 to export a JNL (blazegraph data) file, and so we should wait to reimage this host until tomorrow afternoon US time.

Note that wqds1010 and below have only 1Gbps ethernet, whereas newer hosts have 10Gbps. This means data transfers should take significantly longer. Once wdqs2008 has finished its transfer, it's best that we use it as a source for all the other 1Gbps hosts, so we won't have 10Gbps hosts offline for extended periods of time.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye completed:

wdqs1008 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308212036_bking_633052_wdqs1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

bking added a subscriber: Addshore.Aug 21 2023, 9:24 PM

bking mentioned this in T341042: Unmanaged envoyproxy installation on wdqs1009 and wdqs1010.Aug 22 2023, 3:50 PM

Current status:

gehel@cumin1001:~$ sudo cumin 'A:wdqs-all OR A:wcqs-public' 'cat /etc/debian_version'
35 hosts will be targeted:
wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1003-1009,1011-1016].eqiad.wmnet
OK to proceed on 35 hosts? Enter the number of affected hosts to confirm or "q" to quit: 35
===== NODE GROUP =====                                                          
(6) wdqs[1003-1007,1009].eqiad.wmnet                                            
----- OUTPUT of 'cat /etc/debian_version' -----                                 
10.13                                                                           
===== NODE GROUP =====                                                          
(29) wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1008,1011-1016].eqiad.wmnet                                       
----- OUTPUT of 'cat /etc/debian_version' -----                                 
11.7

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye executed with errors:

wdqs1009 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308251718_bking_1770347_wdqs1009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:29Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124 (duration: 00m 19s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye completed:

wdqs1004 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308281915_bking_2593343_wdqs1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Papaul closed subtask T344518: hw troubleshooting: wdqs1010 unreachable from SSH or DRAC as Resolved.Aug 30 2023, 9:15 PM

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye completed:

wdqs1010 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308311335_bking_2799272_wdqs1010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

lojo_wmde mentioned this in T345425: Move WDQS package off of deprecated OpenJDK Docker Image.Sep 1 2023, 10:50 AM

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye completed:

wdqs1006 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062112_bking_2451042_wdqs1006.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye completed:

wdqs1007 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062115_bking_2453784_wdqs1007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

bking@cumin1001:~$ sudo cumin A:wdqs-all 'cat /etc/debian_version'
===== NODE GROUP =====
(1) wdqs1003.eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
10.13
===== NODE GROUP =====
(28) wdqs[2007-2022].codfw.wmnet,wdqs[1004,1006-1016].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
================

We are done (except for wdqs1003, which will be decommissioned instead) . Moving to "Done" status...

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:

wdqs1016 (WARN)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-ssl"}

Disabled Puppet
Removed from Puppet and PuppetDB if present
Deleted any existing Puppet certificate
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
Host up (Debian installer)
Checked BIOS boot parameters are back to normal
Host up (new fresh bullseye OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309071941_bking_100446_wdqs1016.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is not optimal, downtime not removed
Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-ssl' set/pooled=yes

Updated Netbox data from PuppetDB
Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Addshore unsubscribed.Sep 8 2023, 9:30 AM

^^ Last message can be ignored; we reimaged wdqs1016 after a role change. Still done!

RKemper closed this task as Resolved.Sep 18 2023, 10:21 PM

Migrate WDQS and WCQS servers to Debian Bullseye
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

NODE GROUP

NODE GROUP

Migrate WDQS and WCQS servers to Debian BullseyeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

NODE GROUP

NODE GROUP

Migrate WDQS and WCQS servers to Debian Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...