Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | Gehel | T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye | |||
Resolved | bking | T343124 Migrate WDQS and WCQS servers to Debian Bullseye | |||
Resolved | Papaul | T344518 hw troubleshooting: wdqs1010 unreachable from SSH or DRAC |
Event Timeline
WCQS is now completely on Bullseye, next step is to determine which WDQS hosts need to be upgraded (we don't want to bother upgrading the hosts that'll be retired soon).
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye
Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:48Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:55Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye executed with errors:
- wdqs2008 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111444_bking_1979643_wdqs2008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye executed with errors:
- wdqs2009 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111449_bking_1979813_wdqs2009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- The reimage failed, see the cookbook logs for the details
Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:23:30Z] <inflatador> bking@deploy1002 'deploying WDQS on newly-reimaged Bullseye hosts T343124'
Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:01Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:23Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 22s)
Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:32:44Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:33:28Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 44s)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye completed:
- wdqs2010 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111906_bking_2035666_wdqs2010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:01:51Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:02:41Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 41s)
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye executed with errors:
- wdqs2011 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308112008_bking_2048721_wdqs2011.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:43:22Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:06Z] <bking@deploy1002> deploy aborted: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 02m 44s)
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:20Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 12s)
Headed out for the weekend, we are exactly halfway:
sudo cumin A:wdqs-all 'cat /etc/debian_version' 30 hosts will be targeted: wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet ===== NODE GROUP ===== (15) wdqs[2007-2011,2013-2022].codfw.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 11.7 ===== NODE GROUP ===== (15) wdqs2012.codfw.wmnet,wdqs[1003-1016].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 10.13
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye
wdqs10[03-05] will be decommissioned soon, so we're going to skip those. Work continues on the other hosts...
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:
- wdqs1016 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141419_bking_2766967_wdqs1016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye completed:
- wdqs2012 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141404_bking_2765349_wdqs2012.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:29:29Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:30:13Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 43s)
Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:43Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:59Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye completed:
- wdqs1014 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151454_bking_3069979_wdqs1014.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye completed:
- wdqs1015 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151452_bking_3069936_wdqs1015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:42Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:56Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 14s)
Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:03Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:18Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye
NODE GROUP
(19) wdqs[2007-2022].codfw.wmnet,wdqs[1014-1016].eqiad.wmnet
- OUTPUT of 'cat /etc/debian_version' -----
11.7
NODE GROUP
(8) wdqs[1006-1013].eqiad.wmnet
- OUTPUT of 'cat /etc/debian_version' -----
10.13
8 hosts left, 2 of those 8 are currently reimaging.
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye completed:
- wdqs1013 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152153_bking_3162616_wdqs1013.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye completed:
- wdqs1012 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152150_bking_3162451_wdqs1012.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-operations) [2023-08-16T21:49:33Z] <ryankemper> T343124 [WDQS] Pooled wdqs1012 and wdqs1013 (passing checks after reimage/data transfer)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:
- wdqs1010 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye completed:
- wdqs1011 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308172058_bking_3729755_wdqs1011.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:38Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:54Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 16s)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:
- wdqs1010 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Since wdqs1010 is in an unreachable state after an attempted reimage, I'm going to update firmware on wdqs10[06-09] before attempting their reimages (wdqs10[03-05] are already scheduled for refresh, so we won't be reimaging them).
Working my way back thru tickets, I found the following firmware recommendations:
iDRAC firmware should be 6.10.00.00 [for bullseye installer]
Will work on this now.
Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:23Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:36Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:38:05Z] <inflatador> bking@wdqs1008 'depooling for firmware update T343124'
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye
DRAC firmware updates have been staged on wdqs1006, 1007. We need to reboot these hosts before we start the reimage, so the firmware is actually updated.
wdqs1008 is currently running a data-transfer from wdqs1013 after a successful firmware update/reimage.
@Addshore is currently using wdqs1009 to export a JNL (blazegraph data) file, and so we should wait to reimage this host until tomorrow afternoon US time.
Note that wqds1010 and below have only 1Gbps ethernet, whereas newer hosts have 10Gbps. This means data transfers should take significantly longer. Once wdqs2008 has finished its transfer, it's best that we use it as a source for all the other 1Gbps hosts, so we won't have 10Gbps hosts offline for extended periods of time.
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye completed:
- wdqs1008 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308212036_bking_633052_wdqs1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Current status:
gehel@cumin1001:~$ sudo cumin 'A:wdqs-all OR A:wcqs-public' 'cat /etc/debian_version' 35 hosts will be targeted: wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1003-1009,1011-1016].eqiad.wmnet OK to proceed on 35 hosts? Enter the number of affected hosts to confirm or "q" to quit: 35 ===== NODE GROUP ===== (6) wdqs[1003-1007,1009].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 10.13 ===== NODE GROUP ===== (29) wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1008,1011-1016].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 11.7
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye executed with errors:
- wdqs1009 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308251718_bking_1770347_wdqs1009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details
Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124
Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:29Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124 (duration: 00m 19s)
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye completed:
- wdqs1004 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308281915_bking_2593343_wdqs1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye completed:
- wdqs1010 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308311335_bking_2799272_wdqs1010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye completed:
- wdqs1006 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062112_bking_2451042_wdqs1006.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye completed:
- wdqs1007 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062115_bking_2453784_wdqs1007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
bking@cumin1001:~$ sudo cumin A:wdqs-all 'cat /etc/debian_version' ===== NODE GROUP ===== (1) wdqs1003.eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 10.13 ===== NODE GROUP ===== (28) wdqs[2007-2022].codfw.wmnet,wdqs[1004,1006-1016].eqiad.wmnet ----- OUTPUT of 'cat /etc/debian_version' ----- 11.7 ================
We are done (except for wdqs1003, which will be decommissioned instead) . Moving to "Done" status...
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:
- wdqs1016 (WARN)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-ssl"}
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309071941_bking_100446_wdqs1016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-ssl' set/pooled=yes
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
^^ Last message can be ignored; we reimaged wdqs1016 after a role change. Still done!