Page MenuHomePhabricator

Migrate WDQS and WCQS servers to Debian Bullseye
Closed, ResolvedPublic

Description

Migrate all W[CD]QS servers to Debian Bullseye

Notes:

  • Validation that the stack is working has been done on T331300
  • Newest servers have already been migrated on T328325

AC:

  • all wdqs* and wcqs* servers are running Bullseye

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

WCQS is now completely on Bullseye, next step is to determine which WDQS hosts need to be upgraded (we don't want to bother upgrading the hosts that'll be retired soon).

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:48Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T14:53:55Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs[2008-2009].codfw.wmnet with reason: T343124

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2008.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111444_bking_1979643_wdqs2008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2009.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111449_bking_1979813_wdqs2009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:23:30Z] <inflatador> bking@deploy1002 'deploying WDQS on newly-reimaged Bullseye hosts T343124'

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:01Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T15:37:23Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 22s)

Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:32:44Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T17:33:28Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 44s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2010.codfw.wmnet with OS bullseye completed:

  • wdqs2010 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308111906_bking_2035666_wdqs2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:01:51Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:02:41Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 41s)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2011.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2011 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308112008_bking_2048721_wdqs2011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:43:22Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:06Z] <bking@deploy1002> deploy aborted: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 02m 44s)

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-11T20:46:20Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 12s)

Headed out for the weekend, we are exactly halfway:

 sudo cumin A:wdqs-all 'cat /etc/debian_version'
30 hosts will be targeted:
wdqs[2007-2022].codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
===== NODE GROUP =====
(15) wdqs[2007-2011,2013-2022].codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
===== NODE GROUP =====
(15) wdqs2012.codfw.wmnet,wdqs[1003-1016].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
10.13

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye

wdqs10[03-05] will be decommissioned soon, so we're going to skip those. Work continues on the other hosts...

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:

  • wdqs1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141419_bking_2766967_wdqs1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2012.codfw.wmnet with OS bullseye completed:

  • wdqs2012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308141404_bking_2765349_wdqs2012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:29:29Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:30:13Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:43Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-14T15:45:59Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1014.eqiad.wmnet with OS bullseye completed:

  • wdqs1014 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151454_bking_3069979_wdqs1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1015.eqiad.wmnet with OS bullseye completed:

  • wdqs1015 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308151452_bking_3069936_wdqs1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:42Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:56:56Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 14s)

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:03Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-15T15:58:18Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 15s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye

NODE GROUP

(19) wdqs[2007-2022].codfw.wmnet,wdqs[1014-1016].eqiad.wmnet

  • OUTPUT of 'cat /etc/debian_version' -----

11.7

NODE GROUP

(8) wdqs[1006-1013].eqiad.wmnet

  • OUTPUT of 'cat /etc/debian_version' -----

10.13

8 hosts left, 2 of those 8 are currently reimaging.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1013.eqiad.wmnet with OS bullseye completed:

  • wdqs1013 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152153_bking_3162616_wdqs1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1012.eqiad.wmnet with OS bullseye completed:

  • wdqs1012 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308152150_bking_3162451_wdqs1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-16T21:49:33Z] <ryankemper> T343124 [WDQS] Pooled wdqs1012 and wdqs1013 (passing checks after reimage/data transfer)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1011.eqiad.wmnet with OS bullseye completed:

  • wdqs1011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308172058_bking_3729755_wdqs1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:38Z] <bking@deploy1002> Started deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-17T21:40:54Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@f1a6177]: deploying WDQS on newly-reimaged Bullseye hosts T343124 (duration: 00m 16s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Since wdqs1010 is in an unreachable state after an attempted reimage, I'm going to update firmware on wdqs10[06-09] before attempting their reimages (wdqs10[03-05] are already scheduled for refresh, so we won't be reimaging them).

Working my way back thru tickets, I found the following firmware recommendations:

iDRAC firmware should be 6.10.00.00 [for bullseye installer]

Will work on this now.

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:23Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:37:36Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-21T19:38:05Z] <inflatador> bking@wdqs1008 'depooling for firmware update T343124'

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye

DRAC firmware updates have been staged on wdqs1006, 1007. We need to reboot these hosts before we start the reimage, so the firmware is actually updated.

wdqs1008 is currently running a data-transfer from wdqs1013 after a successful firmware update/reimage.

@Addshore is currently using wdqs1009 to export a JNL (blazegraph data) file, and so we should wait to reimage this host until tomorrow afternoon US time.

Note that wqds1010 and below have only 1Gbps ethernet, whereas newer hosts have 10Gbps. This means data transfers should take significantly longer. Once wdqs2008 has finished its transfer, it's best that we use it as a source for all the other 1Gbps hosts, so we won't have 10Gbps hosts offline for extended periods of time.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1008.eqiad.wmnet with OS bullseye completed:

  • wdqs1008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308212036_bking_633052_wdqs1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Current status:

gehel@cumin1001:~$ sudo cumin 'A:wdqs-all OR A:wcqs-public' 'cat /etc/debian_version'
35 hosts will be targeted:
wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1003-1009,1011-1016].eqiad.wmnet
OK to proceed on 35 hosts? Enter the number of affected hosts to confirm or "q" to quit: 35
===== NODE GROUP =====                                                          
(6) wdqs[1003-1007,1009].eqiad.wmnet                                            
----- OUTPUT of 'cat /etc/debian_version' -----                                 
10.13                                                                           
===== NODE GROUP =====                                                          
(29) wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2022].codfw.wmnet,wdqs[1008,1011-1016].eqiad.wmnet                                       
----- OUTPUT of 'cat /etc/debian_version' -----                                 
11.7

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1009.eqiad.wmnet with OS bullseye executed with errors:

  • wdqs1009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308251718_bking_1770347_wdqs1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:09Z] <bking@deploy1002> Started deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124

Mentioned in SAL (#wikimedia-operations) [2023-08-25T17:39:29Z] <bking@deploy1002> Finished deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124 (duration: 00m 19s)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1004.eqiad.wmnet with OS bullseye completed:

  • wdqs1004 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308281915_bking_2593343_wdqs1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1010.eqiad.wmnet with OS bullseye completed:

  • wdqs1010 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308311335_bking_2799272_wdqs1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1006.eqiad.wmnet with OS bullseye completed:

  • wdqs1006 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062112_bking_2451042_wdqs1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1007.eqiad.wmnet with OS bullseye completed:

  • wdqs1007 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062115_bking_2453784_wdqs1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
bking@cumin1001:~$ sudo cumin A:wdqs-all 'cat /etc/debian_version'
===== NODE GROUP =====
(1) wdqs1003.eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
10.13
===== NODE GROUP =====
(28) wdqs[2007-2022].codfw.wmnet,wdqs[1004,1006-1016].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
================

We are done (except for wdqs1003, which will be decommissioned instead) . Moving to "Done" status...

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs1016.eqiad.wmnet with OS bullseye completed:

  • wdqs1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries"}
{"wdqs1016.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs-ssl"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309071941_bking_100446_wdqs1016.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-heavy-queries' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=wdqs,service=wdqs-ssl' set/pooled=yes

  • Updated Netbox data from PuppetDB
  • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

^^ Last message can be ignored; we reimaged wdqs1016 after a role change. Still done!