Page MenuHomePhabricator

Reclaim jobrunner hardware for k8s
Open, Needs TriagePublic

Description

Now that we have moved a majority of jobs to k8s, we can take some hardware back from the jobrunner cluster. Based on the graphs it seems we can reclaim at least 50% of the jobrunner hardware while refreshLinks and cirrusSearchLinksUpdate remain on metal, but we should be careful to account for spikes and avoid letting the remaining hosts be overwhelmed, so a good start is 25%.

As of time of writing, there are 30 jobrunners in codfw, and (minus parse1* hosts), 23 in eqiad. Take care not to repurpose videoscaler-only hosts.

Initial candidates for ~25% migration with rack placement:

  • mw1460 (A8)
  • mw1461 (A8)
  • mw1469 (B6)
  • mw1439 (D8)
  • mw1486 (C5)
  • mw1495 (F)
  • mw2395 (A3)
  • mw2427 (A6)
  • mw2267 (B3)
  • mw2430 (B6)
  • mw2357 (C6)
  • mw2282 (D4) (blocked T355333)
  • mw2446 (D5)

Further migrations:

  • mw2260 B3
  • mw2355 C6
  • mw2381 A3
  • mw2429 B6
  • mw2445 D5
  • mw1440 D8
  • mw1457 A8
  • mw1466 B6
  • mw1482 C5
  • mw1459 A8

Final servers (excluding videoscalers and hardware that is out of warranty):

  • mw1437.eqiad.wmnet (canary)
  • mw1438.eqiad.wmnet (canary)
  • mw1458.eqiad.wmnet
  • mw1467.eqiad.wmnet
  • mw1468.eqiad.wmnet
  • mw1483.eqiad.wmnet
  • mw1484.eqiad.wmnet
  • mw1485.eqiad.wmnet
  • mw1494.eqiad.wmnet
  • mw2351.codfw.wmnet
  • mw2353.codfw.wmnet
  • mw2382.codfw.wmnet
  • mw2394.codfw.wmnet
  • mw2419.codfw.wmnet
  • mw2426.codfw.wmnet
  • mw2428.codfw.wmnet
  • mw2444.codfw.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 992100 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: reclaim eqiad jobrunner hosts

https://gerrit.wikimedia.org/r/992100

Change 992100 merged by Hnowlan:

[operations/puppet@production] kubernetes: reclaim eqiad jobrunner hosts

https://gerrit.wikimedia.org/r/992100

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1486.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1495.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1486.eqiad.wmnet with OS bullseye completed:

  • mw1486 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401221612_hnowlan_3543603_mw1486.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1495.eqiad.wmnet with OS bullseye completed:

  • mw1495 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401221614_hnowlan_3543590_mw1495.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
hnowlan updated the task description. (Show Details)

Change 992679 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: move more jobrunner hosts to workers

https://gerrit.wikimedia.org/r/992679

Change 992679 merged by Hnowlan:

[operations/puppet@production] kubernetes: move more jobrunner hosts to workers

https://gerrit.wikimedia.org/r/992679

Change 992900 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] installserver: fix disk profiles for new k8s workers

https://gerrit.wikimedia.org/r/992900

Change 992900 merged by Hnowlan:

[operations/puppet@production] installserver: fix disk profiles for new k8s workers

https://gerrit.wikimedia.org/r/992900

Change 992973 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: make 5 jobrunners kubernetes workers

https://gerrit.wikimedia.org/r/992973

Change 992973 merged by Hnowlan:

[operations/puppet@production] kubernetes: make 5 jobrunners kubernetes workers

https://gerrit.wikimedia.org/r/992973

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2260.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2355.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2381.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2429.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2445.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2260.codfw.wmnet with OS bullseye completed:

  • mw2260 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291315_hnowlan_3966495_mw2260.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2381.codfw.wmnet with OS bullseye completed:

  • mw2381 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291318_hnowlan_3968633_mw2381.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2429.codfw.wmnet with OS bullseye completed:

  • mw2429 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291320_hnowlan_3969120_mw2429.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2445.codfw.wmnet with OS bullseye completed:

  • mw2445 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291325_hnowlan_3969352_mw2445.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2355.codfw.wmnet with OS bullseye completed:

  • mw2355 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401291329_hnowlan_3966756_mw2355.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 993714 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: make 5 jobrunners kubernetes workers │

https://gerrit.wikimedia.org/r/993714

Change 993714 merged by Hnowlan:

[operations/puppet@production] kubernetes: make 5 jobrunners kubernetes workers

https://gerrit.wikimedia.org/r/993714

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1440.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1457.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1482.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1459.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1466.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1459.eqiad.wmnet with OS bullseye completed:

  • mw1459 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301544_hnowlan_1017982_mw1459.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1482.eqiad.wmnet with OS bullseye completed:

  • mw1482 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301550_hnowlan_1018528_mw1482.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1440.eqiad.wmnet with OS bullseye completed:

  • mw1440 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301556_hnowlan_1017853_mw1440.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1457.eqiad.wmnet with OS bullseye completed:

  • mw1457 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301602_hnowlan_1017895_mw1457.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1466.eqiad.wmnet with OS bullseye completed:

  • mw1466 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401301606_hnowlan_1018697_mw1466.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-02-07T16:46:25Z] <hnowlan> homer 'cr*codfw*' commit 'T354791' for 5 new k8s ex-appservers

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" mw2282.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Change 1005776 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: move all remaining eligible jobrunners to k8s

https://gerrit.wikimedia.org/r/1005776

Change 1005786 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: move all remaining eligible codfw jobrunners to k8s

https://gerrit.wikimedia.org/r/1005786

Change 1005776 merged by Hnowlan:

[operations/puppet@production] kubernetes: move all remaining eligible eqiad jobrunners to k8s

https://gerrit.wikimedia.org/r/1005776

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1468.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1483.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1484.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1485.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1494.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1483.eqiad.wmnet with OS bullseye completed:

  • mw1483 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402221757_hnowlan_1506700_mw1483.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1468.eqiad.wmnet with OS bullseye completed:

  • mw1468 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402221759_hnowlan_1506726_mw1468.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1484.eqiad.wmnet with OS bullseye completed:

  • mw1484 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402221804_hnowlan_1506704_mw1484.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1494.eqiad.wmnet with OS bullseye completed:

  • mw1494 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402221801_hnowlan_1506779_mw1494.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1485.eqiad.wmnet with OS bullseye completed:

  • mw1485 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402221811_hnowlan_1506709_mw1485.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 1005786 merged by Hnowlan:

[operations/puppet@production] kubernetes: move all remaining eligible codfw jobrunners to k8s

https://gerrit.wikimedia.org/r/1005786

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2394.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2353.codfw.wmnet with OS bullseye completed:

  • mw2353 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231234_hnowlan_611874_mw2353.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2444.codfw.wmnet with OS bullseye completed:

  • mw2444 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231236_hnowlan_612393_mw2444.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2351.codfw.wmnet with OS bullseye completed:

  • mw2351 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231239_hnowlan_611852_mw2351.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2382.codfw.wmnet with OS bullseye completed:

  • mw2382 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231241_hnowlan_611890_mw2382.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2426.codfw.wmnet with OS bullseye completed:

  • mw2426 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231243_hnowlan_611944_mw2426.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2428.codfw.wmnet with OS bullseye completed:

  • mw2428 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231249_hnowlan_611956_mw2428.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2419.codfw.wmnet with OS bullseye completed:

  • mw2419 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231246_hnowlan_611928_mw2419.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2394.codfw.wmnet with OS bullseye completed:

  • mw2394 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402231253_hnowlan_611902_mw2394.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-02-23T14:42:04Z] <hnowlan> running homer 'cr*codfw*' commit 'T354791' for reclaimed codfw jobrunners moving to k8s workers