Page MenuHomePhabricator

Q2:rack/setup/install cloudvirt10[68-76]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudvirt10[68-74]

Hostname / Racking / Installation Details

Hostnames: cloudvirt10[68-74]
Racking Proposal: WMCS racks. rows E/F are fine

Networking Setup: # of Connections: 1 - Speed: 10G. - VLAN: cloud-hosts1-eqiad
OS Distro: Bookworm
Sub-team Technical Contact: @Andrew @dcaro

Per host setup checklist

cloudvirt1068
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1069
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1070
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1071
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1072
  • Receive in syste72m on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1073
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1074
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1075
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudvirt1076
  • Receive in system on procurement task T379370 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Details

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@Andrew,

Two call outs! The original ordering task had a bad hostname range provided by you for racking "Hostnames: cloudvirt1068 - cloudvirt1066" but that makes no sense as it goes down by 2, and even a swap of the numbers isn't accurate. If we order 9 hosts and start with cloudvirt1068 as a new hostname, then its cloudvirt1068 to cloudvirt1074 for a total of 9 hosts.

Additionally,

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH added a parent task: Unknown Object (Task).Dec 19 2024, 1:41 PM
RobH mentioned this in Unknown Object (Task).

Change #1105724 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] site + preseed entries for new cloudvirts

https://gerrit.wikimedia.org/r/1105724

Change #1105724 merged by Andrew Bogott:

[operations/puppet@production] site + preseed entries for new cloudvirts

https://gerrit.wikimedia.org/r/1105724

@Andrew,

Two call outs! The original ordering task had a bad hostname range provided by you for racking "Hostnames: cloudvirt1068 - cloudvirt1066" but that makes no sense as it goes down by 2, and even a swap of the numbers isn't accurate. If we order 9 hosts and start with cloudvirt1068 as a new hostname, then its cloudvirt1068 to cloudvirt1074 for a total of 9 hosts.

clearly a copy/paste error, thanks for correcting! Except if I count up by 9 from 1068 I get to 1076 -- I had to count several times because I kept doing base 12 for some reason.

Andrew renamed this task from Q2:rack/setup/install cloudvirt10[68-74] to Q2:rack/setup/install cloudvirt10[68-76].Dec 19 2024, 2:57 PM
Andrew updated the task description. (Show Details)
Andrew updated the task description. (Show Details)
Jclark-ctr subscribed.

@VRiley-WMF procurement T379370 looks like you updated racking ticket Jan 30 2025 have these been Received in coupa / or can the ticket be closed out? What is status of Racking ticket? No updates since you racked them? T382492

cloudvirt1068
E4
U 35
Port 43
CableID 230304500116

cloudvirt1069
E4
U 39
Port 41
CableID 230304500114

cloudvirt1070
E4
U 13
Port 10
CableID 230304500168

cloudvirt1071
E4
U42
Port 39
CableID 230304500115

cloudvirt1072
F4
U39
Port 39
CableID 230304500124

cloudvirt1073
F4
U17
Port 10
CableID 230304500296

cloudvirt1074
F4
U16
Port 0
CableID 2022005

cloudvirt1075
F4
U15
Port 1
CableID 20220012

cloudvirt1076
F4
U14
Port 5
CableID 20220006

@Jclark-ctr Appreciate the follow-up. Like a good fullback making sure nothing slips past. You’re right, I’ve been caught up with a few other items, but I’ll check on the Coupa side today and confirm if everything’s been received so we can close out T379370.

Thanks for keepin’ us on track, teamwork makes the dream work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudvirt1076.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudvirt1076.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1076 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505132005_jhancock_2595517_cloudvirt1076.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudvirt1075.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudvirt1075.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1075 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505132116_jhancock_2629230_cloudvirt1075.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1068 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1068.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1070 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1070.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1070 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1070.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1069 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1069.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1070.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1070 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505142110_vriley_2130022_cloudvirt1070.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1069.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1069 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505142117_vriley_2132564_cloudvirt1069.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1073 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1073.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1072.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1072 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505151340_vriley_2253541_cloudvirt1072.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1074.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1074 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505151343_vriley_2253629_cloudvirt1074.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1073 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1073.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1071 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudvirt1071.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1073.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1073 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505151437_vriley_2265179_cloudvirt1073.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1071.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1071 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505151440_vriley_2265364_cloudvirt1071.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
17:32 <+icinga-wm> PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64

^ same alerts for all 3 hosts

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudvirt1068.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1068 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505141950_vriley_2117347_cloudvirt1068.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually