Page MenuHomePhabricator

Q2:rack/setup/install cloudcephosd10[42-47]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: Assign to @Andrew for feedback on racking task, likely cloudcephosd.
Racking Proposal:

3x hosts in D5
2x hosts in C8
1x host in F4

Networking Setup: # of Connections:2 - Speed: 10G. - VLAN: plug into the cloudsw in the rack; 2 x 10 ports per server. Each host should have its primary on cloud-hosts1-eqiad and its secondary on cloud-storage1-eqiadOS
Distro: Bullseye
Sub-team Technical Contact: @Andrew

Per host setup checklist

cloudcephosd1042.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) D5
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1043.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) D5
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1044.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) C8
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1045.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) C8
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1046.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) D5
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
cloudcephosd1047.eqiad.wmnet:
  • Receive in system on procurement task <T376327> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned) F4
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Details

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

A few things:

  1. Ceph uses a jbod setup, so we don't want hardware raid involved at all. All drives should be set to non-raid
  2. The partman recipe will make a mirrored raid of the two small OS drives and leave the other drives untouched for Ceph to deal with
  3. The partman recipe works very poorly on Bullseye. I will ultimately need things running on Bullseye, but for current setup you'll have better luck if you use Bookworm; I can revert things to Bullseye myself later
  4. (reverting to Bullseye requires wiping the OS drives before each reimage)

#3 above explains the issue where partman hangs during initial image, if imaging with bullseye. The other issues (specifically the refusal to boot from the OS after the debian install completes) are a mystery to me.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1043.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1043 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508050053_vriley_1476872_cloudcephosd1043.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

cloudcephosd1043 was able to fishish with "bookworm" however, cloudcephosd1042 is still having issues.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bullseye

Could not connect to cloudcephosd1044, Will need to chack the managment cables. Then on cloudcephosd1045 it seems like it's failing with the cable on the 10g cable. Will double check that

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1042 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508072238_vriley_1695_cloudcephosd1042.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1047.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1047.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1047.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm

I have been working on cloudcesphos1045 and cloudcesphos1047 and both of those units are giving me a lot of issues. Still troubleshooting them at the moment.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1047.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1047.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1047 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508111854_vriley_1172862_cloudcephosd1047.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

after troubleshooting this with @Papaul for a bit, we found that cloudcephosd1045 has a bad port on the NIC, and cloudcephosd1047 was plugged into a port that was rated for another speed. After updating the port on cloudcephosd1047, was able to finish with imaging that server

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1046.eqiad.wmnet with OS bookworm

submitted request SR214138427 for cloudcephosd1045

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1046.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1046 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508112259_vriley_1420001_cloudcephosd1046.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1042 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1042.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1044 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1044.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm

cloudcephosd1044 seems to time out during the install. Checked to make sure it was booting from the first disk. I will be looking into the connections very soon.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1044.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcephosd1044 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcephosd1044.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

I'm still unable to reimage cloudcephosd1042; it still PXE boots every time, never landing back in the OS. I tried with 1047 and that worked fine, so I suspect something is still broken with 1042. Does it actually work for you reliably?

Hey @VRiley-WMF just a reminder to update me about port 42 on cloudsw1-d5-eqiad. Currently has config on it for cloudcephosd1046 but in Netbox there is no cable attached and it's disabled.

If you're unsure we can just allow homer to disable that port for now, I just don't want to do that in case it might be used (currently shows as UP).

@cmooney Thanks, this has been updated and completed on port 42

Setting up cloudcesphosd1044, having to decom and provision it again

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1044.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1044 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508191314_vriley_2757813_cloudcephosd1044.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

cloudcephosd1044 is completed

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1042.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1042 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508191623_vriley_2781057_cloudcephosd1042.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

So, here is what has been completed so far

cloudcephosd1042
C8
U12
CableID 5204
Port 29
CableID 20220266 (Not set as of yet)
Port 28

cloudcephosd1043
C8
U13
CableID 5205
Port 30
CableID 5322 (Not set as of yet)
Port 13

cloudcephosd1044
D5
U 19
CableID 230304500273
Port 20
CableID 5338 (Not set as of yet)
Port 19

cloudcephosd1046
D5
U21
CableID: 5357
Port 42
CableID 5359
Port 44

cloudcephosd1047
F4
U35
CableID: 240707900038
Port 35
CableID: 230304500153 (Not set as of yet)
Port 37

The reason why cloudcesphod1046 has been fully configured is because the switch port is open and connected due to troubleshooting with 1045. I spoke to Dell about 1045 and they were looking to obtain TSR logs. I am gathering those and sending that to them

Many thanks to @ayounsi set this secondary cable

cloudcephosd1042
C8
U12
CableID 5204
Port 29
CableID 20220266
Port 28

I was able to make a bit more progress with cloudcephosd1045. There was some foam that was stuck inside the port, and once removed it seemingly came up and the card started to communicate better. Will need to test this out a bit more.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host cloudcephosd1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host cloudcephosd1045.eqiad.wmnet with OS bookworm completed:

  • cloudcephosd1045 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508221437_vriley_3209740_cloudcephosd1045.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
VRiley-WMF updated the task description. (Show Details)

Everything should be completed with this ticket

This is getting very close! I still see ping failures with cloudcephosd1045, probably because the second network connection isn't properly configured yet, or doesn't have jumbo frames enabled.

This is getting very close! I still see ping failures with cloudcephosd1045, probably because the second network connection isn't properly configured yet, or doesn't have jumbo frames enabled.

Jumbos are enabled globally so it won't be that. But yes the second connection is not in Netbox and hence the switch port is disabled.

@VRiley-WMF can you advise if the second link on cloudvirt1045 is connected and if so what the port is?

@cmooney Thanks! The second link on cloudcephosd1045 in port 23 in cloudsw1-d5-eqiad. I also made a few changes to the cable itself. I pushed out the update as well. I'm hopthing this should be all set?

@cmooney Thanks! The second link on cloudcephosd1045 in port 23 in cloudsw1-d5-eqiad. I also made a few changes to the cable itself. I pushed out the update as well. I'm hopthing this should be all set?

Thanks yep you set everything up exactly as we need for that and the link is showing as up.

These are all working now! Thanks @VRiley-WMF and @cmooney