Page MenuHomePhabricator

Q1:rack/setup/install db22[21-40]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of db22[21-40]

Hostname / Racking / Installation Details

Hostnames: db2221 db2222 db2223 db2224 db2225 db2226 db2227 db2228 db2229 db2230 db2231 db2232 db2233 db2234 db2235 db2236 db2237 db2238 db2239 db2240
Racking Proposal: If possible scattered across rows and racks.
Networking Setup: # of Connections:1 - Speed:1G. - VLAN: Private : AAAA records:N, Additional IP records (Cassandra)? No
Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: 10 @Marostegui will add the hosts to the correct partman recipe
OS Distro: Bookworm (default unless otherwise specified)
Sub-team Technical Contact: @Marostegui (going on sabbatical 2nd Sept 2024) @ABran-WMF

Per host setup checklist

db2221
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2222
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2223
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2224
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2225
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2226
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2227
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2228
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2229
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2230
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2231
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2232
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2233
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2234
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2235
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2236
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2237
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2238
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2239
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
db2240
  • Receive in system on procurement task T368922 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2238.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2239.codfw.wmnet with OS bookworm completed:

  • db2239 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407262252_pt1979_3078474_db2239.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2237.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2238.codfw.wmnet with OS bookworm completed:

  • db2238 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407262318_pt1979_3107197_db2238.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2236.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2237.codfw.wmnet with OS bookworm completed:

  • db2237 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407262344_pt1979_3128468_db2237.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2235.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2236.codfw.wmnet with OS bookworm completed:

  • db2236 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407270004_pt1979_3153874_db2236.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2235.codfw.wmnet with OS bookworm completed:

  • db2235 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407270026_pt1979_3180615_db2235.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2234.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2233.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2233.codfw.wmnet with OS bookworm executed with errors:

  • db2233 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2233.codfw.wmnet to get a root shellbut depending on the failure this may not work.

@Jhancock.wm when you back onsite next week before me please check the cable on db2233 thank you.

xe-0/0/30       up    down db2233

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2234.codfw.wmnet with OS bookworm completed:

  • db2234 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407270110_pt1979_3226616_db2234.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm executed with errors:

  • db2222 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2222.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2221.codfw.wmnet with OS bookworm completed:

  • db2221 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291235_jhancock_2652191_db2221.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2224.codfw.wmnet with OS bookworm completed:

  • db2224 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291238_jhancock_2655822_db2224.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed:

  • db2225 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291241_jhancock_2656318_db2225.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm executed with errors:

  • db2225 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291241_jhancock_2656318_db2225.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2225.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2223.codfw.wmnet with OS bookworm completed:

  • db2223 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291243_jhancock_2655342_db2223.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2226.codfw.wmnet with OS bookworm completed:

  • db2226 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291247_jhancock_2656703_db2226.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2225.codfw.wmnet with OS bookworm completed:

  • db2225 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291307_jhancock_2710215_db2225.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2222.codfw.wmnet with OS bookworm completed:

  • db2222 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291313_jhancock_2704819_db2222.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2228.codfw.wmnet with OS bookworm completed:

  • db2228 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291349_jhancock_2745293_db2228.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2232.codfw.wmnet with OS bookworm completed:

  • db2232 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291352_jhancock_2745530_db2232.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2230.codfw.wmnet with OS bookworm completed:

  • db2230 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291355_jhancock_2745416_db2230.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2231.codfw.wmnet with OS bookworm completed:

  • db2231 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291402_jhancock_2745466_db2231.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2229.codfw.wmnet with OS bookworm completed:

  • db2229 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291359_jhancock_2745360_db2229.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

27 and 33 having some issues. will check again shortly.

27 is not connecting to the right puppet hosts.

Generated Puppet certificate
[1/10, retrying in 5.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for db2227.codfw.wmnet
[2/10, retrying in 10.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for db2227.codfw.wmnet
[3/10, retrying in 20.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for db2227.codfw.wmnet
[4/10, retrying in 40.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_fo

33 is not registering a reboot even after manually rebooting

[160/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for db2233.codfw.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[161/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for db2233.codfw.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[162/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for db2233.codfw.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exi

@Jhancock.wm I've fixed db2227's certificate issues. Puppet finished correctly. I am going to reimage it again and see if it works fine this time.

I can't check db2233 as I just get a blank screen when I connect via IDRAC.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm executed with errors:

  • db2233 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2233.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm

@Marostegui 2233 was a switch port issue so it should be fix now. @Jhancock.wm started the re -image already on it
Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm

@Papaul can you also check db2227? It is not rebooting after I issued the reimage cookbook. The idrac screen is also blank so I cannot see where it is.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2233.codfw.wmnet with OS bookworm completed:

  • db2233 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407291513_jhancock_2839244_db2233.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@Marostegui serial was off on the server I set it up. We had an issue with the provision cookbook not setting the serial co we did all the servers manually so we might have missed this one. see task below
https://phabricator.wikimedia.org/T371132

@Marostegui since all those servers are on 10G when you put them in productions can you please let me know if you noticed any improvement.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm

@Papaul I am trying to reimage db2227 but it is not doing PXE boot

@Marostegui since all those servers are on 10G when you put them in productions can you please let me know if you noticed any improvement.

Ah I guess this is why it's not doing PXE. Can you move it to 1G so I can reimage? I can ping you once done

@Marostegui please run the cookbook this way:

sudo cookbook sre.hosts.reimage  -t T369654 --os bookworm --force-dhcp-tftp db2227 --new

add the ---force-dhcp-tftp flag if it doesn't work let me know i can check if pxe is set on the 10G nic

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2227.codfw.wmnet with OS bookworm executed with errors:

  • db2227 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2227.codfw.wmnet to get a root shellbut depending on the failure this may not work.

We reiamge 5 times db2227 for some reason the server is still sending the certificate request to puppetmaster1001

pt1979@puppetmaster1001:~$ sudo puppet cert --list
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
  "db2227.codfw.wmnet" (SHA256) B0:67:3B:F8:29:11:75:B3:8B:7A:FF:5A:75:B4:DD:F7:F9:58:DE:8F:2F:74:C8:A1:51:C5:AD:79:E6:0F:90:C3

@Volans any idea why this may be happening during this reimage? I don't see anything different within its puppet definition that could explain why.

Has it been cleared from puppet5 between each reimage? After the first reimage if the host is in puppetdb the reimage cookbook will use the current puppet version of the host. That could explain it. Let me know if you want me to do something with it.

@Volans after the first reimage that failed, I did clean all the certificates from puppet, but looks like Papaul found the same issues.

The reimage cookbook did set it to puppet 7 each time:

$ grep "node=db2227.codfw.wmnet, command='printf" reimage-extended.log
2024-07-29 12:27:45,345 jhancock 2656784 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
2024-07-29 17:59:29,599 pt1979 3022287 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
2024-07-29 19:35:24,239 pt1979 3092430 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'

Can I try a reimage? I can't access it right now (both ssh and console ask for a login and users are not yet set)

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin2002 for host db2227.codfw.wmnet with OS bookworm

I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run:

sudo cookbook sre.hosts.reimage -t T369654 --os bookworm -p 7 --force-dhcp-tftp --new db2227

And is currently doing it's first puppet run and is registered on the puppetserver as expected.

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin2002 for host db2227.codfw.wmnet with OS bookworm completed:

  • db2227 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407301032_volans_4020369_db2227.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run:

sudo cookbook sre.hosts.reimage -t T369654 --os bookworm -p 7 --force-dhcp-tftp --new db2227

And is currently doing it's first puppet run and is registered on the puppetserver as expected.

@Volans just for my own education, yesterday what I did was:

  • Sign the cert manually on puppetmaster to see if puppet would run fine, it did.
  • Clean the cert on puppetmaster1001 (per wikitech doc)
  • Clean everything on the host (per wikitech doc)
  • Issue the reimage with the same command

And the host failed again with the error show at T369654#10023235

What else did you do that I missed?

Thank you

@Volans thank you. Do we have a long term fix for this because we are having this issue often when we are running the re-image, some nodes are sending their puppet request to puppetmaster1001 and not the puppet server making the cookbook to failed. We all don't have the permission to clean the cert on the puppetmaster and i will like to still away also from doing that.
thank you.

Jhancock.wm updated the task description. (Show Details)

@Papaul what's the timeline for deciding to not reimage anymore buster host so we can just have puppet 7 and solve all problems?

@Papaul We've debugged this live during the tooling and automation office hours and we think it's a race condition due do the fact that late_command.sh stil defaults to puppet 5 if unable to read the version. Luca is changing that.

Change #1058641 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: fix late_command.sh to avoid race conditions

https://gerrit.wikimedia.org/r/1058641

Change #1058641 merged by Elukey:

[operations/puppet@production] install_server: fix late_command.sh to avoid race conditions

https://gerrit.wikimedia.org/r/1058641