Page MenuHomePhabricator

cp4037 reimage for cookbook getting stuck at PXE boot
Closed, ResolvedPublic

Description

This is not specific to cp4037 but a separate task might be helpful for the cookbook getting stuck at PXE boot and taking multiple attempts to finish.

Details

Other Assignee
Fabfur

Event Timeline

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-12-13T12:02:42Z] <vgutierrez> setting cp4037 as inactive - T352876

Vgutierrez added subscribers: Papaul, Vgutierrez.

@Papaul I see that you triggerd the cookbook last week. Are you stuck with something? do you need help from our side? it would be great to get this host back to production before the break

@Vgutierrez please give me until the end of today. Thank you

@Vgutierrez I had a meeting with network and automation team today. We discussed about this issue and we same to not know the really cause of this issue. We decided we let traffic take back this server and put it in service and we can still track this issue @ T350179.

Thanks

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312181047_fabfur_735643_cp4037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp4037.ulsfo.wmnet with OS bullseye completed:

  • cp4037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312181245_fabfur_753803_cp4037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Vgutierrez assigned this task to Fabfur.