Page MenuHomePhabricator

cloudvirt104[346] reimage failures
Closed, ResolvedPublic

Description

When rebooting cloudvirt1043 after the OS install:

Booting from Hard drive C:
GRUB loading..
Welcome to GRUB!

error: disk `lvmid/UUYzRf-z200-W1JE-5dHp-cXfr-ATil-4hhYzV/WEpa3p-OHvH-0Zwj-lql3-
msdg-kUdj-8vuhUh' not found.
grub rescue>

This has happened twice in a row.

Cloudvirt1044 fails in a similar way, although it just shows a blank screen forever after reboot.

Event Timeline

Andrew renamed this task from cloudvirt1043 reimage failures to cloudvirt1043 + cloudvirt1044 reimage failures.Nov 14 2023, 5:41 AM
Andrew updated the task description. (Show Details)

clouvirt1046 also shows a blank screen forever in console com2.

Full output of the reimage cookbook for cloudvirt1046: https://phabricator.wikimedia.org/P53419

fnegri renamed this task from cloudvirt1043 + cloudvirt1044 reimage failures to cloudvirt104[346] reimage failures.Nov 14 2023, 2:32 PM

We may find more of these as we roll through the remaining dozen cloudvirts. For now, though, let's start with FW updates for these hosts.

let's start with FW updates for these hosts.

what is the procedure for FW updates?

fnegri changed the task status from Open to In Progress.Nov 14 2023, 4:40 PM
fnegri triaged this task as High priority.

@Jclark-ctr did firmware upgrades.

  • 1043 has the same grub prompt issue as before
  • 1044 is now working properly and back in service.
  • 1046 has the same 'hangs at a blank screen during reboot' issue

btw those hosts (cloudvirt1043 and cloudvirt1046) are fully out of service and can be restarted or reimaged at any time.

Reimage process is e.g.

andrew@cumin1001:~$ sudo cookbook sre.hosts.reimage --new --puppet 5 --os bookworm -t T345811 cloudvirt1043

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye

After chatting with @Andrew in IRC I decided to take a look at this to help out with cloudvirt1043:

  • checked all firmware versions were indeed updated correctly, yep
  • checked all bios settings were correctly applied, yep
  • reimaged it and the issue magically went away, yep
  • reimage fails at the puppet csr step, rerunning with --puppet 5 defined in the command line rather than the script prompt

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1043 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311142049_robh_757635_cloudvirt1043.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
This comment was removed by Andrew.

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1043 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311142133_robh_782680_cloudvirt1043.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
  • 1043 works fine when Rob reimages it.
  • 1044 is now working properly and back in service.
  • 1046 has the same 'hangs at a blank screen during reboot' issue
Andrew added a subscriber: RobH.

Rob will have a go at 1046. @RobH if you get it to reimage reassign this ticket to me so I can put it in service. Thanks!

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1046 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312011207_fnegri_2885211_cloudvirt1046.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.

I tried reimaging again and it worked!

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T14:18:27Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T14:19:51Z] <fnegri@cloudcumin1001> END (ERROR) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=97) (T351171)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T14:20:27Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T14:24:45Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T351171)

Host rebooted by fnegri@cumin1001 with reason: Rebooting to test the host is stable

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T16:19:45Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T351171)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-12-01T16:19:55Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T351171)