Page MenuHomePhabricator

elastic2033 without bootable devices available (repeat of T281621)
Closed, ResolvedPublic

Description

Hello Papaul,

As described in T281621 , elastic2033 can't boot. It gets to a PXE booting screen, but no further. I've done a 'power off hard/power on' from the iLO 3 times , but it still won't boot. Are you able to try a power drain, or anything else that might help get it booting again?

Thanks for your time!

The host is depooled and has been banned from the relevant elasticsearch clusters so it's ready to be serviced.

Brian

Event Timeline

Banned host like so:

curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-codfw"}}}'

curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-omega-codfw"}}}'

curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-psi-codfw"}}}'
Papaul triaged this task as Medium priority.

Boot was set to UEFI for some reason. I changed it back to Legacy BIOS, system is back online

Mentioned in SAL (#wikimedia-operations) [2022-04-12T22:39:06Z] <ryankemper> T305646 Re-enabling puppet on elastic2033; still need to unban from elasticsearch cluster tomorrow

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch completed:

  • elastic2033 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204130044_ryankemper_1736819_elastic2033.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-06-02T19:08:26Z] <ryankemper> T305646 T308647 Unbanned elastic2033 and elastic2054 from clusters; also pooled elastic2033