elastic2033 without bootable devices available (repeat of T281621)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Apr 7 2022, 5:14 PM

Description

Hello Papaul,

As described in T281621 , elastic2033 can't boot. It gets to a PXE booting screen, but no further. I've done a 'power off hard/power on' from the iLO 3 times , but it still won't boot. Are you able to try a power drain, or anything else that might help get it booting again?

Thanks for your time!

The host is depooled and has been banned from the relevant elasticsearch clusters so it's ready to be serviced.

Brian

Related Objects

Mentioned In: T308647: elastic2054 is having H/W issues
Mentioned Here: T308647: elastic2054 is having H/W issues
T281621: elastic2033 without bootable devices available

Event Timeline

bking created this task.Apr 7 2022, 5:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 7 2022, 5:14 PM

Maintenance_bot added a project: SRE.Apr 7 2022, 5:29 PM

Banned host like so:

curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-codfw"}}}'

curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-omega-codfw"}}}'

curl -H 'Content-Type: application/json' -XPUT http://localhost:9600/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_host": "","_name": "elastic2033-production-search-psi-codfw"}}}'

Papaul claimed this task.Apr 11 2022, 11:12 PM

Papaul triaged this task as Medium priority.

Boot was set to UEFI for some reason. I changed it back to Legacy BIOS, system is back online

Mentioned in SAL (#wikimedia-operations) [2022-04-12T22:39:06Z] <ryankemper> T305646 Re-enabling puppet on elastic2033; still need to unban from elasticsearch cluster tomorrow

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch completed:

elastic2033 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh stretch OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204130044_ryankemper_1736819_elastic2033.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-06-02T19:08:26Z] <ryankemper> T305646 T308647 Unbanned elastic2033 and elastic2054 from clusters; also pooled elastic2033

Stashbot mentioned this in T308647: elastic2054 is having H/W issues.Jun 2 2022, 7:08 PM

elastic2033 without bootable devices available (repeat of T281621)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

elastic2033 without bootable devices available (repeat of T281621)
Closed, ResolvedPublic
Actions