Page MenuHomePhabricator

elastic1043.eqiad.wmnet stuck in power off state
Closed, ResolvedPublic1 Estimated Story Points

Description

The affected host, elastic1043, is due to be replaced soon per https://phabricator.wikimedia.org/T279158. Rather than file a proper hardware failure with dc-ops, I'm just using this ticket as a placeholder so I have something to ack the alerts with.


I can't find much useful from the system event logs (https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/HP_Documentation#Show_system_event_log_entries). Here's the latest entry (not recent):

/system1/log1/record30
  Targets
  Properties
    number=30
    severity=Repaired
    date=09/15/2020
    time=15:56
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show

But note that the instance is in a power off state and doesn't seem to want to turn back on:

</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Tue Dec 14 17:54:28 2021



power: server power is currently: Off


</>hpiLO-> power on

status=0
status_tag=COMMAND COMPLETED
Tue Dec 14 17:54:32 2021



Server powering on .......

</>hpiLO-> power

status=0
status_tag=COMMAND COMPLETED
Tue Dec 14 17:57:27 2021



power: server power is currently: Off

Event Timeline

We're not going to open a dc-ops ticket. This host will be replaced when we add the new eqiad elastic hosts in early January.

Mentioned in SAL (#wikimedia-operations) [2021-12-22T18:42:42Z] <inflatador> T297735 removing/banning elastic1039 and elastic1043 from all EQIAD prod clusters

@RKemper FYI I've made a downtime for the host until end of January on Icinga

ayounsi subscribed.

FYI the host is still set to "active" in Netbox.
https://netbox.wikimedia.org/dcim/devices/1366/

Which is surfaced in one of the reports as it's not in PuppetDB:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/

(Active servers need to be in PuppetDB).

It probably needs to be set to FAILED or DECOM.
https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States

FYI the host is still set to "active" in Netbox.
https://netbox.wikimedia.org/dcim/devices/1366/

Which is surfaced in one of the reports as it's not in PuppetDB:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/

(Active servers need to be in PuppetDB).

It probably needs to be set to FAILED or DECOM.
https://wikitech.wikimedia.org/wiki/Server_Lifecycle#States

Thanks for catching this. Just set this to FAILED in netbox.

RKemper triaged this task as Low priority.

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: elastic1043.eqiad.wmnet