Page MenuHomePhabricator

hw troubleshooting: power up failure for an-worker1132.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Hello. We have noticed that an-worker1132 is refusing to power up and has therefore dropped out of the Hadoop cluster and also from puppet.

I have attempted to power up the server as follows.

btullis@cumin1003:~$ sudo ipmitool -I lanplus -H "an-worker1132.mgmt.eqiad.wmnet" -U root -E shell
Unable to read password from environment
Password: 
ipmitool> chassis power status
Chassis Power is off
ipmitool> chassis power on
Chassis Power Control: Up/On
ipmitool> chassis power status
Chassis Power is off

The power status never changes to on.

I also attempted a BMC restart with bmc reset cold but this did not fix anything.

The event log shows some recent assertions relating to voltage.

ipmitool> sel list
   1 | 04/05/2023 | 02:36:52 PM UTC | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 04/05/2023 | 07:16:48 PM UTC | Physical Security #0x73 | General Chassis intrusion () | Asserted
   3 | 04/05/2023 | 07:16:52 PM UTC | Physical Security #0x73 | General Chassis intrusion () | Deasserted
   4 | 04/05/2023 | 07:34:18 PM UTC | Drive Slot / Bay #0x98 | Drive Present () | Deasserted
   5 | 04/05/2023 | 07:34:33 PM UTC | Drive Slot / Bay #0x9e | Drive Present () | Deasserted
   6 | 04/05/2023 | 07:34:48 PM UTC | Drive Slot / Bay #0xa0 | Drive Present () | Deasserted
   7 | 04/05/2023 | 07:35:04 PM UTC | Drive Slot / Bay #0xa1 | Drive Present () | Deasserted
   8 | 04/05/2023 | 07:35:09 PM UTC | Drive Slot / Bay #0xa2 | Drive Present () | Deasserted
   9 | 04/05/2023 | 07:37:44 PM UTC | Drive Slot / Bay #0xa4 | Drive Present () | Deasserted
   a | 04/05/2023 | 07:37:49 PM UTC | Drive Slot / Bay #0xa5 | Drive Present () | Deasserted
   b | 04/05/2023 | 07:40:04 PM UTC | Drive Slot / Bay #0xa4 | Drive Present () | Asserted
   c | 04/05/2023 | 07:40:04 PM UTC | Drive Slot / Bay #0xa5 | Drive Present () | Asserted
   d | 04/05/2023 | 07:45:59 PM UTC | Drive Slot / Bay #0xa1 | Drive Present () | Asserted
   e | 04/05/2023 | 07:45:59 PM UTC | Drive Slot / Bay #0xa2 | Drive Present () | Asserted
   f | 04/05/2023 | 07:46:03 PM UTC | Drive Slot / Bay #0x9e | Drive Present () | Asserted
  10 | 04/05/2023 | 07:46:08 PM UTC | Drive Slot / Bay #0xa0 | Drive Present () | Asserted
  11 | 04/05/2023 | 07:46:13 PM UTC | Drive Slot / Bay #0x98 | Drive Present () | Asserted
  12 | 04/11/2023 | 10:11:35 AM UTC | Battery #0x88 | Failed | Asserted
  13 | 04/11/2023 | 01:30:45 PM UTC | Battery #0x88 | Failed | Deasserted
  14 | 06/17/2025 | 06:49:41 AM UTC | Memory #0x1b | Monitor | Asserted
  15 | 06/26/2025 | 03:39:46 PM UTC | Voltage #0x74 | State Asserted | Asserted
  16 | 09/22/2025 | 11:30:54 AM UTC | Voltage #0x74 | State Asserted | Asserted
ipmitool>

Please could you investigate and try to fix this? It is not urgent.

If the server can be made to power up, it should rejoin the Hadoop cluster without any intervention, so feel free to let this happen.

I marked the server as failed in Netbox.
Thanks very much

Event Timeline

BTullis moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
BTullis moved this task from Incoming to Watching on the Data-Platform-SRE board.
Jclark-ctr subscribed.

This server is out of warranty

Performed flea power drain Server booted with no errors at this time
updated bios 2.10.0 - > 2.24.0
updated idrac 4.40.00.00 - > 7.00.00

Please note this server was never returned from failed to active status in netbox, and caused an issue during migration of switches earlier today so it wasn't moved yet. I've fixed the netbox status to active, as it appears it is in use.