- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
Hello. We have noticed that an-worker1132 is refusing to power up and has therefore dropped out of the Hadoop cluster and also from puppet.
I have attempted to power up the server as follows.
btullis@cumin1003:~$ sudo ipmitool -I lanplus -H "an-worker1132.mgmt.eqiad.wmnet" -U root -E shell Unable to read password from environment Password: ipmitool> chassis power status Chassis Power is off ipmitool> chassis power on Chassis Power Control: Up/On ipmitool> chassis power status Chassis Power is off
The power status never changes to on.
I also attempted a BMC restart with bmc reset cold but this did not fix anything.
The event log shows some recent assertions relating to voltage.
ipmitool> sel list 1 | 04/05/2023 | 02:36:52 PM UTC | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted 2 | 04/05/2023 | 07:16:48 PM UTC | Physical Security #0x73 | General Chassis intrusion () | Asserted 3 | 04/05/2023 | 07:16:52 PM UTC | Physical Security #0x73 | General Chassis intrusion () | Deasserted 4 | 04/05/2023 | 07:34:18 PM UTC | Drive Slot / Bay #0x98 | Drive Present () | Deasserted 5 | 04/05/2023 | 07:34:33 PM UTC | Drive Slot / Bay #0x9e | Drive Present () | Deasserted 6 | 04/05/2023 | 07:34:48 PM UTC | Drive Slot / Bay #0xa0 | Drive Present () | Deasserted 7 | 04/05/2023 | 07:35:04 PM UTC | Drive Slot / Bay #0xa1 | Drive Present () | Deasserted 8 | 04/05/2023 | 07:35:09 PM UTC | Drive Slot / Bay #0xa2 | Drive Present () | Deasserted 9 | 04/05/2023 | 07:37:44 PM UTC | Drive Slot / Bay #0xa4 | Drive Present () | Deasserted a | 04/05/2023 | 07:37:49 PM UTC | Drive Slot / Bay #0xa5 | Drive Present () | Deasserted b | 04/05/2023 | 07:40:04 PM UTC | Drive Slot / Bay #0xa4 | Drive Present () | Asserted c | 04/05/2023 | 07:40:04 PM UTC | Drive Slot / Bay #0xa5 | Drive Present () | Asserted d | 04/05/2023 | 07:45:59 PM UTC | Drive Slot / Bay #0xa1 | Drive Present () | Asserted e | 04/05/2023 | 07:45:59 PM UTC | Drive Slot / Bay #0xa2 | Drive Present () | Asserted f | 04/05/2023 | 07:46:03 PM UTC | Drive Slot / Bay #0x9e | Drive Present () | Asserted 10 | 04/05/2023 | 07:46:08 PM UTC | Drive Slot / Bay #0xa0 | Drive Present () | Asserted 11 | 04/05/2023 | 07:46:13 PM UTC | Drive Slot / Bay #0x98 | Drive Present () | Asserted 12 | 04/11/2023 | 10:11:35 AM UTC | Battery #0x88 | Failed | Asserted 13 | 04/11/2023 | 01:30:45 PM UTC | Battery #0x88 | Failed | Deasserted 14 | 06/17/2025 | 06:49:41 AM UTC | Memory #0x1b | Monitor | Asserted 15 | 06/26/2025 | 03:39:46 PM UTC | Voltage #0x74 | State Asserted | Asserted 16 | 09/22/2025 | 11:30:54 AM UTC | Voltage #0x74 | State Asserted | Asserted ipmitool>
Please could you investigate and try to fix this? It is not urgent.
If the server can be made to power up, it should rejoin the Hadoop cluster without any intervention, so feel free to let this happen.
I marked the server as failed in Netbox.
Thanks very much