Page MenuHomePhabricator

hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Icinga shows an issue with the power supply:

Current Status:	  CRITICAL  (for 20d 7h 17m 38s)
Status Information:	Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical]
Performance Data:	'Temp'=37.00 'Temp'=38.00 'Inlet Temp'=23.00;3.00:38.00;-7.00:42.00 'Exhaust Temp'=41.00;8.00:75.00;3.00:80.00

Output of ipmi-sel:

$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Dec-20-2021 | 14:40:37 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Dec-20-2021 | 14:46:49 | Sensor #0        | OS Boot                     | C: boot completed
3   | Dec-20-2021 | 14:46:49 | N/A              | N/A                         | OEM defined = 00h 52h 97h C0h 61h 00h
4   | Jun-26-2022 | 16:24:28 | Status           | Power Supply                | Power Supply input lost (AC/DC)
5   | Jun-26-2022 | 16:25:24 | PS Redundancy    | Power Supply                | Redundancy Lost

Event Timeline

Please note that the instance is not currently in use, it is part of a new group of hosts that are being added to a Ceph cluster. I won't add this one to the cluster until the power supply issue is resolved.

Jclark-ctr claimed this task.
Jclark-ctr added subscribers: Cmjohnson, Jclark-ctr.

Reseated power cable