Page MenuHomePhabricator

hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
  • FQDN: mw1486.eqiad.wmnet
  • Urgency: Low (server works but risks not coming back without console intervention on reboot)
  • Hardware failure: On last reboot, I had to connect to console and force boot via F1 because of a power issue (log in comments)

Event Timeline

Clement_Goubert created this task.

racadm getsel log:

-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/05/2023 14:37:39
Source:      system
Severity:    Critical
Description: The system halted because system power exceeds capacity.
-------------------------------------------------------------------------------

Ignoring the error to force boot was still possible, which I did to unblock deployments. However, I am now concerned that the PSU may be unstable. I have depooled the host for now, even if it is functioning correctly pending investigation if DC-Ops deems it necessary.

Thank you for depooling will investigate today while on site

Created ticket
Confirmed: Service Request 159722060 was successfully submitted.
Submitted TSR report to Dell

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:20:35Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:20:50Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425

Dzahn changed the task status from Open to In Progress.Jan 6 2023, 6:21 PM

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Sorry @Dzahn, @ssingh, I had forgotten to downtime it.

Preformed Flea Power Drain As requested by Dell

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Sorry @Dzahn, @ssingh, I had forgotten to downtime it.

Don't worry about it please, but do note that we set it for four days so you may want to extend that; I will leave that to you.

Preformed Flea Power Drain As requested by Dell

Can we pool it back, or do you still need it for further troubleshooting?

Icinga downtime and Alertmanager silence (ID=edb03633-d9b6-4a06-849d-2c3da0e62688) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware troubleshooting

mw1486.eqiad.wmnet

I checked yesterday afternoon did not see any alerts. Let’s repool server close ticket

Mentioned in SAL (#wikimedia-operations) [2023-01-11T11:51:23Z] <claime> repooled mw1486 in api_appserver eqiad after hardware investigation - T326425