hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	Clement_Goubert
	Jan 6 2023, 3:58 PM

Description

- Provide FQDN of system.
- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- Put system into a failed state in Netbox.
- Provide urgency of request, along with justification (redundancy, dependencies, etc)
- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: mw1486.eqiad.wmnet
Urgency: Low (server works but risks not coming back without console intervention on reboot)
Hardware failure: On last reboot, I had to connect to console and force boot via F1 because of a power issue (log in comments)

Event Timeline

Clement_Goubert triaged this task as Low priority.Jan 6 2023, 3:58 PM

Clement_Goubert created this task.

racadm getsel log:

-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/05/2023 14:37:39
Source:      system
Severity:    Critical
Description: The system halted because system power exceeds capacity.
-------------------------------------------------------------------------------

Ignoring the error to force boot was still possible, which I did to unblock deployments. However, I am now concerned that the PSU may be unstable. I have depooled the host for now, even if it is functioning correctly pending investigation if DC-Ops deems it necessary.

Clement_Goubert updated the task description. (Show Details)Jan 6 2023, 4:00 PM

Clement_Goubert moved this task from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.Jan 6 2023, 4:06 PM

Maintenance_bot added a project: SRE.Jan 6 2023, 4:29 PM

Thank you for depooling will investigate today while on site

Created ticket
Confirmed: Service Request 159722060 was successfully submitted.
Submitted TSR report to Dell

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:20:35Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425

Mentioned in SAL (#wikimedia-operations) [2023-01-06T18:20:50Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425

Dzahn changed the task status from Open to In Progress.Jan 6 2023, 6:21 PM

In T326425#8505438, @Dzahn wrote:

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Sorry @Dzahn, @ssingh, I had forgotten to downtime it.

Preformed Flea Power Drain As requested by Dell

In T326425#8508075, @Clement_Goubert wrote:

In T326425#8505438, @Dzahn wrote:

18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%

Sorry @Dzahn, @ssingh, I had forgotten to downtime it.

Don't worry about it please, but do note that we set it for four days so you may want to extend that; I will leave that to you.

In T326425#8509196, @Jclark-ctr wrote:

Preformed Flea Power Drain As requested by Dell

Can we pool it back, or do you still need it for further troubleshooting?

Icinga downtime and Alertmanager silence (ID=edb03633-d9b6-4a06-849d-2c3da0e62688) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware troubleshooting

mw1486.eqiad.wmnet

I checked yesterday afternoon did not see any alerts. Let’s repool server close ticket

Jclark-ctr closed this task as Resolved.Jan 11 2023, 11:41 AM

Mentioned in SAL (#wikimedia-operations) [2023-01-11T11:51:23Z] <claime> repooled mw1486 in api_appserver eqiad after hardware investigation - T326425

Server repooled, thanks a bunch.

Aklapper removed a subscriber: serviceops.May 16 2023, 1:16 PM

hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Event Timeline

hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions