Page MenuHomePhabricator

db2174 lost power
Closed, ResolvedPublic

Description

Today we had a page due to db2174 suddenly losing power and restarting consequently:

11/21/2022 14:10:55: The power input for power supply 1 is lost.

Event Timeline

First timeout on icinga matches that log:

Service Unknown[2022-11-21 15:11:00] SERVICE ALERT: db2174;Check for large files in client bucket;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.

However, technically, that wouldn't explain a full host failure- the host had power redundancy- did the second power supply fail at the same time, but that was not logged/it failed uncleanly? Weird. 🤔

-------------------------------------------------------------------------------
Record:      5
Date/Time:   08/10/2022 17:30:25
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   11/21/2022 14:10:55
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Papaul renamed this task from db1174 lost power to db2174 lost power.Nov 21 2022, 5:04 PM
Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

i checked all looks good on the server. @Ladsgroup can you confirm that all is good us on your end in this server ?
Thanks

@Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is expected. One time on each power supply (maybe you already did that).

@Ladsgroup told me the host is dowtimed and depooled, so any time now would be ok, or sync with him for a better time later on.

jcrespo triaged this task as High priority.Nov 21 2022, 6:00 PM

@Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is expected. (maybe you already did that).

+1

Change 859328 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2174: Disable notifications

https://gerrit.wikimedia.org/r/859328

Change 859328 merged by Marostegui:

[operations/puppet@production] db2174: Disable notifications

https://gerrit.wikimedia.org/r/859328

I have left mysql stopped so @Papaul can do the test whenever he wants.

Given that it is a public holiday in the US and Papaul won't be onsite till Monday, I am starting replication so the host doesn't get behind that many days. I will stop it again on Monday.

MySQL is now off again, so @Papaul you can do the test whenever you can.

I tested the HW on the server all looking good. The only error i had was error-code 2000-0251 which is not a big issue see link below for more information on error-code. I think the task can be closed. Thanks.
https://www.dell.com/support/kbdoc/en-us/000139065/resolving-error-code-2000-0251-when-launching-the-epsa-diagnostics-on-dell-pc

Thank you Papaul, I will get this host back to the load balancer and then close the task.

Host being repooled automatically. Notifications enabled.