Today we had a page due to db2174 suddenly losing power and restarting consequently:
11/21/2022 14:10:55: The power input for power supply 1 is lost.
Today we had a page due to db2174 suddenly losing power and restarting consequently:
11/21/2022 14:10:55: The power input for power supply 1 is lost.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
db2174: Disable notifications | operations/puppet | production | +1 -0 |
First timeout on icinga matches that log:
Service Unknown[2022-11-21 15:11:00] SERVICE ALERT: db2174;Check for large files in client bucket;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
However, technically, that wouldn't explain a full host failure- the host had power redundancy- did the second power supply fail at the same time, but that was not logged/it failed uncleanly? Weird. 🤔
------------------------------------------------------------------------------- Record: 5 Date/Time: 08/10/2022 17:30:25 Source: system Severity: Ok Description: The power supplies are redundant. ------------------------------------------------------------------------------- Record: 6 Date/Time: 11/21/2022 14:10:55 Source: system Severity: Critical Description: The power input for power supply 1 is lost. -------------------------------------------------------------------------------
i checked all looks good on the server. @Ladsgroup can you confirm that all is good us on your end in this server ?
Thanks
@Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is expected. One time on each power supply (maybe you already did that).
@Ladsgroup told me the host is dowtimed and depooled, so any time now would be ok, or sync with him for a better time later on.
Change 859328 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db2174: Disable notifications
Change 859328 merged by Marostegui:
[operations/puppet@production] db2174: Disable notifications
Given that it is a public holiday in the US and Papaul won't be onsite till Monday, I am starting replication so the host doesn't get behind that many days. I will stop it again on Monday.
I tested the HW on the server all looking good. The only error i had was error-code 2000-0251 which is not a big issue see link below for more information on error-code. I think the task can be closed. Thanks.
https://www.dell.com/support/kbdoc/en-us/000139065/resolving-error-code-2000-0251-when-launching-the-epsa-diagnostics-on-dell-pc
Thank you Papaul, I will get this host back to the load balancer and then close the task.