Page MenuHomePhabricator

db1178 didn't come back online after reboot
Closed, ResolvedPublic

Description

it's in s8.

Event Timeline

SEL:

racadm>>getsel
Record:      1
Date/Time:   02/11/2021 14:31:13
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/28/2023 19:04:01
Source:      system
Severity:    Non-Critical
Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_A7. Reboot system to initiate self-heal process.
-------------------------------------------------------------------------------

But that's old

I did a serveraction hardreset but still can't ssh into the host.

Change 952036 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1178: Disable notification

https://gerrit.wikimedia.org/r/952036

Change 952036 merged by Ladsgroup:

[operations/puppet@production] db1178: Disable notification

https://gerrit.wikimedia.org/r/952036

Server is in a boot loop troubleshooting now

Server is out of warranty. pulled dimm from recently decom server and replaced. A7.

Server is back up and running

Thanks for fast fix. I really appreciate it.