Page MenuHomePhabricator

Broken RAM on db1127
Closed, ResolvedPublic

Description

db1127 rebooted itself due to a broken DIMM. The server is still under warranty.

Record:      18
Date/Time:   07/15/2021 16:31:24
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/15/2021 16:31:40
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_A3.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/16/2021 08:03:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
-------------------------------------------------------------------------------

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-07-16T08:28:30Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depooling db1127 due to RAM failures T286763', diff saved to https://phabricator.wikimedia.org/P16827 and previous config saved to /var/cache/conftool/dbconfig/20210716-082829-kormat.json

Change 704926 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1127: Disable notificactions.

https://gerrit.wikimedia.org/r/704926

Change 704926 merged by Kormat:

[operations/puppet@production] db1127: Disable notificactions.

https://gerrit.wikimedia.org/r/704926

Marostegui triaged this task as Medium priority.Aug 2 2021, 8:58 AM
Marostegui added a project: DBA.

Dispatch created with Dell, You have successfully submitted request SR1066677487.

The DIMM has arrived, the server will need to be taken offline for a few minutes do swap the DIMM.

@Cmjohnson I can do that now, let me know if that works. If not, just let me know when it would work for you and I will get the server offline for you.

@Cmjohnson host off - you can proceed as needed

DIMM A3 was replaced and the log was cleared.

Memory looks good now.
This host needs to be recloned - I will do that tomorrow

Thanks Chris

Mentioned in SAL (#wikimedia-operations) [2021-08-04T04:34:39Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1174 to clone db1127 T286763', diff saved to https://phabricator.wikimedia.org/P16948 and previous config saved to /var/cache/conftool/dbconfig/20210804-043438-marostegui.json

I am cloning db1127 from db1174

Cloned - waiting for replication to catch up

Mentioned in SAL (#wikimedia-operations) [2021-08-04T06:45:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1174 and db1127 T286763', diff saved to https://phabricator.wikimedia.org/P16954 and previous config saved to /var/cache/conftool/dbconfig/20210804-064548-marostegui.json

Host pooled, GTID enabled, notifications enabled. All sorted.