Page MenuHomePhabricator

Broken RAM on db1127
Open, Needs TriagePublic

Description

db1127 rebooted itself due to a broken DIMM. The server is still under warranty.

Record:      18
Date/Time:   07/15/2021 16:31:24
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/15/2021 16:31:40
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_A3.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/16/2021 08:03:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
-------------------------------------------------------------------------------

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-07-16T08:28:30Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depooling db1127 due to RAM failures T286763', diff saved to https://phabricator.wikimedia.org/P16827 and previous config saved to /var/cache/conftool/dbconfig/20210716-082829-kormat.json

Change 704926 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1127: Disable notificactions.

https://gerrit.wikimedia.org/r/704926

Change 704926 merged by Kormat:

[operations/puppet@production] db1127: Disable notificactions.

https://gerrit.wikimedia.org/r/704926