Memory error on restbase1016
Open, Needs TriagePublic

Description

When rebooting restbase1016, it stopped the boot complaining about a broken memory DIMM:

UEFI0107: One or more memory errors have occurred on memory slot: A2. Remove 
input power to the system, reseat the DIMM module and restart the system. If the 
issues persist, replace the faulty memory module identified in the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

Change 481495 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Change 481495 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Eevans added a subscriber: Eevans.Thu, Jan 3, 3:13 PM

Is there any status update, or ETA on this?

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:56:50Z] <urandom> forcing removal of restbase1016-a (host down way too long to salvage) -- T212418

Eevans added a comment.Tue, Jan 8, 5:32 PM

We're currently in the process of force-removing these instances. We'll need to coordinate when the host comes back up, as we'll have to re-bootstrap all 3 instances.

Mentioned in SAL (#wikimedia-operations) [2019-01-08T22:12:54Z] <urandom> forcing removal of restbase1016-b (host down way too long to salvage) -- T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-09T13:32:04Z] <urandom> forcing removal of restbase1016-c (host down way too long to salvage) -- T212418

Record: 4
Date/Time: 11/17/2017 19:18:35
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 11/17/2017 19:22:08
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 6
Date/Time: 02/13/2018 22:08:17
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 7
Date/Time: 02/14/2018 12:26:34
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 8
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 9
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

It's completely unusable on our end; We had no choice but to brute-force remove the Cassandra nodes in the last days. You can take it down to do whatever you need. We should consider re-imaging it before trying to bring it back on line anyway.

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

While the server is offline I took this opportunity to update the f/w on the bios and idrac.

Eevans added a comment.EditedThu, Jan 10, 9:51 PM

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

It should not, but if you could give me a heads up when you power it back on, that would be appreciated.

I ended up leaving the production cables disconnected.

The log remains clear and no erros have returned. I will give it another 24 hours and if no change then it can go back into service.

@Eevans The error has not returned, I cannot say with 100% certainty that it will not return but for now please take the server back and do what you need. All the cables are plugged back in and the server is off. I will leave this open for a few days, lmk if the error returns.

Volans added a subscriber: Volans.Tue, Jan 15, 7:32 PM

FYI the host is currently down due to a partial power issue in that rack.

Mentioned in SAL (#wikimedia-operations) [2019-01-15T23:54:08Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-16T00:15:42Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 (duration: 21m 34s)