Page MenuHomePhabricator

Memory test failure on elastic1021
Closed, ResolvedPublic

Description

While rebooting the elasticsearch eqiad cluster, elastic1021 failed to come back up with the following message:

Error: Memory initialization warning detected.
MEMBIST Memory Test failure DIMM A3

@Cmjohnson: could you have a look (and probably change that RAM)?

Related Objects

Event Timeline

Gehel created this task.Mar 1 2018, 8:04 AM
Restricted Application added a project: Operations. · View Herald TranscriptMar 1 2018, 8:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2018-03-01T08:20:48Z] <gehel> banning elastic1021 from cluster (failed memory) - T188595

MoritzMuehlenhoff triaged this task as Normal priority.Mar 1 2018, 9:32 AM

Swapped DIMM A3 to B2 to see if the error follows the DIMM. Powered on

The error did follow the DIMM, The server is out of warranty but I will see if I can snag a similar DIMM from a decommissioned server.

I do not have anything that size as a spare. @RobH and @faidon will need to comment on buying a new DIMM.

Current DIMM is Samsung 16GB 2Rx4PC3L-12800R-11-12-E2-D4

RobH mentioned this in Unknown Object (Task).Mar 8 2018, 4:59 PM

The decision is to not replace this out of warranty RAM. We'll run with 3% less capacity until this batch of servers is renewed (in ~ 1year).

@Gehel do you want to decommission this server then?

RobH closed subtask Unknown Object (Task) as Declined.Mar 14 2018, 7:11 PM
RobH closed this task as Resolved.Mar 14 2018, 7:17 PM
RobH claimed this task.

Discussion on T189223 resulted in the decision to decommission elastic1021. I've created T189727 listing the decom process/steps. I'm going to close this task as resolved, since we're decommissioning the host.