Page MenuHomePhabricator

cp5001 memory errors on DIMM A2
Open, MediumPublic

Description

cp5001.eqsin.wmnet reports the following error from racadm getsel:

-------------------------------------------------------------------------------
Record:      42
Date/Time:   07/31/2022 01:02:16
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      43
Date/Time:   07/31/2022 01:02:16
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   07/31/2022 01:02:16
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-07-31T18:13:48Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256

Mentioned in SAL (#wikimedia-operations) [2022-07-31T18:14:04Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256

Vgutierrez triaged this task as Medium priority.Jul 31 2022, 6:21 PM
Vgutierrez moved this task from Triage to Active Issues on the Traffic board.
Vgutierrez added a subscriber: Vgutierrez.

I've set it as inactive rather than just depool it to let pybal ignore it regarding depooling threshold

Mentioned in SAL (#wikimedia-operations) [2022-08-08T14:46:44Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256

Mentioned in SAL (#wikimedia-operations) [2022-08-08T14:47:00Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp5001.eqsin.wmnet with reason: depooled: faulty DIMM: T314256

wiki_willy added a subscriber: wiki_willy.

Assigning over to Rob, who's currently working on getting the eqsin hardware refresh ordered.

@wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin?

meanwhile I'll remove it from puppet, cause it's been a month since the host crashed and it already got prunned from puppetdb

Change 829118 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache: Remove cp5001

https://gerrit.wikimedia.org/r/829118

Change 829118 merged by Vgutierrez:

[operations/puppet@production] cache: Remove cp5001

https://gerrit.wikimedia.org/r/829118

Hi @Vgutierrez - yeah, probably makes more sense to replace than purchase a replacement part, since the new servers have already been ordered and are expected to arrive in October. Thanks, Willy

@wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin?

@wiki_willy @RobH I'm assuming this host will be decommissioned rather than fixed considering that we are already working in refreshing eqsin?

I assumed exactly the same thing =]