Page MenuHomePhabricator

codfw: cp2038 Correctable memory error on DIMM A3
Closed, ResolvedPublic

Description

There is a potential bad memory on cp2038. I will like for the system to be depool if possible for me to swap DINM A3 with DIMM B3
Thanks.

Correctable memory error logging disabled for a memory device at location DIMM_A3.

Details

Other Assignee
Vgutierrez

Event Timeline

Papaul triaged this task as Medium priority.May 16 2022, 5:10 PM
Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:53Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459

Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:58Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459

Hi @Papaul: Thanks for letting us know! The host is depooled and downtimed and so please proceed whenever you want. Thanks!

@ssingh thanks will work on it when back on site next week

Vgutierrez subscribed.

I Swapped DIMMM A3 with DIMM B3 . No error showing on DIMMB3 for now. I upgrade also IDRAC from version 4.10 to 5.00. Resolving this task for now.
@Vgutierrez you can put the server back in service for now. Thanks

@Papaul thanks, cp2038 is happily serving traffic :)

Papaul reassigned this task from Papaul to Jhancock.wm.
Papaul added a subscriber: Jhancock.wm.

Reopen this task since we are now seeing the error on DIMM B3. @Jhancock.wm since this server is out of warranty can you please check if there is any 32G DDR-4 2400 on site that we can use to replace the bad DIMM. Please coordinate with traffic team so see when is best to swap the DIMM.

Thanks.

We have that on hand. @Vgutierrez (or anyone else in traffic) when is a good time to do this swap?

Mentioned in SAL (#wikimedia-operations) [2024-11-21T17:58:22Z] <sukhe@puppetserver1001> conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet [reason: DIMM failure T308459]

Hi Jenn. The host has been depooled so you can do it whenever you want. Thanks!

replaced with a new DIMM. coming up now

Mentioned in SAL (#wikimedia-operations) [2024-11-21T20:24:08Z] <sukhe@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet [reason: DIMM replaced, T308459]