Page MenuHomePhabricator

codfw: cp2038 Correctable memory error on DIMM A3
Closed, ResolvedPublic

Description

There is a potential bad memory on cp2038. I will like for the system to be depool if possible for me to swap DINM A3 with DIMM B3
Thanks.

Correctable memory error logging disabled for a memory device at location DIMM_A3.

Details

Other Assignee
Vgutierrez

Event Timeline

Papaul triaged this task as Medium priority.May 16 2022, 5:10 PM
Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:53Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459

Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:58Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459

Hi @Papaul: Thanks for letting us know! The host is depooled and downtimed and so please proceed whenever you want. Thanks!

@ssingh thanks will work on it when back on site next week

Vgutierrez subscribed.

I Swapped DIMMM A3 with DIMM B3 . No error showing on DIMMB3 for now. I upgrade also IDRAC from version 4.10 to 5.00. Resolving this task for now.
@Vgutierrez you can put the server back in service for now. Thanks

@Papaul thanks, cp2038 is happily serving traffic :)