There is a potential bad memory on cp2038. I will like for the system to be depool if possible for me to swap DINM A3 with DIMM B3
Thanks.
Correctable memory error logging disabled for a memory device at location DIMM_A3.
There is a potential bad memory on cp2038. I will like for the system to be depool if possible for me to swap DINM A3 with DIMM B3
Thanks.
Correctable memory error logging disabled for a memory device at location DIMM_A3.
Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:53Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459
Mentioned in SAL (#wikimedia-operations) [2022-05-20T13:24:58Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2038.codfw.wmnet with reason: downtimed because of DIMM replacement: T308459
Hi @Papaul: Thanks for letting us know! The host is depooled and downtimed and so please proceed whenever you want. Thanks!
I Swapped DIMMM A3 with DIMM B3 . No error showing on DIMMB3 for now. I upgrade also IDRAC from version 4.10 to 5.00. Resolving this task for now.
@Vgutierrez you can put the server back in service for now. Thanks
Mentioned in SAL (#wikimedia-operations) [2022-05-23T15:39:13Z] <vgutierrez> pool cp2038 - T308459
Reopen this task since we are now seeing the error on DIMM B3. @Jhancock.wm since this server is out of warranty can you please check if there is any 32G DDR-4 2400 on site that we can use to replace the bad DIMM. Please coordinate with traffic team so see when is best to swap the DIMM.
Thanks.
We have that on hand. @Vgutierrez (or anyone else in traffic) when is a good time to do this swap?
Mentioned in SAL (#wikimedia-operations) [2024-11-21T17:58:22Z] <sukhe@puppetserver1001> conftool action : set/pooled=no; selector: name=cp2038.codfw.wmnet [reason: DIMM failure T308459]
Mentioned in SAL (#wikimedia-operations) [2024-11-21T20:24:08Z] <sukhe@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp2038.codfw.wmnet [reason: DIMM replaced, T308459]