Page MenuHomePhabricator

hw troubleshooting: Memory failure for cp2029.codfw.wmnet
Closed, ResolvedPublicRequest

Description

cp2029 has been depooled due to potentially bad hardware memory discovered during a run-puppet-agent:

May 12 21:04:43 cp2029 kernel: Disabling lock debugging due to kernel taint
May 12 21:04:43 cp2029 kernel: mce: [Hardware Error]: Machine check events logged
May 12 21:04:43 cp2029 kernel: mce: Uncorrected hardware memory error in user-access at 3a7d753880
May 12 21:04:43 cp2029 kernel: Memory failure: 0x3a7d753: Sending SIGBUS to puppet:2533255 due to hardware memory corruption
May 12 21:04:43 cp2029 kernel: Memory failure: 0x3a7d753: recovery action for dirty LRU page: Recovered

The host has been depooled and can safely be serviced. Thanks!

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-05-12T21:40:50Z] <brett@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2029.codfw.wmnet with reason: Potential failed memory - T393968

Mentioned in SAL (#wikimedia-operations) [2025-05-12T21:41:07Z] <brett@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cp2029.codfw.wmnet with reason: Potential failed memory - T393968

Jhancock.wm subscribed.

idrac said DIMM_B3 was at fault. i swapped it with DIMM_A3.
if reseating it did not fix the issue, we should be able to diagnose the cause by seeing how it errors again. I'll check back on this tomorrow to see if the error repeated and how it behaves. (we might need to repool the server to get it to error again)

@BCornwall the alert has cleared in the idrac and I dont't see anything new in the history since yesterday. We might be good. You can repool the server. If it errors again, make a new ticket and we'll get the part replaced depending on what it does.