Page MenuHomePhabricator

hw troubleshooting: cp6006 b2 dimm issue
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system. - cp6006.drmrs.wmnet
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Rob noticed host has a memory warn (not error) in the idrac when logged in to correct power redundancy settings (set hotspare psu to off to evenly split draw).

cp6006

The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.

Event Timeline

Mentioned in SAL (#wikimedia-sre) [2022-05-24T18:03:50Z] <robh> cp6006 in maint mode and depooled for memory troubleshooting via T309123

It fixed itself with reboot

Normal,Tue 24 May 2022 18:06:22,The self-heal operation successfully completed at DIMM DIMM_B2.,
Normal,Tue 24 May 2022 18:06:22,The self-heal operation successfully completed at DIMM DIMM_B2.,
Normal,Tue 24 May 2022 18:06:22,A problem was detected during Power-On Self-Test (POST).,
Warning,Fri 13 May 2022 12:41:31,The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.,

Mentioned in SAL (#wikimedia-sre) [2022-05-24T18:11:13Z] <robh> cp6006 memory issue resolved, returned system to service and ended maint window via T309123