Page MenuHomePhabricator

hw troubleshooting: memory DIMM_B3 multi-bit memory errors for prometheus1005
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system. prometheus1005.eqiad.wmnet
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc) without prometheus1005, we have one other host
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/19/2024 08:01:45
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B3.
-------------------------------------------------------------------------------

... lots of self heal operations ...

-------------------------------------------------------------------------------
Record:      492
Date/Time:   04/19/2024 08:08:20
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

Event Timeline

colewhite created this task.
colewhite updated the task description. (Show Details)

Change #1023423 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] trafficserver: move prometheus-eqiad to prometheus1006

https://gerrit.wikimedia.org/r/1023423

Change #1023423 merged by Filippo Giunchedi:

[operations/puppet@production] trafficserver: move prometheus-eqiad to prometheus1006

https://gerrit.wikimedia.org/r/1023423

Jclark-ctr subscribed.

Opened request with Dell
You have successfully submitted request SR189381173.

This was a duplicate ticket that was opened for https://phabricator.wikimedia.org/T360687

Closing ticket as this is resolved.