Page MenuHomePhabricator

hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
  • FQDN: rdb1014.eqiad.wmnet
  • Urgency: High, only replica for redis-misc-pair1
  • Issue: CPU 2 machine check error. Host went down during the week-end due to this error, tried to powercycle, it comes up for a little while then crashes again with the same error
-------------------------------------------------------------------------------
Record:      137
Date/Time:   07/22/2024 13:15:52
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------

Host not put into failed status due to T336275: Upgrade Netbox to 4.x in progress

Event Timeline

Clement_Goubert created this task.

Icinga downtime and Alertmanager silence (ID=9bfde8c2-0f71-4de8-908c-ff3a74fdbe71) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

rdb1014.eqiad.wmnet

Confirmed: Service Request 195103368 was successfully submitted.

updated firmwares per dells request last week monitoring if any errors return

jijiki subscribed.

@Jclark-ctr Thank you! Closing for now and will reopen if the problem persists

Mentioned in SAL (#wikimedia-operations) [2024-08-19T13:09:35Z] <cgoubert@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rdb1014 back to active - cgoubert@cumin1002 - T370633"

Mentioned in SAL (#wikimedia-operations) [2024-08-19T13:09:54Z] <cgoubert@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rdb1014 back to active - cgoubert@cumin1002 - T370633"