- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc): Medium to high as we already have 5 servers of this k8s cluster in this exact state
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
I am seeing the following in logs after a failed reimage on wikikube-worker1057 (previously known as kubernetes1038):
racadm>>getsel Record: 1 Date/Time: 08/09/2023 02:06:35 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 08/09/2023 02:15:20 Source: system Severity: Ok Description: OEM software event. ------------------------------------------------------------------------------- Record: 3 Date/Time: 08/09/2023 02:15:21 Source: system Severity: Ok Description: OEM software event. ------------------------------------------------------------------------------- Record: 4 Date/Time: 08/09/2023 02:15:28 Source: system Severity: Ok Description: C: boot completed. ------------------------------------------------------------------------------- Record: 5 Date/Time: 08/09/2023 02:15:28 Source: system Severity: Ok Description: OEM software event. ------------------------------------------------------------------------------- Record: 6 Date/Time: 12/06/2024 12:27:25 Source: system Severity: Critical Description: The System Configuration Check operation resulted in the following issue: Comm Error: Backplane 0. -------------------------------------------------------------------------------
I have power cycled via racadm but it hasn't fixed itself.