- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
System is fully down at the moment. Redundant partner server is now working. Will try to bring up server via iDRAC. This is a highly active wikireplicas database server and the service is in a degraded state. Medium urgency I guess.
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
The only logs that really work are on web console.
LC logs from the web console show:
2021-09-28 16:35:13 CPU0001 CPU 1 has a thermal trip (over-temperature) event.
Log Sequence Number:
The processor temperature increased beyond the operational range. Thermal protection shut down the processor. Factors external to the processor may have induced this exception.
Review logs for fan failures, replace failed fans. If no fan failures are detected, check inlet temperature (if available) and reinstall processor heatsink.
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.