severity=Critical date=11/16/2016 time=17:01 description=Automatic Operating System Shutdown Initiated Due to Overheat Condition Critical Temperature Threshold Exceeded (Temperature Sensor 18, Location System, Temperature 127C) Power on request received by: Automatic Power Recovery
This is what I have seen
- There is a big spike on disk writes just before the server died
- The ILO logs after the reset show:
description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array. Verbs
The controller is now showing:
Cache Status Details: The current array controller had valid data stored in its battery/capacitor backed write cache the last time it was reset or was powered up. This indicates that the system may not have been shut down gracefully. The array controller has automatically written, or has attempted to write, this data to the drives. This message will continue to be displayed until the next reset or power-cycle of the array controller.
The overheat message specifically mentions sensor 18
Sensor18's logs look fine
/system1/sensor18 Targets Properties DeviceID=18-VR P2 ElementName=System OperationalStatus=Ok RateUnits=Celsius CurrentReading=52 SensorType=Temperature HealthState=Ok oemhp_CautionValue=115 oemhp_CriticalValue=120
I have started MySQL and let it recover, as there was no errors. I have started replication.
Even though the burning tests were fine, I have depooled the server and silenced it for the weekend, just in case.