severity=Critical date=11/16/2016 time=17:01 description=Automatic Operating System Shutdown Initiated Due to Overheat Condition Critical Temperature Threshold Exceeded (Temperature Sensor 18, Location System, Temperature 127C) Power on request received by: Automatic Power Recovery
Description
Details
Related Objects
- Mentioned In
- T166853: Degraded RAID on db2049
Event Timeline
This is what I have seen
- There is a big spike on disk writes just before the server died
- The ILO logs after the reset show:
description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array. Verbs
The controller is now showing:
Cache Status Details: The current array controller had valid data stored in its battery/capacitor backed write cache the last time it was reset or was powered up. This indicates that the system may not have been shut down gracefully. The array controller has automatically written, or has attempted to write, this data to the drives. This message will continue to be displayed until the next reset or power-cycle of the array controller.
The overheat message specifically mentions sensor 18
Sensor18's logs look fine
/system1/sensor18 Targets Properties DeviceID=18-VR P2 ElementName=System OperationalStatus=Ok RateUnits=Celsius CurrentReading=52 SensorType=Temperature HealthState=Ok oemhp_CautionValue=115 oemhp_CriticalValue=120
db2050 is located right on top of db2049 and its logs do not reveal any warning or any trace of overheat
Mentioned in SAL (#wikimedia-operations) [2016-11-17T09:30:04Z] <marostegui> Reboot db2049 for maintenance - https://phabricator.wikimedia.org/T150876
I am running a burn test - I have started burning 3 CPUs and will leave it for a little while before starting with 3 more.
I am burning 12 CPUs now. For the night I am planning to leave 24 of them and see what happens tomorrow morning.
Change 322238 had a related patch set uploaded (by Marostegui):
db-codfw.php: Depool db2049 for the weekend
Mentioned in SAL (#wikimedia-operations) [2016-11-18T08:21:20Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2049 - T150876 (duration: 00m 49s)
I have started MySQL and let it recover, as there was no errors. I have started replication.
Even though the burning tests were fine, I have depooled the server and silenced it for the weekend, just in case.
This has been running fine for 6 days already. Once the deploys are un blocked, I will pool it back.
Change 323823 had a related patch set uploaded (by Marostegui):
db-codfw.php: Repool db2049
Mentioned in SAL (#wikimedia-operations) [2016-11-28T13:01:48Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2049 - T150876 (duration: 00m 45s)
This server didn't have any other issue and not even when it gots its cpu burned for some hours. So I have pooled it back