Page MenuHomePhabricator

db2049 overheated and restarted
Closed, ResolvedPublic

Description

severity=Critical
date=11/16/2016
time=17:01
description=Automatic Operating System Shutdown Initiated Due to Overheat Condition

Critical Temperature Threshold Exceeded (Temperature Sensor 18, Location System, Temperature 127C)

Power on request received by: Automatic Power Recovery

Related Objects

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald Transcript

I am checking the fans logs and they look fine.

This is what I have seen

  • There is a big spike on disk writes just before the server died
  • The ILO logs after the reset show:
  description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.
Verbs

The controller is now showing:

Cache Status Details: The current array controller had valid data stored in its battery/capacitor backed write cache the last time it was reset or was powered up.  This indicates that the system may not have been shut down gracefully.  The array controller has automatically written, or has attempted to write, this data to the drives.  This message will continue to be displayed until the next reset or power-cycle of the array controller.

The overheat message specifically mentions sensor 18
Sensor18's logs look fine

/system1/sensor18
  Targets
  Properties
    DeviceID=18-VR P2
    ElementName=System
    OperationalStatus=Ok
    RateUnits=Celsius
    CurrentReading=52
    SensorType=Temperature
    HealthState=Ok
    oemhp_CautionValue=115
    oemhp_CriticalValue=120

db2050 is located right on top of db2049 and its logs do not reveal any warning or any trace of overheat

After the reboot the Cache message is gone.

I am running a burn test - I have started burning 3 CPUs and will leave it for a little while before starting with 3 more.

Assigning it to you to credit you are working more on this.

jcrespo triaged this task as Medium priority.Nov 17 2016, 3:15 PM
jcrespo moved this task from Triage to In progress on the DBA board.

I am burning 12 CPUs now. For the night I am planning to leave 24 of them and see what happens tomorrow morning.

I am burning 32 cores until tomorrow morning.

Change 322238 had a related patch set uploaded (by Marostegui):
db-codfw.php: Depool db2049 for the weekend

https://gerrit.wikimedia.org/r/322238

Change 322238 merged by jenkins-bot:
db-codfw.php: Depool db2049 for the weekend

https://gerrit.wikimedia.org/r/322238

Mentioned in SAL (#wikimedia-operations) [2016-11-18T08:21:20Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2049 - T150876 (duration: 00m 49s)

I have started MySQL and let it recover, as there was no errors. I have started replication.
Even though the burning tests were fine, I have depooled the server and silenced it for the weekend, just in case.

This has been running fine for 6 days already. Once the deploys are un blocked, I will pool it back.

Change 323823 had a related patch set uploaded (by Marostegui):
db-codfw.php: Repool db2049

https://gerrit.wikimedia.org/r/323823

Change 323823 merged by jenkins-bot:
db-codfw.php: Repool db2049

https://gerrit.wikimedia.org/r/323823

Mentioned in SAL (#wikimedia-operations) [2016-11-28T13:01:48Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool db2049 - T150876 (duration: 00m 45s)

This server didn't have any other issue and not even when it gots its cpu burned for some hours. So I have pooled it back