Page MenuHomePhabricator

aqs1001 getting multiple and repeated heat MCEs
Closed, ResolvedPublic

Description

In the logs, repeated instances like

Oct 26 13:52:52 aqs1001 mcelog: Processor 8 heated above trip temperature. Throttling enabled.
Oct 26 13:52:52 aqs1001 mcelog: Please check your system cooling. Performance will be impacted
Oct 26 13:52:52 aqs1001 mcelog: Processor 8 below trip temperature. Throttling disabled
Oct 26 13:52:52 aqs1001 mcelog: Processor 20 heated above trip temperature. Throttling enabled.
Oct 26 13:52:52 aqs1001 mcelog: Please check your system cooling. Performance will be impacted

Can be found over a long period. It appears that the box is overheating for some reason.

Event Timeline

coren created this task.Oct 26 2015, 1:57 PM
coren assigned this task to Cmjohnson.
coren raised the priority of this task from to High.
coren updated the task description. (Show Details)
coren added a project: ops-eqiad.
coren added a subscriber: coren.
Restricted Application added a project: Operations. · View Herald TranscriptOct 26 2015, 1:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack added a subscriber: BBlack.Oct 26 2015, 3:13 PM

Note this is showing ~90-91C on the software read of the temp sensors as well:

root@aqs1001:~# cat /sys/class/thermal/thermal_zone*/temp
91000
90000

This seems similar to symptoms from the cache hosts we had in T103226 that were eventually addressed with thermal paste in that case.

I have thermal pate on-site. Let me know when you would like to schedule downtime to try the fix.

Chris

@Cmjohnson: anytime is ok, aqs1002 and aqs1003 should be ok by themselves if aqs1001 isn't down for too long.

The thermal paste was very crusty and caked on. Cleaned off and reapplied and the temps are much better now. Leaving open and will check back in 24-48 hours.

cmjohnson@aqs1001:~$ sudo cat /sys/class/thermal/thermal_zone*/temp
60000
57000

Hm, that sucks, now how are we going to cook our eggs?

Thx Chris :)

Cmjohnson closed this task as Resolved.Nov 2 2015, 2:16 PM

aqs1001 has been up for several days and temps are holding...resolving the ticket

cmjohnson@aqs1001:~$ sudo cat /sys/class/thermal/thermal_zone*/temp
71000
71000