When looking into an HHVM crash I noticed that about a third of our mw* servers have temperature alerts like this: (I also checked mw in codfw, but none of these show that kind of error):
Oct 27 06:31:11 mw1233 mcelog: Processor 28 heated above trip temperature. Throttling enabled.
Oct 27 06:31:11 mw1233 mcelog: Processor 12 heated above trip temperature. Throttling enabled.
Oct 27 06:31:46 mw1233 mcelog: Processor 20 heated above trip temperature. Throttling enabled.
Oct 27 06:31:46 mw1233 mcelog: Processor 4 heated above trip temperature. Throttling enabled.
Digging through old Phab tasks shows that this was fixed by reapplying thermal paste in the past. Here's the full list of affected systems:
mw1161.eqiad.wmnet
mw1162.eqiad.wmnet
mw1163.eqiad.wmnet
mw1164.eqiad.wmnet
mw1165.eqiad.wmnet
mw1166.eqiad.wmnet
mw1167.eqiad.wmnet
mw1168.eqiad.wmnet
mw1169.eqiad.wmnet
mw1174.eqiad.wmnet
mw1179.eqiad.wmnet
mw1180.eqiad.wmnet
mw1181.eqiad.wmnet
mw1182.eqiad.wmnet
mw1184.eqiad.wmnet
mw1187.eqiad.wmnet
mw1189.eqiad.wmnet
mw1190.eqiad.wmnet
mw1191.eqiad.wmnet
mw1193.eqiad.wmnet
mw1194.eqiad.wmnet
mw1195.eqiad.wmnet
mw1197.eqiad.wmnet
mw1198.eqiad.wmnet
mw1199.eqiad.wmnet
mw1200.eqiad.wmnet
mw1201.eqiad.wmnet
mw1202.eqiad.wmnet
mw1203.eqiad.wmnet
mw1204.eqiad.wmnet
mw1205.eqiad.wmnet
mw1206.eqiad.wmnet
mw1207.eqiad.wmnet
mw1208.eqiad.wmnet
mw1209.eqiad.wmnet
mw1221.eqiad.wmnet
mw1222.eqiad.wmnet
mw1225.eqiad.wmnet
mw1226.eqiad.wmnet
mw1227.eqiad.wmnet
mw1229.eqiad.wmnet
mw1230.eqiad.wmnet
mw1231.eqiad.wmnet
mw1232.eqiad.wmnet
mw1233.eqiad.wmnet
mw1234.eqiad.wmnet
mw1236.eqiad.wmnet
mw1237.eqiad.wmnet
mw1238.eqiad.wmnet
mw1240.eqiad.wmnet
mw1241.eqiad.wmnet
mw1242.eqiad.wmnet
mw1244.eqiad.wmnet
mw1246.eqiad.wmnet
mw1255.eqiad.wmnet