Page MenuHomePhabricator

Heating alerts for mw servers in eqiad
Open, NormalPublic

Description

When looking into an HHVM crash I noticed that about a third of our mw* servers have temperature alerts like this: (I also checked mw in codfw, but none of these show that kind of error):

Oct 27 06:31:11 mw1233 mcelog: Processor 28 heated above trip temperature. Throttling enabled.
Oct 27 06:31:11 mw1233 mcelog: Processor 12 heated above trip temperature. Throttling enabled.
Oct 27 06:31:46 mw1233 mcelog: Processor 20 heated above trip temperature. Throttling enabled.
Oct 27 06:31:46 mw1233 mcelog: Processor 4 heated above trip temperature. Throttling enabled.

Digging through old Phab tasks shows that this was fixed by reapplying thermal paste in the past. Here's the full list of affected systems:

mw1161.eqiad.wmnet
mw1162.eqiad.wmnet
mw1163.eqiad.wmnet
mw1164.eqiad.wmnet
mw1165.eqiad.wmnet
mw1166.eqiad.wmnet
mw1167.eqiad.wmnet
mw1168.eqiad.wmnet
mw1169.eqiad.wmnet
mw1174.eqiad.wmnet
mw1179.eqiad.wmnet
mw1180.eqiad.wmnet
mw1181.eqiad.wmnet
mw1182.eqiad.wmnet
mw1184.eqiad.wmnet
mw1187.eqiad.wmnet
mw1189.eqiad.wmnet
mw1190.eqiad.wmnet
mw1191.eqiad.wmnet
mw1193.eqiad.wmnet
mw1194.eqiad.wmnet
mw1195.eqiad.wmnet
mw1197.eqiad.wmnet
mw1198.eqiad.wmnet
mw1199.eqiad.wmnet
mw1200.eqiad.wmnet
mw1201.eqiad.wmnet
mw1202.eqiad.wmnet
mw1203.eqiad.wmnet
mw1204.eqiad.wmnet
mw1205.eqiad.wmnet
mw1206.eqiad.wmnet
mw1207.eqiad.wmnet
mw1208.eqiad.wmnet
mw1209.eqiad.wmnet
mw1221.eqiad.wmnet
mw1222.eqiad.wmnet
mw1225.eqiad.wmnet
mw1226.eqiad.wmnet
mw1227.eqiad.wmnet
mw1229.eqiad.wmnet
mw1230.eqiad.wmnet
mw1231.eqiad.wmnet
mw1232.eqiad.wmnet
mw1233.eqiad.wmnet
mw1234.eqiad.wmnet
mw1236.eqiad.wmnet
mw1237.eqiad.wmnet
mw1238.eqiad.wmnet
mw1240.eqiad.wmnet
mw1241.eqiad.wmnet
mw1242.eqiad.wmnet
mw1244.eqiad.wmnet
mw1246.eqiad.wmnet
mw1255.eqiad.wmnet

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptOct 27 2016, 8:16 AM
MoritzMuehlenhoff triaged this task as Normal priority.Nov 4 2016, 11:25 AM
elukey added a subscriber: elukey.

@Cmjohnson ping :) Can we apply some thermal paste to one host as test?

In T132256 Chris applied the thermal paste to analytics1039 and the alerts stopped, so it seems the right way to go. Since the number of appservers is big we might need to schedule it properly to avoid 8 hours of thermal paste per day for Chris :)

Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.May 25 2017, 6:09 PM

mw1227 has been alerting over the weekend of high load, I depooled it and noticed it was on the list of machines with temperature overheating as well, so likely related.

Mentioned in SAL (#wikimedia-operations) [2018-02-19T08:11:16Z] <godog> repool mw1227 - T149287

mw1227 has been alerting over the weekend of high load, I depooled it and noticed it was on the list of machines with temperature overheating as well, so likely related.

Turns out that's not the case, likely hhvm was locking up instead

Most of the servers are decommissioned. Are you still have problems with

mw1221.eqiad.wmnet
mw1222.eqiad.wmnet
mw1225.eqiad.wmnet
mw1226.eqiad.wmnet
mw1227.eqiad.wmnet
mw1229.eqiad.wmnet
mw1230.eqiad.wmnet
mw1231.eqiad.wmnet
mw1232.eqiad.wmnet
mw1233.eqiad.wmnet
mw1234.eqiad.wmnet
mw1236.eqiad.wmnet
mw1237.eqiad.wmnet
mw1238.eqiad.wmnet
mw1240.eqiad.wmnet
mw1241.eqiad.wmnet
mw1242.eqiad.wmnet
mw1244.eqiad.wmnet
mw1246.eqiad.wmnet
mw1255.eqiad.wmnet

mw1221, mw1230 and mw1235 are fine, the others are still showing the mentioned symptoms.

Thanks Moritz. I have a procurement task for more thermal paste. Once it arrives, we can schedule a time to take care of these.
procurement task https://phabricator.wikimedia.org/T198326

@Cmjohnson : Per the procurement task, thermal paste is now available?

CDanis added a subscriber: CDanis.Nov 12 2018, 6:24 PM

We observed overheating symptoms on the following machines today:
mw[1221-1227,1229,1231-1235,1238,1240-1248,1250-1251,1253,1255].eqiad.wmnet

Many of these machines are always running hot -- ambient temps of 85C or more, even when only lightly loaded.

I whipped up a quick Grafana graph of some of the temperatures:
https://grafana.wikimedia.org/dashboard/db/xxxx-cdanis-test?panelId=1&fullscreen&orgId=1&from=now-7d&to=now

jijiki added a subscriber: jijiki.Feb 12 2019, 8:18 PM
jijiki assigned this task to RobH.Feb 12 2019, 8:35 PM
jijiki added a subscriber: RobH.

@RobH We had another alert for an mw server having a high load. After investigating with @CDanis, do you think we could add some thermal paste to the following servers?

  • mw1222
  • mw1227
  • mw1233
  • mw1238
  • mw1244