eqiad: investigate thermal issues with some cp10xx machines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Jun 20 2015, 6:31 AM

Description

I'm still troubled by the fact that while the CPU core freqs all seem to be reasonably-managed (they seem to vary dynamically as expected with intel_idle at the wheel and such), we're still seeing thermal-throttle events on some machines. I decided to investigate this a bit deeper on a software level to see whether this was still potentially a configuration issue, or a real cooling issue.

In eqiad, what I'm seeing is that out of 34 cp* cache machines, only 9 are logging recurring thermal-throttle events, and 8 of those are clustered together in a single rack right next to each other (other than one anomalous machine in the midst of them). Additionally, a software view of CPU package temperatures shows these 9 to stand out from the normal temps observed in the other machines.

The data...

Thermal Throttle Event Counts

root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'grep -c "Package temp" /var/log/kern.log'|sort
{'cp1008.wikimedia.org': '0'}
{'cp1043.eqiad.wmnet': '0'}
{'cp1044.eqiad.wmnet': '0'}
{'cp1045.eqiad.wmnet': '0'}
{'cp1046.eqiad.wmnet': '11277'}
{'cp1047.eqiad.wmnet': '0'}
{'cp1048.eqiad.wmnet': '0'}
{'cp1049.eqiad.wmnet': '0'}
{'cp1050.eqiad.wmnet': '0'}
{'cp1051.eqiad.wmnet': '0'}
{'cp1052.eqiad.wmnet': '0'}
{'cp1053.eqiad.wmnet': '0'}
{'cp1054.eqiad.wmnet': '0'}
{'cp1055.eqiad.wmnet': '0'}
{'cp1056.eqiad.wmnet': '0'}
{'cp1057.eqiad.wmnet': '0'}
{'cp1058.eqiad.wmnet': '0'}
{'cp1059.eqiad.wmnet': '1407'}
{'cp1060.eqiad.wmnet': '6160'}
{'cp1061.eqiad.wmnet': '46981'}
{'cp1062.eqiad.wmnet': '59774'}
{'cp1063.eqiad.wmnet': '0'}
{'cp1064.eqiad.wmnet': '20047'}
{'cp1065.eqiad.wmnet': '73695'}
{'cp1066.eqiad.wmnet': '98366'}
{'cp1067.eqiad.wmnet': '86553'}
{'cp1068.eqiad.wmnet': '0'}
{'cp1069.eqiad.wmnet': '0'}
{'cp1070.eqiad.wmnet': '0'}
{'cp1071.eqiad.wmnet': '0'}
{'cp1072.eqiad.wmnet': '0'}
{'cp1073.eqiad.wmnet': '0'}
{'cp1074.eqiad.wmnet': '0'}
{'cp1099.eqiad.wmnet': '0'}

Core0 Temps (m°C)

root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/thermal_zone0/temp'|sort
{'cp1008.wikimedia.org': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1043.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1044.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1045.eqiad.wmnet': '49000'}
{'cp1046.eqiad.wmnet': '94000'}
{'cp1047.eqiad.wmnet': '67000'}
{'cp1048.eqiad.wmnet': '66000'}
{'cp1049.eqiad.wmnet': '67000'}
{'cp1050.eqiad.wmnet': '69000'}
{'cp1051.eqiad.wmnet': '68000'}
{'cp1052.eqiad.wmnet': '67000'}
{'cp1053.eqiad.wmnet': '68000'}
{'cp1054.eqiad.wmnet': '68000'}
{'cp1055.eqiad.wmnet': '66000'}
{'cp1056.eqiad.wmnet': '53000'}
{'cp1057.eqiad.wmnet': '51000'}
{'cp1058.eqiad.wmnet': '48000'}
{'cp1059.eqiad.wmnet': '81000'}
{'cp1060.eqiad.wmnet': '89000'}
{'cp1061.eqiad.wmnet': '95000'}
{'cp1062.eqiad.wmnet': '87000'}
{'cp1063.eqiad.wmnet': '71000'}
{'cp1064.eqiad.wmnet': '83000'}
{'cp1065.eqiad.wmnet': '102000'}
{'cp1066.eqiad.wmnet': '99000'}
{'cp1067.eqiad.wmnet': '101000'}
{'cp1068.eqiad.wmnet': '71000'}
{'cp1069.eqiad.wmnet': '51000'}
{'cp1070.eqiad.wmnet': '48000'}
{'cp1071.eqiad.wmnet': '83000'}
{'cp1072.eqiad.wmnet': '83000'}
{'cp1073.eqiad.wmnet': '86000'}
{'cp1074.eqiad.wmnet': '85000'}
{'cp1099.eqiad.wmnet': '69000'}

What I think I see here is this: the 9 machines with thermal throttle event counts (46, 59-62, 64-67) are also the only ones showing temps above 80°C (in some cases way above) out of the old hardware (the newest ones (71-74) are a new gen of hardware and show low-80's without throttle events; the limits may differ for this HW and they're probably ok). The 65-67 set of three are particularly bad, regularly topping 100°C

Looking at racktables: cp1046 is an oddball, it's off in rack C8 with a bunch of other machines without issue. The other 8 are all in rack A5 in adjacent slots 10-18, other than one not-hot machine sitting in the middle of them (cp1063, A5:14). The hottest 3 are at the top of that set, so perhaps there's some "heat rises" stuff going on there with all the other hot machines below them.

My suspicion is that there are physical issues in play here. In the A5 case, perhaps something broad: a cooling airflow issue, some rack hardware/cabling blocking ventilation, etc. The cp1046 oddball could be any of those sorts of things, or it could be something like a failed fan for all I know. It might be interesting as a first step to at least confirm (by feel, or with an IR temp probe) that the machines showing excess temps + thermal throttle events in software are indeed running physically hotter, and also check whether there are any obviously-contributing physical factors.

I tried to correlate this against cluster assignments in case there was a pattern there based on cluster load behavior / config, but there is no obvious correlation (3 different clusters are represented in the A5 hot area, and a few of the hot machines are in the usually-quite-idle "mobile" cluster).

I pulled the same info at all datacenters to compare as well. codfw and ulsfo were completely-clean and looked good, but are also much lighter on load. esams shows a similar issue, but with much less severity and a little less conclusive (will put that in a separate ticket later, when/if we figure out what's going on in the eqiad case).

I'm tempted to get some dumps of the mappings of machine names to rack names and correlate this data across them all to see if there are patterns in the non-cp* machines that corroborate, but it's late. Maybe later!

Related Objects

Mentioned In: T125205: Monitor hardware thermal issues
T116584: aqs1001 getting multiple and repeated heat MCEs
rOPUP61c6ac6623d8: Revert "depool cp1065 for thermal stuff: T103226"
rOPUPc5a7e4411fdb: depool cp1065 for thermal stuff: T103226
Mentioned Here: T116584: aqs1001 getting multiple and repeated heat MCEs

Event Timeline

BBlack created this task.Jun 20 2015, 6:31 AM

BBlack assigned this task to • Cmjohnson.

BBlack raised the priority of this task from to High.

BBlack updated the task description. (Show Details)

BBlack added projects: acl*sre-team, Traffic, ops-eqiad.

BBlack added subscribers: BBlack, faidon, RobH.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 20 2015, 6:31 AM

BBlack moved this task from Backlog to Traffic team actively servicing on the Traffic board.Jun 20 2015, 6:39 AM

All the cp10xx had front bezels. I removed them to allow more airflow.

Re-ran my temp display command from above (just now, a few mins before this comment), and pattern looks unchanged overall (other than cp1059 happening to fall just below the 80°C mark, probably temporarily).

root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/thermal_zone0/temp'|sort
{'cp1008.wikimedia.org': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1043.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1044.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1045.eqiad.wmnet': '50000'}
{'cp1046.eqiad.wmnet': '89000'}
{'cp1047.eqiad.wmnet': '73000'}
{'cp1048.eqiad.wmnet': '67000'}
{'cp1049.eqiad.wmnet': '68000'}
{'cp1050.eqiad.wmnet': '65000'}
{'cp1051.eqiad.wmnet': '69000'}
{'cp1052.eqiad.wmnet': '67000'}
{'cp1053.eqiad.wmnet': '67000'}
{'cp1054.eqiad.wmnet': '66000'}
{'cp1055.eqiad.wmnet': '68000'}
{'cp1056.eqiad.wmnet': '53000'}
{'cp1057.eqiad.wmnet': '51000'}
{'cp1058.eqiad.wmnet': '47000'}
{'cp1059.eqiad.wmnet': '79000'}
{'cp1060.eqiad.wmnet': '84000'}
{'cp1061.eqiad.wmnet': '102000'}
{'cp1062.eqiad.wmnet': '95000'}
{'cp1063.eqiad.wmnet': '68000'}
{'cp1064.eqiad.wmnet': '92000'}
{'cp1065.eqiad.wmnet': '100000'}
{'cp1066.eqiad.wmnet': '99000'}
{'cp1067.eqiad.wmnet': '100000'}
{'cp1068.eqiad.wmnet': '69000'}
{'cp1069.eqiad.wmnet': '52000'}
{'cp1070.eqiad.wmnet': '47000'}
{'cp1071.eqiad.wmnet': '84000'}
{'cp1072.eqiad.wmnet': '85000'}
{'cp1073.eqiad.wmnet': '86000'}
{'cp1074.eqiad.wmnet': '85000'}
{'cp1099.eqiad.wmnet': '67000'}

I polled a view of the system board temperatures on the those listed with highest temps and the system boards are well within their range The cool air in and heat exhaust is close to the temperature readings I took from the thermal gun.

The system boards for 1065-67 were all about the same at:
System Board Exhaust Temp 39 °C (102.2 °F)
System Board Inlet Temp 21 °C (69.8 °F) (1067 was 1° higher)

Thermal Gun Readings 1065-1067
Inlet approx 18° C
Exhaust approx 41° C
(for comparison I picked a random server in another rack approx same level and the readings were very similar).

CPU temps were not given. If we're getting thermal events on the CPU's my recommendation is to clean and re-apply thermal paste to each of them and see if we get lower temps.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 30 2015, 6:29 PM

Ok just so I understand, the system board temps oncp1065-67 that are ~21C input and ~39C output are normal, and we get similar readings on other servers like, say, cp1048?

We do get constant thermal-throttle events in syslog on the affected machines (the exact same ones that show higher CPU package temps in my outputs above: 46, 59-62, 64-67).

The paste below shows there are ~11K matching syslog lines on cp1065 since log rotation this morning at ~06:30 UTC. I've cut off the paste to show just the first 100 of those:

root@cp1065:~# egrep '(kernel|mcelog):' /var/log/syslog|egrep -i '(cpu|processor)'|wc -l
10955
root@cp1065:~# egrep '(kernel|mcelog):' /var/log/syslog|egrep -i '(cpu|processor)'|head -100
Jun 30 06:26:53 cp1065 kernel: [3591210.596177] CPU1: Core temperature above threshold, cpu clock throttled (total events = 84761573)
Jun 30 06:26:53 cp1065 kernel: [3591210.606275] CPU1: Core temperature/speed normal
Jun 30 06:27:38 cp1065 kernel: [3591255.669312] CPU17: Core temperature/speed normal
Jun 30 06:28:22 cp1065 mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Jun 30 06:28:22 cp1065 mcelog: Processor 1 below trip temperature. Throttling disabled
Jun 30 06:28:22 cp1065 mcelog: Processor 17 below trip temperature. Throttling disabled
Jun 30 06:28:22 cp1065 mcelog: CPU 1 on socket 1 received unknown error
Jun 30 06:28:22 cp1065 mcelog: Location: CPU 1 on socket 1
Jun 30 06:29:01 cp1065 kernel: [3591339.036314] CPU1: Package temperature above threshold, cpu clock throttled (total events = 181810773)
Jun 30 06:29:01 cp1065 kernel: [3591339.036316] CPU15: Package temperature above threshold, cpu clock throttled (total events = 181824542)
Jun 30 06:29:01 cp1065 kernel: [3591339.036318] CPU19: Package temperature above threshold, cpu clock throttled (total events = 181829704)
Jun 30 06:29:01 cp1065 kernel: [3591339.036320] CPU13: Package temperature above threshold, cpu clock throttled (total events = 181825850)
Jun 30 06:29:01 cp1065 kernel: [3591339.036322] CPU21: Package temperature above threshold, cpu clock throttled (total events = 181833233)
Jun 30 06:29:01 cp1065 kernel: [3591339.036324] CPU11: Package temperature above threshold, cpu clock throttled (total events = 181831186)
Jun 30 06:29:01 cp1065 kernel: [3591339.036325] CPU29: Package temperature above threshold, cpu clock throttled (total events = 181831186)
Jun 30 06:29:01 cp1065 kernel: [3591339.036326] CPU31: Package temperature above threshold, cpu clock throttled (total events = 181829320)
Jun 30 06:29:01 cp1065 kernel: [3591339.036327] CPU5: Package temperature above threshold, cpu clock throttled (total events = 181824744)
Jun 30 06:29:01 cp1065 kernel: [3591339.036329] CPU25: Package temperature above threshold, cpu clock throttled (total events = 181835053)
Jun 30 06:29:01 cp1065 kernel: [3591339.036330] CPU27: Package temperature above threshold, cpu clock throttled (total events = 181834418)
Jun 30 06:29:01 cp1065 kernel: [3591339.036331] CPU9: Package temperature above threshold, cpu clock throttled (total events = 181831364)
Jun 30 06:29:01 cp1065 kernel: [3591339.036333] CPU23: Package temperature above threshold, cpu clock throttled (total events = 181832673)
Jun 30 06:29:01 cp1065 kernel: [3591339.036335] CPU17: Package temperature above threshold, cpu clock throttled (total events = 181827205)
Jun 30 06:29:01 cp1065 kernel: [3591339.037325] CPU5: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037325] CPU27: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037326] CPU23: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037327] CPU11: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037328] CPU9: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037329] CPU25: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037331] CPU13: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037332] CPU19: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037333] CPU29: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037334] CPU15: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037335] CPU31: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037336] CPU21: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.037337] CPU17: Package temperature/speed normal
Jun 30 06:29:01 cp1065 kernel: [3591339.184055] CPU1: Package temperature/speed normal
Jun 30 06:29:04 cp1065 kernel: [3591341.342435] CPU16: Package temperature above threshold, cpu clock throttled (total events = 216987567)
Jun 30 06:29:04 cp1065 kernel: [3591341.353016] CPU16: Package temperature/speed normal
Jun 30 06:29:09 cp1065 kernel: [3591346.500226] CPU3: Package temperature above threshold, cpu clock throttled (total events = 181816558)
Jun 30 06:29:09 cp1065 kernel: [3591346.510714] CPU3: Package temperature/speed normal
Jun 30 06:29:14 cp1065 kernel: [3591351.965111] CPU10: Package temperature/speed normal
Jun 30 06:29:14 cp1065 kernel: [3591351.965113] CPU18: Package temperature/speed normal
Jun 30 06:29:14 cp1065 kernel: [3591351.965114] CPU26: Package temperature/speed normal
Jun 30 06:29:14 cp1065 kernel: [3591351.965116] CPU24: Package temperature/speed normal
Jun 30 06:29:14 cp1065 kernel: [3591351.965117] CPU6: Package temperature/speed normal
Jun 30 06:29:15 cp1065 kernel: [3591353.100362] CPU28: Package temperature above threshold, cpu clock throttled (total events = 216959355)
Jun 30 06:29:15 cp1065 kernel: [3591353.100367] CPU8: Package temperature above threshold, cpu clock throttled (total events = 216987472)
Jun 30 06:29:15 cp1065 kernel: [3591353.100390] CPU22: Package temperature above threshold, cpu clock throttled (total events = 216991847)
Jun 30 06:29:15 cp1065 kernel: [3591353.102354] CPU8: Package temperature/speed normal
Jun 30 06:29:15 cp1065 kernel: [3591353.102357] CPU22: Package temperature/speed normal
Jun 30 06:29:16 cp1065 kernel: [3591353.457086] CPU14: Package temperature/speed normal
Jun 30 06:29:16 cp1065 kernel: [3591354.027711] CPU2: Package temperature above threshold, cpu clock throttled (total events = 216986631)
Jun 30 06:29:17 cp1065 kernel: [3591355.076033] CPU4: Package temperature above threshold, cpu clock throttled (total events = 216978882)
Jun 30 06:29:17 cp1065 kernel: [3591355.076036] CPU12: Package temperature above threshold, cpu clock throttled (total events = 216941087)
Jun 30 06:29:17 cp1065 kernel: [3591355.076039] CPU20: Package temperature above threshold, cpu clock throttled (total events = 216989659)
Jun 30 06:29:17 cp1065 kernel: [3591355.078062] CPU12: Package temperature/speed normal
Jun 30 06:29:17 cp1065 kernel: [3591355.078064] CPU20: Package temperature/speed normal
Jun 30 06:29:21 cp1065 kernel: [3591358.343806] CPU0: Package temperature above threshold, cpu clock throttled (total events = 216981669)
Jun 30 06:29:21 cp1065 kernel: [3591358.354288] CPU0: Package temperature/speed normal
Jun 30 06:29:31 cp1065 kernel: [3591368.541822] CPU30: Package temperature above threshold, cpu clock throttled (total events = 216983471)
Jun 30 06:29:31 cp1065 kernel: [3591368.552393] CPU30: Package temperature/speed normal
Jun 30 06:29:50 cp1065 kernel: [3591387.789947] CPU7: Package temperature above threshold, cpu clock throttled (total events = 181829362)
Jun 30 06:31:58 cp1065 kernel: [3591515.376622] CPU1: Core temperature above threshold, cpu clock throttled (total events = 84780348)
Jun 30 06:31:58 cp1065 kernel: [3591515.386719] CPU1: Core temperature/speed normal
Jun 30 06:33:23 cp1065 mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Jun 30 06:33:23 cp1065 mcelog: Processor 1 below trip temperature. Throttling disabled
Jun 30 06:33:23 cp1065 mcelog: CPU 1 on socket 1 received unknown error
Jun 30 06:33:23 cp1065 mcelog: Location: CPU 1 on socket 1
Jun 30 06:33:33 cp1065 kernel: [3591610.553313] CPU12: Core temperature above threshold, cpu clock throttled (total events = 95131637)
Jun 30 06:33:33 cp1065 kernel: [3591610.553316] CPU28: Core temperature above threshold, cpu clock throttled (total events = 95149121)
Jun 30 06:33:33 cp1065 kernel: [3591610.554220] CPU28: Core temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824249] CPU19: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824251] CPU9: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824252] CPU25: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824254] CPU15: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824256] CPU11: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824257] CPU29: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824258] CPU13: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824259] CPU27: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824262] CPU17: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824262] CPU31: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824264] CPU21: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824266] CPU23: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.824268] CPU5: Package temperature/speed normal
Jun 30 06:34:01 cp1065 kernel: [3591638.972080] CPU1: Package temperature above threshold, cpu clock throttled (total events = 181836454)
Jun 30 06:34:01 cp1065 kernel: [3591638.982563] CPU1: Package temperature/speed normal
Jun 30 06:34:04 cp1065 kernel: [3591641.357228] CPU16: Package temperature above threshold, cpu clock throttled (total events = 217014719)
Jun 30 06:34:04 cp1065 kernel: [3591641.367808] CPU16: Package temperature/speed normal
Jun 30 06:34:12 cp1065 kernel: [3591649.566825] CPU3: Package temperature above threshold, cpu clock throttled (total events = 181842651)
Jun 30 06:34:14 cp1065 kernel: [3591651.756074] CPU10: Package temperature above threshold, cpu clock throttled (total events = 217013098)
Jun 30 06:34:14 cp1065 kernel: [3591651.756076] CPU18: Package temperature above threshold, cpu clock throttled (total events = 217018608)
Jun 30 06:34:14 cp1065 kernel: [3591651.756078] CPU24: Package temperature above threshold, cpu clock throttled (total events = 217019013)
Jun 30 06:34:14 cp1065 kernel: [3591651.756083] CPU6: Package temperature above threshold, cpu clock throttled (total events = 217013567)
Jun 30 06:34:14 cp1065 kernel: [3591651.756086] CPU26: Package temperature above threshold, cpu clock throttled (total events = 217018225)
Jun 30 06:34:14 cp1065 kernel: [3591651.761069] CPU6: Package temperature/speed normal
Jun 30 06:34:14 cp1065 kernel: [3591651.761075] CPU24: Package temperature/speed normal
Jun 30 06:34:14 cp1065 kernel: [3591651.761077] CPU18: Package temperature/speed normal
Jun 30 06:34:14 cp1065 kernel: [3591651.761079] CPU26: Package temperature/speed normal
Jun 30 06:34:14 cp1065 kernel: [3591651.808845] CPU10: Package temperature/speed normal
Jun 30 06:34:15 cp1065 kernel: [3591652.890258] CPU22: Package temperature/speed normal

I may have some thermal paste in storage. Let's pick one and see if that
helps.

Is cp1067 acceptable? If so I'll downtime/depool it.

Change 222319 had a related patch set uploaded (by BBlack):
depool cp1065 for thermal stuff: T103226

https://gerrit.wikimedia.org/r/222319

Change 222319 merged by BBlack:
depool cp1065 for thermal stuff: T103226

https://gerrit.wikimedia.org/r/222319

BBlack mentioned this in rOPUPc5a7e4411fdb: depool cp1065 for thermal stuff: T103226.Jul 2 2015, 3:16 PM

cp1065 downtimed and depooled in various places and software poweroff'd, can use that one.

Change 222327 had a related patch set uploaded (by BBlack):
Revert "depool cp1065 for thermal stuff: T103226"

https://gerrit.wikimedia.org/r/222327

Change 222327 merged by BBlack:
Revert "depool cp1065 for thermal stuff: T103226"

https://gerrit.wikimedia.org/r/222327

BBlack mentioned this in rOPUP61c6ac6623d8: Revert "depool cp1065 for thermal stuff: T103226".Jul 2 2015, 4:02 PM

I think there's a good chance redoing the thermal paste addressed the issue on cp1065. It's been roughly 4 hours since it was powered back up and repooled to full load. In that time, there have been zero syslogged thermal events, and my random observations of /sys/class/thermal/thermal_zone0/temp have all been in the ~65-71°C range, as opposed to ~100°C before.

Let's let it go for more observation over the (holiday) weekend first, and then if it continues to look healthy, start trying to fix the others similarly?

cp1065 over the weekend: Zero thermal events and temperature still looks good! Can we set up some schedule/time to try this on the others? The rest of the list is basically cp1046, 59-62, 64, 66, 67 (8 more machines). They'll all need depool/downtime and such, and most of them would have to be one at a time.

Yes, but I will need to buy more thermal paste first. I only had enough
to do the one server on-site.

ok great, let me know! if it makes this easier, we can probably chunk this up into 4 sets of 2 machines at a time, just have to sort out which sets ahead of time.

Ordered thermal paste and it should be here in 3 days. Let's work on fixing them next week. I will ping you and we go do in chunks.

The thermal paste is on-site. @BBlack let me know the first chunk of servers. The whole process is pretty quick.

So the affected machines by-cluster:

mobile
- cp1046
- cp1059
- cp1060
upload
- cp1061
- cp1062
- cp1064
text
- cp1066
- cp1067

We could group these up into 3 batches (at most one machine per-cluster at a time) like so:

Batch 1
- cp1046 (mobile)
- cp1061 (upload)
- cp1066 (text)
Batch 2
- cp1059 (mobile)
- cp1062 (upload)
- cp1067 (text)
Batch 3
- cp1060 (mobile)
- cp1064 (upload)

But before we dive into that: let me re-check cp1065's status now that it's been fixed for a while, and reconfirm that the "bad temperature" set still looks the same as back when I first logged all of this, etc...

(lists above edited, I had mistakenly used 106[78] when it should have been 106[67] in the text cluster...)

Confirmed: throttle/temp data still looks like it did before, other than cp1065, which still looks like it's fine after the thermal-paste fix.

Timing is up to you, I can work with your schedule. Just give me some warning so I can depool/shutdown each batch.

cp1046, cp1061 and cp1063 are complete.

@Cmjohnson did the thermal paste work on the other 8 hosts. So far everything looks peachy on the core temp values:

root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/thermal_zone0/temp'|sort

-------------------------------------------
{'cp1008.wikimedia.org': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1043.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1044.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1045.eqiad.wmnet': '48000'}
{'cp1046.eqiad.wmnet': '65000'}
{'cp1047.eqiad.wmnet': '67000'}
{'cp1048.eqiad.wmnet': '67000'}
{'cp1049.eqiad.wmnet': '63000'}
{'cp1050.eqiad.wmnet': '67000'}
{'cp1051.eqiad.wmnet': '67000'}
{'cp1052.eqiad.wmnet': '68000'}
{'cp1053.eqiad.wmnet': '65000'}
{'cp1054.eqiad.wmnet': '67000'}
{'cp1055.eqiad.wmnet': '68000'}
{'cp1056.eqiad.wmnet': '52000'}
{'cp1057.eqiad.wmnet': '50000'}
{'cp1058.eqiad.wmnet': '48000'}
{'cp1059.eqiad.wmnet': '70000'}
{'cp1060.eqiad.wmnet': '58000'}
{'cp1061.eqiad.wmnet': '71000'}
{'cp1062.eqiad.wmnet': '72000'}
{'cp1063.eqiad.wmnet': '70000'}
{'cp1064.eqiad.wmnet': '62000'}
{'cp1065.eqiad.wmnet': '71000'}
{'cp1066.eqiad.wmnet': '73000'}
{'cp1067.eqiad.wmnet': '70000'}
{'cp1068.eqiad.wmnet': '69000'}
{'cp1069.eqiad.wmnet': '51000'}
{'cp1070.eqiad.wmnet': '50000'}
{'cp1071.eqiad.wmnet': '83000'}
{'cp1072.eqiad.wmnet': '84000'}
{'cp1073.eqiad.wmnet': '85000'}
{'cp1074.eqiad.wmnet': '84000'}
{'cp1099.eqiad.wmnet': '71000'}

Will leave this open a few days to see if temps eventually rise or any more kernel throttle events happen before closing it up.

Still looking good

cmjohnson@palladium:~$ sudo salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/thermal_zone0/temp'|sort

{'cp1008.wikimedia.org': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1043.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1044.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'}
{'cp1045.eqiad.wmnet': '47000'}
{'cp1046.eqiad.wmnet': '65000'}
{'cp1047.eqiad.wmnet': '68000'}
{'cp1048.eqiad.wmnet': '66000'}
{'cp1049.eqiad.wmnet': '61000'}
{'cp1050.eqiad.wmnet': '69000'}
{'cp1051.eqiad.wmnet': '68000'}
{'cp1052.eqiad.wmnet': '68000'}
{'cp1053.eqiad.wmnet': '66000'}
{'cp1054.eqiad.wmnet': '68000'}
{'cp1055.eqiad.wmnet': '70000'}
{'cp1056.eqiad.wmnet': '52000'}
{'cp1057.eqiad.wmnet': '50000'}
{'cp1058.eqiad.wmnet': '46000'}
{'cp1059.eqiad.wmnet': '70000'}
{'cp1060.eqiad.wmnet': '69000'}
{'cp1061.eqiad.wmnet': '68000'}
{'cp1062.eqiad.wmnet': '67000'}
{'cp1063.eqiad.wmnet': '70000'}
{'cp1064.eqiad.wmnet': '70000'}
{'cp1065.eqiad.wmnet': '67000'}
{'cp1066.eqiad.wmnet': '68000'}
{'cp1067.eqiad.wmnet': '69000'}
{'cp1068.eqiad.wmnet': '72000'}
{'cp1069.eqiad.wmnet': '52000'}
{'cp1070.eqiad.wmnet': '49000'}
{'cp1071.eqiad.wmnet': '84000'}
{'cp1072.eqiad.wmnet': '84000'}
{'cp1073.eqiad.wmnet': '83000'}
{'cp1074.eqiad.wmnet': '84000'}
{'cp1099.eqiad.wmnet': '68000'}

Temps all still look good today, closing this up. Thanks @Cmjohnson!

BBlack moved this task from Traffic team actively servicing to Done on the Traffic board.Aug 8 2015, 4:38 PM

BBlack mentioned this in T116584: aqs1001 getting multiple and repeated heat MCEs.Oct 26 2015, 3:13 PM

I do have thermal paste on-site. Let me know when you want to schedule downtime on each of these.

@Cmjohnson above I think meant for T116584

BBlack mentioned this in T125205: Monitor hardware thermal issues.Jan 29 2016, 12:21 PM

eqiad: investigate thermal issues with some cp10xx machinesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

eqiad: investigate thermal issues with some cp10xx machines
Closed, ResolvedPublic
Actions