I'm still troubled by the fact that while the CPU core freqs all seem to be reasonably-managed (they seem to vary dynamically as expected with intel_idle at the wheel and such), we're still seeing thermal-throttle events on some machines. I decided to investigate this a bit deeper on a software level to see whether this was still potentially a configuration issue, or a real cooling issue.
In eqiad, what I'm seeing is that out of 34 cp* cache machines, only 9 are logging recurring thermal-throttle events, and 8 of those are clustered together in a single rack right next to each other (other than one anomalous machine in the midst of them). Additionally, a software view of CPU package temperatures shows these 9 to stand out from the normal temps observed in the other machines.
The data...
root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'grep -c "Package temp" /var/log/kern.log'|sort {'cp1008.wikimedia.org': '0'} {'cp1043.eqiad.wmnet': '0'} {'cp1044.eqiad.wmnet': '0'} {'cp1045.eqiad.wmnet': '0'} {'cp1046.eqiad.wmnet': '11277'} {'cp1047.eqiad.wmnet': '0'} {'cp1048.eqiad.wmnet': '0'} {'cp1049.eqiad.wmnet': '0'} {'cp1050.eqiad.wmnet': '0'} {'cp1051.eqiad.wmnet': '0'} {'cp1052.eqiad.wmnet': '0'} {'cp1053.eqiad.wmnet': '0'} {'cp1054.eqiad.wmnet': '0'} {'cp1055.eqiad.wmnet': '0'} {'cp1056.eqiad.wmnet': '0'} {'cp1057.eqiad.wmnet': '0'} {'cp1058.eqiad.wmnet': '0'} {'cp1059.eqiad.wmnet': '1407'} {'cp1060.eqiad.wmnet': '6160'} {'cp1061.eqiad.wmnet': '46981'} {'cp1062.eqiad.wmnet': '59774'} {'cp1063.eqiad.wmnet': '0'} {'cp1064.eqiad.wmnet': '20047'} {'cp1065.eqiad.wmnet': '73695'} {'cp1066.eqiad.wmnet': '98366'} {'cp1067.eqiad.wmnet': '86553'} {'cp1068.eqiad.wmnet': '0'} {'cp1069.eqiad.wmnet': '0'} {'cp1070.eqiad.wmnet': '0'} {'cp1071.eqiad.wmnet': '0'} {'cp1072.eqiad.wmnet': '0'} {'cp1073.eqiad.wmnet': '0'} {'cp1074.eqiad.wmnet': '0'} {'cp1099.eqiad.wmnet': '0'}
root@palladium:~# salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/thermal_zone0/temp'|sort {'cp1008.wikimedia.org': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'} {'cp1043.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'} {'cp1044.eqiad.wmnet': 'cat: /sys/class/thermal/thermal_zone0/temp: No such file or directory'} {'cp1045.eqiad.wmnet': '49000'} {'cp1046.eqiad.wmnet': '94000'} {'cp1047.eqiad.wmnet': '67000'} {'cp1048.eqiad.wmnet': '66000'} {'cp1049.eqiad.wmnet': '67000'} {'cp1050.eqiad.wmnet': '69000'} {'cp1051.eqiad.wmnet': '68000'} {'cp1052.eqiad.wmnet': '67000'} {'cp1053.eqiad.wmnet': '68000'} {'cp1054.eqiad.wmnet': '68000'} {'cp1055.eqiad.wmnet': '66000'} {'cp1056.eqiad.wmnet': '53000'} {'cp1057.eqiad.wmnet': '51000'} {'cp1058.eqiad.wmnet': '48000'} {'cp1059.eqiad.wmnet': '81000'} {'cp1060.eqiad.wmnet': '89000'} {'cp1061.eqiad.wmnet': '95000'} {'cp1062.eqiad.wmnet': '87000'} {'cp1063.eqiad.wmnet': '71000'} {'cp1064.eqiad.wmnet': '83000'} {'cp1065.eqiad.wmnet': '102000'} {'cp1066.eqiad.wmnet': '99000'} {'cp1067.eqiad.wmnet': '101000'} {'cp1068.eqiad.wmnet': '71000'} {'cp1069.eqiad.wmnet': '51000'} {'cp1070.eqiad.wmnet': '48000'} {'cp1071.eqiad.wmnet': '83000'} {'cp1072.eqiad.wmnet': '83000'} {'cp1073.eqiad.wmnet': '86000'} {'cp1074.eqiad.wmnet': '85000'} {'cp1099.eqiad.wmnet': '69000'}
What I think I see here is this: the 9 machines with thermal throttle event counts (46, 59-62, 64-67) are also the only ones showing temps above 80°C (in some cases way above) out of the old hardware (the newest ones (71-74) are a new gen of hardware and show low-80's without throttle events; the limits may differ for this HW and they're probably ok). The 65-67 set of three are particularly bad, regularly topping 100°C
Looking at racktables: cp1046 is an oddball, it's off in rack C8 with a bunch of other machines without issue. The other 8 are all in rack A5 in adjacent slots 10-18, other than one not-hot machine sitting in the middle of them (cp1063, A5:14). The hottest 3 are at the top of that set, so perhaps there's some "heat rises" stuff going on there with all the other hot machines below them.
My suspicion is that there are physical issues in play here. In the A5 case, perhaps something broad: a cooling airflow issue, some rack hardware/cabling blocking ventilation, etc. The cp1046 oddball could be any of those sorts of things, or it could be something like a failed fan for all I know. It might be interesting as a first step to at least confirm (by feel, or with an IR temp probe) that the machines showing excess temps + thermal throttle events in software are indeed running physically hotter, and also check whether there are any obviously-contributing physical factors.
I tried to correlate this against cluster assignments in case there was a pattern there based on cluster load behavior / config, but there is no obvious correlation (3 different clusters are represented in the A5 hot area, and a few of the hot machines are in the usually-quite-idle "mobile" cluster).
I pulled the same info at all datacenters to compare as well. codfw and ulsfo were completely-clean and looked good, but are also much lighter on load. esams shows a similar issue, but with much less severity and a little less conclusive (will put that in a separate ticket later, when/if we figure out what's going on in the eqiad case).
I'm tempted to get some dumps of the mappings of machine names to rack names and correlate this data across them all to see if there are patterns in the non-cp* machines that corroborate, but it's late. Maybe later!