Sometimes machines overheat due to hardware issues and/or rack airflow issues, etc. The kernel actually warns us with kern.log spam and we can often poll the data from /sys/ as well (hardware/generation -dependent). We should be monitoring and alerting on this stuff somehow, and resolving these issues as they come up instead of waiting for high temps to induce failures and/or performance throttling.
My known quick and dirty commands I've used to audit cache boxes at times:
Count of recent Package Temp alerts from the running kernel (most machines will report zero - machines with thermal issues will often show tens of thousands):
grep -c "Package temp" /var/log/kern.log
Actual temperature (not available on some older machines. also, the proper limit for the varies by hardware/generation. For example some classes of hardware seem to start generating the messages above when they cross 80C, others 85C:
There are a handful of cache machines (at esams, ulsfo, and eqiad) that look bad on temp data right now from a random audit, but rather than filing bugs for these individual issues again, I think we should really look at getting monitoring for the fleet (which will probably turn up a lot of cases...).
For reference, see past ticket for eqiad caches here: T103226 .