Machine is depooled from service.
CPU temp trips and MCE errors have been logging on cp1053 for at least a week, e.g.:
May 14 06:26:21 cp1053 kernel: [1536133.112970] CPU7: Core temperature above threshold, cpu clock throttled (total events = 4508187) May 14 06:26:21 cp1053 kernel: [1536133.112971] CPU23: Core temperature above threshold, cpu clock throttled (total events = 4508763) May 14 06:26:21 cp1053 kernel: [1536133.112984] mce_notify_irq: 1 callbacks suppressed May 14 06:26:21 cp1053 kernel: [1536133.112984] mce: [Hardware Error]: Machine check events logged
As of today, we've had some small spikes of user-facing 503s that localized to this varnish backend, almost certainly somehow related.
Meta-point (perhaps separate task) - why aren't we catching things like CPU temp trips and MCEs in icinga alerting?