Several events during the past days:
19:22 <icinga-wm> PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
04:37 #wikimedia-operations: <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager 06:30 #wikimedia-operations: <icinga-wm> PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
dmesg -T
[...] [Sat Apr 9 17:46:09 2016] mce: [Hardware Error]: Machine check events logged [Sat Apr 9 17:48:18 2016] CPU22: Package temperature above threshold, cpu clock throttled (total events = 378279009) [Sat Apr 9 17:48:18 2016] CPU8: Package temperature above threshold, cpu clock throttled (total events = 378273836) [Sat Apr 9 17:48:18 2016] CPU20: Package temperature above threshold, cpu clock throttled (total events = 378279595) [Sat Apr 9 17:48:18 2016] CPU0: Package temperature above threshold, cpu clock throttled (total events = 378271750) [Sat Apr 9 17:48:18 2016] CPU4: Package temperature above threshold, cpu clock throttled (total events = 378275623) [Sat Apr 9 17:48:18 2016] CPU16: Package temperature above threshold, cpu clock throttled (total events = 378279736) [Sat Apr 9 17:48:18 2016] CPU18: Package temperature above threshold, cpu clock throttled (total events = 378280021) [...]
and /var/log/mcelog:
mcelog: failed to prefill DIMM database from DMI data mcelog: Warning: cpu 0 offline?, imc_log not set : No such file or directory mcelog: Warning: cpu 1 offline?, imc_log not set : No such file or directory mcelog: Warning: cpu 2 offline?, imc_log not set : No such file or directory mcelog: Warning: cpu 3 offline?, imc_log not set : No such file or directory mcelog: Warning: cpu 4 offline?, imc_log not set : No such file or directory [...] Hardware event. This is not a software error. MCE 0 CPU 12 THERMAL EVENT TSC da97d489bc000 TIME 1447816783 Wed Nov 18 03:19:43 2015 Processor 12 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88000bc3 MCGSTATUS 0 MCGCAP 1000c17 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 62 Hardware event. This is not a software error. MCE 1 CPU 12 THERMAL EVENT TSC da97d48bb43f3 TIME 1447816783 Wed Nov 18 03:19:43 2015 Processor 12 below trip temperature. Throttling disabled STATUS 88010a82 MCGSTATUS 0 MCGCAP 1000c17 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 62 Hardware event. This is not a software error. [...]