We had some events where videoscalers are flapping in Icinga checks like this:
10:15 < icinga-wm> RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time
10:17 < icinga-wm> PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
10:18 < icinga-wm> RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.238 second response time
I checked on mw1168 and the load was around 19 with lots of "ffmpeg2theora" processes but things were moving along.
Then i saw this in syslog:
Mar 31 17:23:28 mw1168 kernel: [8710669.041549] CPU9: Package temperature/speed normal Mar 31 17:23:28 mw1168 kernel: [8710669.049201] CPU5: Package temperature/speed normal Mar 31 17:23:28 mw1168 kernel: [8710669.057568] CPU3: Package temperature/speed normal Mar 31 17:23:42 mw1168 kernel: [8710682.380504] CPU12: Package temperature above threshold, cpu clock throttled (total events = 925538556) Mar 31 17:23:42 mw1168 kernel: [8710682.380508] CPU6: Package temperature above threshold, cpu clock throttled (total events = 925537101) Mar 31 17:23:42 mw1168 kernel: [8710682.380511] CPU14: Package temperature above threshold, cpu clock throttled (total events = 925540291) Mar 31 17:23:42 mw1168 kernel: [8710682.380515] CPU18: Package temperature above threshold, cpu clock throttled (total events = 925544865) Mar 31 17:23:42 mw1168 kernel: [8710682.380519] CPU0: Package temperature above threshold, cpu clock throttled (total events = 925533879) Mar 31 17:23:42 mw1168 kernel: [8710682.380521] CPU2: Package temperature above threshold, cpu clock throttled (total events = 925537129) Mar 31 17:23:42 mw1168 kernel: [8710682.380523] CPU30: Package temperature above threshold, cpu clock throttled (total events = 925542900) Mar 31 17:23:42 mw1168 kernel: [8710682.380526] CPU28: Package temperature above threshold, cpu clock throttled (total events = 925541124) Mar 31 17:23:42 mw1168 kernel: [8710682.380530] CPU4: Package temperature above threshold, cpu clock throttled (total events = 925537618) Mar 31 17:23:42 mw1168 kernel: [8710682.380532] CPU16: Package temperature above threshold, cpu clock throttled (total events = 925544994) Mar 31 17:23:42 mw1168 kernel: [8710682.380535] CPU22: Package temperature above threshold, cpu clock throttled (total events = 925545082) Mar 31 17:23:42 mw1168 kernel: [8710682.380539] CPU24: Package temperature above threshold, cpu clock throttled (total events = 925542973) Mar 31 17:23:42 mw1168 kernel: [8710682.381551] CPU6: Package temperature/speed normal .. Mar 31 17:23:42 mw1168 kernel: [8710682.381569] CPU4: Package temperature/speed normal Mar 31 17:23:42 mw1168 kernel: [8710682.381571] CPU30: Package temperature/speed normal Mar 31 17:23:42 mw1168 kernel: [8710682.391615] CPU20: Package temperature/speed normal Mar 31 17:23:42 mw1168 kernel: [8710682.522683] CPU8: Package temperature above threshold, cpu clock throttled (total events = 925538075) Mar 31 17:23:42 mw1168 kernel: [8710682.563697] CPU26: Core temperature above threshold, cpu clock throttled (total events = 587254781) Mar 31 17:23:42 mw1168 kernel: [8710682.575761] CPU26: Package temperature/speed normal
and
Mar 31 17:21:13 mw1168 kernel: [8710534.187738] mce: [Hardware Error]: Machine check events logged Mar 31 17:21:13 mw1168 mcelog: Processor 3 heated above trip temperature. Throttling enabled. Mar 31 17:21:13 mw1168 mcelog: Please check your system cooling. Performance will be impacted Mar 31 17:21:13 mw1168 mcelog: Processor 3 below trip temperature. Throttling disabled Mar 31 17:21:13 mw1168 mcelog: Processor 19 heated above trip temperature. Throttling enabled. Mar 31 17:21:13 mw1168 mcelog: Please check your system cooling. Performance will be impacted Mar 31 17:21:13 mw1168 mcelog: Processor 19 below trip temperature. Throttling disabled Mar 31 17:22:28 mw1168 kernel: [8710609.162128] mce: [Hardware Error]: Machine check events logged Mar 31 17:22:28 mw1168 mcelog: Processor 13 heated above trip temperature. Throttling enabled. Mar 31 17:22:28 mw1168 mcelog: Please check your system cooling. Performance will be impacted
So is this first getting very busy and then warm and then throttled because of that. Or is it first getting too warm, then throttled and then that's why it's slow? Maybe it makes sense to also apply the thermal paste here like we did on other servers before?