Page MenuHomePhabricator

videoscalers (mw1168, mw1169) - high load / overheating
Closed, DeclinedPublic

Description

We had some events where videoscalers are flapping in Icinga checks like this:

10:15 < icinga-wm> RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time
10:17 < icinga-wm> PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
10:18 < icinga-wm> RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 5.238 second response time

I checked on mw1168 and the load was around 19 with lots of "ffmpeg2theora" processes but things were moving along.

Then i saw this in syslog:

Mar 31 17:23:28 mw1168 kernel: [8710669.041549] CPU9: Package temperature/speed normal
Mar 31 17:23:28 mw1168 kernel: [8710669.049201] CPU5: Package temperature/speed normal
Mar 31 17:23:28 mw1168 kernel: [8710669.057568] CPU3: Package temperature/speed normal
Mar 31 17:23:42 mw1168 kernel: [8710682.380504] CPU12: Package temperature above threshold, cpu clock throttled (total events = 925538556)
Mar 31 17:23:42 mw1168 kernel: [8710682.380508] CPU6: Package temperature above threshold, cpu clock throttled (total events = 925537101)
Mar 31 17:23:42 mw1168 kernel: [8710682.380511] CPU14: Package temperature above threshold, cpu clock throttled (total events = 925540291)
Mar 31 17:23:42 mw1168 kernel: [8710682.380515] CPU18: Package temperature above threshold, cpu clock throttled (total events = 925544865)
Mar 31 17:23:42 mw1168 kernel: [8710682.380519] CPU0: Package temperature above threshold, cpu clock throttled (total events = 925533879)
Mar 31 17:23:42 mw1168 kernel: [8710682.380521] CPU2: Package temperature above threshold, cpu clock throttled (total events = 925537129)
Mar 31 17:23:42 mw1168 kernel: [8710682.380523] CPU30: Package temperature above threshold, cpu clock throttled (total events = 925542900)
Mar 31 17:23:42 mw1168 kernel: [8710682.380526] CPU28: Package temperature above threshold, cpu clock throttled (total events = 925541124)
Mar 31 17:23:42 mw1168 kernel: [8710682.380530] CPU4: Package temperature above threshold, cpu clock throttled (total events = 925537618)
Mar 31 17:23:42 mw1168 kernel: [8710682.380532] CPU16: Package temperature above threshold, cpu clock throttled (total events = 925544994)
Mar 31 17:23:42 mw1168 kernel: [8710682.380535] CPU22: Package temperature above threshold, cpu clock throttled (total events = 925545082)
Mar 31 17:23:42 mw1168 kernel: [8710682.380539] CPU24: Package temperature above threshold, cpu clock throttled (total events = 925542973)
Mar 31 17:23:42 mw1168 kernel: [8710682.381551] CPU6: Package temperature/speed normal
..
Mar 31 17:23:42 mw1168 kernel: [8710682.381569] CPU4: Package temperature/speed normal
Mar 31 17:23:42 mw1168 kernel: [8710682.381571] CPU30: Package temperature/speed normal
Mar 31 17:23:42 mw1168 kernel: [8710682.391615] CPU20: Package temperature/speed normal
Mar 31 17:23:42 mw1168 kernel: [8710682.522683] CPU8: Package temperature above threshold, cpu clock throttled (total events = 925538075)
Mar 31 17:23:42 mw1168 kernel: [8710682.563697] CPU26: Core temperature above threshold, cpu clock throttled (total events = 587254781)
Mar 31 17:23:42 mw1168 kernel: [8710682.575761] CPU26: Package temperature/speed normal

and

Mar 31 17:21:13 mw1168 kernel: [8710534.187738] mce: [Hardware Error]: Machine check events logged
Mar 31 17:21:13 mw1168 mcelog: Processor 3 heated above trip temperature. Throttling enabled.
Mar 31 17:21:13 mw1168 mcelog: Please check your system cooling. Performance will be impacted
Mar 31 17:21:13 mw1168 mcelog: Processor 3 below trip temperature. Throttling disabled
Mar 31 17:21:13 mw1168 mcelog: Processor 19 heated above trip temperature. Throttling enabled.
Mar 31 17:21:13 mw1168 mcelog: Please check your system cooling. Performance will be impacted
Mar 31 17:21:13 mw1168 mcelog: Processor 19 below trip temperature. Throttling disabled
Mar 31 17:22:28 mw1168 kernel: [8710609.162128] mce: [Hardware Error]: Machine check events logged
Mar 31 17:22:28 mw1168 mcelog: Processor 13 heated above trip temperature. Throttling enabled.
Mar 31 17:22:28 mw1168 mcelog: Please check your system cooling. Performance will be impacted

So is this first getting very busy and then warm and then throttled because of that. Or is it first getting too warm, then throttled and then that's why it's slow? Maybe it makes sense to also apply the thermal paste here like we did on other servers before?

Event Timeline

Dzahn renamed this task from videoscalers - high load / overheating to videoscalers (mw1168) - high load / overheating.Mar 31 2017, 5:30 PM
Dzahn added a subscriber: Joe.
Dzahn renamed this task from videoscalers (mw1168) - high load / overheating to videoscalers (mw1168, mw1169) - high load / overheating.Mar 31 2017, 5:49 PM

mw1169 is behaving the same way. It also has the throttling log entries due to temperature. Not randomly just on one of them.

We tried to lower down transcode_prioritized to 10 with https://gerrit.wikimedia.org/r/#/c/345829/ but it probably didn't help much. The issue appears only on mw116[89] and not on the other two videoscalers (less powerful and configured with less hhvm threads and transcodes), so it is probably due to hhvm getting loaded (~20) and start queueing (hhvmadm check-health can probably confirm this).

Change 345881 had a related patch set uploaded (by Dzahn; owner: Elukey):
[operations/puppet@production] Lower down total transcodes for mw116[89]

https://gerrit.wikimedia.org/r/345881

Change 345881 merged by Elukey:
[operations/puppet@production] Lower down total transcodes for mw116[89]

https://gerrit.wikimedia.org/r/345881

Note there are probably a lot of jobs in the non-prioritized queue still backed up last I looked; Note I also have a fill-in batch job on terbium that's throttled by queue length, which should be adding jobs only as fast as they get done. I can turn that queue job off if necessary.

earlier we thought that mw1259 (videoscaler but other hardware) was not affected, but...

13:12 < mutante> yea, why do we never see the issue with mw1259
..
16:05 < icinga-wm> PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
16:06 < icinga-wm> RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time

Note there are probably a lot of jobs in the non-prioritized queue still backed up last I looked; Note I also have a fill-in batch job on terbium that's throttled by queue length, which should be adding jobs only as fast as they get done. I can turn that queue job off if necessary.

From terbium I can see:

webVideoTranscode: 6380 queued; 195 claimed (162 active, 33 abandoned); 0 delayed
webVideoTranscodePrioritized: 0 queued; 69 claimed (6 active, 63 abandoned); 0 delayed

As far as I can see the problem seems to be related to hhvm getting all its thread busy and starting to queue requests, ending up in the errors reported by Icinga. We'd need to find a configuration for the jobrunners/scalers that allows peaks of load without getting saturated.

The high load may be caused by T161836 as @Revent has notice transcodes showing as 10 - 20 hours. He also found some videos was failing to transcode with errors like <Revent> When I investigated, they were all ‘source file missing’ in the db...

I reset the transcodes that had been running for egregiously long periods... this included ones running for 10+ hours that should have run in a minute or two. The transcodes then showed up in https://quarry.wmflabs.org/query/17726. The list was initially 90 or so... most have since 'reappeared', and I have then reset them to get the transcodes done.

This ticket has (at least) two independant tasks: (1) Fine-tuning the video scaler queue and (2) applying thermal paste to the video scalers (which turned out to be effective on one of the Hadoop workers last week). @elukey, is something still needed for (1) or are the additional, open tasks?

Yes there is some work to do for 1), I'll take care of it in a separate code review. For this particular issue, namely the videoscalers alarming, I am not sure what fixed it, since right after my patch some alarms were still firing. We haven't seen any alert for the past hours so I guess that we are good. The hosts still need to get the thermal paste but I believe this is something to discuss in the task to apply the new paste everywhere.

fgiunchedi triaged this task as Medium priority.Apr 10 2017, 1:02 PM

This is obsolete, these servers are gone for a while.