Rework job queue usage for TimedMediaHandler (video scalers)
Closed, ResolvedPublic

Description

We've had a number of problems with the job queue usage for TimedMediaHandler on the video scalers:

  • job queue's HHVM connections time out during long-running jobs, causing overscheduling which bogs down servers (set to run 4 jobs but might have 16 running)
  • small server pool and extremely variable queue depth; most of the time little activity but sometimes HUNDREDS of batch uploads flood the system with jobs, which get backed up
  • no prioritization when jobs are backed up -- everything goes first-come first-serve, meaning new uploads may have *no* transcodes while high-res transcode jobs are running on a batch upload or someone's attempt to re-encode old transcodes

I'm planning to retool the jobs for TMH to help fix/workaround these:

  • On a new upload, instead of queueing immediate jobs for each output format/resolution, queue a single job
  • When the job is processed, check in transcode table how many transcodes are currently running -- if more than a threshold, stop right there!
  • If under threshold, then check the transcode table for the best next candidate -- if no candidates, stop. Prioritize by resolution and then upload time, so low-resolution versions always get done faster and output is more quickly visible.
  • When processing is complete, queue another job.

This should avoid requiring any changes to the actual job queue system; while the long-running jobs may still time out, the threshold check will prevent the explosion of extra processes. The prioritization should help a lot with getting good user-visible behavior when we're flooded with large upload sets.

Related Objects

brion created this task.Jan 11 2017, 5:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2017, 5:07 PM
Yann added a subscriber: Yann.Jan 11 2017, 5:12 PM
brion added a subscriber: Joe.Jan 11 2017, 5:13 PM
Revent added a subscriber: Revent.Jan 11 2017, 5:18 PM
brion added a comment.Jan 11 2017, 5:19 PM

Note -- need to be able to trigger additional jobs at _start_ not just at _end_ to make sure we're multitasking. :)

brion added a comment.Jan 11 2017, 5:50 PM

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

May want to divide into high-priority queue and low-priority queues, it's not too hard to set them up.

Change 331662 had a related patch set uploaded (by Brion VIBBER):
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/331662

Change 331668 had a related patch set uploaded (by Brion VIBBER):
Split TMH transcode queue into two for prioritization

https://gerrit.wikimedia.org/r/331668

Change 331669 had a related patch set uploaded (by Brion VIBBER):
Exclude new high-priority video transcode jobs from default queue

https://gerrit.wikimedia.org/r/331669

Change 331669 merged by jenkins-bot:
Exclude new high-priority video transcode jobs from default queue

https://gerrit.wikimedia.org/r/331669

elukey added a subscriber: elukey.Jan 11 2017, 11:19 PM
zhuyifei1999 added a subscriber: zhuyifei1999.

Uh so what's the ETA for this?

Yann added a comment.Jan 18 2017, 9:01 PM

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

I'm pretty sure he meant the apache timeout (T153488#2887091), not the transcode timeout.

Joe added a comment.Jan 19 2017, 7:17 AM

joe informs me the timeout problem is solved, which should help reduce the need to check for thresholds. :D

I don't see the timeout problem as solved. All big transcodes (720p, 1080p, and long ones) still fail.
And we have 360,000+ failed transcodes, and growing.

Transcodes might fail for a number of reasons, not necessarily timeouts. In fact, any transcode that takes less than one day to run should not fail because of issues in the way the videoscalers are set up; moreover, while monitoring TMH/videoscalers, I've seen several hundreds long videos (~ 1 hour plus) being successfully transcoded to 1080p/720p, which mostly proves the problem is not system-level timeouts anymore.

Joe added a comment.Jan 19 2017, 7:19 AM

@brion before the change to TMH goes into production, we also need to tweak the jobrunner setup in operations/puppet.

Transcodes might fail for a number of reasons, not necessarily timeouts. In fact, any transcode that takes less than one day to run should not fail because of issues in the way the videoscalers are set up; moreover, while monitoring TMH/videoscalers, I've seen several hundreds long videos (~ 1 hour plus) being successfully transcoded to 1080p/720p, which mostly proves the problem is not system-level timeouts anymore.

FYI: We still have a transcode background time limit in TMH ($wgTranscodeBackgroundTimeLimit).

Yann added a comment.Jan 19 2017, 1:21 PM

FYI: We still have a transcode background time limit in TMH ($wgTranscodeBackgroundTimeLimit).

That's the reason most big transcodes fail now.
I think they should pass with a longer timeout (16h or 24h).
Is there any reason not to allow a longer timeout?

brion added a comment.Jan 19 2017, 5:09 PM

@Joe the puppet changes are in https://gerrit.wikimedia.org/r/#/c/331668/ -- can you take a look and see if it makes sense?

brion added a comment.Jan 19 2017, 5:20 PM

@Yann can you break out a separate task for raising the background time limit? I want to make sure we don't lose that. Currently most of the failures I'm seeing aren't reporting back what their errors were, which makes it tricky to see what's going on, I'll have to poke around and see why that's going on.

Yann added a comment.Jan 19 2017, 5:36 PM

@Yann can you break out a separate task for raising the background time limit? I want to make sure we don't lose that. Currently most of the failures I'm seeing aren't reporting back what their errors were, which makes it tricky to see what's going on, I'll have to poke around and see why that's going on.

OK done in T155750.

brion added a comment.Jan 19 2017, 5:36 PM

(Ok I can find the error messages, it's just not linking to the popup anymore. Weird.) Yeah, I see those "Exitcode: 137" which indicates a SIGKILL... but the times don't seem to match up with a strict 8-hour limit. I see some apparently killed at 6 hours or so, others less. Actually I think they're dying on the memory limit -- "Exitcode: 137 / Memory: 4194304" etc.

brion added a comment.Jan 19 2017, 5:39 PM

Ah that 'memory' is just the memory limit being reported. That doesn't help us know why it died. Sigh.

brion added a comment.Jan 19 2017, 6:02 PM

Memory usage looks sane enough logging in... I do see a few processes in 240+ minutes of CPU time, longest running now is 346 minutes.

It also occurs to me that ulimit handles *CPU time* not wall-clock time, which may explain why the limit seems to go off prematurely -- ffmpeg uses multithreading to do some operations in parallel (at least decode / scale / encode, and there may or may not be some parallelization in the final encode) so CPU usage sits around 175% rather than 100% on these jobs.

Ok switching over to T155750 for that. :)

Yann added a comment.Jan 22 2017, 10:18 AM
This comment was removed by Yann.
Yann awarded a token.Feb 2 2017, 2:52 PM

Change 331668 merged by Filippo Giunchedi:
Split TMH transcode queue into two for prioritization

https://gerrit.wikimedia.org/r/331668

Change 331662 merged by jenkins-bot:
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/331662

Change 336846 had a related patch set uploaded (by Brion VIBBER):
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/336846

brion added a comment.Feb 9 2017, 5:41 PM

Ok, the puppet changes went out at this morning's puppet swat, mediawiki side is ready to go for the 11am pacific mediawiki swat.

Change 336846 merged by jenkins-bot:
TMH job queue split into low and high priority

https://gerrit.wikimedia.org/r/336846

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:47:56Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler/TimedMediaHandler.php: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART I T155098 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:49:02Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler/TimedMediaHandler.hooks.php: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART II T155098 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-09T19:50:04Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.11/extensions/TimedMediaHandler: SWAT: [[gerrit:336846|TMH job queue split into low and high priority]] PART III T155098 (duration: 00m 44s)

brion closed this task as "Resolved".Feb 9 2017, 8:06 PM

MW changes are live. Existing files will continue to process on the low-prio queue. I'll watch the server load for the next couple days and see if we need to change the queue runner balance between the two.

Joe added a comment.Feb 10 2017, 6:54 AM

The prioritized queue is working well, but I'll probably raise the number of non-prioritized workers today as we're now underutilizing the systems.

Good news:

https://grafana.wikimedia.org/dashboard/db/job-queue-health?var-jobType=webVideoTranscodePrioritized

shows the prioritized jobs have a 99th percentile of wait time below 1 hour, and most times in the order of a few minutes. This is a decisive improvement over what we had before.

Change 337230 had a related patch set uploaded (by Brion VIBBER):
Bump up number of queue runners for transcodes

https://gerrit.wikimedia.org/r/337230

Change 337230 merged by Giuseppe Lavagetto:
Bump up number of queue runners for transcodes

https://gerrit.wikimedia.org/r/337230