We've had a number of problems with the job queue usage for TimedMediaHandler on the video scalers:
- job queue's HHVM connections time out during long-running jobs, causing overscheduling which bogs down servers (set to run 4 jobs but might have 16 running)
- small server pool and extremely variable queue depth; most of the time little activity but sometimes HUNDREDS of batch uploads flood the system with jobs, which get backed up
- no prioritization when jobs are backed up -- everything goes first-come first-serve, meaning new uploads may have *no* transcodes while high-res transcode jobs are running on a batch upload or someone's attempt to re-encode old transcodes
I'm planning to retool the jobs for TMH to help fix/workaround these:
- On a new upload, instead of queueing immediate jobs for each output format/resolution, queue a single job
- When the job is processed, check in transcode table how many transcodes are currently running -- if more than a threshold, stop right there!
- If under threshold, then check the transcode table for the best next candidate -- if no candidates, stop. Prioritize by resolution and then upload time, so low-resolution versions always get done faster and output is more quickly visible.
- When processing is complete, queue another job.
This should avoid requiring any changes to the actual job queue system; while the long-running jobs may still time out, the threshold check will prevent the explosion of extra processes. The prioritization should help a lot with getting good user-visible behavior when we're flooded with large upload sets.