MediaWiki job queue is a poor fit for long-running transcode processes, and is notoriously fragile (failing to update the status table with many failure cases). Experiment with refactoring the transcode jobs to be able to run from a (PHP-based) 'microservice' that can more reliably be checked in on.
Is this just a problem with timeouts being to low in HHVM or do you want some kind of rpc thing where can know if the process that claimed a job is still running?
Are any of the failures internal problems in the Job subclass (low timeouts, moving large files around for no reason, jobs that should be split up, ect). If that's the case it might just be easier to rewrite those.
It'd be nice to have a more general RPC-status-check; the Job subclasses maintain that information in the 'transcode' database table -- eg if ffmpeg crashes, execution returns to the PHP Job subclass and it updates the table with error info -- but when the timeout hits on the HHVM level the cleanup code never has a chance to run. :(
For now it looks like we've got the timeouts sorted out so I'm putting this one on back burner.