Page MenuHomePhabricator

Increase transcode background time limit
Closed, ResolvedPublic

Description

Most big transcodes (720p, 1080p and long videos) now fail due to transcode background time limit in TMH ($wgTranscodeBackgroundTimeLimit) set to 8h.
I think they should pass with a longer timeout (16h or 24h).

Examples of failed transcodes from https://commons.wikimedia.org/wiki/Special:TimedMediaHandler :
https://commons.wikimedia.org/wiki/File:President_Obama_Meets_with_the_Export_Council.webm 2 h 5 min 41 s, 1,280 × 720 (2.43 GB) -> WebM 360P, 480P, and 720p failed
https://commons.wikimedia.org/wiki/File:CIA_Director_Brennan_Press_Conference.webm 46 min 11 s, 1,920 × 1,080 (524.17 MB) -> WebM 1080P failed
https://commons.wikimedia.org/wiki/File:Awaara_(1951).webm 2 h 43 min 39 s, 1,920 × 1,080 (3.11 GB) -> WebM 480P, 720p and 1080p failed
https://commons.wikimedia.org/wiki/File:Mahal_(1949).webm 2 h 21 min 37 s, 1,200 × 720 (1.14 GB) -> WebM 480P, and 720p failed

Event Timeline

Yann renamed this task from Increase background time limit to Increase transcode background time limit.Jan 19 2017, 5:36 PM

I was a bit baffled why the timeouts seem to be happening significantly before the 8-hour limit is hit, but it turns out ulimit is based on *CPU time* not *wall-clock time*. Since there is some parallelization between decode, scaling, and re-encoding, the CPU usage is around 175% on these ffmpeg processes, not a 'mere' 100%, so we'll hit an 8 hour limit in 4-6 hours.

Currently just cutting these processes off is wasteful as we lose the entire encoding thime that did happen, so recommend bumping up to match the actual wall-clock time.

Change 333035 had a related patch set uploaded (by Brion VIBBER):
Double $wgTranscodeBackgroundTimeLimit to compensate for threading

https://gerrit.wikimedia.org/r/333035

(patch in the works to double the timeout based on our threading setting)

Change 333035 merged by jenkins-bot:
Double $wgTranscodeBackgroundTimeLimit to compensate for threading

https://gerrit.wikimedia.org/r/333035

Ok, this is merged live in today's SWAT updates. Already-running jobs will still have the lower limit and may still time out, but those that start from now should have a doubled time limit which'll be more in line with wall-clock time and should avoid timing out on most of the 1-2 hours 720p/1080p videos.

Now there are a few transcodes with time over 8h, but some are still failing: https://commons.wikimedia.org/wiki/File:Janmabhoomi,_1936.webm
https://quarry.wmflabs.org/query/15684
Exitcode: 137
startwork = 20170121151803, error = 20170121235548
8 hours