Page MenuHomePhabricator

Video trancoding on commons does not work
Closed, ResolvedPublic

Description

The transcoding of new uploaded videos to derivative formates has stopped since yesterday.

Event Timeline

Pristurus raised the priority of this task from to Medium.
Pristurus updated the task description. (Show Details)
Pristurus subscribed.

Hi @Pristurus,
unfortunately this report is not very useful because it does not describe the problem well. If you have time and can still reproduce the problem, please add a more useful and complete description to this report.

Aklapper raised the priority of this task from Medium to Needs Triage.Apr 16 2015, 12:37 PM

Sorry, being a normal commons and wikipedia user I am not so familiar with phabricator. Well, please have a look to https://commons.wikimedia.org/wiki/File:VID_20150327_164408.webm. In the "Transcode status" section of the file you will find something like "Added to Job queue 1 day, 8 hours, 39 minutes, 3 seconds ago". Trying ''Reset transcode'' doesn't solve the problems. And all videos uploaded later on show this behaviour. .

Tgr triaged this task as High priority.Apr 16 2015, 5:44 PM
Tgr added a project: TimedMediaHandler.
Tgr added a project: Commons.
Tgr subscribed.
tgr@terbium:~$ mwscript maintenance/showJobs.php --wiki=commonswiki --type=webVideoTranscode --group
webVideoTranscode: 256 queued; 89 claimed (0 active, 89 abandoned); 0 delayed

Transcoding died on Wednesday around 11h UTC: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Video+scalers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report

Nothing relevant in the logs when grepping for TimedMediaHandler and MwEmbedSupport in /a/mw-log/*.log on fluorine, and none of the 1.25wmf24 changes in core, TMH or mwEmbed seem relevant (the timing would also not match). No puppet or site config changes jump out either.

The job runner log has a bunch of errors but they are not so useful:

/a/mw-log/runJobs.log:11618172:2015-04-17 15:06:22 terbium commonswiki: webVideoTranscode File:Héron_cendré_(_Ardea_cinerea).webm transcodeMode=derivative transcodeKey=480p.webm (uuid=1c8270c7cd5045998df108767c5a8f52,QueuePartition=rdb2) t=179 error='/usr/bin/avconv' -y -i '/tmp/localcopy_e28d4167377d-1.webm' -threads 2 -skip_threshold 0 -bufsize 6000k -rc_init_occupancy 4000 -qmin 1 -qmax 51 -vb '1024000' -vcodec libvpx -g '128' -keyint_min '128' -f webm -s 854x480 -an -pass '1' -passlogfile '/tmp/transcode_480p.webm070712451c16-1.webm.log' /dev/null
/a/mw-log/runJobs.log:11618133:2015-04-17 15:06:21 terbium commonswiki: webVideoTranscode File:Kroz_srednju_Bosnu_073534.webm transcodeMode=derivative transcodeKey=480p.ogv (uuid=d6279cf5f86e47f6b5e8bdbe0bf4514f,QueuePartition=rdb1) t=317 error='/usr/bin/ffmpeg2theora' '/tmp/localcopy_da37953e0d7a-1.webm' -V '1024' -a '2' -H '44100' -c '2' --no-upscaling --keyint '128' --buf-delay '256' --width '848' --height '480' --aspect '848:480' -o '/tmp/transcode_480p.ogv577d5677f0a6-1.ogv'

There are no recent logs in /tmp on tmh1001/tmh1002. mw-runJobs-backoffs.json contains {"webVideoTranscode":1429025552.3145} (which is Apr 14 15h UTC) so maybe that's the real time of failure and the rest of the activity on ganglia is just existing jobs running out?

/var/log/mediawiki/jobrunner.log on the scalers is full of

Fatal error: Uncaught exception 'Exception' with message 'Could not parse JSON file '/etc/jobrunner/jobrunner.conf'.' in /srv/deployment/jobrunner/jobrunner/src/RedisJobService.php:91
Stack trace:
#0 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(186): RedisJobService::init(Array)
#1 {main}
  thrown in /srv/deployment/jobrunner/jobrunner/src/RedisJobService.php on line 91

(/etc/jobrunner/jobrunner.conf seems okay, though.) That's for Apr 16-17; there are no logfiles for 14-15, and the older files have no errors.

In T96236#1215977, @Tgr wrote:

/etc/jobrunner/jobrunner.conf seems okay, though.

It's not, there is an extra comma, added in https://gerrit.wikimedia.org/r/#/c/203379/4/modules/mediawiki/templates/jobrunner/jobrunner.conf.erb.

Change 204815 had a related patch set uploaded (by Gergő Tisza):
Fix invalid JSON in job runner config

https://gerrit.wikimedia.org/r/204815

Change 204815 merged by Ori.livneh:
Fix invalid JSON in job runner config

https://gerrit.wikimedia.org/r/204815

Tgr claimed this task.

Re-running failed jobs with

tgr@terbium:~$ mwscript eval.php --wiki=commonswiki
> $iter = JobQueueGroup::singleton()->get( 'webVideoTranscode' )->getAllAbandonedJobs();
> foreach ( $iter as $job ) { JobQueueGroup::singleton()->push( $job ); }

There are 73 abandoned jobs and 49 uploads since the problem started so that should cover all failed transcodes.

Assuming this is fixed, please reopen if you still see files which cannot be fixed via reset transcode link, or if lots of files remain broken by the end of the week.

CC-ing Victor who complained about a video earlier. Should be all good now.