The transcoding of new uploaded videos to derivative formates has stopped since yesterday.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Fix invalid JSON in job runner config | operations/puppet | production | +1 -1 |
Related Objects
Event Timeline
Hi @Pristurus,
unfortunately this report is not very useful because it does not describe the problem well. If you have time and can still reproduce the problem, please add a more useful and complete description to this report.
Sorry, being a normal commons and wikipedia user I am not so familiar with phabricator. Well, please have a look to https://commons.wikimedia.org/wiki/File:VID_20150327_164408.webm. In the "Transcode status" section of the file you will find something like "Added to Job queue 1 day, 8 hours, 39 minutes, 3 seconds ago". Trying ''Reset transcode'' doesn't solve the problems. And all videos uploaded later on show this behaviour. .
tgr@terbium:~$ mwscript maintenance/showJobs.php --wiki=commonswiki --type=webVideoTranscode --group webVideoTranscode: 256 queued; 89 claimed (0 active, 89 abandoned); 0 delayed
Transcoding died on Wednesday around 11h UTC: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Video+scalers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report
Nothing relevant in the logs when grepping for TimedMediaHandler and MwEmbedSupport in /a/mw-log/*.log on fluorine, and none of the 1.25wmf24 changes in core, TMH or mwEmbed seem relevant (the timing would also not match). No puppet or site config changes jump out either.
The job runner log has a bunch of errors but they are not so useful:
/a/mw-log/runJobs.log:11618172:2015-04-17 15:06:22 terbium commonswiki: webVideoTranscode File:Héron_cendré_(_Ardea_cinerea).webm transcodeMode=derivative transcodeKey=480p.webm (uuid=1c8270c7cd5045998df108767c5a8f52,QueuePartition=rdb2) t=179 error='/usr/bin/avconv' -y -i '/tmp/localcopy_e28d4167377d-1.webm' -threads 2 -skip_threshold 0 -bufsize 6000k -rc_init_occupancy 4000 -qmin 1 -qmax 51 -vb '1024000' -vcodec libvpx -g '128' -keyint_min '128' -f webm -s 854x480 -an -pass '1' -passlogfile '/tmp/transcode_480p.webm070712451c16-1.webm.log' /dev/null /a/mw-log/runJobs.log:11618133:2015-04-17 15:06:21 terbium commonswiki: webVideoTranscode File:Kroz_srednju_Bosnu_073534.webm transcodeMode=derivative transcodeKey=480p.ogv (uuid=d6279cf5f86e47f6b5e8bdbe0bf4514f,QueuePartition=rdb1) t=317 error='/usr/bin/ffmpeg2theora' '/tmp/localcopy_da37953e0d7a-1.webm' -V '1024' -a '2' -H '44100' -c '2' --no-upscaling --keyint '128' --buf-delay '256' --width '848' --height '480' --aspect '848:480' -o '/tmp/transcode_480p.ogv577d5677f0a6-1.ogv'
There are no recent logs in /tmp on tmh1001/tmh1002. mw-runJobs-backoffs.json contains {"webVideoTranscode":1429025552.3145} (which is Apr 14 15h UTC) so maybe that's the real time of failure and the rest of the activity on ganglia is just existing jobs running out?
/var/log/mediawiki/jobrunner.log on the scalers is full of
Fatal error: Uncaught exception 'Exception' with message 'Could not parse JSON file '/etc/jobrunner/jobrunner.conf'.' in /srv/deployment/jobrunner/jobrunner/src/RedisJobService.php:91 Stack trace: #0 /srv/deployment/jobrunner/jobrunner/redisJobRunnerService(186): RedisJobService::init(Array) #1 {main} thrown in /srv/deployment/jobrunner/jobrunner/src/RedisJobService.php on line 91
(/etc/jobrunner/jobrunner.conf seems okay, though.) That's for Apr 16-17; there are no logfiles for 14-15, and the older files have no errors.
It's not, there is an extra comma, added in https://gerrit.wikimedia.org/r/#/c/203379/4/modules/mediawiki/templates/jobrunner/jobrunner.conf.erb.
Change 204815 had a related patch set uploaded (by Gergő Tisza):
Fix invalid JSON in job runner config
Re-running failed jobs with
tgr@terbium:~$ mwscript eval.php --wiki=commonswiki > $iter = JobQueueGroup::singleton()->get( 'webVideoTranscode' )->getAllAbandonedJobs(); > foreach ( $iter as $job ) { JobQueueGroup::singleton()->push( $job ); }
There are 73 abandoned jobs and 49 uploads since the problem started so that should cover all failed transcodes.
Assuming this is fixed, please reopen if you still see files which cannot be fixed via reset transcode link, or if lots of files remain broken by the end of the week.