Page MenuHomePhabricator

Transcodes of audio-only samples are not running for new uploads
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
The uploads are added to the queue, but do not seem to get picked up from the queue.
There are add job timestamps, but no 'started work' timestamps, meaning the job failed to execute.

What should have happened instead?:

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
We can also see this by looking at the last added transcode and looking for audio clips
https://quarry.wmcloud.org/query/84371

Event Timeline

See also T368333: No sound in MP4 video files downloaded from Commons (which is likely not related; still mentioning it for the paper trail).

So pretty early on the 20th around 01:20 (last Thursday) it started rising. So it wasn't the train, as the train arrived on Commons on Wednesday. In SAL, I only see db maintenance T367856 going on around that time.

Could be a backfill run but that shouldn't be interfering with anything... I'll check on it

Batch requeueTranscodes failured on June 22 with this error:

Hebben_wij_Nederlanders_es_gewusst-_(5-5).webm
.. removing 360p.video.vp9.mp4
JobQueueError from line 134 of /srv/mediawiki/php-1.43.0-wmf.9/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php: Could not enqueue jobs
#0 /srv/mediawiki/php-1.43.0-wmf.9/includes/jobqueue/JobQueue.php(380): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(Array, 0)
#1 /srv/mediawiki/php-1.43.0-wmf.9/includes/jobqueue/JobQueue.php(352): JobQueue->batchPush(Array, 0)
#2 /srv/mediawiki/php-1.43.0-wmf.9/includes/jobqueue/JobQueueGroup.php(155): JobQueue->push(Array)
#3 /srv/mediawiki/php-1.43.0-wmf.9/includes/jobqueue/JobQueueGroup.php(189): JobQueueGroup->push(Array)
#4 /srv/mediawiki/php-1.43.0-wmf.9/extensions/TimedMediaHandler/includes/WebVideoTranscode/WebVideoTranscode.php(1117): JobQueueGroup->lazyPush(Object(HTMLCacheUpdateJob))
#5 /srv/mediawiki/php-1.43.0-wmf.9/extensions/TimedMediaHandler/includes/WebVideoTranscode/WebVideoTranscode.php(1091): MediaWiki\TimedMediaHandler\WebVideoTranscode\WebVideoTranscode::invalidatePagesWithFile(Object(MediaWiki\Title\Title))
#6 /srv/mediawiki/php-1.43.0-wmf.9/extensions/TimedMediaHandler/maintenance/requeueTranscodes.php(92): MediaWiki\TimedMediaHandler\WebVideoTranscode\WebVideoTranscode::removeTranscodes(Object(LocalFile), '360p.video.vp9....')
#7 /srv/mediawiki/php-1.43.0-wmf.9/extensions/TimedMediaHandler/maintenance/TimedMediaMaintenance.php(69): RequeueTranscodes->processFile(Object(LocalFile))
#8 /srv/mediawiki/php-1.43.0-wmf.9/extensions/TimedMediaHandler/maintenance/requeueTranscodes.php(33): TimedMediaMaintenance->execute()
#9 /srv/mediawiki/php-1.43.0-wmf.9/maintenance/includes/MaintenanceRunner.php(696): RequeueTranscodes->execute()
#10 /srv/mediawiki/php-1.43.0-wmf.9/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#11 /srv/mediawiki/multiversion/MWScript.php(158): require_once('/srv/mediawiki/...')
#12 {main}

and hasn't been running since. At that time there were 5,498 items estimated in the transcode queue for Commons.

Mentioned in SAL (#wikimedia-operations) [2024-06-25T16:23:55Z] <bvibber> running requeueTranscodes for missing audio files on commons (mwmaint1002) cf T368364

Live system thinks it has 9,223 items queued on commons and requeue is throttling there for now.... occasionally it goes down an item and moves on.

Currently running a check for "active" items via the db, which will have lots of false positives probably:

https://quarry.wmcloud.org/query/79545

I have no way that I know of to query what jobs are *actually* running on job queue servers or whether there are stuck jobs that should be failing out.

I'm bulk-adding the missing audio transcodes which should force them to run through as fast as possible between other jobs, and hopefully will handle the prioritized queue split better.

The list of "active" (may or may not actually be active) includes a number of 2160p high-res videos hitting since June 21. We've also gotten reports before about certain kinds of AV1 videos slowing down the input handling, which I haven't checked for.

Note also that the replag is listed as 3 hours on quarry, is this worrying or normal?

Looks like we've got a couple problems with high-res videos:

  • a bunch of 4K videos got uploaded at once and they all queued up
  • some of them are stuck! they should be timing out
  • it's also possible the audio clips are going to the wrong queue, i have to double-check this

cf T368433 for temporarily disabling the highest res transcodes until they're made more performant and .... not breaky

Funnily enough, thats exactly the kind of problem i was also concerned would pop up with the new k8s cluster T357309#9561624
Didnt know we were so susceptible to it on the old setup as well.

I'm seriously considering bringing back my "chunked" scheme that would at least produce smaller, standalone jobs that encode say 10 seconds worth of video, then reassemble the final into a single video at the end. :P Main reason I haven't is that the logic needs to be able to handle missing chunks if individual ones time out or fail and that sounds like a pain, but it'll be a lot friendlier to the job queue infrastructure.

Ok, 1440p and 2160p transcodes are temporarily disabled for now until better fixes, and we did a kill of the old stuck processes. Might still take a bit to shake everything out; I'm trying to flush through all the missing audio.

The queue was shrinking since yesterday, but is climbing again since this morning. This doesn't make any sense. With those high res jobs disabled, we should have more than enough capacity to catch up, should we not ?

Hi, apologies that I don't know how to help, and not sure if it's the same problem; the manual intervention here worked for files I uploaded to Commons on Monday (e.g. File:3-5B set class on C.mid), but files I uploaded yesterday are still waiting, e.g. File:4-3 set class on C.mid. I made a quarry query for these.

Hmm, it's down under 4k entries but still high.

Might be in part from my background process re-queueing audio items shuffling them around, but shouldn't be adding significantly. I'll keep an eye on it...

image.png (758×1 px, 68 KB)