Page MenuHomePhabricator

Increase capacity for Mercurius webvideoTranscode job (1080p) processing
Open, HighPublic

Description

According to my calculations, we have some 900 or so files for which after more than a month, we have still not finished generating the 1080p transcode. That seems like a backlog that cannot be caught up with, without adding additional capacity. This seems like we are chronically underprovisioned (and this is AFTER we got rid of the 720p transcodes, which I suspect were also in that class (this in turn might indicate that 1080p is often simply not finishing and crashing nodes)..

One of the problems with not having many transcode node for these, is that the transcodes for large files often get stuck, and then the node only gets restarted after like 3 hours of being stuck (not sure what the max time is exactly, but as transcodes can take like a day, I assume it's a bit conservative in shooting down a node), in the mean time not handling any other transcodes in that category.

SELECT COUNT(transcode_id)
FROM transcode
WHERE transcode_key = '1080p.vp9.webm' 
AND transcode_time_startwork IS NULL AND transcode_time_addjob IS NOT NULL AND transcode_time_success is null and transcode_time_error is null
AND transcode_time_addjob > DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 30 DAY), '%Y%m%d%H%i%S')

903 entries

Over the same period, 1165 1080p entries succeeded and 62 failed.

There are multiple people on Commons who have noted that 1080ps are simply "not being generated" any longer. We need to tackle this, because otherwise we are just burning cpu while practically not even supporting 1080p.

What is strange is that we used to have 720p, 1080p, 1440p AND 2160p in this class. While 1440 and 2160p was a bit too much and were eventually removed (I think for good reasons), we were able to keep up with the rest pretty easily. This probably was when we were still on bare metal however.

Event Timeline

JMeybohm triaged this task as Medium priority.Jan 13 2026, 2:29 PM
JMeybohm added a project: ServiceOps new.
MLechvien-WMF raised the priority of this task from Medium to High.
MLechvien-WMF moved this task from Inbox to Scheduled (this Q) on the ServiceOps new board.
MLechvien-WMF subscribed.

Assigning to Raine to take a look

Reassigning to Hugh who accepted to take a look

I think this actually recovered now, because of all the fixes we made recently to the transcode queue processing. This was obfuscating the numbers, but also kept hosts pre-occupied because they didn't realize their job had actually already ended.

Good to hear! Are the numbers you queried in the ticket body exposed as metrics anywhere by any chance?

Not as grafana metrics or anything like that (other than the job queue counts)

They are on https://commons.wikimedia.org/wiki/Special:Transcode_statistics
Note that the counts there are accurate, but the tables are 'broken' (but these indexes are in the process of being replaced at the db level, so that should fix that and should improve the performance of that page)

Thanks - I've added a note to T385707 to include those details