Page MenuHomePhabricator

webm issue: Some videos on Commons have no duration in their metadata
Open, Needs TriagePublicBUG REPORT

Description

Found some odd video entries on Commons that have no duration in the metadata, but it shows immediately when playback starts. Example:
https://commons.wikimedia.org/wiki/File:The_Sea_Beast_(1926).webm
and
https://commons.wikimedia.org/w/api.php?action=query&format=json&prop=imageinfo&iiprop=metadata&&titles=File:The%20Sea%20Beast%20(1926).webm

Other file metadata has duration encoded as playtime_seconds, length, or duration, which is not ideal but solvable. But those like the example above are hard to tackle.

Event Timeline

This may be another getid3 issue but with the webm/mkv parser, I'll sort through these in a bit and collect a list of affected files to test fixes with.

ffprobe says it has no listed duration o_O in which case i'm not sure if a duration can be read without parsing the whole file, but getid3 happily does that for mp3 etc ;)

If you load it into firefox over web you can see that the duration listing shows up as the amount of whatever has been downloaded so far -- as you seek farther into the file and it downloads more it increases the duration to what it's seen so far. :D

We should be able to handle that fine, we just need to be able to get the duration input via getid3, and that'll take a patch.

Workaround: if you remux the file with something like:

ffmpeg -i brokenfile.webm -vcodec copy newfile.webm

and reupload over it, that should clear it up.

The funny thing is, if you click on the movie https://commons.wikimedia.org/wiki/File:The_Sea_Beast_(1926).webm to play it, the moment the player comes up, it shows the correct duration. Where does it get that from?

In T357035#9526675, @brion wrote:

ffprobe says it has no listed duration o_O in which case i'm not sure if a duration can be read without parsing the whole file, but getid3 happily does that for mp3 etc ;)

Looks like a YouTube rip. YT stores all metadata separately and if the rip software doesn’t get that meta info and write it correctly, it’s essentially a live stream that you downloaded.

Where does it get that from?

It likely gets the presentation timestamp from the last frame and subtracts the pts of the first timestamp, but it’s not reliable metadata. what @brion said

The funny thing is, if you click on the movie https://commons.wikimedia.org/wiki/File:The_Sea_Beast_(1926).webm to play it, the moment the player comes up, it shows the correct duration. Where does it get that from?

From the duration listed in the transcoded output that's being played.

Asking the user what they are using. https://commons.wikimedia.org/wiki/User_talk:SnowyCinema#c-TheDJ-20240208195200-Software_used

In the end, we can only really correct this by reading the entire file, which is expensive. But maybe if the duration as stored originally is 0, we could fire off a job from the transcode job that updates the mediawiki metadata or something when it becomes known? Has all kinds of race condition issues though…

Since we aggressively cache file metadata I don't expect parsing the whole file to be significantly expensive here. No compressed data has to be decoded, it'd just be seeking through the file during the getid3() call at upload time, which it may well be doing anyway (I haven't checked the code for mkv/webm, but I know that's exactly what the mpeg parser does to get durations)