Page MenuHomePhabricator

GetID3 returns junk bytes for description of some mp3 files. FFmpeg skips these bytes
Open, LowPublicBUG REPORT

Description

Event Timeline

Aklapper changed the subtype of this task from "Production Error" to "Bug Report".
Aklapper added a project: TimedMediaHandler.
$:acko\> ffprobe Notes_p1_05_dostoyevsky.mp3 
ffprobe version 6.0 Copyright (c) 2007-2023 the FFmpeg developers
[mp3 @ 0x558b8b521000] Skipping 417 bytes of junk at 503.
[mp3 @ 0x558b8b521000] Estimating duration from bitrate, this may be inaccurate

There is junk (binary data) in the description field of the mp3 metadata of these files.
Files can be corrected with:

ffmpeg -i Notes_p1_05_dostoyevsky.mp3 -f mp3 -acodec copy output.mp3

This likely is a limitation in the ability of the php metadata parser (getID3) to handle invalid binary data in mp3s.

TheDJ renamed this task from MP3 files show 0 second to GetID3 returns junk bytes for description of some mp3 files. FFmpeg skips these bytes.Nov 1 2023, 2:41 PM

When time permits I'm planning to do a big refactor on the MPEG and MP3 parsers for getid3 and submit that upstream; there's a lot of things that are legit in bitstreams but it doesn't handle properly.

(Some MPEG video files also don't have their audio tracks detected correctly, either because they're a type that's not handled by the current code or because the parser took shortcuts in parsing a few structures that sometimes produce wrong results)

Until then the most effective fix for individual affected files is probably to re-mux them with ffmpeg, as recommended above... I can do this in a batch process relatively easily if nobody else wants to take it. :)

I was looking at this file again and discovered something new.

It does get parsed properly when I feed it directly to getID3, but it throws these warnings.

Array
(
    [0] => Some ID3v1 fields do not use NULL characters for padding
    [1] => Unknown data before synch (ID3v2 header ends at 503, then 417 bytes garbage, synch detected at 920). This is a known problem with some versions of LAME (3.90-3.92) DLL in CBR mode.
)

And this confused me, as we simply fail. I looked into the metadata column, and there it says:

{"data":{"GETID3_VERSION":"1.9.23-202310190849","filesize":6385792,"encoding":"UTF-8","id3v2":{"header":true,"majorversion":3,"minorversion":0},"error":["unable to determine file format"],"version":2}}

Now that was new to me... I attempted to reproduce this, but initially wasn't able to and it was somewhat baffling. Until i was reading the mediawiki logs:

/private/var/tmp/php9u5cdf225v7c89gtllk for magic numbers

And then i realized that this is also the file we pass to getID3 to extract the metadata. But that doesn't have a file extension, and due to the warning listed above, that causes getID3 to be unable to find the start of the mp3 audio stream. When the filename ends in .mp3, it can fallback to that to determine it is mp3.

So possible solutions are:

  1. pass a proper filename to getid3
  2. bypass getid3 analyze and directly load the module we need (we have already DONE the format detection, so why do it twice?)
  3. fix getid3 to scan more of the file if the file has an id3v2 tag.

@bvibber i wonder if this also explains some of the mpeg problems we have seen.

@TheDJ *chef's kiss* that's some awesome anaylsys work there, thanks :D

I'm not sure we can add an extension cleanly since iirc the temporary file is created by the upload handling system before we've done a file type check on our end I think? Whereas the metadata extraction is done later, once we already know the type but on the same file.

Cleanest thing might be to directly load the modules with a map from our media handler types to the target getid3 modules. We already do type-specific handling of the output in many cases, it should be straightforward to add a module mapping into each handler sublcass.

@TheDJ *chef's kiss* that's some awesome anaylsys work there, thanks :D

I'm not sure we can add an extension cleanly since iirc the temporary file is created by the upload handling system before we've done a file type check on our end I think?

It seems that we can pass it a name via original_filename

public function analyze($filename, $filesize=null, $original_filename='', $fp=null) {

Change #1231052 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/TimedMediaHandler@master] Pass file extension to getID3

https://gerrit.wikimedia.org/r/1231052

worked. Bed time, to be refined later :)

Screenshot 2026-01-24 at 00.01.25.png (488×1 px, 95 KB)