Page MenuHomePhabricator

Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes
Closed, ResolvedPublic

Description

Hi there,

I came across some videos that are large enough but are displayed too small in normal galleries.
Check out the following link:

https://commons.wikimedia.org/wiki/Commons:Featured_videos/Animated ( at the bottom of this page)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I was able to reproduce this locally using the files via InstantCommons, and it looks like the root problem is that some of the files are misidentifed as audio, causing the packed-gallery code to use a small icon width instead of a large width constraint here:

	protected function getThumbParams( $img ) {
		if ( $img && $img->getMediaType() === MEDIATYPE_AUDIO ) {
			$width = $this->mWidths;
		} else {
			// We want the width not to be the constraining
			// factor, so use random big number.
			$width = $this->mHeights * 10 + 100;
		}

		// self::SCALE_FACTOR so the js has some room to manipulate sizes.
		return [
			'width' => $width * self::SCALE_FACTOR,
			'height' => $this->mHeights * self::SCALE_FACTOR,
		];
	}

Probably need to find a fix a bug with audio vs video detection for WebM.

Ah this is of course due to our naive parsing in MimeAnalyzer.

} elseif ( strncmp( $data, "webm", 4 ) == 0 ) {
        // XXX HACK look for a video track, if we don't find it, this is an audio file
        $videotrack = strpos( $head, "\x86\x85V_VP" );

In the case of File:Elephants Dream (2006).webm, this V_VP is likely outside of the scanned range of bytes.

Yeah, looks like we only read 1024 header bytes and it appears later in both those files, around 4000+ bytes in.

Proper fix would be to read as many bytes as are required through the metadata fetch. Hacky fix would be to just read more header bytes. ;)

Change 518426 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/core@master] Workaround for misdetection of some WebM files as audio

https://gerrit.wikimedia.org/r/518426

@TheDJ @brion Do you know what happened here? Was the work paused/discontinued?

Jonteemil renamed this task from Full size videos displayed as small videos in gallery to Some WebM video files are misdetected as audio files due to the MIME detecter not scanning enough bytes.Oct 24 2021, 10:12 PM
Jonteemil renamed this task from Some WebM video files are misdetected as audio files due to the MIME detecter not scanning enough bytes to Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes.

I figured this was a more precise title.

I see there is a patch to fix this already made. What is the progress on it? What is yet to be done for this to be addressed?

From my observations looking at imageinfo metadata available for these files (as available in image Mediawiki table), I think an easy fix for many (if not all) of these files would be to simply look at existing imageinfo metadata: if a file is application/ogg and has width and height, mark it as video, if width and height is 0, mark it as audio. Similar for audio/webm and video/webm. So this could be fixed now for existing files with a simple script going over the image table. While we wait for the perfect fix for new files.

Change 518426 merged by jenkins-bot:

[mediawiki/core@master] Workaround for misdetection of some WebM files as audio

https://gerrit.wikimedia.org/r/518426

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

The patch was merged, fixing this problem for newer files, but older files will have to reparse their basic information into the DB. I think this can be done with the refereshImageMetadata script ?

I know @Ladsgroup has recently run this for djvu files.. For this maybe we could select on mediatype === AUDIO and maybe add an option to the script to filter on width/height, as these videos will have width and height even though they are marked as 'audio'.

I can run that and even for old file versions too, it might take a while depending on how many audio files we have, it won't fix the deleted files though (and the issue resurface once they get undeleted). I will find a way for them as well. Thanks for letting me know.

Mentioned in SAL (#wikimedia-operations) [2022-03-14T17:47:46Z] <Amir1> start of foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO --sleep 2 (T226311)

Mentioned in SAL (#wikimedia-operations) [2022-03-15T10:13:56Z] <Amir1> start of foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO --sleep 2 --oldimage (T226311)

That is strange.. it didn't seem to refresh that data..

Example of files:
https://commons.wikimedia.org/w/api.php?action=query&prop=videoinfo&viprop=mediatype|mime|metadata&titles=File:FEZ_trial_gameplay_HD.webm
https://commons.wikimedia.org/w/api.php?action=query&prop=videoinfo&viprop=mediatype|mime|metadata&titles=File:Elephants_Dream_(2006).webm

I tested the FEZ file locally, overwriting the data in the table, then running:

php ./maintenance/refreshImageMetadata.php --force --verbose --mediatype=AUDIO
Processing next 2 row(s) starting with FEZ_trial_gameplay_HD_local.webm.
Forcibly refreshed File:FEZ_trial_gameplay_HD_local.webm.
Forcibly refreshed File:Local_audio.ogg.

Finished refreshing file metadata for 2 files. 0 needed to be refreshed, 2 did not need to be but were refreshed anyways, and 0 refreshes were suspicious.

And after running it did update the data in the table for me. Perhaps we encountered an error when running the script or something that caused it to not fully process all ?

Hmm, the metadata for that file is interesting btw. Some ID3 warnings, both locally and in wikimedia though, so I doubt that explains the difference in behaviour of the maintenance script.

a:15:{s:14:"GETID3_VERSION";s:19:"1.9.21-202109171300";s:8:"filesize";i:143769846;s:12:"avdataoffset";i:0;s:9:"avdataend";i:143769846;s:10:"fileformat";s:4:"webm";s:5:"audio";a:6:{s:10:"dataformat";s:6:"A_OPUS";s:11:"sample_rate";d:48000;s:8:"channels";i:2;s:8:"language";s:3:"und";s:7:"streams";a:1:{s:2:"02";a:5:{s:10:"dataformat";s:6:"A_OPUS";s:7:"default";b:1;s:11:"sample_rate";d:48000;s:8:"channels";i:2;s:8:"language";s:3:"und";}}s:11:"channelmode";s:6:"stereo";}s:5:"video";a:8:{s:10:"dataformat";s:5:"V_VP9";s:12:"resolution_x";i:1280;s:12:"resolution_y";i:720;s:12:"display_unit";s:6:"pixels";s:9:"display_x";i:1280;s:9:"display_y";i:720;s:10:"frame_rate";d:30;s:7:"streams";a:1:{s:2:"01";a:8:{s:10:"dataformat";s:5:"V_VP9";s:7:"default";b:1;s:12:"resolution_x";i:1280;s:12:"resolution_y";i:720;s:12:"display_unit";s:6:"pixels";s:9:"display_x";i:1280;s:9:"display_y";i:720;s:10:"frame_rate";d:30;}}}s:7:"warning";a:8:{i:0;s:101:"Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [2 bytes]) at 63";i:1;s:101:"Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [2 bytes]) at 78";i:2;s:101:"Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [4 bytes]) at 93";i:3;s:102:"Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [4 bytes]) at 110";i:4;s:99:"Unhandled track.video element [module.audio-video.matroska.php:712] (5552::15b0 [16 bytes]) at 4314";i:5;s:92:"Unhandled track element [module.audio-video.matroska.php:819] (5819::16bb [4 bytes]) at 4353";i:6;s:92:"Unhandled track element [module.audio-video.matroska.php:819] (5802::16aa [3 bytes]) at 4389";i:7;s:29:"Unhandled audio type "A_OPUS"";}s:8:"encoding";s:5:"UTF-8";s:9:"mime_type";s:10:"video/webm";s:8:"matroska";a:1:{s:8:"comments";a:2:{s:9:"muxingapp";a:1:{i:0;s:35:"libebml v1.3.6 + libmatroska v1.4.9";}s:10:"writingapp";a:1:{i:0;s:39:"mkvmerge v26.0.0 ('In The Game') 64-bit";}}}s:16:"playtime_seconds";d:390.016;s:7:"bitrate";d:2949004.061371841;s:15:"playtime_string";s:4:"6:30";s:7:"version";i:2;}

@brion any ideas ?

The fact that it has php metadata stored in db (in production) says the script was not run on it at all.
Strangely when I run the script without mediatype (or with it), it doesn't pick up the file:

ladsgroup@mwmaint1002:~$ mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --verbose --start=FEZ --end=FEZZ
Processing next 6 row(s) starting with FEZ00.JPG.
Forcibly refreshed File:FEZ00.JPG.
Forcibly refreshed File:FEZ0010.jpg.
Forcibly refreshed File:FEZ1.JPG.
Forcibly refreshed File:FEZ9BrKX0Ac8x8-.jpg.
Forcibly refreshed File:FEZE.jpg.
Forcibly refreshed File:FEZIFY.svg.

Finished refreshing file metadata for 6 files. 0 needed to be refreshed, 6 did not need to be but were refreshed anyways, and 0 refreshes were suspicious.

Let me see what's going on.

The query made by the script is:

SELECT  img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,img_actor,image_actor.actor_user AS `img_user`,image_actor.actor_name AS `img_user_text`,comment_img_description.comment_text AS `img_description_text`,comment_img_description.comment_data AS `img_description_data`,comment_img_description.comment_id AS `img_description_cid`,img_metadata  FROM `image` JOIN `actor` `image_actor` ON ((actor_id=img_actor)) JOIN `comment` `comment_img_description` ON ((comment_img_description.comment_id = img_description_id))   WHERE (img_name <= 'FEZZ') AND (img_name >= 'FEZ')  ORDER BY img_name ASC LIMIT 200;

Which doesn't bring back that file in production :/

I don't know why but the name was the issue. Now re-ran and this is the result:

| FEZ_trial_gameplay_HD.webm | 143769846 |      1280 |        720 | {"data":{"GETID3_VERSION":"1.9.21-202109171300","filesize":143769846,"avdataoffset":0,"avdataend":143769846,"fileformat":"webm","audio":{"dataformat":"A_OPUS","sample_rate":48000,"channels":2,"language":"und","streams":{"02":{"dataformat":"A_OPUS","default":true,"sample_rate":48000,"channels":2,"language":"und"}},"channelmode":"stereo"},"video":{"dataformat":"V_VP9","resolution_x":1280,"resolution_y":720,"display_unit":"pixels","display_x":1280,"display_y":720,"frame_rate":30,"streams":{"01":{"dataformat":"V_VP9","default":true,"resolution_x":1280,"resolution_y":720,"display_unit":"pixels","display_x":1280,"display_y":720,"frame_rate":30}}},"warning":["Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [2 bytes]) at 63","Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [2 bytes]) at 78","Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [4 bytes]) at 93","Unhandled seekhead element [module.audio-video.matroska.php:627] (5036::SeekPosition [4 bytes]) at 110","Unhandled track.video element [module.audio-video.matroska.php:712] (5552::15b0 [16 bytes]) at 4314","Unhandled track element [module.audio-video.matroska.php:819] (5819::16bb [4 bytes]) at 4353","Unhandled track element [module.audio-video.matroska.php:819] (5802::16aa [3 bytes]) at 4389","Unhandled audio type \"A_OPUS\""],"encoding":"UTF-8","mime_type":"video/webm","matroska":{"comments":{"muxingapp":["libebml v1.3.6 + libmatroska v1.4.9"],"writingapp":["mkvmerge v26.0.0 ('In The Game') 64-bit"]}},"playtime_seconds":390.016,"bitrate":2949004.061371841,"playtime_string":"6:30","version":2}} |        0 | VIDEO          | video          | webm           |           26993399 |     28552 | 20180908224504 | tm9gvasmoevps3l2urt0c55x4qi626n |

Found a rather easier way to pick up all of them run this:

mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --verbose --mediatype=AUDIO --mime audio/webm

And found the problem:

Forcibly refreshed File:Asteroid_Belts.webm.

mmap() failed: [12] Cannot allocate memory

mmap() failed: [12] Cannot allocate memory
Fatal error: Out of memory (allocated 42205184) (tried to allocate 72057594037927968 bytes) in /srv/mediawiki/php-1.38.0-wmf.26/vendor/james-heinrich/getid3/getid3/getid3.php on line 2215

sigh

Running it with batch-size of 1. Hopefully it'll fix everything.

mmap() failed: [12] Cannot allocate memory
Fatal error: Out of memory (allocated 42205184) (tried to allocate 72057594037927968 bytes) in /srv/mediawiki/php-1.38.0-wmf.26/vendor/james-heinrich/getid3/getid3/getid3.php on line 2215```

Not the first to run into this.
https://github.com/JamesHeinrich/getID3/issues/285

Weird number… pretty far from a 64bit signed, but also pretty far from a long (53 bit I believe?)
Probably something ebml variable int related? Ah look this is pretty close: “An Element Data Size with an octet length of 8 is able to express a size of 2^56-2 or 72,057,594,037,927,934 octets”

https://tools.ietf.org/id/draft-ietf-cellar-ebml-03.html#rfc.section.6

I re-ran it on most of those files and the re-run didn't fix those cases. Fixed some before but not all. For example run on Estación_de_Barbantes.webm was successful without any errors but no change as result.

OK.

Estación_de_Barbantes.webm makes sense. It is an V_AV1 video, which apparently we don't properly detect in mimeanalyzer. There are a total of 8 such files in the results.

For some of the others, for instance https://commons.wikimedia.org/wiki/File:2019-12-26_Annular_Solar_Eclipse,_Liwa,_Abu_Dhabi,_UAE.webm I found that we check for \x86\x85V_VP, but these files have \x86\x86V_VP. I don't remember right now what those numbers mean, but should be easy to dig up

hexdump -n 16000 -C /Users/hartman/Downloads/2019-12-26_Annular_Solar_Eclipse,_Liwa,_Abu_Dhabi,_UAE.webm | grep V_
00001090  83 75 6e 64 86 86 56 5f  56 50 38 00 23 e3 83 84  |.und..V_VP8.#�..|

Right, its:

0x86: CodecName Element
0x85: Size ( 8 == a 1 byte length size, 5 === a data value size of 5)
V_VP8 == 5 bytes

So 0x86 0x86, means a length of 6, and in this case, someone (WonderShare Matroska Muxer) wrote a null terminated string, which added a byte to the length (which is forbidden in matroska/webm)

About 300 of the quarry results are WonderShare files.
Another 250 or so files are GStreamer Matroska muxer, which has the same issue.
Without the 8 AV1 files, that leaves 34 files

Other files I tried:
https://commons.wikimedia.org/wiki/File:Comfort_behaviour_of_greylag_goose_including_somersault.webm
Lavf56.40.101, Has a 7 day duration even though it is only 1m56. This file does get the right type for me locally however..

https://commons.wikimedia.org/wiki/File:1987_Bezirksfeuerwehrtag_Übersbach_Fürstenfeld.webm
Lavf56.40.101 works correctly on my local machine

https://commons.wikimedia.org/wiki/File:1988_stoerche_beringung.webm
Lavf56.40.101 works correctly on my local machine

https://commons.wikimedia.org/wiki/File:Internet_in_a_Box.webm
Lavf56.40.101 Has a 7 hour duration even though it is only 12minutes (mux error). But works correctly on my local machine

https://commons.wikimedia.org/wiki/File:Yugorno._Mari_folk_epos_in_Mari_national_dramatic_theatre_16.03.2019._Fragment_7.webm
Has a 4h:43m duration, even though it is only 56secs long (mux error). But works correctly on my local machine.

So it seems that some Lavf56.40.101 definitely have A problem and most of them should be remuxed, but when I upload them locally they are stored correctly. Perhaps the metadata refresh fails in a way that is different from an original upload ? Anyway, lets fix the AV1 and length issues first, those other 34 files we can deal with later.

I'm getting this for one of the ones that should be fixed. I think you locally set it to be php serialization? If you set it to json, you'll probably be able to reproduce this (reminds me that I need to change the default serialization):

Processing next 1 row(s) starting with 1988_stoerche_beringung.webm.
1988_stoerche_beringung.webm failed. LocalFile::jsonEncode: metadata is not JSON-serializable (type = video/webm)

I'm getting this for one of the ones that should be fixed. I think you locally set it to be php serialization? If you set it to json

Bingo

you'll probably be able to reproduce this (reminds me that I need to change the default serialization):

Processing next 1 row(s) starting with 1988_stoerche_beringung.webm.
1988_stoerche_beringung.webm failed. LocalFile::jsonEncode: metadata is not JSON-serializable (type = video/webm)

I can reproduce. The framerate is INF, which cannot be encoded in json.
That in turn is a side effect from an issue in ID3, which doesn't account for a duration of 0 when making this calculation:

if (isset($trackarray['DefaultDuration'])) { $track_info['frame_rate'] = round(1000000000 / $trackarray['DefaultDuration'], 3); }

Which throws

Division by zero in .... TimedMediaHandler/vendor/james-heinrich/getid3/getid3/module.audio-video.matroska.php on line 299

The spec specifically says that 0 is an invalid value for DefaultDuration... So i'd say this one should get mkvclean'ed eventually.

For the json encoding problem specifically however: We could use JSON_ERROR_INF_OR_NAN to convert them to 0. Or alternatively (and maybe better?), we should expand the exception we throw with json_last_error_msg, so that it is easier to determine what kind of error is occurring.

Change 774552 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/core@master] Handle webm files with AV1 and/or nullbyte terminated VP8/9

https://gerrit.wikimedia.org/r/774552

@Ladsgroup btw. considering that a release is coming up

reminds me that I need to change the default serialization

Does this need a ticket ?

@Ladsgroup btw. considering that a release is coming up

reminds me that I need to change the default serialization

Does this need a ticket ?

Already done ^^.

BTW If you're happy with the patch, I can +2 it.

Change 774552 merged by jenkins-bot:

[mediawiki/core@master] Handle webm files with AV1 and/or nullbyte terminated VP8/9

https://gerrit.wikimedia.org/r/774552

@Ladsgroup can you rerun the refreshmetadata now that the second patch was deployed ?

Mentioned in SAL (#wikimedia-operations) [2022-05-11T06:31:55Z] <Amir1> mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --verbose --mediatype=AUDIO --mime audio/webm (T226311)

@TheDJ Done now. It was ~700 files having this issue and it's now only 122:

wikiadmin@10.64.0.219(commonswiki)> select count(*) from image where img_media_type = 'AUDIO' and img_major_mime= 'audio' and img_minor_mime = 'webm';
+----------+
| count(*) |
+----------+
|      122 |
+----------+
1 row in set (0.002 sec)

There are only 35 files left now. I suspect most of these will require remux'ing

TheDJ claimed this task.

I have remuxed all remaining problematic files on commons. Every single one of them had broken metadata for the seekpoints.

I suspect these files have not had their info updated in the search index yet. If u use the api for one of them, u can see that (even though the specific file I’m querying here has broken metadata), it is actually recognized as video/webm

https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=imageinfo&meta=&titles=File%3AHof%20vs.%20Corona%2020200319%20WebM%20ohne%20Ton%20002.webm&iiprop=timestamp%7Cuser%7Cmetadata%7Cmediatype%7Cmime

Okay, how can their info be updated in the search index then?