Page MenuHomePhabricator

Ogg Opus-File should be classified as audio not multimedia-Files.
Closed, ResolvedPublic

Description

Since MediaWiki 1.28, CirrusSearch allowes to search for files with a give Type of file with the keyword filetype: (→ T145560). Audio files in Ogg Opus files are classified as multimedia instead of audio as one would expect.

Description:
There are 3 Codecs we allow in OGG-Container on Commons: Vorbis (Audio), Theora (Video) & Opus (Audio, successor of Vorbis). If a file contains a Theora-Stream (and maybe an audio-stream) it is correctly classified as 'video', if a file contains only a Vorbis-stream it is correctly classified as audio but if a file only contains a Opus-Stream it is classified as Multimedia.
No other type of files allowed on Commons are classified as multimedia so far.

Searching for "filetype:audio" thus does not result in all Audio-files but the ones in Opus-coding missing. Searching for "filetype:multimedia" only results in Ogg Opus-Files which is also not what one would expect. Due to T151347 there is no work-around by combining both types with the OR-Keyword. An concrete example would be searching on commons: “O du fröhliche filetype:audio” and “O du fröhliche filetype:multimedia” – both show one result and those differ! There is no way to find both.

Details

Related Gerrit Patches:
mediawiki/core : masterAdd test case for Opus file check
mediawiki/core : masterSpecial case opus mime detction

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 22 2016, 5:44 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptNov 22 2016, 5:48 PM
Restricted Application added a subscriber: Poyekhali. · View Herald Transcript

@Smalyshev Do you have any thoughts on this? I'm not sure what's going on.

Deskana triaged this task as Low priority.Dec 15 2016, 11:10 PM
Deskana moved this task from needs triage to later on... on the Discovery-Search board.

Search is using already existing file data, which is generated by MimeAnalyzer::getMediaType. Looks like there are some code for OGG there, but maybe needs updating. In any case, it's not really a search thing, it's general core file type detection thing.

MarkTraceur moved this task from Untriaged to Triaged on the Multimedia board.Dec 22 2016, 5:36 PM
MarkTraceur added a subscriber: MarkTraceur.

@MichaelSchoenitzer_WMDE could you give an example of a file that produces this result? I cannot tell from the code which thing is going wrong, and it would be useful to have a debug case. Thanks for reporting!

@MarkTraceur, almost all files in https://commons.wikimedia.org/wiki/Special:Search/filetype:multimedia should not be classified as "multimedia" but as "audio". You should find plenty of examples there, plus the specific example given in the task description.

brion added a subscriber: brion.

Adding TimedMediaHandler tag as the determination will be in there somewhere.

Change 332072 had a related patch set uploaded (by TheDJ):
Special case opus mime detction

https://gerrit.wikimedia.org/r/332072

TheDJ added a subscriber: TheDJ.Jan 14 2017, 8:22 PM

This is because we don't actually know the difference between these files, they are all mapped to application/ogg. We really should figure out a way to stop guessing mime types for A/V and just analyze the actual file and then return a proper mimetype.

For now, we can add one more hack...

TheDJ moved this task from To sort to Doing on the TimedMediaHandler board.Jan 14 2017, 8:49 PM

@EBernhardson @Smalyshev Is this something we could review, since it touches on our work?

Change 332072 merged by jenkins-bot:
Special case opus mime detction

https://gerrit.wikimedia.org/r/332072

Change 332640 had a related patch set uploaded (by Brion VIBBER):
Add test case for Opus file check

https://gerrit.wikimedia.org/r/332640

Change 332640 merged by jenkins-bot:
Add test case for Opus file check

https://gerrit.wikimedia.org/r/332640

Quarry call to list these types of files:
https://quarry.wmflabs.org/query/15764

TheDJ added a comment.May 20 2017, 5:20 PM

We should probably run a bot to purge out the remaining metadata: https://quarry.wmflabs.org/query/15764

brion added a comment.May 20 2017, 9:00 PM

Note that requeueTranscodes used for the formats cleanup won't redo the media type in the image table; it just does transcodes. I'll check if there's anything suitable in core to rerun them or if I need to add a maint script.

brion added a comment.May 20 2017, 9:05 PM

This should work:

maintenance/refreshImageMetadata.php --wiki=xyz --force --mime=application/ogg --verbose

I'll do a test run.

brion added a comment.May 20 2017, 9:16 PM

Seems to work (quarry shows the old data tho). Went ahead and ran on terbium:

mwscript maintenance/refreshImageMetadata.php --wiki=commonswiki --force --mediatype=MULTIMEDIA --verbose |2>&1 tee refresh-image-metadata.log

Finished refreshing file metadata for 327 files. 0 needed to be refreshed, 327 did not need to be but were refreshed anyways, and 3 refreshes were suspicious.

Does this need to be run on all wikis?

brion added a comment.May 20 2017, 9:37 PM

(Now running on all wikis :D)

So... this is done, but the search links in the task summary still return the old data, probably due to CirrusSearch index being out of date? Do the pages need to be refreshed somehow in the search index or will this 'just happen' after some time?

TheDJ awarded a token.May 21 2017, 8:59 AM

So... this is done, but the search links in the task summary still return the old data, probably due to CirrusSearch index being out of date? Do the pages need to be refreshed somehow in the search index or will this 'just happen' after some time?

It'll happen over time as pages get re-parsed. I purged the cache of the two files in the example queries in the description; after a few minutes, the first query correctly returned two results and the second correctly returned none. :-)

A reindex would speed this process up, but reindexing takes effort and is somewhat error prone. In this case, it does not seem urgent enough to warrant the risks and time that come with a reindex.

Thanks for fixing this, @brion!

Deskana closed this task as Resolved.May 21 2017, 10:30 AM
Deskana assigned this task to brion.