Page MenuHomePhabricator

Implement strict mime type detection and media type inferring of audio/video files
Open, Needs TriagePublic

Description

Currently, all ogg files are registered in the database as application/ogg

And inside the mime analyzer there is this horrible check to infer the media type for ogg files

// Special code for ogg - detect if it's video (theora),
// else label it as sound.
if ( $mime == 'application/ogg' && file_exists( $path ) ) {

        // Read a chunk of the file
        $f = fopen( $path, "rt" );
        if ( !$f ) {
                return MEDIATYPE_UNKNOWN;
        }
        $head = fread( $f, 256 );
        fclose( $f );

        $head = str_replace( 'ffmpeg2theora', '', strtolower( $head ) );

        // This is an UGLY HACK, file should be parsed correctly
        if ( strpos( $head, 'theora' ) !== false ) {
                return MEDIATYPE_VIDEO;
        } elseif ( strpos( $head, 'vorbis' ) !== false ) {
                return MEDIATYPE_AUDIO;
        } elseif ( strpos( $head, 'flac' ) !== false ) {
                return MEDIATYPE_AUDIO;
        } elseif ( strpos( $head, 'speex' ) !== false ) {
                return MEDIATYPE_AUDIO;
        } elseif ( strpos( $head, 'opus' ) !== false ) {
                return MEDIATYPE_AUDIO;
        } else {
                return MEDIATYPE_MULTIMEDIA;
        }
}

That really needs a better long term and more maintainable implementation.

Event Timeline

TheDJ created this task.Jan 14 2017, 8:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2017, 8:37 PM

Adding T103421 as dep; proper way to check is to actually look at the header packets and check what they contain, which File_Ogg package can do. We need to move our copy out of TMH and use it via Composer in a shared manner.

TheDJ added a comment.Jan 18 2017, 7:15 AM

I believe the problem here is partially that we are not guaranteed to have skeleton on original files right ? Making detection more involved/expensive for those files ?

brion added a comment.Jan 18 2017, 5:05 PM

Skeleton track, if present, can list the types of the various streams, but yes it's usually not there for audio and not always there for video depending on which software was used to convert. But parsing the skeleton track isn't trivial either -- it's embedded in the Ogg stream multiplexing too -- so we may as well just check the header packets of each stream, which is what File_Ogg does.

If you dive into it you might be horrified to see that it also does substring matches... ;) but it's checking the actual initial header packets, not just 'whatever is in the first 255 bytes' so shouldn't fall prey to false positives (like the hack to avoid matching on comments mentioning ffmpeg2theora)

The whole concept of ogg horrifies me and always has :)

TheDJ added a comment.EditedJan 25 2017, 1:07 PM

Right, so I think that part of the problem here, is that this information is simply at the wrong place.

1: We infer the media-type based on the mime-type
2: The mime-type is what controls the handler we use and thus reflects the file's mime-type (application/ogg)
3: In the 'frontend' however, we often are interested in the an alias mime type (audio/vorbis instead of application/ogg )
4: The alias mime-type is based on the codecs and type of tracks used inside the file
5: The alias mime-type can thus only be accurately given by the file handler.
6: The proper media-type can thus only really be determined by the handler.
7: We have a 'hack' in MimeAnalyzer that allows to find an 'alias' based on the file extension, but not all files use separate file extensions for these alias mime types.
8: Our Swift filebackend (which serves the frontend) ignores most of this, and does it's own mime-type detection. T131012: SVGs without XML prolog (<?xml ... ?>) served with Content-Type text/html from upload.wikimedia.org

Questions:
1: What is the intended purpose of media_type ? We need to clarify this.
2: Do we actually need this information in the database ? Maybe it's enough if the file handler can provide it ? Maybe it should be in a separate table for caching purposes ? This is also a question raised in T589: RFC: image and oldimage tables.
3: should we make the Handler (through the File interface) the canonical information provider for mime and media-type information instead of MimeAnalyzer filling the database with guesses ?
4: Or are we going to create complex integrations directly in MimeAnalyzer with the handlers and/or their format libraries, to provide directly the desired information to the Analyzer ? That would add a lot of extra complexity to this relatively simple class.
5: How does this mix with the more advanced mime inspection that the file commandline utility does ?

If you dive into it you might be horrified to see that it also does substring matches... ;) but it's checking the actual initial header packets, not just 'whatever is in the first 255 bytes' so shouldn't fall prey to false positives (like the hack to avoid matching on comments mentioning ffmpeg2theora)

If you mean a scan for the string "OggS", note that the Ogg spec has a resynchronisation feature which requires scanning the data for page headers, identified by that string, which is called the capture pattern. "The bitstream is captured (or recaptured) by looking for the beginning of a page, specifically the capture pattern. Once the capture pattern is found, the decoder verifies page sync and integrity by computing and comparing the checksum. At that point, the decoder can extract the packets themselves." https://www.xiph.org/vorbis/doc/framing.html

TheDJ added a comment.EditedJul 4 2017, 9:35 AM

This mime-type thing keeps bothering me. I was thinking the following:

  1. make mime_type the actual 'frontend' desired mimetype of a specific file.
  2. add a new column filehandler_id
  3. add filehandler_id's to our FileHandlers
  4. add a scan method to file handlers to detect magic bytes, return true or false depending on if this handler can handle this file.
  5. when ingesting files, scan with each file handler to find the right file handler
  6. add a 'getMimeType' to file handlers to return the correct mime type for a file.
  7. register the file handler_id into the database for fast instantiation of the file handler, store the mime-type alias in the database for fast serving of the correct mime_type.

Maybe simply base the whole MediaHandler off of ContentHandler concepts ?

Restricted Application added projects: Multimedia, Commons. · View Herald TranscriptJul 25 2017, 5:49 PM
Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Nov 28 2017, 8:42 PM