Page MenuHomePhabricator

Extract video encoder metadata from WebM videos
Closed, ResolvedPublic

Description

The Wikipedia Zero pirates (T129845) switched from hidden RAR files to blatant copyright violations. Most videos are from YouTube which can be downloaded in WebM format to easily re-upload. Unfortunately, Google does not include any video identifiers, but lists the encoder as Google. This should be included in img_metadata.

Event Timeline

A sample from اتحداك تشوف المقطع بدون ما تضحك مقاطع مضحكة جداََ عالم الاندرويد و المجانيات.webm (Identified as a YouTube video)

# hachoir-metadata AQCWqFD9iBU.webm
Common:
- Duration: 10 min 7 sec 44 ms
- Producer: Google
- MIME type: video/webm
- Endianness: Big endian
Video stream:
- Image width: 640 pixels
- Image height: 360 pixels
- Compression: V_VP8
Audio stream:
- Channel: stereo
- Sample rate: 44.1 kHz
- Bits/sample: 32 bits
- Compression: A_VORBIS

YouTube only outputs in MP4 and WebM. The streams are separate audio and video combined on the client. They keep some non-DASH streams, but have deleted HD WebM transcodes, so WebM will always be 360p or lower.

Note that legitimate videos come from YouTube pretty frequently...

Metadata extraction if it's not done already would need to be done in getid3 I think, the library that we use for fetching stream info, or else in a reimplementation (ugh).

Is there a sample file that's not deleted?

So there's a couple of EBML elements in the WebM/Matroska stream that I think could be added to getid3's extraction easily:

MuxingApp 2 [4D][80]

WritingApp 2 [57][41]

These are both "Google" in the typhoon example file.

Also there can be Vorbis comments in the Vorbis audio stream, which lists 'encoder=google' too. :) But that doesn't seem to be exposed to getid3 at the moment and I don't know how hard it would be to integrate the vorbis comment extraction.

Hrm, those two *should* already be being fetched. Looking as to why they don't show up...

Change 376088 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/extensions/TimedMediaHandler@master] Retain some WebM metadata for processing purposes

https://gerrit.wikimedia.org/r/376088

Ok, prior code was removing all the matroska-specific metadata on the MediaWiki side because it was heavy on binary junk, presumably. Patch in https://gerrit.wikimedia.org/r/376088 puts the 'comments' subsection back, which contains the WritingApp and MuxingApp tags which list 'Google'.

Change 376088 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Retain some WebM metadata for processing purposes

https://gerrit.wikimedia.org/r/376088

brion claimed this task.

This was merged, so ... done? If specific things to expose, open new bug with details.

File:An Itch in Time.webm is a recent upload with the encoder metadata extracted:

Software used: Google
Software used: Google

I assume it listed twice once for Muxing application and again for Writing application