Extract video encoder metadata from WebM videos
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dispenser
	Jun 5 2017, 3:30 AM

Description

The Wikipedia Zero pirates (T129845) switched from hidden RAR files to blatant copyright violations. Most videos are from YouTube which can be downloaded in WebM format to easily re-upload. Unfortunately, Google does not include any video identifiers, but lists the encoder as Google. This should be included in img_metadata.

Details

	Subject	Repo	Branch	Lines +/-
	Retain some WebM metadata for processing purposes	mediawiki/extensions/TimedMediaHandler	master	+56 -1

Customize query in gerrit

Event Timeline

Dispenser created this task.Jun 5 2017, 3:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2017, 3:30 AM

A sample from اتحداك تشوف المقطع بدون ما تضحك مقاطع مضحكة جداََ عالم الاندرويد و المجانيات.webm (Identified as a YouTube video)

# hachoir-metadata AQCWqFD9iBU.webm
Common:
- Duration: 10 min 7 sec 44 ms
- Producer: Google
- MIME type: video/webm
- Endianness: Big endian
Video stream:
- Image width: 640 pixels
- Image height: 360 pixels
- Compression: V_VP8
Audio stream:
- Channel: stereo
- Sample rate: 44.1 kHz
- Bits/sample: 32 bits
- Compression: A_VORBIS

Steinsplitter awarded a token.Jun 5 2017, 6:34 AM

zhuyifei1999 added a project: TimedMediaHandler.Jun 5 2017, 12:51 PM

YouTube only outputs in MP4 and WebM. The streams are separate audio and video combined on the client. They keep some non-DASH streams, but have deleted HD WebM transcodes, so WebM will always be 360p or lower.

Nick awarded a token.Jun 13 2017, 9:18 PM

zhuyifei1999 subscribed.Aug 28 2017, 7:17 PM

Keegan subscribed.Sep 5 2017, 5:58 PM

Note that legitimate videos come from YouTube pretty frequently...

Metadata extraction if it's not done already would need to be done in getid3 I think, the library that we use for fetching stream info, or else in a reimplementation (ugh).

Is there a sample file that's not deleted?

In T167000#3581121, @brion wrote:

Is there a sample file that's not deleted?

https://commons.wikimedia.org/wiki/File:Super_Typhoon_Haiyan_Impacts_the_Philippines.webm

https://commons.wikimedia.org/wiki/File:%28inizio%29_Va,_pensiero_-_Festa_dell%27orgoglio_leghista_a_Bergamo,_10_04_2012.webm contains a lowercase "google" as producer.

So there's a couple of EBML elements in the WebM/Matroska stream that I think could be added to getid3's extraction easily:

MuxingApp 2 [4D][80]

WritingApp 2 [57][41]

These are both "Google" in the typhoon example file.

Also there can be Vorbis comments in the Vorbis audio stream, which lists 'encoder=google' too. :) But that doesn't seem to be exposed to getid3 at the moment and I don't know how hard it would be to integrate the vorbis comment extraction.

Hrm, those two *should* already be being fetched. Looking as to why they don't show up...

Change 376088 had a related patch set uploaded (by Brion VIBBER; owner: Brion VIBBER):
[mediawiki/extensions/TimedMediaHandler@master] Retain some WebM metadata for processing purposes

https://gerrit.wikimedia.org/r/376088

gerritbot added a project: Patch-For-Review.Sep 5 2017, 8:13 PM

Ok, prior code was removing all the matroska-specific metadata on the MediaWiki side because it was heavy on binary junk, presumably. Patch in https://gerrit.wikimedia.org/r/376088 puts the 'comments' subsection back, which contains the WritingApp and MuxingApp tags which list 'Google'.

Keegan awarded a token.Sep 5 2017, 9:23 PM

Change 376088 merged by jenkins-bot:
[mediawiki/extensions/TimedMediaHandler@master] Retain some WebM metadata for processing purposes

https://gerrit.wikimedia.org/r/376088

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)).Oct 3 2017, 7:00 PM

• Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Oct 13 2017, 11:05 PM

This was merged, so ... done? If specific things to expose, open new bug with details.

File:An Itch in Time.webm is a recent upload with the encoder metadata extracted:

Software used: Google
Software used: Google

I assume it listed twice once for Muxing application and again for Writing application

Extract video encoder metadata from WebM videosClosed, ResolvedPublicActions

Description

Details

Event Timeline

Extract video encoder metadata from WebM videos
Closed, ResolvedPublic
Actions