Page MenuHomePhabricator

Display standard webm metadata/tag
Closed, ResolvedPublic

Description

This is a request to extend the metadata parsing of video files to display standard metadata available in webm video files on Commons image pages.

The definitions at https://www.webmproject.org/docs/container/ take you to https://matroska.org/technical/specs/tagging/index.html for a listing of standard tags that should be accepted for display when available. Using ffmpeg when reprocessing from other formats allows the adding of metadata in this format, and EXIF readers will display these tags for webm files, but Commons currently ignores them.

Some tags have immediate and obvious value to Wikimedia Commons and reusers, such as the COPYRIGHT and URL tags, which can help to confirm the status and source of files even when separated from their original web pages or renamed.

As an example Defending_Our_Future-_Protecting_Humans_and_Animals_from_Antibiotic_Resistance.webm has embedded the following tag name/string pairs if the file is examined in http://exif.regex.info/exif.cgi:

Tag Name	COMMENT
Tag String	https://commons.wikimedia.org/wiki/User:Fae/Project_list/CDC_videos
Tag Name	URL
Tag String	https://www.youtube.com/watch?v=5VNIL3gbqfI
Tag Name	PUBLISHER
Tag String	Centers for Disease Control and Prevention (CDC)
Tag Name	COPYRIGHT
Tag String	Public Domain
Tag Name	SUBJECT
Tag String	Antibiotic Resistance
Tag Name	DATE_RELEASED
Tag String	2019-10-31

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
ResolvedBawolff

Event Timeline

When Wikimedia Commons generates alternate transcodes (e.g. converting a WebM audio/video file, VP9/Opus, length 20 s, 1,080 × 1,080 pixels to a smaller VP9 360P version) different tags are created for the file, which drops several of the standard Matroska entries.

For example Cúbrete la nariz y la boca al toser o estornudar (niños).webm contains the following tag/string pairs and all of these are lost in the transcode versions:

Tag Name	COMMENT
Tag String	https://commons.wikimedia.org/wiki/User:Fae/Project_list/CDC_videos
Tag Name	PUBLISHER
Tag String	Centers for Disease Control and Prevention (CDC)
Tag Name	COPYRIGHT
Tag String	Public Domain
Tag Name	SUBJECT
Tag String	CDC-TV
Tag Name	DATE_RELEASED
Tag String	2019-03-06

Raising this as a related concern, discovered today, on the presumption that it does not need a separate ticket as any addition of standard fields should be passed on within the transcode process.

TheDJ subscribed.

This has nothing to do with the CommonsMetadata extension, moving into TimedMediaHandler which implements the webm parsing and the video transcoding pipeline

With regard to preserving metadata in the derivative, it should be possible to use

-map_metadata 0

aka, map metadata from the first input file to the output

Change 977116 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[mediawiki/extensions/TimedMediaHandler@master] Show more metadata on image description page of WebM files

https://gerrit.wikimedia.org/r/977116

Bawolff added subscribers: C.Suthorn, Bawolff.

While looking at a file by @C.Suthorn in the api I noticed it was fill of metadata which we were extracting but not displaying, which seemed sad, so i thought i'd have a go at this task

As I asked for exactly that a number of times, there may be one or more other phab tasks asking for the feature.

Two notes:

The same is true for ogv, ogg, mpg. mp3, wav, oga, opus, flac files. (and it would be very useful to detect copyright violations)

The display of metadata in file desctiptions if a png and a jpeg file come with the same Exif-description-field, it gets displayed for one file format, but not the other. The same for other EXIF/XMP/IPTC filelds. Even if a text field is displayed for both file types an html tag may be shown, or shown in quoted format or actually evaluted (anchor tags show sometimes as links and sometimes as text).

And i forgot; webp is also missing diyplay of metadata!

Change 977116 merged by jenkins-bot:

[mediawiki/extensions/TimedMediaHandler@master] Show more metadata on image description page of WebM files

https://gerrit.wikimedia.org/r/977116

Patch is merged, should be live on commons in about a week.

Its always possible to do more here, but i think we should call this fixed for now.

Metadata in other media formats should go on other tasks.

We probably need to run refreshMetadata on webm files for anything to show up on existing files, but that seems like a follow up task.

I'm not sure we actually do, as this is just changing how the metadata is processed/shown, but not what is stored in the DB.

Purging the file description pages may be needed?

Im not sure. There are definitely some short term caches but i dont think they last that long. I guess we will find out on Wednesday