Embed image author, description, and copyright data in file metadata fields
Open, LowPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Sep 4 2005, 10:33 PM

Description

Image files downloaded from Wikimedia sites do not contain any information about the images (author, license, description page URL etc). This makes it hard to identify the source or the author in certain context (e.g. image reused on the web without proper attribution); arguably it causes certain ways in which Wikimedia publishes these files (such image tarballs) to be license violations. The situation could be improved by embedding this information into the file as metadata (e.g. EXIF fields).

This is tricky as it would mean that the image needs to change on upload and potentially every time someone makes an edit to the file page; or we would have to make the original image hard to access and instead offer a modified version (a kind of full-size thumbnail - see T67383) for download/view/export.

See also:

T2657 - using metadat in the opposite direction
T20871 - the same issue for thumbnails
T95217 - same issue for audio files

(Original reporter: reflection)

Details

Reference: bz3361

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T87268 Copyright license and attribution issues (tracking)
		Open		None	T5361 Embed image author, description, and copyright data in file metadata fields

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 8:49 PM

• bzimport added projects: MediaWiki-Uploading, TestMe.

• bzimport set Reference to bz3361.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Sep 4 2005, 10:33 PM

reflection wrote:

Corellary to bug 657 http://bugzilla.wikimedia.org/show_bug.cgi?id=657

(In reply to alterego from comment #0)

an image dump contains no metadata concerning any of the images

What are ways to reproduce the problem nowadays? How to get an "image dump" in 2014, so to say? Is this still a problem?

reflection wrote:

Do the EXIF data about images contain copyright information etc? If not, the bug should be left open, and probably elevated in importance.

Errm, I'm a bit confused by the counter questions.
Could you answer comment 2, please?

Plus this is de-facto low priority and not planned for a future release, until somebody provides a patch. Resetting Target Milestone and priority to previous values which seem more realistic.

Ok to clarify:
*fileare not stored by their md5sum, its the md5sum of the file *name*. Deleted files do use their sha1 sum as file name.
*however we still make the assumption pretty much everywhere that each version of the file has a constant sha1 sum/is bit for bit identical. So any change must be a reupload.
*the file versioning code is not well adapted to having an excessively large number of versions of a file. (If an edit->pseudo new upload, it would probably explode if someone made 5000 edits, especially to a large file)
*to do this automatically (or perhaps to have mallable metadata included with the dump), it might be easier once wikidata hits commons.
*the most likely solution, at least in the meantime, i think would be to have an extension hook up to exiftool, which allows people to modify exif on the server side triggering upload. (Perhaps with button to import data from wikipage). This wouldnt be as quick as as total automation, but would be something, and more easily turned off if their is an issue

re andre, well we dont really have image dumps anymore (afaik, which is sad) the bug equally applies to people reusing our images from any form, or just wget'ing them off the server. The original poster wants the data from the image wikipage to be directly embedded in the file so that the data cannot be separated from the file (without malicious intent) where currently its common for reusers to lose this data if they dont care.

I agree this would be nice, think it may be difficult to do (fully) given our current infrastructure, and ultimately is a low priority compared to other more pressing issues we have with media files.

(In reply to Bawolff (Brian Wolff) from comment #5)

I agree this would be nice, think it may be difficult to do (fully)

Difficult as in time-consuming or as in really complex. I'm thing whether this cold become a GSoC project idea one day (not in the current round).

Very complex to do it fully (The original request of auto recording edits into image metadata). Doing it in a somewhat superficial manner (Just having an on-wiki interface to edit metadata) might potentially be gsoc worthy (Kind of like a continuation of my gsoc project from 2010)

• Gilles added a project: Multimedia.Nov 24 2014, 3:41 PM

Tgr mentioned this in T20871: Include at least some EXIF metadata in resized pictures.Jun 5 2015, 5:58 PM

T95217 is the audio version of this task.

Restricted Application added subscribers: Steinsplitter, Matanya, Aklapper. · View Herald TranscriptJul 31 2015, 6:57 PM

Tgr renamed this task from Image author, description, and copyright data saved in EXIF fields to Embed image author, description, and copyright data in file metadata fields.Jul 31 2015, 7:17 PM

Tgr updated the task description. (Show Details)

Tgr set Security to None.

Tgr edited projects, added MediaWiki-File-management; removed MediaWiki-Uploading.Jul 31 2015, 7:21 PM

Tgr updated the task description. (Show Details)

Ijon subscribed.Jul 31 2015, 7:50 PM

Ricordisamoa subscribed.Aug 1 2015, 5:24 PM

ThurnerRupert subscribed.Aug 2 2015, 7:23 PM

• Tbayer subscribed.Aug 30 2015, 8:50 PM

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:03 PM

in germany it is common to send cease and desist letters which cost 500-1000 euro each. a couple of contributors showed up like this, one discussion here: https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Archive_53#Legal_action_resulting_from_photographs_by_Haraldbischoff

interesting is the numbers the club against cease and desist letters (interessensgemeinschaft gegen den abmahnwahn, http://www.iggdaw.de/) presents: 200'000 cases a year with a value requested 165 million euro. cc-by cases are only a low percentage.

putting the license in the metadata wold allow to adjust the toolchain afterwards. e.g. make an announcement so image programs can keep this information, or web browsers get an option to display the data, print programs get an option to include it automatically, etc.

In T5361#77302, @Bawolff wrote:

Doing it in a somewhat superficial manner (Just having an on-wiki interface to edit metadata) might potentially be gsoc worthy (Kind of like a continuation of my gsoc project from 2010)

Adding Possible-Tech-Projects, then.

In T5361#1633637, @ThurnerRupert wrote:

in germany it is common to send cease and desist letters which cost 500-1000 euro each. a couple of contributors showed up like this, one discussion here: https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Archive_53#Legal_action_resulting_from_photographs_by_Haraldbischoff

interesting is the numbers the club against cease and desist letters presents: 200'000 cases a year with a value requested 165 million euro. cc-by cases are only a low percentage.

putting the license in the metadata wold allow to adjust the toolchain afterwards. e.g. make an announcement so image programs can keep this information, or web browsers get an option to display the data, print programs get an option to include it automatically, etc.

Effectively leveraging metadata for use with non-specialist users is hard. Lots of people have tried to adjust the /general tool chain/ and usually meet with only limited success (See semantic web, microformats, etc. For a more relavent example to this bug, see http://commonsmachinery.se/ ).

That's not to say we shouldn't be doing much better on embedding data, we should be better. Our entire approach to media metadata on commons and in MediaWiki generally is extremely hap-hazard and well sucks. But I just want to caution you about being too optimistic. Even if we fix this bug, there is a long long long way to go to solving the types of problems you want to solve.

Shrutika719 subscribed.Sep 13 2015, 12:17 PM

This is a message posted to all tasks under "Backlog" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

Iamneha subscribed.Oct 1 2015, 10:49 AM

Qgil moved this task from Backlog to Need Discussion on the Possible-Tech-Projects board.Oct 5 2015, 11:38 AM

IMPORTANT: This is a message posted to all tasks under "Need Discussion" at Possible-Tech-Projects. Wikimedia has been accepted as a mentor organization for GSoC '16. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

Restricted Application added a project: Commons. · View Herald TranscriptMar 1 2016, 5:37 PM

• ZhouZ subscribed.Apr 19 2016, 6:43 PM

Restricted Application added subscribers: TerraCodes, Poyekhali. · View Herald TranscriptApr 19 2016, 6:43 PM

• ZhouZ added a project: WMF-Legal.Apr 19 2016, 6:43 PM

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptApr 19 2016, 6:43 PM

• ZhouZ moved this task from Backlog to Assigned on the WMF-Legal board.Apr 19 2016, 6:43 PM

why not starting with something easy? if making a thumbnail for wikipedia, leave exif in place?

In T5361#2295848, @ThurnerRupert wrote:

why not starting with something easy? if making a thumbnail for wikipedia, leave exif in place?

If you leave the entire exif in place you can get very large files. Sometimes exif metadata can be larger then the entire rest of the thumbnail (Especially when they start to have embedded thumbnails inside them). Leaving exif would actually cause a significant increase in file size.

interesting point, what exif fields would be necessary to get the copyright ok? or add the copyright related fields of xmp?

Could this project be a good candidate for the current Outreachy-13 internship ( Dec 6 to March 6 )?
I had recently seen an RFC - T589 related to images. Are the two related?

In T5361#2627520, @Sumit wrote:

Could this project be a good candidate for the current Outreachy-13 internship ( Dec 6 to March 6 )?

If someone was willing to mentor it, probably. You would need to clarify what the current status is of thumbor and how it fits into this.

I had recently seen an RFC - T589 related to images. Are the two related?

Nope.

In T5361#2296425, @ThurnerRupert wrote:

interesting point, what exif fields would be necessary to get the copyright ok? or add the copyright related fields of xmp?

Bare minimun, that would probably be the "Artist" field and the "Copyright" field of Exif. (ImageDescription sometimes has some copyright related info too, but not as critical). [Part of the problem here, is that image magick doesn't really have fine grained options for what fields to keep if I remember correctly, although that's not something you should take my word for]

However, most guides for how to mark your image as creative commons, strongly suggest adding XMP metadata, so for proper maintaining of copyright info for freely-licensed works, its definitely a plus to keep at least those XMP fields.

CCing @Gilles to weigh in ref Thumbor.

The current Thumbor implementation, due to roll out for all thumbnail traffic by the end of the year, has selective EXIF filtering for thumbnails. While implementing the same thing in Mediawiki is worthwhile for Mediawiki itself, it soon won't be of any use for Wikimedia.

The idea of population EXIF fields based on wiki metadata is still something worth looking into, but it's much more challenging as a project, imho.

In T5361#2627793, @Gilles wrote:

The idea of population EXIF fields based on wiki metadata is still something worth looking into, but it's much more challenging as a project, imho.

As long as you don't want to change the original image, just the thumbnails, it doesn't seem that bad. We have an API for getting the wiki metadata; formatting, filtering out HTML, length limiting etc. is nontrivial but not particularly hard.

I don't think that making thumbnail generation dependent on a mediawiki API is a great idea, the whole point of decoupling the thumbnailing infrastructure is to avoid having mediawiki in the mix of actual thumbnail generation. Having an API as a dependency would be a step back in terms of performance and availability of thumbnailing.

I would be more leaning towards storing that data as extra headers for the original in swift. This way thumbor thumbnail generation remains as mediawiki-agnostic as it currently is, it just applies extra data when it finds some in the existing request to read the original from swift.

Mediawiki would be responsible for keeping the extra headers up to date, and filtering the intake of metadata. Which is more in line with the soon-to-be new status quo where mediawiki is still responsible for all the metadata wrangling. The separation of concerns stays the same.

@Gilles can we have this or some part of it for an Outreachy-13 internship, and would you be willing to mentor?

Quiddity unsubscribed.Sep 14 2016, 9:10 PM

I don't have time to mentor this at the moment, sorry, but I'm happy to keep commenting and providing feedback if some of it gets picked up for outreachy.

MarkTraceur unsubscribed.Jan 27 2017, 10:22 PM

We're having a new round of GSoC, Outreachy and RGSoC internship. @Gilles do you have time to mentor this, in this term?

I don't, sorry :(

Liuxinyu970226 subscribed.Nov 27 2017, 10:24 AM

another case which could have been solved by this:

https://de.wikipedia.org/wiki/Wikipedia:Urheberrechtsfragen#Hilfe_-_Rechnung_f%C3%BCr_Commons-Foto_erhalten
here the glitch: https://web.archive.org/web/20170821133848/https://2017.langertagderstadtnatur.de/angebote/details/2732 - no link to the license

the upload wizard could guide the user in adding the information. or require the user to add it herself? so we have not the problem of "original photo does not contain the license information, but then it is added". what you think?

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I'm not adding this tag right now!

Qgil unsubscribed.Jul 12 2018, 1:04 PM

• gpaumier unsubscribed.Jul 18 2018, 6:02 PM

Tgr mentioned this in T250317: Add schema.org structured data to images on Commons and Wikipedia to meet Google's requirements.May 29 2020, 9:32 PM

a couple of reports of copyright / copyleft trolling, which could be avoided by such metadata:

cory doctorow, 2022, A Bug in Early Creative Commons Licenses Has Enabled a New Breed of Superpredator, https://doctorow.medium.com/a-bug-in-early-creative-commons-licenses-has-enabled-a-new-breed-of-superpredator-5f6360713299.
2018 reported by martin steiger, cc switzerland: https://steigerlegal.ch/2018/10/15/wladyslaw-sojka-wikipedia-abmahnungen/, image in question, as most cases cc-by-sa 3.0: https://commons.wikimedia.org/wiki/File:Aerial_View_-_Goetheanum1.jpg

Embed image author, description, and copyright data in file metadata fieldsOpen, LowPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Embed image author, description, and copyright data in file metadata fields
Open, LowPublic
Actions

Related Objects
Search...