Page MenuHomePhabricator

Add Structured data during file upload with Upload Wizard
Open, LowPublic


Many SDC properties should be added to the file during upload process with Upload Wizard.

At the moment Upload Wizard is hardwired to add several Commons templates, like {{Information}}, {{Own}}, {{FlickrVerifiedByUploadWizard}}, {{Location}}, {{cc-by-4.0}}, etc. Most of those should also be matched to specific SDC properties which should be added at the upload time. We should have more discussions at c:Commons:Structured_data/Modeling about how to model different cases, but Upload Wizard should handle at the minimum case of own work by the uploader under standard CC license with standard date and optional camera coordinates and heading.

Initially the information would be stored in wikitext and SDC, but once {{Information}} is updated we should store such info for basic cases, in SDC only. {{Location}} template is already capable of fetching camera coordinates and heading from SDC, so if coordinates are added to SDC than all that needs to be added to wikitext is empty "{{Location}}".

Examples of SDC properties that could be automatically filled in by UploadWizard:

  • copyright status
  • copyright license
  • inception
  • coordinates of the point of view
  • image captured with
  • source of file: original creation by uploader (for own works)
  • author user name

Event Timeline

T243926 task already proposes the same for license statements.

Ramsey and Keegan, Can you consider this task? It would be nice if we mostly had to worry about adding SDC properties to old files, while new ones are handled automatically at the upload time.

My only concern is around CC-0. SDC managed to avoid the political minefield of CC-0 within Commons by making such contributions optional; there are many that are firmly ideologically opposed to CC-0. If I'm reading this proposal right, the UW would automatically generate CC-0 contributions. I'm not 100% sure how well that will go down with the rest of the Commons community.

Do you have any thoughts on how to navigate this issue, @Jarekt ? I supposed it could still be made an optional step, but then there's more clutter in the UI...

I don't think many oppose, only a few loud mouths.

The CC-0 issue I think applies only to captions. Other metadata like author's username, date, or source does not raise to the level of originality to warrant any copyright protection. That is my opinion, and I do not know if it would be controversial. I think it would be safe to add SDC statements related to author, source and date, the way Multichill is doing it right now with his bot.

As for the captions, we could ask user to fill the caption and in the description have some gray out text in the background with info to leave it blank if one wants it the same as caption, since that is what will happen even with the current version if the description is missing. If one is opposed to CC0 than they can use descriptions instead of captions, but I doubt more than handful of users would know about those distinctions.

Oh, I forgot about camera coordinates and heading which we collect in the Upload Wizard, that data can also be safely stored in SDC as it would not be copyrighted.

Fair enough, thanks for the responses!

MarkTraceur added a subscriber: MarkTraceur.

As a feature request in a project that is relatively low priority for us right now, marking as low.

That said, I think this is a cool idea and could certainly make an interesting hackathon project or similar if anyone is interested. It also seems to me that, barring the need for Official Legal Advice™, there may be a way forward without causing too much uproar about the licensing of the entered data, possibly with the stipulation that we note a dual-licensing (or whatever) scheme in a message somewhere in the interface.

I am not sure why is that a low priority. We have a structured data at the upload time (username of the uploader, do they claim "own work" or not, geo coordinates date picked on the calendar, etc. ) then we downgrade it all to unstructured wikitext, and then much latter (possibly after many error-prone edits) rely on user written bots to parse it and recover the original structured data. I know we have to do it the hard way for the old uploads, but in case of new ones we do not have to rely on this error-prone process and store original structured data directly. I do not mind "interesting hackathon project" but that seem to me as basic functionality of SDC.

Also I do not think we need to get "Official Legal Advice™" to store username of the uploader, geo-coordinates or date a photo was taken, as that level of data is not eligible for copyrights. Saving it as CC-BY-SA and than parsing CC-BY-SA string to extract CC-0 components seems less safe.

Ideally upload wizard could allow Commons Community to decide how to store in SDC each piece of information. Maybe store upload wizard values in json and allow community controlled codes to to create wikitext and SDC data based on it

I don't think many oppose, only a few loud mouths.

We (WMF) have a long history of making the mistake of thinking exactly this (and acting on it)

We’ve actually wanted to do this from the start, but have not pursued this because of the dual licensing.

I’d be delighted if we could implement this, but I think we’re going to need a clearer mandate. Anyone willing to champion this, to the point that we’ve sufficiently addressed those loud mouths’ (valid) concerns?

I've arrived here from the Community Wishlist Survey 2021, does this issue suggest "extract metadata upon upload and make it part of the file page"? becuase if yes, then hell yeah...

We’ve actually wanted to do this from the start, but have not pursued this because of the dual licensing.

The licensing is not an issue for any data that would be automatically filled in by UploadWizard. None of that data is copyrightable. Bots have been adding this data from the template content since at least January 2020 with no controversy.

We should cater for cases where the uploader is stating "own work", and can tell us that they are the subject of a Wikidata item; or that the work is that of a third party who is the subject of a Wikidata item.

I agree that this should not be a low priority. SDC is a WMF priority, and is supported by the community at large.

This currently makes it hard to migrate various tooling to SDC as one can not reliably expect media uploaded with Upload Wizard to have basic SDC-data.