Page MenuHomePhabricator

Allow structured data to be added via API:Upload
Open, Needs TriagePublic

Description

UploadWizard uploads structured data via one-at-a-time wikibase API calls after the image itself has been uploaded

A better user experience would be possible if the structured data was uploaded at once along with the image

The image is uploaded via API:Upload. Add processing of structured data to the API call

(and once it's added a ticket can be added to update UploadWizard to use the new functionality)

Event Timeline

UploadWizard aside, this functionality would be also useful in particular for users relying on the API to do bulk uploads, since it would allow users to upload a file with structured data attached—in the same way as the unstructured file description is currently included in the same request as the media upload. This would reduce the number of API requests necessary to achieve the task of adding structured data statements for new uploads.

Just to chime in on this again, as it's no longer hypothetical for me. DPLA_bot on Wikimedia Commons has fully implemented SDC statements, in the sense that it added SDC statements for over 2 million of its past uploads, and it also has the functionality to update statements on existing Commons files from the DPLA API data source when changes are made. However, when we perform new uploads, they still must be made with plain wikitext only, and then SDC has to be added as a second step. This is also complicated by the fact that wbeditentity requires the MediaInfo id, which the user needs to get after uploading, even if they know the file name they uploaded to, which isn't necessary if they could include the statements with the upload request. Even more complicated if you are using async, as you can't just immediately add the structured data in a subsequent edit without waiting.

A second reason, but also an important one for large-scale batch uploads, is there is some risk involved in having to complete the upload of a file before you can submit the API request for the structured data—because then if the second API request returns an error for any reason, you are left with a potentially orphaned/incomplete upload, in terms of the data you wanted to accompany it. (Or any other issue that could cause a discontinued series of edits, such as server lag, rate limiting, etc.)

This gap in upload vs. SDC may also cause problems in the increasing number of cases where metadata displayed in the wikitext template is derived from SDC. If you are relying on SDC to populate your images' copyright (a la https://commons.wikimedia.org/wiki/Module:License), then you will have your uploads flagged within minutes as lacking copyright information. In all these cases, the user also probably prefers if their statements were were going to be rejected by the API that the upload is rejected to begin with, to avoid this scenario.

Essentially, I think the work this calls for is combining functionalities of the MWAPI's upload and wbeditentity actions. In upload, the wikitext of the page is provided in the text parameter. In wbeditentity, the statements are provided in the data parameter. So I think the most practical way for this to work on the front end would be for action=upload to also accept a data parameter, which functions in exactly the same way as wbeditentity's and can be supplied along with text.

@Dominicbm you bring up a good point that I also had concerns of... if the upload process is not synchronous/simultaneous with attaching structured data... then having orphaned/incomplete uploaded files. Even though we would want to be efficient with batch uploading files, I think the reality is that the "batch" would be considered "a series of individual file uploads with attached structured data". The whole batch could fail, or only 1 of the individual files where the batch would be considered partially upload completed.

One helpful step in that direction would be to return the page id in the JSON response of a successful file upload. That should be fairly straightforward and would bring the number of requests required from three to two.

At the moment the process to upload a file with structured metadata is as follows:

  • upload the file with action=upload
  • retrieve the page id from the filename with action=query
  • add the structured data with action=wbeditentity

If the page id was exposed, the second request would not be needed.

I created the subtask T307096 about this.

One helpful step in that direction would be to return the page id in the JSON response of a successful file upload. That should be fairly straightforward and would bring the number of requests required from three to two.

At the moment the process to upload a file with structured metadata is as follows:

  • upload the file with action=upload
  • retrieve the page id from the filename with action=query
  • add the structured data with action=wbeditentity

If the page id was exposed, the second request would not be needed.

I created the subtask T307096 about this.

I do think this is good practice to have in the API response, and not just for this use case. However, wouldn't it be even simpler for the user to not even need the page id, and be able to add claims with a Commons file name? Considering there is a one-to-one and unchanging relationship between Commons file names and MediaInfo entities, I have not understood why all the SDC functions are not designed to work with either M-id or file name supplied. In that scenario, the user already knows the name of the page where they need to add the structured data before they even read the upload response, if it's just going to be the same name as the file they are uploading. There are also many other cases where determining the Commons page id is an additional step when the user may already have Commons file names in hand. For example, I get a bunch of file names from a WDQS query (e.g. P18 values). WDQS, as far as I'm aware, can only provide me the file names (or URIs, which are formed using file names, in the Commons case), and I'd have to take an additional step to generate IDs from them. It all feels unnecessary, since the file name and MediaInfo ID are both basically about the same thing, and should be easily translatable on the back end if we allowed users to use them interchangeably.

However, wouldn't it be even simpler for the user to not even need the page id, and be able to add claims with a Commons file name?

I cannot speak for the team who designed the SDC features, but working with filenames is not always ideal. First, filenames can change over time (as files can be renamed), whereas pageids remain stable. Also, file names must be normalized, whereas Mids are simpler.

However, wouldn't it be even simpler for the user to not even need the page id, and be able to add claims with a Commons file name?

I cannot speak for the team who designed the SDC features, but working with filenames is not always ideal. First, filenames can change over time (as files can be renamed), whereas pageids remain stable. Also, file names must be normalized, whereas Mids are simpler.

I'm not sure how much normalization should be a problem, since isn't this already solved for all the other placed where only a page name is used in the API? Regarding stability, you're right, of course, I just think file renaming is rare and even if we use a simple approach like the user receives an error whenever trying to post to a file name that is a redirect (rather than following the redirect), it's better than nothing, it's better than not having it. Especially if there are a lot of cases like the ones in this story, where the user is adding SDC immediately after upload, and so the file name will definitely be current. I can make a separate ticket for this request.