Page MenuHomePhabricator

Add Structured Data on Commons M-ID to Wikidata dumps
Open, HighPublic

Description

Currently, there is no easy way to fetch the Structured Data on Commons (SDOC) M-ID for e.g. the value of the "image" property on Wikidata.

The value of the "image" property in Wikidata RDF dumps is an URL like http://commons.wikimedia.org/wiki/Special:FilePath/Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg
However, this is the URL of the file itself, and not of the SDOC MediaInfo entity.
This requires people to write SPARQL queries between Wikidata and SDOC to "hack" in order to fetch the M-ID for a given "Commons File" value on Wikidata.

It would be nice to provide the M-IDs in Wikidata dumps using the same "normalized value" system that is already used for RDF URIs for some "external id" properties. For example, we could have:

wd:Q123 wdt:P18 <http://commons.wikimedia.org/wiki/Special:FilePath/Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg> .
wd:Q123 wdtn:P234 sdoc:M22222

instead of just

wd:Q123 wdt:P18 <http://commons.wikimedia.org/wiki/Special:FilePath/Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg> .

The "normalized value" system is not used already for "commons files" and so, allows to keep RDF backward compatibility.

Event Timeline

I think you meant

wd:Q123  wdtn:P18  sdoc:M6919529

in that second line ?

The value of the "image" property in Wikidata RDF dumps is an URL like https://commons.wikimedia.org/wiki/File:Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg

As far as I’m aware, the real URL in RDF is more like http://commons.wikimedia.org/wiki/Special:FilePath/Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg – the query service UI rewrites it to the file description URL (/wiki/File:) on display.

I also don’t understand your first example – is sdoc:P18 meant to be something like sdoc:M123 instead?

And finally, I’m not sure this will be doable, since it would require looking up the page ID for every Commons file mentioned in a dump, and also sending updates to the query service when that page ID change due to page moves… I think it would be more feasible to add the Special:FilePath URL to the WikibaseMediaInfo RDF, and combine WDQS and WCQS that way.

PS. It's also stupidly hard to find the M-ID from a WikiCommons file page at the moment. This would be a good thing to display in the "structured data" tab there, I think.

@Lucas_Werkmeister_WMDE : I think it would be more feasible to add the Special:FilePath URL to the WikibaseMediaInfo RDF, and combine WDQS and WCQS that way

this has been suggested at T258769

Tpt updated the task description. (Show Details)

As far as I’m aware, the real URL in RDF is more like http://commons.wikimedia.org/wiki/Special:FilePath/Leon%20Cogniet%20-%20Jean-Francois%20Champollion.jpg – the query service UI rewrites it to the file description URL (/wiki/File:) on display.
I also don’t understand your first example – is sdoc:P18 meant to be something like sdoc:M123 instead?

Yes, Indeed, sorry. I should have proofread myself more carefully.

And finally, I’m not sure this will be doable, since it would require looking up the page ID for every Commons file mentioned in a dump, and also sending updates to the query service when that page ID change due to page moves…
I think it would be more feasible to add the Special:FilePath URL to the WikibaseMediaInfo RDF, and combine WDQS and WCQS that way.

Thant's a very good point. Changing schema:contentUrl to use Special:FilePath is indeed an easier way to do.

I think Wikidata and SDC has an issue with representation of Wikimedia pages. Sometimes thy are stored as URLs (see Property:P1957), sometimes as strings (see Property:P1472), images are stored using data type just for images (see Property:P18) and in SDC they are stored as M-ids based on page ID. Some years ago there was a proposal to have a property data type just for Wikimedia pages, which would allow the links to stay valid when the original page was renamed, or would allow comparison of links saved in different formats, like compare Property:P1472 with a sitelink to Commons. If we add SDoC M-IDs to WDQS, maybe we could also find some way of representing other links to Wikimedia pages in some unified way as right now it is kind of a mess.

dcausse added a subscriber: dcausse.

I agree with @Lucas_Werkmeister_WMDE it seems hard and very costly to it this way, I'd be in favor of changing the way we encode the file URL from the commons side.

Gehel triaged this task as High priority.Sep 15 2020, 8:00 AM

The URI for the image is https://commons.wikimedia.org/entity/M6919529 (yes, https, not http, that got messed up, see T258590). You can see an RDF representation of that at https://commons.wikimedia.org/entity/M6919529.rdf . We should someone include this URI on Wikidata too. The royal way would be to update the image data type which accepts the Mediainfo ID and does all the logic like the current image data type.

Currently it something like this:

{"snaktype":"value","property":"P18","datavalue":{"value":"Woman Mending Stockings f888r jh68.jpg","type":"string"},"datatype":"commonsMedia"},

Would be something like:

{"snaktype":"value","property":"P18","datavalue":{"value":{"entity-type":"mediainfo","numeric-id":82541649,"id":"M82541649"},"type":"wikibase-entityid"},"datatype":"commonsMedia"},

I think Wikidata and SDC has an issue with representation of Wikimedia pages. Sometimes thy are stored as URLs (see Property:P1957), sometimes as strings (see Property:P1472), images are stored using data type just for images (see Property:P18) and in SDC they are stored as M-ids based on page ID. Some years ago there was a proposal to have a property data type just for Wikimedia pages, which would allow the links to stay valid when the original page was renamed, or would allow comparison of links saved in different formats, like compare Property:P1472 with a sitelink to Commons. If we add SDoC M-IDs to WDQS, maybe we could also find some way of representing other links to Wikimedia pages in some unified way as right now it is kind of a mess.

This!! Filenames are not useful data at all and are completely messy especially when the only difference is capitalized vs non-capitalized letters. More problematic are the letters I cannot even recreate on my keyboard without doing some kind of lookup. As a misspeller I beg you not to use the Filenames as reliable interwiki connectors.

Note: in an attempt to unblock the status quo I created T277665 with some practical solution (esp the first one suggested in T258769#6332430)

I'm going to close this as a duplicate of T277665 as that is what we see as the practical solution to this problem. Please re-open if you don't think that will solve your use case.

I'm going to close this as a duplicate of T277665 as that is what we see as the practical solution to this problem. Please re-open if you don't think that will solve your use case.

This is clearly not a duplicate because this task is about the Wikidata RDF and T277665 is about Commons RDF.

I'm going to close this as a duplicate of T277665 as that is what we see as the practical solution to this problem. Please re-open if you don't think that will solve your use case.

This is clearly not a duplicate because this task is about the Wikidata RDF and T277665 is about Commons RDF.

While it's a different solution, it seems like T277665 would solve the same problem that this task would solve, in a more practical way that is simpler to implement. If you have a problem that changing the Wikidata RDF would solve that T277665 (changing the Commons RDF) wouldn't, please let us know here so we can understand why this should stay open. Thanks!