Page MenuHomePhabricator

[L] Determine an IRI to join commons mediainfo entities and wikidata properties referencing commons images
Closed, ResolvedPublic

Description

As a user of WDQS and WCQS I want to be able to join a mediainfo item with a property value referencing a commons image such as P18.

There are no obvious ways to do this currently.

In the RDF output wikidata the properties instance of Q18610173 (e.g. P18) do use an IRI in the form:
http://commons.wikimedia.org/wiki/Special:FilePath/_filename_ while commons entities reference their contentUrl using https://upload.wikimedia.org/wikipedia/commons/X/XY/_filename.

These IRIs could have been used for joining but they are different:

  • the use of Special:FilePath
  • the use of http vs https

There should exist a common IRI identifying a commons file.

With a commons RDF output like:

sdc:M10031710 a wikibase:Mediainfo,
		schema:MediaObject,
		schema:ImageObject ;
	schema:encodingFormat "image/jpeg" ;
	schema:contentUrl <https://upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg> ;

One approach could be to change the URL emitted by commons mediainfo to be the same as the one by wikidata. This is a breaking approach:

sdc:M10031710 a wikibase:Mediainfo,
		schema:MediaObject,
		schema:ImageObject ;
	schema:encodingFormat "image/jpeg" ;
	schema:contentUrl <http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg> ;

Another (preferred) approach would to introduce a new triple, e.g. schema:url. This would increase the size of the graph, but by an acceptable amount, and without adding any breaking changes:

sdc:M10031710 a wikibase:Mediainfo,
		schema:MediaObject,
		schema:ImageObject ;
	schema:encodingFormat "image/jpeg" ;
	schema:contentUrl <https://upload.wikimedia.org/wikipedia/commons/c/c0/Douglas_adams_portrait_cropped.jpg> ;
	schema:url <http://commons.wikimedia.org/wiki/Special:FilePath/Douglas%20adams%20portrait%20cropped.jpg> ;

Note on similar tickets:

  • T258769 is very similar but is worded to simplify the use of the image grid feature of the UI
  • T258776 serves similar purposes but is harder to achieve:
    • it requires "synchronization" between wikidata and commons to obtain the page ID of a media info item
    • MediaInfo item may not yet exist while the commons image is still referenceable from wikidata

AC:

  • A query on WCQS for joining wikidata item from WDQS using federation can be easily written without complex string manipulation
  • A new schema:url triple is added

Event Timeline

File names are bad URI's. Files get renamed all the time (see https://commons.wikimedia.org/w/index.php?title=Special:Log&offset=&limit=500&type=move ) causing all sorts of breakage. The pageid stays the same so the mediaid also stays the same. That's a much more stable identifier.

CBogen renamed this task from Determine an IRI to join commons mediainfo entities and wikidata properties referencing commons images to [L] Determine an IRI to join commons mediainfo entities and wikidata properties referencing commons images.Apr 7 2021, 4:43 PM

Change 704881 had a related patch set uploaded (by Seddon; author: Seddon):

[mediawiki/extensions/WikibaseMediaInfo@master] WIP Determine an IRI to join commons mediainfo entities and wikidata properties

https://gerrit.wikimedia.org/r/704881

Just to note that this ticket will require a config patch prior to deployment.

Change 704881 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Determine an IRI to join commons mediainfo entities and wikidata properties

https://gerrit.wikimedia.org/r/704881

File names are bad URI's. Files get renamed all the time (see https://commons.wikimedia.org/w/index.php?title=Special:Log&offset=&limit=500&type=move ) causing all sorts of breakage. The pageid stays the same so the mediaid also stays the same. That's a much more stable identifier.

+1

Joining between Wikimedia Commons and Wikidata via P18 in the WCQS can be done using schema:url going forward. Both are using the same encoding and should match.

<schema:url rdf:resource="http://commons.wikimedia.org/wiki/Special:FilePath/Stilleven%2C%20SK-C-1561.jpg"/>
<ps:P18 rdf:resource="http://commons.wikimedia.org/wiki/Special:FilePath/Stilleven%2C%20SK-C-1561.jpg"/>

Based on this query it seems to working for WCQS/WMQS queries.

Seddon updated the task description. (Show Details)

Yeah, great! This simplifies the queries so much, without that awful hack of using the string concat method. Finding all images depicting Douglas Adams is now just:

#defaultView:ImageGrid
select ?file ?image where {
  ?file wdt:P180 wd:Q42;
        schema:url ?image.
}

Joining between Wikimedia Commons and Wikidata via P18 in the WCQS can be done using schema:url going forward. Both are using the same encoding and should match.

<schema:url rdf:resource="http://commons.wikimedia.org/wiki/Special:FilePath/Stilleven%2C%20SK-C-1561.jpg"/>
<ps:P18 rdf:resource="http://commons.wikimedia.org/wiki/Special:FilePath/Stilleven%2C%20SK-C-1561.jpg"/>

That should be httpS on both sides.