Page MenuHomePhabricator

Sitelink URIs should be IRIs
Closed, DeclinedPublic

Description

The RDF representation of all Wikidata sitelinks as URIs rather than IRIs is problematic.

These sitelinks (as primary source identifiers) should be represented as unencoded IRIs in the RDF because the sitelinks refer directly to their canonical IRI (RFC3987 compliant) representation.

For applications, mapping to an RFC3986 compliant URI from the IRI can easily be done programmatically, but decoding an IRI from an encoded URI in a SPARQL 1.1 query is just not possible.

See https://tools.ietf.org/html/rfc3987 for guidance on IRIs.

Event Timeline

I'm not sure I fully understand what the problem is. Could you give couple of examples of the pages that are problematic and specify which format you think it should be in and why?

Also, I do not think sitelinks are primery identifiers for anything. Non-wikidata sites are not part of the dataset, so they are just URIs.

The main issue is with string comparison of the percent encoded and unencoded forms of Unicode IRIs as resources.

Per https://tools.ietf.org/html/rfc3987#section-5.3.1

When comparing character by character, the comparison function MUST NOT map IRIs to URIs, because such a mapping would create additional spurious equivalences. It follows that an IRI SHOULD NOT be modified when being transported if there is any chance that this IRI might be used as an identifier.

For experimentation, I have created a named graph (http://wikidata.org/en-sitelinks) in Virtuoso that contains all 7M+ english sitelinks in their normal Unicode form. The query below shows a side-by-side comparison of the two variants.

http://wdm-rdf.wmflabs.org/sparql?default-graph-uri=&query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0ASELECT+%3Fs+%3Fsitelink+%3Fenclink+%0D%0AFROM+%3Chttp%3A%2F%2Fwikidata.org%3E%0D%0AFROM+%3Chttp%3A%2F%2Fwikidata.org%2Fen-sitelinks%3E%0D%0AWHERE+%7B%3Fs+foaf%3AisPrimaryTopicOf+%3Fsitelink%3B%0D%0Awdt%3AP17+wd%3AQ189+.%0D%0A%3Fenclink+schema%3Aabout+%3Fs%3B%0D%0Aschema%3AinLanguage+%22en%22%0D%0A%7D+LIMIT+100&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=0&debug=on

It seems quite obvious that the unencoded form is what should be represented in the RDF. I can see no reason why the RDF should have a percent encoded IRI that is not only ugly, but practically useless to connect disparate external sources that may reference the Wikipedia canonical article name

I don't think sitelinks are used as identifiers anywhere. Also, absent encoding they may not be safe to export in RDF. Also, sitelinks are actually properly represented by URIs by virtue of them being links.

Note also RDF standard: https://www.w3.org/TR/rdf11-concepts/#section-IRIs specifically mentions percent-encoding of the IRIs.

Do we have any specific problem with connecting with some data source? Which data source is it and what is the specific problem?

The RDF standard that you reference explicitly supports my point.

IRI normalization: Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of [RFC3987].
Non-normalized forms that are best avoided include:
Percent-encoding of characters where it is not required by IRI syntax

The sitelinks are not properly represented in the RDF according to the standards. When a sitelink is rendered as an dereferenced URI (a webpage link), yes, it should be percent encoded because its function is defined by the http protocol. When in RDF, however, the link should be represented as a reference that is string comparable to the source, in this case the wiki article IRI.

I have many problems with Wikidata as a linked data source, but I am able to work around them. What is not fixed at the source can be normalized with Java and then rectified with SPARQL update, though it would certainly be better for the developers to try and produce data that was not so proprietary.

For example, T121274 can easily be fixed in the RDF with SPARQL update once it is in the linked data store. For some reason, it is an epic task for Wikibase to distinguish object properties from datatype properties, and this major interoperability problem remains. The consequence of this übercomplexity is that you have hacks like the authority control gadget that the whole project is dependent on. Why?

I am sorry but I've reread the comments a number of times and I can't identify what could be done here. Sitelinks are encoded because it makes them possible to be used as links, and processed by tools that expect URIs. It would be pretty trivial to decode them if needed, and also they are not identifiers to anything but articles in Wikipedia - which by nature are addressed as URLs. If there is another problem I am missing here, please specify it.

Smalyshev changed the task status from Open to Stalled.May 20 2017, 9:34 AM
Gehel subscribed.

Since there hasn't been any update since 2016 and the problem still does not seem to be fully understood, let's close this. Feel free to reopen and add more context if it is needed.