Page MenuHomePhabricator

"_" character encoded as %20 in Wikidata URI RDF serialization
Closed, ResolvedPublic

Description

Wikipedia and Commons URIs do not match their RDF representation in Wikidata if there is an underscore.

For example. even though the rewrite rules of Wikipedia translate spaces to the underscore form of the URI, the canonical URI for a Wikipedia article has the underscore. This underscore form of the URI is what should be represented in the RDF.

dbpedia uses the foaf:isPrimaryTopicOf property for Wikipedia sitelink and their URI form contains the underscores. This essentially breaks federation between dbpedia resources and Wikidata entities using the sitelink as the primary key (if the Wikidata sitelink has a %20).

Examples:
A query (http://tinyurl.com/gntg9wx) for a sitelink for the entity Q3032 with the article URI https://de.wikipedia.org/wiki/Darwin_Harbour

returns:
https://de.wikipedia.org/wiki/Darwin%20Harbour

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev subscribed.

Need more input about this one - right now, it uses $siteLink->getPageName() which is the only method SiteLink has for getting page info. If we need to apply some transformation for this, which one?

I'd like to resolve this issue one way or another. What we have basically is (for page named "Category:Pretty flowers")

  • $siteLink->getPageName() returns article name as-is (Category:Pretty flowers)
  • Rdf generator now applies rawurlencode, producing: Category%3APretty%20flowers
  • Title->getCanonicalURL produces: Category:Pretty_flowers

Both Title and Rdf encode non-ASCII, but with spaces and some ASCII

Currently RDF algorithm either:

$baseUrl = str_replace( '$1', rawurlencode( $siteLink->getPageName() ), $site->getLinkPath() );
if ( !parse_url( $baseUrl, PHP_URL_SCHEME ) ) {
   $url = "http:".$baseUrl;
} else {
   $url = $baseUrl;
}

or:

global $wgArticlePath;
$url = str_replace( '$1', rawurlencode( $title->getPrefixedText() ), $wgArticlePath );

Both do not match what Title is doing. This may be annoying, especially as you can't copy URL of pages and use them in SPARQL queries directly.

Should we change behavior to match what Title is doing? Can we do it reliably -
SiteLinksRdfBuilder may not have access to configs needed for actually using Title, but we could simulate it maybe? Should we maybe use wfUrlencode instead of rawurlencode?

@daniel and myself agree that the URI should match what Title does (i.e., Category:Pretty_flowers in the example above or Darwin_Harbour in the original description). We should announce the change beforehand, and will have to do a DB reload, but making the URLs in RDF sitelinks match the URLs people see in browser and can copy-paste seems to be right way to go.
I'll add a patch for it a bit later.

Smalyshev raised the priority of this task from Low to Medium.May 20 2017, 1:54 PM

Change 355316 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Change link encoding to match what Title is doing.

https://gerrit.wikimedia.org/r/355316

The change checked in and deployed on Wikidata, WDQS reload underway (T166244).