Page MenuHomePhabricator

schema:about URLs in Wikidata are encoded (mangled)
Closed, ResolvedPublic

Description

Using an IRI to search a WD entity by WP IRI finds nothing:

<https://en.wikipedia.org/wiki/University_of_Tromsø> schema:about ?x

One has to search by the encoded URL:

select * {
  <https://en.wikipedia.org/wiki/University_of_Troms%C3%B8> schema:about ?x
}

Other "special" chars in URLs (eg parentheses) are not encoded.
Can you confirm whether only non-ASCII chars are encoded?

This makes it hard to integrate WD data to other datasets by WP URL: one has to encode some chars before querying WD.

The percent-encoded URL is equivalent to the IRI, so why WD query finds nothing for the first query?

If I take the str() form of the URL, it must return the decoded URL, but it does not:

select * {
 ?x schema:about wd:Q279724; schema:isPartOf <https://en.wikipedia.org/>.
 bind(str(?x) as ?y)
}

returns the string https://en.wikipedia.org/wiki/University_of_Troms%C3%B8.

Event Timeline

It is the standard Mediawiki title encoding - however Mediawiki represents the title, the same does WDQS, since Mediawiki (Wikidata) is the data source.

If I take the str() form of the URL, it must return the decoded URL

I don't think this is the case. Could you point to the place in the standard where it says the URIs should be returned in decoded form when converted to string?

Can you point to the encoding rules so we do the same before querying. Sparql's encode-for-uri encodes everything except upper- and lower-case letters A-Z, the digits 0-9, HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", and TILDE "~". See https://www.w3.org/TR/xpath-functions/#func-encode-for-uri.

https://www.w3.org/TR/sparql11-query/#func-str says "returns the codepoint representation of an IRI". Doesn't say that anything needs to be encoded. Since it talks of IRIs, unicode chars should be preserved. And the example shows that "@", at least , is preserved.

Cheers!

This returns False, which I believe is a bug in Blazegraph

select * {
  bind (
    sameTerm(<https://en.wikipedia.org/wiki/University_of_Tromsø>,
     <https://en.wikipedia.org/wiki/University_of_Troms%C3%B8>)
        as ?x)
  }

Hmmm, rdf4j returns the same false. Also if i use = instead of sameTerm

I've posted https://github.com/eclipse/rdf4j/issues/1291, let's see what the rdf4j developers say.

https://www.w3.org/TR/rdf-concepts/#section-Graph-URIref says "Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. See also the URI equivalence issue of the Technical Architecture Group"

https://github.com/eclipse/rdf4j/issues/1291 got this answer:

http://w3.org/TR/rdf11-concepts/#section-IRIs it says the following:

IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of [RFC3987]. Further normalization MUST NOT be performed when comparing IRIs for equality.

This explicitly states that in RDF, IRIs are considered equal only under simple string comparison normalization, and even goes so far to as to explicitly forbid any other normalizations. Normalization of %-encoding is one of these "other" normalizations (see http://tools.ietf.org/html/rfc3987#section-5.3.2.3), and is therefore explicitly ruled out.
This is further enforced by RFC3987 itself, which in section 5.1 states:

Applications using IRIs as identity tokens with no relationship to a protocol MUST use the Simple String Comparison (see section 5.3.1). All other applications MUST select one of the comparison practices from the Comparison Ladder (see section 5.3 or, after IRI-to-URI conversion, select one of the comparison practices from the URI comparison ladder in [RFC3986], section 6.2)

RDF4J (and RDF triplestores in general) fall in the first category: IRIs are considered identity tokens not related to a specific protocol.

rdf4j will probably add a function like wikibase:decodeUri to deal with this, but won't do %-decoding automatically.


So there's nothing to do on this issue except: document the Wikipedia encoding rules so we can do the same before querying.
Sparql's encode-for-uri encodes everything except upper- and lower-case letters A-Z, the digits 0-9, HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", and TILDE "~", but Wikipedia URLs use less encoding.