Page MenuHomePhabricator

https switch changed wdata prefix to https:
Closed, ResolvedPublic1 Story Points

Description

After the https switch, the page http://www.wikidata.org/wiki/Special:EntityData/Q1.ttl?flavor=dump redirects to https://www.wikidata.org/wiki/Special:EntityData/Q1.ttl?flavor=dump - which produces this prefix:

@prefix wdata: <https://www.wikidata.org/wiki/Special:EntityData/> .

Which doesn't seem right since our URLs are defined as http:// canonically. Should we change the definition or (my preference) fix the output so it displays http:// even if current page is https. Data should not depend on the transfer protocol IMO.

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added subscribers: Smalyshev, daniel.
Restricted Application added a project: Discovery. · View Herald TranscriptJun 16 2015, 10:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Normal priority.Jun 16 2015, 10:09 PM
Smalyshev set Security to None.

Change 218994 had a related patch set uploaded (by Smalyshev):
T102717: Fix the RDF links always use http:

https://gerrit.wikimedia.org/r/218994

Instead of the proposed patch, which would lead to inconsistencies, just set the concept base URI explicitly in the config:

$wgWBRepoSetting['conceptBaseUri'] = 'http://www.wikidata.org/entity/';

That would go into the wikibase.php config file, if I understand correctly.

Tobi_WMDE_SW edited a custom field.Jun 18 2015, 2:24 PM
Smalyshev added a comment.EditedJun 18 2015, 6:16 PM

@daniel - check out the link, conceptBaseUri is fine. However, data URL, which is derived from canonical URL, is not. I'm also not sure which inconsistencies the patch will lead to, could you explain?

daniel added a comment.EditedJun 22 2015, 6:45 PM

Sorry for the initial misunderstanding. I was thinking of concept URIs, which are controlled by the conceptBaseUri setting.

So, after some digging around and talking to @Smalyshev, I think he's right: the canonical document URI should use the http schema, even though the canonical document URL should be a https URL.

The reason is: we want documents to be retrieved via HTTPS if possible, so our canonical documens URLs should use the https protocoll, and the plain http URLs should redirect to the https URLs, as per section 2.11 of W3C's Linked Data recommendation.

However, be the same recommendations, URIs should always use the HTTP schema, see https://dvcs.w3.org/hg/ldpwg/raw-file/default/ldp-bp/ldp-bp.html#predicate-uris-should-be-http-urls section 2.1. This seems a bit arbitrary and annoying, but it's backed by other documents, such as section 4.1 of the Study on persistent URIs by the ISA.

We currently construct the base URI for documents in two places, SpecialEntityData.php and dumpRdf.php. Both rely on the canonical URL of Special:EntityData. Note that the notion of canonical page URLs in core has recently been discussed, and the implementation improved, see https://gerrit.wikimedia.org/r/#/c/219782/2

So, currently, our canonical URIs are based on canonical URLs. Considering the protocol conundrum above, we probably need to change this, and make the canonical document URIs use plain HTTP. The downside of this is security: when resolving the document URI (or the concept URI, for that matter), which should be supported in the spirit of Open Data, then the initial request will be unencrypted, and subject to manipulation. While this is not cirticial for the application at hand, it should be avoided in general.

@Smalyshev's original patch will change the document URIs to use plain http for RDF output - do we need a more generic solution? In particular, how should URLs for rel="alternate" be constructed? See T96298 and https://gerrit.wikimedia.org/r/#/c/219001/ for that.

@Lydia_Pintscher have a look above. What do you think?

Markus: I seem to remember you saying pretty much everyone does http and not https? If this is indeed the case then I think we should follow conventions and do the same.

Smalyshev added a comment.EditedJun 23 2015, 5:16 PM

So, I think we have a consensus that we want to use http:. So let's now figure out what is the best way to do it. I made it in RdfVocabulary kind of brute-force - just make the URLs http: always. But it may be too harsh - maybe we could have the clients use RdfVocabulary::alwaysHTTP (I don't feel 100% happy with such a basic function like "canonicalize URL to be http" reside in very specialized class but I couldn't find existing one and I don't want to introduce cross-module dependencies) voluntarily instead of RdfVocabulary forcing http in the ctor.
Or maybe there's another way of getting canonical URL with http which I am missing.

Thoughts welcome on this.

It is important to pick one and stick to it. WMF hosted things changed to https being the default for good reason, which was in preparation for years. We should pick https for things we host.

This seems a bit arbitrary and annoying, but it's backed by other documents, such as section 4.1 of the Study on persistent URIs by the ISA.

This actually does not talk about not using https at all. The example has explanations for all parts, except the missing s, probably it is only important that it is either one. Throughout the document they act like http and https are the same thing.

https://dvcs.w3.org/hg/ldpwg/raw-file/default/ldp-bp/ldp-bp.html#predicate-uris-should-be-http-urls

This states that it is about being able to dereference, not that it should be http over https.

https://dvcs.w3.org/hg/ldpwg/raw-file/default/ldp-bp/ldp-bp.html#respond-with-primary-urls-and-use-them-for-identity-comparison

This explicitly says you should pick one and always return that. It also mentions that http without s is not sufficient to trust that a certain response may define what is canonical. Note that this one is after the one above in the same document, i.e. they technically suggest https without making it so explicity. Possibly that was done as to not divide the Working Group over things that were already decided and enshrined as BCP 188.

While I did say that pretty much all URIs I know use http, I do not have any reason to believe that https would cause problems. It is not so extensively tested maybe, but in most contexts it should work fine.

A bigger issue is that some people are already using our http URIs.

I agree that we need to pick one. But the URI in the data and the URL ultimately used to deliver the data doesn't have to be the same, I think - it only need to lead ultimately to the same resource. If we choose https, then we'll need to ensure there URL is always https even if the page is somehow accessed with http (I know right now there's a redirect but what if that redirect is or will be disabled for some clients/IPs/countries/etc.?) anyways.

Basing on @dnaiel's comments on the ticket, I think we can commit this one as short-term fix and then develop longer-term consensus in T103767 (if it's different from the short-term one).

Change 218994 merged by jenkins-bot:
T102717: Fix the RDF links always use http:

https://gerrit.wikimedia.org/r/218994

Smalyshev closed this task as Resolved.Jun 25 2015, 6:28 AM

Change 220832 had a related patch set uploaded (by Smalyshev):
T103767: Revert "T102717: Fix the RDF links always use http:"

https://gerrit.wikimedia.org/r/220832

Change 220832 merged by jenkins-bot:
T103767: Revert "T102717: Fix the RDF links always use http:"

https://gerrit.wikimedia.org/r/220832