Page MenuHomePhabricator

Wikidata seems to still be utilizing insecure HTTP URIs
Open, HighPublic

Description

It has come to our attention via T330906 that some part of the Wikidata software/ecosystem is emitting insecure HTTP URIs that some UAs are consuming for insecure access. We need to find a way to secure these accesses. We also need to understand a little more about the nature of the use of these URIs as identifiers and what the challenges are in changing them at some level (either rewriting them just for output purposes, or changing them in a deeper way).

Event Timeline

BBlack triaged this task as High priority.Mar 6 2023, 9:25 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some remarks:

  • We should consider these canonical HTTP URIs to be names in the first place, which are unique worldwide and issued by the Wikidata project as the "owner" [1] of the wikidata.org domain. The purpose of these names is to identify things.
  • Following linked data principles, it is no coincidence that these names happen to be valid URIs. These are meant to be used to look up information about the named entity. It is okay to redirect a canonical URI to another location, including of course to a secure HTTPS location.
  • Pretty much every external project (i.e. outside Wikimedia) that has aligned its content with Wikidata in the past 10+ years uses these canonical HTTP URIs. While the canonical HTTP URIs are not very present within Wikidata (but still relevant e.g. in WDQS and hardcoded in plenty of tools/bots), external usage is huge—not necessarily to look information up, but primarily to express identity with names issued by others for the same entity [2].
  • To my understanding, HSTS can be used to secure all but the first request of a client (that supports HSTS).
  • Canonical HTTP URIs are still widespread in many other linked data resources, since many projects have started issueing these before everything transitioned to HTTPS. Some projects have transitioned to canonical HTTPS URIs, however, with GND doing this in 2019 being a prominent example [3].

[1] Yeah, this is legally not super precise but it does not matter here
[2] Examples: linked data for the Wikimedia Foundation at VIAF and GND. The principle is the same everywhere else.
[3] Background here: https://wiki.dnb.de/display/DINIAGKIM/HTTP+vs.+HTTPS+in+resource+identification

Some remarks:

  • We should consider these canonical HTTP URIs to be names in the first place, which are unique worldwide and issued by the Wikidata project as the "owner" [1] of the wikidata.org domain. The purpose of these names is to identify things.

If they're only names, that's relatively-fine. However, there are user agents that end up following them as access URIs. If we could control every agent, we could require that they all upconvert to HTTPS for access, but we can't.

  • Following linked data principles, it is no coincidence that these names happen to be valid URIs. These are meant to be used to look up information about the named entity. It is okay to redirect a canonical URI to another location, including of course to a secure HTTPS location.

The problem with relying on redirects is that they're insecure. The initial request goes over the wire in the clear, as does the initial redirect response. They can both be hijacked, modified, censored, and surveilled, before the redirect to HTTPS ever happens. An advanced agent on the wire (like a national telecom) can even persistently hijack a whole session this way, by proxying the traffic into our servers as HTTPS.

We support redirects as a "better than breakage/nothing" solution, but ideally UAs shouldn't ever utilize insecure HTTP to begin with. This is why all of our Canonical URIs (in the HTTP/HTML sense) begin with https, as evidenced in all the normal pageviews' <link rel="canonical" href="https://... tags.

  • To my understanding, HSTS can be used to secure all but the first request of a client (that supports HSTS).

It can be, and we ever participate in HSTS Preload for all of our canonical domains as well, which protects even the first request to a domain from browsers which use the preload list. However, there are many clients, especially bots and scripted tools, which rely on HTTP libraries or CLI tools which do not, by default, honor HSTS or load the preload list.

  • Canonical HTTP URIs are still widespread in many other linked data resources, since many projects have started issueing these before everything transitioned to HTTPS. Some projects have transitioned to canonical HTTPS URIs, however, with GND doing this in 2019 being a prominent example [3].

This would be the ideal end-outcome: that we're able to transition the URLs to be HTTPS everywhere. Barring that, we could also look at where and how they're being emitted. We may have HTML page outputs which are rendering these canonical URIs for access purposes, where it would make sense to convert them to HTTPS as part of the rendering process to cut down on the problem.

Until it gets changed to HTTPS, basically we have two options:

  • Remove the link from sidebar and add it as text field in action=info
  • Make the link to copy the URL on click, instead of going to the target.

Which one do you prefer @Lydia_Pintscher ?

Until it gets changed to HTTPS, basically we have two options:

  • Remove the link from sidebar and add it as text field in action=info
  • Make the link to copy the URL on click, instead of going to the target.

The sidebar is just one place the concept URI appears, how about all the concept-URIs that are the result of SPARQL-queries ? (such as the ?item column inhttps://w.wiki/6iAE)

The sidebar is just one place the concept URI appears, how about all the concept-URIs that are the result of SPARQL-queries ? (such as the ?item column inhttps://w.wiki/6iAE)

That is a valid point but most of the traffic is coming from the sidebar so that improves it a lot then we can look into SPARQL next? Using HTTP as a concept URI is fine, it should just not link to them.