Page MenuHomePhabricator

Set up a public interface to the wikidata query service
Closed, ResolvedPublic

Description

We need to set up an interface for the Wikidata query service on the open internet.

My questions here are:

  • what hostname would we use? query.wikidata.org ?
  • is a zero-http-caching strategy advisable (as in: should we just cache static assets and no query? Is caching good enough in the application?)

Event Timeline

Joe raised the priority of this task from to Medium.
Joe updated the task description. (Show Details)
Joe added subscribers: Addshore, Laddo, bd808 and 12 others.

what hostname would we use? query.wikidata.org

Yes, looks like it from discussion with wikidata team.

should we just cache static assets and no query

Yes. I don't think caching queries makes a lot of sense for now, as we couldn't cache them for any meaningful time, and repetitiveness wouldn't be that high. If we discover some patterns later, we may think about per-pattern caching maybe.

Change 228411 had a related patch set uploaded (by JanZerebecki):
Add query.wikidata.org

https://gerrit.wikimedia.org/r/228411

Add query.wikidata.org

CentralAuth cookies are currently set for ".wikidata.org". Should this service have access to those cookies?

The service does not need to access them but I'm not sure how we can avoid them being sent... Maybe have some varnish rule to strip them? I see that "interesting" cookies are HTTP only, so it looks less problematic.

The service does not need to access them but I'm not sure how we can avoid them being sent... Maybe have some varnish rule to strip them? I see that "interesting" cookies are HTTP only, so it looks less problematic.

@BBlack, is that possible?

Another alternative could be to have CentralAuth set the cookie for "www.wikidata.org". Only downside is that test.wikidata.org users would have to visit that wiki once and reload the page to get logged in...but it's a test wiki.

Stripping cookies at the varnish layer is possible, not adviceable in general IMO.

Joe set Security to None.

Bringing this conversation back here from the comments in https://gerrit.wikimedia.org/r/#/c/228411/

The short summary about what this does is:
A read only mirror of Wikidata.org (only the public information) in a special database for anyone to run queries against ( think http://quarry.wmflabs.org/ ). The database receives queries via HTTP natively and answers JSON encoded. Then there is a HTTP server in front to add static assets for a JS based UI and disallow POST requests to the DB.
However according to T107601" target="_blank">https://phabricator.wikimedia.org/T107601 we need to later move this from misc to a new LVS service. Unless we change that plan and can somehow implement fall over / load balancing over two backend servers with misc.

The part about failover is orthogonal to the decision about misc-web. Our standard model for a raw internal service will be to LVS it across redundant backends as discussed in the other ticket (T107601), but this ticket/commit are about how present the frontend of it to the world via standard termination on the public LVS/cache endpoints (which will then backend into the T107601 internal LVS'd endpoint).

I don't think we've ever really elucidated exactly what differentiates misc-web from text-lb for termination. The easy cases are obvious, but there's a grey area too, and this is a somewhat-grey case...

@Smalyshev , is wikidata.org required for some reason? Or was that just ok with them? Running on wikimedia.org would have a number of benefits for security-- no cookies, and no CORS accepted from the service.

@csteipp Well, it's Wikidata Query Service which serves wikidata content... So having domain at wikimedia and not wikidata would not be ideal. But if it's easier, we could start with that and add wikidata one later. Also, doesn't wikimedia.org have cookies too?

I'd very much prefer query.wikidata.org for the query service. wikidata-query.wikimedia.org is rather ugly and not memorable for outsiders.

The intent is for the service to allow CORS, but I'm not sure about the implications. Anyway that that means it is not an argument for wikimedia.org and against wikidata.org. So we are left with the cookies which we should isolate from the service, by either restricting them to www or filtering out. Which is the better route?

The part about failover is orthogonal to the decision about misc-web. Our standard model for a raw internal service will be to LVS it across redundant backends as discussed in the other ticket (T107601), but this ticket/commit are about how present the frontend of it to the world via standard termination on the public LVS/cache endpoints (which will then backend into the T107601 internal LVS'd endpoint).

I don't think we've ever really elucidated exactly what differentiates misc-web from text-lb for termination. The easy cases are obvious, but there's a grey area too, and this is a somewhat-grey case...

If we put it in misc then this would be the first that has another level behind misc instead of one named server. I have no preference. You or whoever wants to merge it chooses?

If we put it in misc then this would be the first that has another level behind misc instead of one named server. I have no preference. You or whoever wants to merge it chooses?

Another level of LVS, yes, I suppose so, but it still doesn't matter. It's a confusing topology, but the internal and external LVS bits have nothing to do with each other in a logical sense, even though they happen to share hosts and puppet configuration stanzas and all that. All that said, though, do we actually need an internal endpoint separate from the public one? We could also skip that whole layer if we don't have an explicit need. I'll bring it up in the other ticket.

FWIW on the rest of the above, I think query.wikidata.org makes more sense as well.

The intent is for the service to allow CORS, but I'm not sure about the implications. Anyway that that means it is not an argument for wikimedia.org and against wikidata.org. So we are left with the cookies which we should isolate from the service, by either restricting them to www or filtering out. Which is the better route?

It's fine for the service to accept CORS from anywhere-- that has no impact on our other sites. But if our wikis accept CORS requests from the service's domain, then an xss in this service can lead to significant issues on the wikis (steal user tokens, perform checkusers, etc). We whitelist *.wikidata.org for CORS on our wikis, but only specific wikimedia.org domains (commons and meta, iirc).

So if we go with wikidata.org, we need to change the CORS settings for the rest of our sites to whitelist specific domains (www and test), and also restrict cookie setting.

Will the query service return raw HTML or SVG content? If it's only returning other content types like JSON, then CORS might not end up mattering too much.

An alternative to a separate domain could be to use something like https://wikidata.org/api/query_v1/. We are just setting up a listing for /api/, which will list all domain-specific APIs, including the REST API at /api/rest_v1/.

Will the query service return raw HTML or SVG content?

Check out: https://wiki.blazegraph.com/wiki/index.php/REST_API#QUERY. The formats query accepts are XML and JSON.

However, I don't think we do URL filtering now which means one could access not only the query URL but also other URLs which could potentially store or produce HTML, now or in the future. We could add more strict limits on which URLs are passed to blazegraph and limit it to just query URL. I'll add a task for that.

An alternative to a separate domain could be to use something like https://wikidata.org/api/query_v1/

Not sure how this solves the issue - wouldn't it still be on wikidata.org and thus come under *.wikidata.org permissions?

But if our wikis accept CORS requests from the service's domain, then an xss in this service can lead to significant issues on the wikis (steal user tokens,

Aren't our tokens HTTP only? If we allow content from *.wikidata.org to be injected to any wiki, then this means test.wikidata.org is included too and any domain that is in wikidata.org. Maybe we can set it to www.wikidata.org only? I'm not sure we need other wikis to pull anything from test.wikidata.org, do we?

Aren't our tokens HTTP only?

Our session cookies are, but anti-csrf tokens are available via API call. So javascript running on a wikidata.org subdomain can edit on any other WMF wiki via CORS.

If we allow content from *.wikidata.org to be injected to any wiki, then this means test.wikidata.org is included too and any domain that is in wikidata.org. Maybe we can set it to www.wikidata.org only? I'm not sure we need other wikis to pull anything from test.wikidata.org, do we?

I know we edit www.wikidata.org from many WMF domains (and I believe test.wikidata.org from test.wikipedia.org, for... testing), so wikidata.org needs to allow CORS requests from all WMF domains. However, I don't know if we ever edit other WMF domains from wikidata.org via CORS, so we might be able to cut *.wikidata.org out of our CORS policy entirely. Maybe @hoo knows?

I don't know of any deployed functionality that edits from www.wikidata.org to other Wikis. Maybe who knows what gadgets and user scripts do?

including the REST API at /api/rest_v1/

What REST API are you talking about?

I've took a bit of an alternative approach:

  • deploy behind misc-web, as query.wikidata.org
  • as logstash does, do not use lvs but varnish directly here.

Change 229392 had a related patch set uploaded (by Giuseppe Lavagetto):
wikidata query: add misc-web configuration

https://gerrit.wikimedia.org/r/229392

@Smalyshev, before we deploy this, can we task someone with updating $wgCrossSiteAJAXdomains to remove it from CORS domains, and set cookies for only the specific wikidata subdomains from CentralAuth?

@csteipp sure. but I have no idea who that would be. Could you create a task and assign it to appropriate person?

Change 229392 merged by Giuseppe Lavagetto:
wikidata query: add misc-web configuration

https://gerrit.wikimedia.org/r/229392