Set up a public interface to the wikidata query service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Jul 31 2015, 5:07 PM

Description

We need to set up an interface for the Wikidata query service on the open internet.

My questions here are:

what hostname would we use? query.wikidata.org ?
is a zero-http-caching strategy advisable (as in: should we just cache static assets and no query? Is caching good enough in the application?)

Details

	Subject	Repo	Branch	Lines +/-
	wikidata query: add misc-web configuration	operations/puppet	production	+25 -0
	Add query.wikidata.org	operations/dns	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T67626 [Epic] Support for queries on-wiki (automated list generation)
Resolved	Smalyshev	T85159 [EPIC] Deploy a Wikidata Query Service into production
Resolved	Joe	T107602 Set up a public interface to the wikidata query service
Resolved	None	T108101 Isolate wikidata.org cookies and CORS policies
Resolved	Bene	T112087 [Bug] m.wikidata.org is not CORS whitelisted
Resolved	• csteipp	T100413 "You are centrally logged in." toast on every page view on commons
Open	None	T109538 Vagrant wikis should have mobile site setup when MobileFrontend role is enabled
Resolved	• Gilles	T54302 Varnish vagrant role

Event Timeline

Joe created this task.Jul 31 2015, 5:07 PM

Joe raised the priority of this task from to Medium.

Joe updated the task description. (Show Details)

Joe added projects: Discovery-ARCHIVED, Wikidata, Wikidata-Query-Service, acl*sre-team, Traffic.

Joe added subscribers: Addshore, Laddo, bd808 and 12 others.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 31 2015, 5:07 PM

what hostname would we use? query.wikidata.org

Yes, looks like it from discussion with wikidata team.

should we just cache static assets and no query

Yes. I don't think caching queries makes a lot of sense for now, as we couldn't cache them for any meaningful time, and repetitiveness wouldn't be that high. If we discover some patterns later, we may think about per-pattern caching maybe.

Change 228411 had a related patch set uploaded (by JanZerebecki):
Add query.wikidata.org

https://gerrit.wikimedia.org/r/228411

gerritbot added a project: Patch-For-Review.Jul 31 2015, 9:38 PM

In T107602#1499826, @gerritbot wrote:

Add query.wikidata.org

CentralAuth cookies are currently set for ".wikidata.org". Should this service have access to those cookies?

The service does not need to access them but I'm not sure how we can avoid them being sent... Maybe have some varnish rule to strip them? I see that "interesting" cookies are HTTP only, so it looks less problematic.

In T107602#1499998, @Smalyshev wrote:

The service does not need to access them but I'm not sure how we can avoid them being sent... Maybe have some varnish rule to strip them? I see that "interesting" cookies are HTTP only, so it looks less problematic.

@BBlack, is that possible?

Another alternative could be to have CentralAuth set the cookie for "www.wikidata.org". Only downside is that test.wikidata.org users would have to visit that wiki once and reload the page to get logged in...but it's a test wiki.

Stripping cookies at the varnish layer is possible, not adviceable in general IMO.

Joe claimed this task.Aug 3 2015, 7:49 AM

Joe set Security to None.

jcrespo subscribed.Aug 3 2015, 7:51 AM

Ironholds moved this task from Needs triage to WDQS on the Discovery-ARCHIVED board.Aug 4 2015, 8:15 AM

Bringing this conversation back here from the comments in https://gerrit.wikimedia.org/r/#/c/228411/

The short summary about what this does is:
A read only mirror of Wikidata.org (only the public information) in a special database for anyone to run queries against ( think http://quarry.wmflabs.org/ ). The database receives queries via HTTP natively and answers JSON encoded. Then there is a HTTP server in front to add static assets for a JS based UI and disallow POST requests to the DB.
However according to T107601" target="_blank">https://phabricator.wikimedia.org/T107601 we need to later move this from misc to a new LVS service. Unless we change that plan and can somehow implement fall over / load balancing over two backend servers with misc.

The part about failover is orthogonal to the decision about misc-web. Our standard model for a raw internal service will be to LVS it across redundant backends as discussed in the other ticket (T107601), but this ticket/commit are about how present the frontend of it to the world via standard termination on the public LVS/cache endpoints (which will then backend into the T107601 internal LVS'd endpoint).

I don't think we've ever really elucidated exactly what differentiates misc-web from text-lb for termination. The easy cases are obvious, but there's a grey area too, and this is a somewhat-grey case...

@Smalyshev , is wikidata.org required for some reason? Or was that just ok with them? Running on wikimedia.org would have a number of benefits for security-- no cookies, and no CORS accepted from the service.

@csteipp Well, it's Wikidata Query Service which serves wikidata content... So having domain at wikimedia and not wikidata would not be ideal. But if it's easier, we could start with that and add wikidata one later. Also, doesn't wikimedia.org have cookies too?

I'd very much prefer query.wikidata.org for the query service. wikidata-query.wikimedia.org is rather ugly and not memorable for outsiders.

The intent is for the service to allow CORS, but I'm not sure about the implications. Anyway that that means it is not an argument for wikimedia.org and against wikidata.org. So we are left with the cookies which we should isolate from the service, by either restricting them to www or filtering out. Which is the better route?

In T107602#1507297, @BBlack wrote:

The part about failover is orthogonal to the decision about misc-web. Our standard model for a raw internal service will be to LVS it across redundant backends as discussed in the other ticket (T107601), but this ticket/commit are about how present the frontend of it to the world via standard termination on the public LVS/cache endpoints (which will then backend into the T107601 internal LVS'd endpoint).

I don't think we've ever really elucidated exactly what differentiates misc-web from text-lb for termination. The easy cases are obvious, but there's a grey area too, and this is a somewhat-grey case...

If we put it in misc then this would be the first that has another level behind misc instead of one named server. I have no preference. You or whoever wants to merge it chooses?

greg subscribed.Aug 4 2015, 9:41 PM

In T107602#1507676, @JanZerebecki wrote:

If we put it in misc then this would be the first that has another level behind misc instead of one named server. I have no preference. You or whoever wants to merge it chooses?

Another level of LVS, yes, I suppose so, but it still doesn't matter. It's a confusing topology, but the internal and external LVS bits have nothing to do with each other in a logical sense, even though they happen to share hosts and puppet configuration stanzas and all that. All that said, though, do we actually need an internal endpoint separate from the public one? We could also skip that whole layer if we don't have an explicit need. I'll bring it up in the other ticket.

FWIW on the rest of the above, I think query.wikidata.org makes more sense as well.

In T107602#1507585, @JanZerebecki wrote:

The intent is for the service to allow CORS, but I'm not sure about the implications. Anyway that that means it is not an argument for wikimedia.org and against wikidata.org. So we are left with the cookies which we should isolate from the service, by either restricting them to www or filtering out. Which is the better route?

It's fine for the service to accept CORS from anywhere-- that has no impact on our other sites. But if our wikis accept CORS requests from the service's domain, then an xss in this service can lead to significant issues on the wikis (steal user tokens, perform checkusers, etc). We whitelist *.wikidata.org for CORS on our wikis, but only specific wikimedia.org domains (commons and meta, iirc).

So if we go with wikidata.org, we need to change the CORS settings for the rest of our sites to whitelist specific domains (www and test), and also restrict cookie setting.

Will the query service return raw HTML or SVG content? If it's only returning other content types like JSON, then CORS might not end up mattering too much.

An alternative to a separate domain could be to use something like https://wikidata.org/api/query_v1/. We are just setting up a listing for /api/, which will list all domain-specific APIs, including the REST API at /api/rest_v1/.

Will the query service return raw HTML or SVG content?

Check out: https://wiki.blazegraph.com/wiki/index.php/REST_API#QUERY. The formats query accepts are XML and JSON.

However, I don't think we do URL filtering now which means one could access not only the query URL but also other URLs which could potentially store or produce HTML, now or in the future. We could add more strict limits on which URLs are passed to blazegraph and limit it to just query URL. I'll add a task for that.

An alternative to a separate domain could be to use something like https://wikidata.org/api/query_v1/

Not sure how this solves the issue - wouldn't it still be on wikidata.org and thus come under *.wikidata.org permissions?

But if our wikis accept CORS requests from the service's domain, then an xss in this service can lead to significant issues on the wikis (steal user tokens,

Aren't our tokens HTTP only? If we allow content from *.wikidata.org to be injected to any wiki, then this means test.wikidata.org is included too and any domain that is in wikidata.org. Maybe we can set it to www.wikidata.org only? I'm not sure we need other wikis to pull anything from test.wikidata.org, do we?

In T107602#1508326, @Smalyshev wrote:

Aren't our tokens HTTP only?

Our session cookies are, but anti-csrf tokens are available via API call. So javascript running on a wikidata.org subdomain can edit on any other WMF wiki via CORS.

If we allow content from *.wikidata.org to be injected to any wiki, then this means test.wikidata.org is included too and any domain that is in wikidata.org. Maybe we can set it to www.wikidata.org only? I'm not sure we need other wikis to pull anything from test.wikidata.org, do we?

I know we edit www.wikidata.org from many WMF domains (and I believe test.wikidata.org from test.wikipedia.org, for... testing), so wikidata.org needs to allow CORS requests from all WMF domains. However, I don't know if we ever edit other WMF domains from wikidata.org via CORS, so we might be able to cut *.wikidata.org out of our CORS policy entirely. Maybe @hoo knows?

I don't know of any deployed functionality that edits from www.wikidata.org to other Wikis. Maybe who knows what gadgets and user scripts do?

• JohnLewis subscribed.Aug 5 2015, 12:53 AM

including the REST API at /api/rest_v1/

What REST API are you talking about?

@JeroenDeDauw: https://en.wikipedia.org/api/rest_v1/?doc

I've took a bit of an alternative approach:

deploy behind misc-web, as query.wikidata.org
as logstash does, do not use lvs but varnish directly here.

Change 229392 had a related patch set uploaded (by Giuseppe Lavagetto):
wikidata query: add misc-web configuration

https://gerrit.wikimedia.org/r/229392

@Smalyshev, before we deploy this, can we task someone with updating $wgCrossSiteAJAXdomains to remove it from CORS domains, and set cookies for only the specific wikidata subdomains from CentralAuth?

@csteipp sure. but I have no idea who that would be. Could you create a task and assign it to appropriate person?

Change 228411 merged by Dzahn:
Add query.wikidata.org

https://gerrit.wikimedia.org/r/228411

Dzahn mentioned this in rODNS9f28a5071777: Add query.wikidata.org.Aug 5 2015, 10:23 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Aug 7 2015, 4:31 PM

Yurik subscribed.Aug 7 2015, 7:00 PM

Legoktm closed subtask T108101: Isolate wikidata.org cookies and CORS policies as Resolved.Aug 12 2015, 11:05 PM

jeremyb-phone added a subscriber: jeremyb.Aug 13 2015, 12:28 PM

Change 229392 merged by Giuseppe Lavagetto:
wikidata query: add misc-web configuration

https://gerrit.wikimedia.org/r/229392

Joe mentioned this in rOPUP294487108585: wikidata query: add misc-web configuration.Aug 13 2015, 1:19 PM

MrStradivarius unsubscribed.Aug 13 2015, 1:54 PM

https://query.wikidata.org :)

Joe closed this task as Resolved.Aug 13 2015, 8:27 PM

yay!

Macro antoine-approve:

\o/

JanZerebecki reopened subtask T108101: Isolate wikidata.org cookies and CORS policies as Open.Aug 14 2015, 12:39 PM

JanZerebecki closed subtask T108101: Isolate wikidata.org cookies and CORS policies as Resolved.Sep 10 2015, 12:01 PM

BBlack moved this task from Backlog to Done on the Traffic board.Sep 22 2015, 1:57 PM

Set up a public interface to the wikidata query serviceClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Set up a public interface to the wikidata query service
Closed, ResolvedPublic
Actions

Related Objects
Search...