Page MenuHomePhabricator

Create a more controlled WDQS cluster
Closed, ResolvedPublic

Description

WDQS is by design a fragile service. We allow arbitrary users to run arbitrary SPARQL queries. This is similar to allow any user direct access to our production MySQL databases. This is fine as long as the expectations are managed. It is definitely wrong when services from which we expect stable response time and availability start depending on it.

Splitting WDQS in a public, do whatever you want service and a different more controlled service make sense. This would be similar to what we do with MySQL, splitting it between a production service and labs replicas, with different expectations.

My understanding is that the structured data on commons project will rely on WDQS for some of its functionalities. If we want that project to be stable, we need to address the WDQS stability issues.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

My understanding is that the structured data on commons project will rely on WDQS for some of its functionalities. If we want that project to be stable, we need to address the WDQS stability issues.

In what form does it rely on it?
What I can say is that Wikibase-Quality-Constraints depends on WDQS and also on short response times.
Could we maybe create an official list?

I am not familiar with the details of dependent services (would be nice to document it somewhere? or point me to the place where it is already documented :) but in general unless you need an instant response, 429 should not be a cause for immediate failure. For offline functionality, 429 should cause retry within prescribed time. For online, it'd depend on the case, but if it's a secondary UI element, the script can just try and load it later, maybe?

The Recommendation API service depends entirely on WDQS (and partially on AQS). While in theory we could (and should) account for 429s sent by WDQS in the service, the problem described in the task description is genuine. I second the effort of splitting the service (or its access) into two logical entities - one for external, the other for internal clients.

Random thoughts gathered in multiple discussion (note that this is very much a brainstorming, nothing written below is an actual decision of a path to follow):

  • I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.
  • Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.
  • The hard to support use case is the external endpoint, since it is subject to uncontrolled load. Do we want to continue supporting this use case (I very much think we should continue to support it, but it isn't my place to make that decision, that question should be asked, and answered).
  • Since each WDQS node is independent and does its own updates by querying wikidata, increasing the number of nodes increases the load on wikidata. Is that an issue? Is there a way to better share the resources required for updates?
  • Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.

Perhaps a viable solution for the short- to mid-term is to have two LVS end points in each DC: one that serves external traffic, and another one internal, much like we have api.svc.{site}.wmnet and api-async.svc.{site}.wmnet for the MW API, respectively,

Perhaps a viable solution for the short- to mid-term is to have two LVS end points in each DC: one that serves external traffic, and another one internal, much like we have api.svc.{site}.wmnet and api-async.svc.{site}.wmnet for the MW API, respectively,

But if those 2 LVS endpoints are served by the same cluster, we've not achieved anything in term of isolation...

It might make sense to add whitelisting / authentication / authorization on the "controlled" wdqs cluster to ensure the usage is under control. This might or might not be required. If we go that way, we might want to address this as a more general problem, which might make sense to address in a more general way.

I think for now limiting it by IP should probably work? I think IP ranges from production hosts, labs and outside are segregated?

Random thoughts gathered in multiple discussion (note that this is very much a brainstorming, nothing written below is an actual decision of a path to follow):

  • I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.

We will have a lot of structured data on Commons in the same way/format as it currently exists on Wikidata. We will want to provide people with the ability to search/query for media files on Commons based on the statements on these files. People will also want to combine the data on Commons and Wikidata in queries. To make this all work we'll want to load that data into WDQS as well (or set up a separate query service but iirc Stas said no and I can see the benefits of not doing it.) I am not sure we are much farther in planning than this.

  • Since we have 2 active / active WDQS clusters (eqiad / codfw), we could use one of them to serve internal traffic and one as external endpoint. This defeats the purpose of having a backup datacenter, so that's not a long term solution.
  • The hard to support use case is the external endpoint, since it is subject to uncontrolled load. Do we want to continue supporting this use case (I very much think we should continue to support it, but it isn't my place to make that decision, that question should be asked, and answered).

Yes that is definitely a use-case we want to continue to support.

  • Since each WDQS node is independent and does its own updates by querying wikidata, increasing the number of nodes increases the load on wikidata. Is that an issue? Is there a way to better share the resources required for updates?
  • I have heard WDQS mentioned in the context of structured data on commons, but I have no idea how it is going to be used. If anyone has a pointer to some documentation / discussion, please let me know.

We will have a lot of structured data on Commons in the same way/format as it currently exists on Wikidata. We will want to provide people with the ability to search/query for media files on Commons based on the statements on these files. People will also want to combine the data on Commons and Wikidata in queries. To make this all work we'll want to load that data into WDQS as well (or set up a separate query service but iirc Stas said no and I can see the benefits of not doing it.) I am not sure we are much farther in planning than this.

As I understand, this will allow users to run queries in fairly controlled way, not to run arbitrary SPARQL queries. This means that we can probably fairly easily ensure that those queries have a reasonable cost and don't endanger the "controlled" WDQS cluster. As I understand it, this pattern is similar to what we do with any MySQL backed service, we don't allow random SQL, but we take input from users and inject it in known queries.

I'm not too worried about the additional data at this point (we have more head room on storage space than we have on computational resources).

I believe people will also want to run arbitrary queries. However I guess the majority coming from users who are searching files on the Commons website will be fairly controlled.

I think for now limiting it by IP should probably work? I think IP ranges from production hosts, labs and outside are segregated?

Yes, allowing access to a controlled cluster to only production hosts is probably fine and is similar to what we do for all our services. The idea of more controlled access has been raised a few times (I remember a conversation with @BBlack about maps and similar concerns). Having API keys or a similar mechanism would allow us open up a controlled cluster to trusted clients (even from labs), or would allow to more easily identify / block abusive consumers including production consumers.

I believe people will also want to run arbitrary queries. However I guess the majority coming from users who are searching files on the Commons website will be fairly controlled.

So we want users to be able to run arbitrary search on commons, but not arbitrary SPARQL queries, right (at least in the context of the structured data on commons project, we of course still want users to be able to run arbitrary SPARQL queries on the current query.wikidata.org)? Feel free to ping me for a chat (irc, hangouts, other, ...) to clarify that more if you think that's needed.

Gehel closed subtask Restricted Task as Resolved.Feb 20 2018, 9:35 AM
Smalyshev claimed this task.

New cluster is up and serving traffic.